You are on page 1of 313

International Experiences in Language Testing and Assessment

L a n g u a g e Te s t i n g
and Evaluation
Series editors: Rüdiger Grotjahn
and Günther Sigott

Volume 28
Dina Tsagari
Salomi Papadima-Sophocleous
Sophie Ioannou-Georgiou

International Experiences
in Language Testing
and Assessment
Selected Papers in Memory of Pavlos Pavlou
Bibliographic Information published by the Deutsche
The Deutsche Nationalbibliothek lists this publication in the
Deutsche Nationalbibliografie; detailed bibliographic data is
available in the internet at

Cover Design:
© Olaf Glöckler, Atelier Platen, Friedberg

Library of Congress Cataloging-in-Publication Data

International experiences in language testing and assessment :

selected papers in memory of Pavlos Pavlou / Dina Tsagari,
Salomi Papadima-Sophocleous, Sophie Ioannou-Georgiou.
pages cm. — (Language Testing and Evaluation ; Volume 28)
ISBN 978-3-631-62192-9
1. Language and languages—Ability testing. 2. Language and
languages—Examinations. 3. Educational tests and measure-
ments. I. Pavlou, Pavlos Y., 1964- honouree. II. Tsagari, Dina,
editor of compilation. III. Title.
P53.4.I577 2013

ISSN 1612-815X
ISBN 978-3-631-62192-9

© Peter Lang GmbH

Internationaler Verlag der Wissenschaften
Frankfurt am Main 2013
All rights reserved.
Peter Lang Edition is an imprint of Peter Lang GmbH
All parts of this publication are protected by copyright. Any
utilisation outside the strict limits of the copyright law, without
the permission of the publisher, is forbidden and liable to
prosecution. This applies in particular to reproductions,
translations, microfilming, and storage and processing in
electronic retrieval systems.
To the memory of our dear friend and colleague,
Dr Pavlos Pavlou
Table of Contents

Foreword ......................................................................................................... 9
Dina Tsagari, Salomi Papadima-Sophocleous, Sophie Ioannou-Georgiou

Part I
Problematising Language Testing and Assessment ....................................... 15

Expanding The Construct of Language Testing With

Regards to Language Varieties And Multilingualism ................................... 17
Elana Shohamy

Social Meanings in Global-Glocal Language Proficiency Exams ................. 33

Bessie Dendrinos

Towards an Alternative Paradigm in Language Testing Research:

Challenging the Existing Regime of Truth .................................................... 59
Vanda Papafilippou

Part II
Language Testing and Assessment in Schools ............................................... 73

Formative Assessment Patterns in CLIL Primary Schools in Cyprus .......... 75

Dina Tsagari and George Michaeloudes

EFL Learners’ Attitudes Towards Peer Assessment,

Teacher Assessment and Process Writing ...................................................... 95
Elena Meletiadou

Part III
Language Testing and Assessment in HE ...................................................... 115

EFL Students’ Perceptions of Assessment in Higher Education ................... 117

Dina Tsagari

Computer-based Language Tests in a University Language Centre .............. 145
Cristina Pérez-Guillot, Julia Zabala Delgado and Asunción Jaime Pastor

Assessing the Quality of Translations ............................................................ 163

Diamantoula Korda-Savva

Formative Assessment and the Support of Lecturers

in the International University ....................................................................... 177
Kevin Haines, Estelle Meima and Marrit Faber

Oral Presentations in Assessment: a Case Study ........................................... 191

Ian Michael Robinson

Part IV
High-Stakes Exams ........................................................................................ 201

High-stakes Language Testing in the Republic of Cyprus ............................ 203

Salomi Papadima-Sophocleous

Quality Control in Marking Open-Ended Listening and

Reading Test Items ......................................................................................... 229
Kathrin Eberharter and Doris Froetscher

Strategies for Eliciting Language in Examination Conditions ...................... 243

Mark Griffiths

Part V
Language Testing Practices ............................................................................ 259

Assessing Spoken Language: Scoring Validity .............................................. 261

Barry O’ Sullivan

The Use of Assessment Rubrics and Feedback Forms in Learning ............... 275
Roxanne Wong

Identifying Important Factors in Essay Grading Using

Machine Learning .......................................................................................... 295
Victor D. O. Santos, Marjolijn Verspoo and John Nerbonne


This publication is an outcome of the 1st International Conference of Language

Testing and Assessment (ICLTA) that took place at the University of Cyprus from
3 to 5 June, 2011 ( dedicated to the memory of
Associate Professor Pavlos Pavlou, a distinguished language testing and assess-
ment researcher, ELT practitioner, a sociolinguist and an esteemed faculty
member at the University of Cyprus since 1997. Pavlos passed away on August
22, 2010 after a long and difficult fight against cancer. The Cypriot and inter-
national academic community will miss Pavlos not only as a colleague but also as
a valued friend for his good will, sense of humour and open-mindedness.
Several distinguished scholars, researchers and practitioners in the field of
Language Testing and Assessment (LTA) took part in the aforementioned con-
ference. One of the plenary speakers and a friend and colleague of Pavlos, Prof.
Elana Shohamy (Tel Aviv University) sent the following personal note to be pub-
lished in this volume:

To Pavlos
Pavlos was a friend, a colleague and a ‘neighbor’; after all, we both come from
the same geographical area, that of the Mediterranean, so there has always been
a close affinity along this space. Whenever we saw each other at conferences
around the world we felt a certain closeness, we both come from small countries,
visiting conferences in the big world seeking innovations we could implement in
our own contexts. We often talked about the mutual conflicts, Greek and Turks in
Cyprus, Jews and Arabs in Israel, similarities, differences and special attention
regarding the role that languages played within these conflicts and the con-
nections with language testing. The last time we met was at AILA in Essen, Ger-
many, in 2008; I ran into Pavlos as he was walking back from a tour at the Jewish
synagogue in Essen, sharing deep emotions about the experience and having the
conversation at the very spot we were standing, the place where the Jews of Essen
were taken to the concentration camps. Pavlos pointed at the sign that told the
story of the Jews in Essen and suggested that we walk to the synagogue together,
a living memory of the Jewish life and culture in Essen. We toured the external of
the synagogue, being terribly moved by the history of the place. I have not seen
Pavlos since; I was shocked to hear about his untimely death; I was not aware
of his disease. I miss him a lot in discussions on language testing, as a friend, as

a colleague, as a neighbor. We shared lots of agendas of research especially on
the power of tests; here again we come from cultures where tests dominate the
educational systems. We kept talking about putting together a conference on lan-
guage testing that will bring together experts from the Mediterranean areas. A
major component of Pavlos’ work was his research and writings (some with Prof.
Andreas Papapavlou, see:
losPavlou.aspx) on issues of bi-dialects as his deep concern related to the caring
and commitment of the difficulties that children and teachers face in circumvent-
ing their own familiar language, the Cypriot Greek in schools while learning and
using standard Greek. This was another nexus of our work, touching familiar
issues in relation to learning of Arabic in Israel where the focus is on Modern
Standard Arabic (MSA) with little, if at all, legitimacy to the spoken language
in schools. These very studies led me to address this topic in this chapter that is
dedicated to the memory of Pavlos and to connect it with language testing. It is
with great sadness that I addressed the ICLTA conference in June of 2011. I do
hope that the writings, research and thoughts of Pavlos will continue to provide
an inspiration to the many researchers in these very important topics.
(Elana Shohamy, Tel Aviv University)

Summary of the contents of the volume

The field of LTA is, admittedly, growing fast in terms of theory and practice given
the changing nature of LTA purposes and needs worldwide. This has led to a
great deal of research activity and discussions. The present volume aims to pres-
ent some of this activity based on a selection of papers on current international
experiences in LTA presented at the ICLTA conference.
Overall, the chapters of this volume focus on exciting topics of theory and
research in LTA, which occurs in different parts of the world such as Cyprus,
Greece, Spain, Rumania, the Netherlands, U. K. and China and different edu-
cational levels. The volume is divided into five parts, each consisting of two to
five chapters
The first part entitled ‘Problematising Language Testing and Assessment’,
contains three chapters that do exactly that, ‘problematise’ the field. For example,
Elana Shohamy in her chapter argues that language tests follow a narrow view
of languages and urges the testing community to widen its perspective, that is to
view language(s) as open, flexible, and changing constructs. More specifically,
she argues in favour of acceptance and legitimacy of multiple varieties of lan-
guages and multilingualism and stresses the need to view language as an inter-
action between and among multiple languages especially in the context of immi-

gration. The author illustrates her points with examples and data and shows how
tests, which adopt her proposed perspective, can in fact facilitate higher levels of
academic achievement of learners.
In a similar vein, Bessy Dendrinos criticizes high-stakes language proficiency
exams as ‘ideological apparatuses’ and makes a case for ‘glocal’ language pro-
ficiency testing. The author considers the concerns linked with global or inter-
national [English] language testing in the context of the cultural politics of ‘strong’
(and ‘weak’) languages. However, the chapter moves beyond critique and claims
that locally-controlled testing suites may serve as counter-hegemonic alternatives
to the profit-driven global language testing industry. The author stresses that pro-
glocal language testing arguments – where attention is turned from the language
itself to the language users (taking into account their experiences, literacies and
needs) – are politically, economically and also linguistically fair. The author uses
the case of the Greek National Foreign Language Proficiency Examination suite
to illustrate her points.
In her chapter, Vanda Papafilippou argues in favour of different epistemologi-
cal approaches to language testing. The author aspires to contribute to the for-
mation of an alternative paradigm in language testing research and suggests a
critical poststructuralist philosophical framework drawing upon the work of Fou-
cault and Gramsci operationalised through two qualitative methods: narrative
interviewing and critical discourse analysis.
Two chapters are included in the second part, entitled ‘Language Testing and
Assessment in Schools’. In the first one, written by Tsagari and Michaeloudes, the
authors explore the types of formative assessment techniques and methods used
in CLIL (Content and Language Integrated Learning) primary classes in Cyprus.
Through content analysis of observational data, the authors charter an array of
techniques that EFL teachers use to scaffold and support learning. Suggestions
are also made by the authors for further research in the field and recommen-
dations for teachers are offered.
Elena Meletiadou’s chapter introduces the readers of this volume to peer-
assessment (PA). The author presents the theoretical and research background of
PA and its implementation with EFL high school adolescent students in Cyprus
to explore its impact on writing. The results of her study showed that by engag-
ing in PA of writing, students improved their writing skills and developed an
overall positive attitude towards PA as well as teacher assessment. The study
manages to expand on our understanding of PA and process writing and proposes
their combined use to improve students’ motivation towards development in and
assessment of writing.
The third part, devoted to language testing and assessment in Higher Edu-
cation, opens with Dina Tsagari’s chapter which focuses on the centrality of the

tertiary students’ role, perceptions and approaches to assessment practices. The
results of her study showed that student assessment, in the context of a private
university in Cyprus, does not actively involve students nor does it encourage
substantive and collaborative learning. The author argues for the development
of suitable assessment techniques, which create the affordances that support stu-
dent involvement, empower their role and eventually strengthen the link between
teaching, learning and assessment.
In their chapter entitled ‘Computer-based Language Tests in a University Lan-
guage Centre’ Cristina Pérez-Guillot, Julia Zabala Delgado and Asunción Jaime
Pastor from the Universitàt Politècnica de València report on work that relates
to the adaption of language levels and course contents to the standards defined
in the Common European Framework of Reference for Languages (CEFR).
Specifically, the chapter describes the experience of designing and delivering a
computer-based tool for the administration and management of placement and
achievement purposes that allow the present Language Centre to optimise their
available human and technological resources and expand their knowledge of the
linguistic skills and abilities defined in the CEFR.
In the next chapter, Diamantoula Korda-Savva discusses issues relating to
another interesting LTA area, that of the assessment of translations, in the context
of methodological courses offered by the University of Bucharest (Department of
Classical Philology and Modern Greek Studies and Department of English). This
chapter offers indications towards the development of assessment in the ‘slip-
pery’ area of Translation Studies and investigates the extent to which there is
room for improvement.
Kevin Haines, Estelle Meima and Marrit Faber discuss the English language
assessment and provision for academic staff at a Dutch university. More spe-
cifically, the authors focus on the need to support lecturers in delivering some
of their teaching in English in the context of the internationalization of the uni-
versity curriculum and describe assessment procedures at two faculties guided
by the principles of formative assessment and ‘Person-in-Context’. The authors
address the assessment process from a quality perspective, making use of the
LanQua model to evaluate the procedures used. The chapter highlights the posi-
tive engagement of the lecturers who seem to gain greater autonomy through the
formative assessment processes recommended by the authors.
The last chapter of part three is written by Ian Michael Robinson and set in the
context of tertiary education in Italy where students’ final subject grade is based
exclusively on a final exam (written and oral) putting a great stress on the learn-
ers. The chapter reports results of a project in which students of the second level
university cycle were offered alternative forms of assessment in their English
course, namely peer and teacher-assessed oral presentations carried out in class.

The chapter reports on this experience and examines the value of oral presen-
tations as a form of testing.
The three chapters of part four explore issues that relate to high-stakes exams.
For example, in her chapter, Salomi Papadima-Sophocleous presents a historic
and evaluative overview of a particular case that of the high-stakes language
examinations offered by the Ministry of Education and Culture of the Republic
of Cyprus in the last fifty years. The chapter brings to light the need, not only
for constant updating of high-stakes examinations, but also for continuous and
systematic research of language testing in the Cypriot context. The chapter con-
cludes with a speculative look at possible future improvements of high-stakes
language testing in the specific context.
The topic of quality control in LTA is discussed by Kathrin Eberharter and
Doris Froetscher in the next chapter. The authors describe an attempt to enhance
reliability of marking open-ended items of a standardized national examination
introduced at the end of secondary schooling in Austria. The authors discuss how
a series of measures ensured the reliability of the exam. These included the intro-
duction of guidelines for standardization, the use of a specially-designed grid for
evaluating items during marking and the introduction of support structures such
as an online helpdesk and a telephone hotline for the live examination.
In the next chapter Mark Griffiths examines a different aspect of high-stakes
examinations, that of eliciting language in high-stakes oral language exami-
nations. Griffiths analysed video recordings of GESE (Graded Examinations in
Spoken English – Trinity College London) spoken examinations and identified
a range of examiner techniques used to elicit language. The results showed that
examiners adapted and developed prompts, creating elicitation strategies that
represented a range of conversation patterns and roles.
The final part of the volume is dedicated to language testing practices and
includes three chapters. In the first one, Barry O’Sullivan examines a specific
aspect of the testing of spoken language, that of scoring validity which refers to
‘those elements of the testing process, which are associated with the entire pro-
cess of score and grade awarding, from rater selection, training and monitoring
to data analysis and grade awarding’. The author argues that there is a disconnect
between the underlying construct being tested and the scoring system of the test.
He highlights the main problems with current approaches to the assessment of
speaking as seen from the perspective of validation and recommends ways in
which these approaches might be improved.
In the next chapter, Roxanne Wong reports on the preliminary findings of
a pilot study based on the use of assessment rubrics and feedback procedures
in the English Language Centre of City University of Hong Kong. The study
investigated students’ and teachers’ ease of understanding the new rubrics, exem-

plar booklets and the use of the marking scheme applied. Initial results of using
assessment for learning appeared positive.
The final chapter written by Victor D. O. Santos, Marjolijn Verspoo and John
Nerbonne aims to build a bridge between applied linguistics and language tech-
nology by looking at features that determine essay grades, with a view to future
implementation in an Automated Essay Scoring (AES) system. The researchers
investigated various machine-learning algorithms that could provide the best
classification accuracy in terms of predicting the English proficiency level of the
essays written by 481 Dutch high school learners of English. It was found that
Logistic Model Tree (LMT), which uses logistic regression, achieved the best
accuracy rates (both in terms of precise accuracy and adjacent accuracy) when
compared to human judges.
As can be seen from the foregoing description, this volume of selected papers
from the ICLTA conference has much to offer. We are most sincerely thankful to
our authors for sharing their expertise and experience in LTA theory and practice.

The editors
Dina Tsagari
Salomi Papadima-Sophocleous
Sophie Ioannou-Georgiou

Part I
Problematising Language Testing
and Assessment
Expanding The Construct Of Language Testing With
Regards To Language Varieties And Multilingualism

Elana Shohamy1
Tel Aviv University

This chapter argues that language tests follow a narrow view of languages as closed and homoge-
nous systems and calls for testing to follow a broader perspective and view definitions of language(s)
as open, flexible, and changing constructs. It focuses on two areas, one is the acceptance and legiti-
macy of multiple varieties of language – written, oral and other varieties – each for its own functions
but all part of the broader notion of languages as used by people (Papapavlou & Pavlou, 2005). The
second is the focus on multilingualism and the need to view language as an interaction between and
among multiple languages especially in the context of immigration and the use of multiple languages
where immigrants and indigenous groups continue to use their L-1 along with L-2, creating hybrids
and translanguaging. Data is brought to show how tests of this sort in fact facilitate higher levels of
academic achievements of students in schools.

Key words: construct validation, multilingual tests, multi-dialects, method-trait, hybrids.

1. Expanding the language construct

The main argument of this chapter is that while language tests need to follow
updated and current views of what language is, the reality is that language testing
lags behind as it continues to maintain a view of language that is not on par with
current thinking on ‘language’. In this chapter I will provide evidence of new
definitions of language from various sources along two dimensions – dialects and
languages. I am calling here for an expansion of the perspectives and views of
language by addressing it in the context of multiple varieties and multilingual-
ism. In the case of Cyprus this means the need to recognize and grant legitimacy
to Greek Cypriot, along with Standard Modern Greek and the need to incorpo-
rate a number of languages in tests. The views of multiple varieties of language
are in line with the work of Dr. Pavlou of the past years, whose research is of
prime importance in its connection to testing as the testing community overlooks
the relevance and incorporation of language varieties. It is my intention here to
address this and other issues in the context of language testing.


2. Background

An examination of the developments of language testing and assessment since the

1960’s, reveals that its theories and practices have always been closely related to
definitions of language and its proficiency. The field of language testing is viewed
as consisting of two major components: one, focusing on the ‘what’, referring to
the constructs that need to be assessed (also known as ‘the trait’); and the other
pertains to the ‘how’ (also known as ‘the method’), which addresses the specific
procedures and strategies used for assessing the ‘what’. Traditionally, ‘the trait’
has been defined by the Applied Linguistic field so that these definitions provided
the essential elements for creating language tests. The ‘how’, on the other hand
is derived mostly from the field of testing and measurement which has, over the
years, developed a broad body of theories, research, techniques and practices
about testing and assessment. Language testers incorporated these two areas to
create the discipline of language testing and assessment, a field, which includes
theories, research and applications and has its own research and publications.
Matching the ‘how’ of testing with the ‘what’ of language uncovers several
periods in the development of the field so that each one instantiated different
notions of language knowledge along with specific measurement procedures.
Thus, discrete-point testing viewed language as consisting of lexical and struc-
tural items so that the language tests of that era presented isolated items in objec-
tive testing procedures. In the integrative era, language tests tapped integrated
and discoursal language; in the communicative era, tests aimed to replicate inter-
actions among language users utilizing authentic oral and written texts; and in
the performance testing era, language users were expected to perform tasks taken
from ‘real life’ contexts. Alternative assessment was a way of responding to the
realization that language knowledge is a complex phenomenon, which no single
procedure can be expected to assess by itself. Assessing language knowledge,
therefore, requires multiple and varied procedures that complement one another.
While we have come to accept the centrality of the ‘what’ to the ‘how’ trajec-
tory for the development of tests, extensive work in the past decade points to a
less overt but highly influential dynamic in other directions. This dynamic has
to do with the pivotal roles that tests play in societies in shaping the definitions
of language, in affecting learning and teaching, and in maintaining and creat-
ing social classes. This also means that current assessment research perceives
its obligations to examine the close relationship between methods and traits in
broader contexts and to focus on how language tests interact with societal factors,
given their enormous power. In other words, as language testers seek to develop
and design methods and procedures for assessment (the ‘how’) they also become

mindful not only of the emerging insights regarding the trait (the ‘what’), and its
multiple facets and dimensions, but also of the societal role that language tests
play, the power that they hold, and their central functions in education, politics
and society. Thus, given the strong influence of language testing on the teaching
and learning of languages highlights the urgent need to constantly review and
assess the current views of language, the object that is being assessed (Menken,
2008; Shohamy, 2008).

3. Viewing language in updated terms

Relating to the above, I argue here for the expansion of language tests so that
they are in line with current definitions of language as the applications of tests
have profound influence on learning and teaching. Currently, most language tests
are based on a definition of language, which is closed, monolingual, monolithic,
static, standard, and ‘native like’ with very few deviations from the official norms
with defined and set boundaries. In fact, current views of language perceive it as
dynamic, energetic, diverse, personal, fluid, and constantly evolving. In this day
and age, with migration and globalization, there is recognition of migrant and
global languages as well as multiple language varieties used by various groups
and there are more flexible rules resulting from language contact. Thus, it is
believed that a number of languages, codes, dialects and modalities exist simulta-
neously and harmoniously resulting in cases of code switching and code mixing
(Wei and Martin, 2009). These are also manifested ‘beyond words’, via multi-
modal forms of images, signs, music, clothes, food and other ways of ‘languag-
ing’ (Shohamy, 2006). Further, language is not limited to what people say and
‘read’ but also includes how they choose to represent themselves via language in
the ecology, in public spaces in signs, personal cards and names (e. g., linguistic
landscape and symbolic representations).
This phenomenon is especially relevant with regards to the English language,
the current world lingua franca that varies across different places and spaces,
often manifested in mixed languages with no fixed boundaries, resulting in
fusions, hybrids, and multiple varieties. In fact new ‘Englishes’ are constantly
being created in dynamic and personal ways. These include hybrids, fusions and
mixes of English with L-1s, L-2s and L-n (First Language/s, Second Language/s
and Other Language/s), that flow over local, regional, national, global and trans-
national spaces, often referred to as ‘trans-languaging’. These varieties of English
are especially manifested in cyber and public places as shown in the extensive
research of the past decade with regards to the language that is displayed in public
spaces, referred to as Linguistic Landscape (Gorter, 2006; Shohamy and Gorter,

2009; inter alia). It is shown, for example, that the English that is represented in
public spaces assumes new and creative ways in terms of accents, words, tones,
spellings, and a variety of innovative mixes. This language is multi-modal con-
sisting of codes, icons, images, sounds and designs, co-constructed harmoni-
ously. Images 1 and 2 point to such varieties as multiple interpretations of new
Englishes are represented in different settings.

Image 1. A street name in Hebrew and its transliteration in English

In Image 1 we can see how the English on the sign is actually transliteration of
the Hebrew equivalent and one wonders about the kind of English this repre-
sents, native, non- native or any other variety. Image 2 displays a text written by
a student, which includes various mixes of Hebrew (the L-1 of the writer) and
English (the L-2) in a communicative text included on a writing test utilizing both
It is this lack of congruence between methods of language testing and the
current views of language, which brings about a need for an expansion of the
construct of language tests to fit these new notions.
The main argument to design tests, which match the current expanded views
of language is that tests are not just isolated events but rather are shown to have a
strong effect on knowledge, on teaching and on identity, by far more than any new
curriculum. They also affect what we think of ourselves as our identity is often
shaped and dictated by our test scores.
Thus, the expansion of language tests, as argued in the next sections, expands
the construct of language in two specific directions. First, language as multi-
dialectical consisting of multiple varieties, and second, language as multilingual,
based on creative mixes as demonstrated in Image 2. I will first discuss the focus

on multi-dialect as this reflects the extensive research of Pavlou and Papapavlou
(2004) and additional sources.
Image 2. A recipe written by a student in a mixed code of Hebrew and English

4. The multi-dialect

The research by Pavlou, as published in a number of places, with Papapavlou

in articles appearing in ‘Issues of dialect use in education from Greek Cypriot
perspective’ (Pavlou and Papapavlou, 2004), and in a chapter in the edited book
by Liddicoat (2007), entitled ‘Literacy and Language-in-Education Policy in
Bidialectal Settings’, addressed one of the most relevant and important issues

in language education in a large number of contexts worldwide. This refers to
the lack of recognition of multiple varieties, often referred to as ‘dialects’, as
‘languages’, which need to be legitimized, taught, learned and assessed. Issues
of dialect recognition are relevant in many societies where differences exist
between language varieties. These also refer to differences between native vs.
non-native varieties, as in the context of English, which are used extensively by
more non-natives than natives in the world today. While the non-native varieties
are accepted in speech as these reflect actual language practice, as they are based
on mixing home languages with English, in speech as well as in public spaces,
these varieties by and large are not recognized by linguists and especially not by
language testers who determine correctness based on very strict rules. Rarely do
we find tests that legitimize the varieties of the non-natives, even in the situations
of ELF (English as Lingua Franca), which has promoted the legitimacy of such
non-native varieties of English (Seidlhofer, 2011). Thus while ELF is viewed as an
accepted variety, it is not possible to find tests which assess proficiency of ELF.
Similar issues relate to terms such as ‘standard’ versus ‘dialects’. As is surveyed
in the Liddicoat (2007) book, these situations of two (or more) varieties which are
referred to as ‘dialects’ and have low status and recognition versus the view of
what is termed as ‘standard languages’; are typical to many settings in the world
and reflect a most natural language use of regional and local varieties as per
sociolinguistics rules of language contact and creativity. Still, in these cases, the
lack of recognition of any variety that does not follow the traditional definition of
‘standard’, imply that these languages do not get recognition in the most powerful
bastion, i. e., language tests. Thus, a variety that is used extensively as in the case
of spoken Arabic in its multiple varieties in the world, still suffers from low pres-
tige on the institutional levels. Spoken Arabic, for example, is still not viewed as a
legitimate school language and is overlooked in tests among Arabic learners, both
as a first and second language contexts. Tests of spoken varieties, of what is often
termed ‘vernaculars’, are almost non-existent while these varieties are used in
most domains of life even in areas in which they are ‘forbidden’, such as schools.
Numerous examples and cases exist around the world, in relation to Cantonese in
Hong Kong, Ebonics or ‘black English’ in the US and many more.
In this very context it is important to mention the studies by Pavlou and Papa-
pavlou (2004), Papapavlou and Pavlou (1998; 2005; 2007) and Pavlou and Christ-
odoulou (2001), as they address in research these very issues of dialects within
the context of Cyprus; the Greek Cypriot Dialect (GCD) vs. Standard Modern
Greek (SMG). In one such study (Papapavlou and Pavlou, 2005; 2007), 133 Greek
Cypriot elementary school teachers from 14 schools were given questionnaires
in order to examine their attitudes towards the use of GCD in the classroom and
teachers’ own behaviour inside and outside the classrooms with regards to these

two language varieties. The researchers examined teachers’ opinions on students’
use of GCD and how this usage affects students’ literacy acquisition, teachers’
attitudes toward GCD and the connection of using GCD and identity.
Results showed that teachers saw as their duty to correct students’ use of dia-
lects. ‘Corrections’ meant that students felt that their own natural use of language
is erroneous and substandard. Yet, teachers confessed that they too use GCD
with colleagues out of classroom but at the same time GCD was stigmatized as it
claimed to affect badly the quality of communication. Teachers claimed that they
were hesitant before speaking in the classroom and this hindered them from being
intellectually active and creative.
Teachers were aware of the detrimental consequences of these repeated cor-
rections on students and claimed not to be in agreement that GCD is an unsophis-
ticated code. Furthermore, the use of GCD was viewed as deficit, of lower class,
especially for rural students. Many of the teachers had positive attitudes towards
GCD but opposed it in the classroom. The view that it is an effective means of
communication does not grant it legitimacy as a fully-fledged language. Thus, they
concluded that: ‘… the variety could have a legitimate place in school contexts. It
is therefore problematic that authorities … insist on maintaining and glorifying
a national language at the expense of local dialects, rather than accommodating
both” (p. 173). They questioned the motives of these policies from political ideo-
logical perspectives and recommended to incorporate these studies in thinking of
the language policies, as attitudes are instrumental in policy making.
My recommendation and conclusion from these studies, and those done in
other contexts worldwide on different varieties, is that there is a need to grant
legitimacy to these language varieties in language testing and assessment. As
long as there is no legitimacy to these varieties in language testing, they will not
be legitimized. Thus, testers need to take an active role here and to call for the
development of tests that will address both the standard and the local varieties.
This is how I connect this very important work of Pavlou to his most productive
and interesting work in language testing.

5. Multilingualism

The second context where I find that language testing and assessment are not
on par with the new definitions of ‘language’ in its expanded form relates to the
construct of multilingualism. While multilingualism, or additive approaches to
language, are part of life nowadays in many local and global contexts of immi-
gration, transnationalism and the existence of ethnolinguistic groups within vari-
ous local contexts, and these approaches are now addressed in the literature and

in policies worldwide. Yet, these views of language are totally overlooked by the
testing literature in tests as well as in the different scales such as the CEFR. It is
often the case that two separate languages are taught in the same class, conver-
sation may take place in specific local languages while reading of texts occurs
in another, especially within academic contexts where English and/or other lan-
guages are dominant. Multilingualism is an integral part of language in public
life. Multilingual competence is encouraged, taught and practiced as two or more
languages are mixed with one another in creative ways not only in separate lan-
guages in the classroom as was discussed above but also in forms of trans-lan-
guaging where one language moves into another creating hybrids and various
mixes as was shown in Image 2. It is often the case that learners born into one lan-
guage learn another, function in a number of languages and thus possess multilin-
gual competencies. More and more evidence emerges nowadays that immigrants,
foreign workers, asylum seekers and refugees continue to use a number of lan-
guages in given contexts throughout their lives (Haim, 2010). In addition, there
are cases where classes in Spanglish emerge in various places in the US. Even in
cases where there is no legitimacy for mixed languages or hybrids as some hold
the views that languages should remain separate and closed, mixed varieties are
viewed as intrusions and separations; still there are many cases where a number
of separate languages are used in the same class, reading of texts is via English,
conversations are via Arabic, etc. (see special issue of Modern Language Journal,
95.iii, 2011 for a number of articles on the topic of multilingualism in schools).
Take the example of the performance of immigrants in school learning for
instance. It has been shown repeatedly that their achievement is usually lower
than the performance of native speakers. As long as tests require non-natives
(e. g., immigrants/minorities) to be like natives, measure them on monolingual
tests, and penalize them if they bring in any language aspects from their home
languages, these groups will remain marginalized. The language and knowledge
they possess is overlooked and rejected (Thomas and Collier, 2002; Menken,
2008; Creese and Blackledge, 2010). This leads to a situation where immigrants
are being given various types of test accommodations such as extra test time,
translation of test language to mother tongue, the use of pictures, etc. in order to
help them perform on academic tests and settle in one language rather than in the
two they posses (Abedi, 2004, 2009; Levi-Keren, 2008).
Indeed, in Graphs 1 and 2 we can see the constant gap between immigrants
and native speakers in a test in Hebrew and in Mathematics in grades 9th and 11th
respectively. For the students from the former USSR it takes 9–10 years to per-
form academically in similar ways when the tests are being given in a language
they are in the process of acquiring, resulting in a situation whereby it takes them
a long time to close the academic gap. The students from Ethiopia cannot close

the gap at all in the first generation (Levin and Shohamy, 2008; Levin, Shohamy
and Spolsky, 2003).

Graph 1. Academic achievement of 9th grade immigrant students from Ethiopia and the USSR on
Hebrew test according to years of residence

Graph 5: 9th grade Hebrew standard grades

according to years of residence


-1 Fomer USSR



0-2 3-4 5-6 7-8 9-10 11-12

Graph 2. Academic achievement of 11th grade immigrant students from Ethiopia and the USSR on
Mathematics tests according to years of residence




until 2 3-4 5-6 7-8 9-10 11-12
years Years

Yet, at the same time, we found in our studies (Levin, Shohamy, Spolsky, 2003;
Levin, Shohamy, 2008) that immigrant students who were born in a Russian
speaking context continue to make sense of the world in two languages – their
home language, the one they used in their academic settings in the country where
they immigrated from and the language they are using in the new setting. In
Graph 3 below we can see how immigrants obtain significant higher scores in
Mathematics tests when these are administered in two languages, a bilingual
version, in Hebrew and Russian. The significance of obtaining high scores in
Mathematics should not be underestimated as research shows how long it takes
immigrants to perform well in various content area subjects because of the lim-
ited language proficiency in the language of the educational context in the coun-
try they immigrated to. Thus, the use of two languages on tests and achieving
higher scores than those Russian speaking students who are forced to perform in
one language, Hebrew, affect their ability to manifest their knowledge in Math-
ematics. In other words, when tests are being given in a bilingual mode (e. g.,
Russian and Hebrew), immigrant students obtain higher scores, up to 8 years
beyond immigration. Home languages – in this case Russian – continue to be
central for a very long time and continue to provide an important resource for
acquiring academic knowledge.

Graph 3. Bilingual tests as enhancing achievements scores.

Math grades according to years of residence content

and reading
Hebrew + 70
Hebrew 60
until 1 2 3-4 5-7 8 and up


Additional studies along these lines (Rea-Dickens, Khamis and Olivero, forth-
coming) provide further support to this very situation and the need to legitimize
bi- and multilingual tests in order to assess students’ true languages in more real-
istic ways, especially in the context of content-based learning, which is the most
dominant mode nowadays in learning English. One case of incorporating two
languages in a high stake test is the Greek National Certificate of Language Pro-
ficiency, known by its Greek acronyms KPG. It is referred to as a glocal test as it
assesses global languages in a local Greek context (Dendrinos, this volume; Den-
drinos and Stathopoulou, 2010). It consists of a component entitled ‘mediation’
where the text is presented in English but the discussion of the text is performed
in Greek; mediation is different than translation/interpretation and it is defined as
a process whereby information is extracted from a source text and it is relayed by
the test taker in a target text of a different social variety, register, code.
Still most tests, as well as rating scales such as the CEFR, do not incorporate
any of these multiple language proficiencies, mixed codes, and other ways of
demonstrating knowledge. While bi- and multilingualism is of prime importance,
tests continue to be based on monolingual knowledge (Canagarajah, 2006; Sho-
hamy, 2001).
Additional extensions can be introduced in expanding the construct of lan-
guage and a need to design tests that fit these. For example, there is a need to
address issues of multi-modality (New London Group, 1996; Kress and van
Leewan, 1996) where language is viewed more broadly beyond words but also of
a variety of modalities such as images and sounds. The construct is well-devel-
oped today especially since the type of texts which appear on the internet are
multimodal, made of sounds, images, moving videos, etc.

6. Conclusions, recommendations

The construct of language has changed and so should tests. Tests need to follow
definitions of language and not dictate them. Yet, tests, given their power, work
against these new views of language and still strive for native proficiency as the
goal is high performance on tests and overlook the rich and creative varieties of
different languages within the same language. Yet, in spite of multilingual prac-
tices of ‘code switching’ and simultaneous uses of languages, tests view these as
violations, as ‘the wrong answer’. So, all language tests, rating scales and rubrics
such as the CEFR are based on monolingual, mono-variety forms. It is especially
large scale tests that serve monolingual ideologies of one nation, one language,
ignoring these varieties. Rather than for testers to lead by pointing to such uses

of languages and develop tests accordingly, testers take a role of protecting such
‘standard’ ideologies and policing monolingualism and mono-elite varieties.
Following such policies brings about significant loss. Immigrants bring with
them knowledge of the world, varied content, experiences, interpretations – can
we afford to ignore it? Students bring with them different varieties which are very
effective for communication but are downgraded in schools and work against
learning and knowledge while they continue to employ their L-1, in academic
functioning and in fact all their lives. Furthermore, there is the issue of testing
identity (Lazaraton & Davis, 2008), showing that results of tests deliver a mes-
sage to test takers that this is the reality, a fixed reality that cannot change.
Thus, it is argued here that testers should have professional responsibility
(Davies, 2009). Given that our tests are so powerful, can determine the prestige
and status of languages, can affect the identity of people, marginalize others, lead
to monolingualism, or cultivate multilingualism, suppress (or enhance) language
diversity, perpetuate language correctness and purity, testers should try to think
more broadly about what language in all these situations means and act to prevent
such losses.
Expansion of the language construct implies more inclusive policies, such as
granting credit and recognition to different language varieties, recognizing the
mixed code used by immigrant children and adults, recognize the literacy and
knowledge that students have, regardless of the medium – the language they use
to deliver it – and expand language to images, signs, sounds, and a variety of
modes of ‘languaging’; especially in the internet and cyber space era. Language
testers need to re-think the meaning of language, argue with traditional defi-
nitions, and take the side of the test takers, the learners and the ways they use
language in the rich variety of contexts.
These recommendations can contribute to the designing of tests, which are
more inclusive, more just and of higher levels of fairness and validity and with
benefits to society. Rather than contesting and arguing with narrow definitions,
testers should comply with the definitions and start creating tests, which are more
valid in terms of the construct. I, therefore, encourage people working in lan-
guage testing to adopt this broader view of language along the lines mentioned
in this chapter and thus to enhance broader inclusion and more valid and fair
tests. The work of Pavlou researching and critiquing traditional views of stan-
dard language versus other varieties and his work in language testing provide the
foundations we need and legitimate data to strengthen such arguments that can
bring about change.


Abedi, J. (2004). The No Child Left Behind Act and English Language learners:
Assessment and accountability issues. Educational Researcher, 33(1), 4–14.
Abedi, J. (2009). Utilizing accommodations in assessment. In E. Shohamy & N. H.
Hornberger (Eds.), Encyclopedia of Language and Education, 2nd Edition,
Vol 7: Language Testing and Assessment (pp. 331–347). Berlin: Springer.
Canagarajah, S. (2006). Changing communicative needs: revised assessment
objectives, testing English as an International language. Language Assess-
ment Quarterly, 3(3), 229–242.
Creese, A. & Blackledge, A. (2010). Translanguaging in the bilingual classroom:
A pedagogy for learning and teaching? The Modern Language Journal, 94,
Davies, A. (2009). Ethics, Professionalism, Rights and Codes. In E. Shohamy
& N. H. Hornberger (Eds.), Encyclopedia of Language and Education, 2nd
Edition, Vol. 7: Language Testing and Assessment (pp. 429–444). Berlin:
Dendrinos, B. & Stathopoulou, M. (2010). Mediation activities: Cross-language
Communication Performance. ELT News, KPG Corner, 249, 12. Retrieved
Haim, O. (2010). The Relationship between Academic Proficiency (AP) in first
Language and AP in Second and Third Languages. PhD dissertation, Tel Aviv
Kress G. and van Leewan, T. (1996). Reading images, the grammar of visual
design, London: Routledge.
Lazaraton, A. & Davis, L, (2008). A microanalytic perspective on discourse, pro-
ficiency, and identity in paired oral assessment. Language Assessment Quar-
terly, 5(4), 313–335.
Levi-Keren, M. (2008). Factors explaining biases in Mathematic tests among
immigrant students in Israel. Ph.D. dissertation, Tel Aviv University [in
Levin, T., Shohamy, E., & Spolsky, B. (2003). Academic achievements of immi-
grants in schools. Report submitted to the Ministry of Education. Tel Aviv
University [in Hebrew].
Levin, T. & Shohamy, E. (2008). Achievement of Immigrant Students in Math-
ematics and Academic Hebrew in Israeli School: A large Scale Evaluation
Study. Studies in Educational Evaluation, 34, 1–14.
Liddicoat A. (ed.) (2007) Language planning and policy: Issue in language plan-
ning and literacy. Clevedon: Multilingual Matters.

Menken, K. (2008). English learners left behind: Standardized testing as lan-
guage policy. Clevedon: Multilingual Matters.
Menken, K. (2009). High-Stakes Tests as de facto Language Education Policies.
In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of Language and
Education, 2nd Edition, Vol 7: Language Testing and Assessment (pp. 401–
414). Berlin: Springer.
Papapavlou, A. & Pavlou, P. (1998). A review of the sociolinguistic aspects of the
Greek Cypriot dialect. Journal of Multilingual and Multicultural Develop-
ment, 19(3), 212–220.
Papapavlou, A. & Pavlou, P. (2005). Literacy and language-in Education policy
in bidialectical settings. Current issues in language planning, 6(2), 164–181.
Papapavlou, A. & Pavlou, P. (2007). Literacy and language-in Education policy
in bidialectical settings. In A. Liddicoat (ed.), Language planning and policy:
Issue in language planning and literacy (pp. 254–299). Clevedon: Multilin-
gual Matters.
Pavlou, P. & Christodoulou, N. (2004). The use of communication strategies by
students of Greek as a foreign language. Proceedings of the 6th International
Conference on Greek Linguistics (pp. 871–877) [in Greek].
Pavlou, P. & Papapavlou, A. (2004) Issues of dialect use in education from the
Greek Cypriot perspective. International Journal of Applied Linguistics,
14(2), 243–258.
Pavlou, P. and Christodoulou, N. (2001). Bidialectism in Cyprus and its impact on
the teaching of Greek as a foreign language, International Journal of Applied
Linguistics, 11(1), 75–91.
Rea-Dickins, P., Khamis, Z., & Olivero, F., (forthcoming) Does English-medium
instruction and examining lead to social and economic advantage? Promises
and threats: A Sub-Saharan case study.. In E. J. Erling & P. Seargeant (Eds.),
English and International Development. Bristol: Multilingual Matters.
Seidlhofer, B. (2011). Understanding English as a Lingua Franca: A complete
introduction to the theoretical nature and practical implications of English
used as a lingua franca. Oxford: Oxford University Press.
Shohamy, E. (2001). The power of tests: A critical perspective on the uses of lan-
guage tests. Harlow, UK: Pearson Educational.
Shohamy, E. (2006). Language policy: Hidden agendas and new approaches.
London: Routledge.
Shohamy, E. (2008). Introduction to Volume 7: Language Testing and Assess-
ment. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of Language
and Education, 2nd Edition, Vol 7: Language Testing and Assessment (pp.
xiii-xxii). Berlin: Springer.

The New London Group (2000). A pedagogy of multiliteracies, designing social
futures. In B. Cope and M. Kalazis (eds) Multiliteracies, Literacy learning
and the design of social futures (pp. 9–37). London: Routledge.
Thomas, W. & Collier, V. (2002). A national study of school effectiveness for
language minority students’ long-term academic achievement: Final report.
Project 1.1. Santa Cruz, CA: Center for Research on Education, Diversity and
Excellence (CREDE).
Wei, L. & Martin, P. (2009). Conflicts and tensions in classroom codeswitching:
and introduction. Bilingual Education and Bilingualism, 12(2), 117–122.

Social Meanings In Global-Glocal Language
Proficiency Exams

Bessie Dendrinos2
National and Kapodistrian University of Athens

If we admit that capitalism is collapsing

(functioning in a growth economy that’s
destroying the earth), we might see a hint of
hope in the development of local artefacts,
produced in micro-economies, based on local
self-sufficiency and fair-trade. Growth-oriented
export/import economies are the dream of the
past, the shock of the present and the nightmare
of the future!
Letter by a reader in the New Internationalist
(March 2009)

Viewing high-stakes language proficiency exams as ideological apparatuses involving processes,

which produce and reproduce or resist specific forms of knowledge and communication exchange,
this chapter makes a case for ‘glocal’ language proficiency testing. In doing so, it considers the
concerns linked with global or international [English] language testing in the context of the cul-
tural politics of ‘strong’ (and ‘weak’) languages. However, the chapter moves beyond critique and
claims that locally-controlled testing suites may serve as counter-hegemonic alternatives to the
profit-driven global language testing industry. The pro-glocal language testing arguments –using
the Greek national foreign language proficiency examination suite as a case study – are political,
economic and also linguistic. Specifically, in glocal testing, attention is turned from the language
itself to language users (taking into account their experiences, literacies and needs) and may well
serve multilingual literacy.

Key words: international language testing, language proficiency certification, glocal testing, multi-
lingualism, mediation, literacy (literacies).

1. International English language testing as

a ‘self-serving’ project

1.1 The (political) economy of English language proficiency tests

The worldwide English language testing enterprise has established itself as ‘a

commercial condominium trading in a specific knowledge product, the stan-
dardised EFL proficiency test’, says Templer (2004), and it is true. International


English language proficiency testing is indeed a profitable enterprise. For exam-
ple, it is estimated that the revenues of the Educational Testing Service alone (an
organisation which markets the well-known TOEFL exam) range from 700 to
800 million dollars each fiscal year. Likewise, other big commercial testing firms
or organisations control and wish to control further English proficiency exams,
investing significantly into this project. The 1990s Corporate Plan of the British
Council, for example, aimed “to secure a substantial share of agency markets for
educational and cultural services overseas”, and to do so “on a full cost-recovery
basis” (quoted in Pennycook, 1994, pp. 157–8). Other goals included to promote
“a wider knowledge of the English language abroad” and also ‘to increase the
demand for British examinations in English as a foreign language” (ibid).
International proficiency testing is constantly seeking to expand in significant
markets. During the last decade, Eastern Europe and South America were tar-
geted, and still are, but the most recent grand acquisition is China. Yet, certif-
icate-hungry markets like Greece are not ignored. The Greek market is sought
after by a good number of English testing enterprises (about 15 international Eng-
lish language proficiency tests are ‘recognized’), and these often shape the needs
of the social groups that they target. By this, I mean that the international test-
ing industry does not just respond to social needs for testing and certification; it
actually helps shape them. Therefore, well-established and well-respected inter-
national exam suites in Greece, for example, have helped generate and reinforce
the need to certify young learners’ language proficiency even though certificates
of this sort are useless to children who would be better off spending their valuable
time on creative language play instead of spending it to prepare for standardized
language tests. When children do sit for the exams and pass them, Greek parents
are convinced that the certificate they get is the proof they need to be reassured
that their child is learning/has learnt English (cf. Pavlou, 2005). The more pres-
tigious the brand of the certificate, the greater their conviction is – especially if
the branding process exploits the name of a reputable English or American uni-
versity3 and the credentials that it provides. Greek parents are proud to announce
that their child has the ‘Cambridge University’ or the ‘Michigan University’
degree.4And they are willing to pay more for these ‘international’ exams than
for local alternatives, convinced as they are, as stakeholders elsewhere, that these
are the “only efficient, scientific option[s] for an international assessment mar-
ketplace” (Templer, 2004), and participating in a ‘hegemonic consensus on the
inevitability of it all’ (Mulderrig, 2003).
3 For the ideological operations and effects of branding, see Mitsikopoulou (2008).
4 The various English language proficiency exam suites operating in Greece advertise that
they offer a ‘ptychio’ [degree] rather than a certificate, intentionally blurring the difference
between a university degree and a language certificate.

The exams under consideration are not in reality international, if we under-
stand the term as it has been used in the last two centuries.5 They are actually
national products administered internationally. To label them international is to
purposefully attribute to them the traits of entities that extend across or transcend
national boundaries. But, internationalization should not go hand-in-hand with
forfeiting the right to claim ownership over the language of testing or of language
tests in the language in question.6As a matter of fact, for dominant languages like
English, it is financially beneficial to sustain the construct that language is owned
by its native speakers (cf. Prodromou, 2008).7 Thus, the naturalisation of the idea
that the native speakers of a language have the exclusive prerogative to assess
proficiency in their language is not innocent, nor are decisions of organizations
such as ALTE (the Association of Language Testers in Europe) to offer member-
ship only to country-specific testing organisations, which have developed exams
for the language of their country.
International proficiency testing has grown so much in the last 50 years that it
is worth the attention of international conglomerates which have their fair share
in the commodification of the language teaching, learning and assessment, and
the accompanying goods, such as textbook sets, testing paraphernalia, ELT mul-
timedia and fee-paying services such as training or exam preparation courses
(cf. Dendrinos, 1999; Pennycook, 1994). Language testing organisations spend
millions on promoting, advertising, and on even bribing their prospective cus-
tomers.8 No wonder McNamara (2002) stresses the need for critical analysis of
‘industrialised language testing’ and suggests that a thorough social critique of
the language-testing project be undertaken. In a similar vein, Templer (2004)
calls for a critical ethnography of the language testing industry which, since the
late 90s, has raised concerns about its ethics.

5 The term ‘international’ refers to an organization relating to or involving two or more nations,
(e. g. an international commission). It also means extending across or transcending national
boundaries (e. g. the international fame one has). Finally, as a noun, it refers to any of several
socialist organizations of international scope formed during the late 19th and early 20th cen-
turies – the first one was formed in London in 1864 as the International Workingmen’s Asso-
ciation aiming to unite all workers for the purpose of achieving political power along the lines
set down by Marx and Engels.
6 For discussion on this issue, see Widdowson (1994).
7 For further discussion regarding the ownership of language by native speakers, see Higgins
8 In Greece, one commercial English testing firm in particular offers (through mobile phone
messages) a price per candidate head to private-tuition language school owners who can guar-
antee a group of at least 15 candidates for each exam administration. It also offers public
and private language school teachers who can bring in candidate customers for their English
exams special package deals such as trips abroad.

Respected testing scholars like Spolsky (1995), Hamp-Lyons (1997 & 2000),
Fulcher (1999), and Shohamy (2001a) have urged that the international testing
community look into and abide by ethical practices. Actually, concern about ethi-
cal issues in testing led the International Language Testing Association to draw
up a Code of Ethics – even though most of the ethical issues posed herein are
rather limited to the testers themselves. That is, they do not really cover insti-
tutional policies and practices such as entrepreneurial tactics sometimes used
by commercial testing, nor do they touch on specific procedures that may lead
to corrupt practices, such as not making a clear line of distinction between the
educational institution that prepares prospective test-takers for a specific exam
and the institution that administers that exam.

1.2 Tests as ideological apparatuses

Undoubtedly, international proficiency language testing has a tradition for tools

that measure with psychometric accuracy and precision forms of knowledge.
Test validity and reliability of assessment is hardly ever questioned by the gen-
eral public, and any questions, posed by testing specialists, serve the purpose of
improving the quality of the English language proficiency test tasks, the content
of which is often chosen on the basis of validity criteria, rather than what form of
knowledge is most appropriate and/or desirable in a given social context.
The content of international testing is not questioned either, with regard to
whether or to how well it responds to the experiences, literacies and needs of
local stakeholders. Nor is it read critically for its ideological underpinnings, even
though it is without doubt that tests are not value-free or ideology-free products,
and exam systems are apparatuses involving processes of ideologisation – just as
the discursive practices in the texts of language textbooks do (Dendrinos, 1992),
and just as language teaching practices do. “You may be sobered, as I was” asserts
Holliday (2005, p. 45) “by the fact that in your everyday teaching, “small culture
formation is in many ways a factory for the production of ideology”.
Texts and images used in tests carry ideological meanings, while the choice
of subject matter and text type, as well as the linguistic choices in texts construe
worlds and realities for the test taker. Balourdi (2012)9, who critically analyses the
reading comprehension texts of three proficiency exams (two international ones
and the Greek national foreign language exams), makes a data-based argument
9 Amalia Balourdi is a PhD student, working under my supervision, at the Faculty of English
Studies of the University of Athens, who submitted her thesis, entitled World Representations
in Language Exam Batteries: Critical Discourse Analysis of texts Used to Test Reading Com-

illustrating this. The world portrayed in one of them – an American proficiency
exam – linguistically constructs a market-orientated world – one in which adver-
tising has a central role. Specifically, upon very detailed systemic functional
linguistic analysis of a rather sizeable corpus of texts, the researcher finds that
there is consistent use of linguistic patterns which serve as typical features of
advertising discourse and which position the implied reader as a potential con-
sumer. Linguistic patterns also serve the foregrounding of the supplier-customer
relationship, with money having a key role in their relationship.
By contrast, the reading comprehension texts in the second battery, an English
proficiency exam, construes quite a different world – a world of inner experi-
ence inhabited by British subjects. This becomes apparent as Balourdi (ibid) finds
that lexicogrammatical choices favour behavioural processes, indicating states
of consciousness and mental activity, but also relational attributive processes,
assigning text participants with strong feelings and emotions. It is a world of
subjects who are mainly preoccupied with their personal and professional life,
who make persistent attempts to prove themselves, to achieve personal goals and
fulfill their ambitions through hard work and the use of their creative skills. Work
and travel, which are among their prime interests, are consistently construed as
opportunities for the participants to gain social recognition, fame, and success.
It is somewhat ironic that both the aforementioned proficiency tests, which
are internationally administered contain texts inhabited by nationally defined
subjects – Americans in the first exam battery and British in the second. The
texts of the Greek exams, on the other hand, which are nationally administered,
contain texts inhabited by subjects whose national identity is unmarked. That is,
the subjects in the Greek foreign language exams, that will be discussed more
extensively later, are ‘citizens of the world’ according to Balourdi (ibid).

1.3 Testing for symbolic profit

Having discussed the commercial aspect of international proficiency testing in

some detail, it is important not to shift attention from the fact that the material
turnover from the exams is not the only reward for the testing organizations. In
fact, I should like to suggest that the symbolic gain is just as crucial as the mate-
rial profit.
International language proficiency testing has several self-serving goals,
including that of sustaining the position of the national language it is built to
market and the culture in which the language is immersed. In other words, it
is substantially involved in the larger politics of the internationalisation of lan-
guage, so that it is easily exported around the world both as a service and as a

marketable product (cf. Pennycook, 1994, p. 157). To this end, not only are the
implications of the spread of a dominant language neutralised and denigrated,
but also language itself is portrayed as culturally and ideologically neutral (cf.
Dendrinos, 2001; 2002).
Interestingly, the theory of language at the heart of international proficiency
testing is that language is a structural system cut off from ideological inscriptions
and disconnected from its cultural context. Such a construal is serviced best by a
structural view of language, a theory of language as an abstract meaning system
which can be learnt and used irrespective of its contextual use. This theory is
materialised in test tasks and items commonly focusing on lexical and sentence
level meaning, as well as in the assessment criteria for oral and written production,
which favour accuracy over appropriacy. It is also materialised in projects such as
the ‘Cambridge English Profile’, which investigates candidates’ scripts by focus-
ing on the formal properties of language.10

1.4 International proficiency testing as a monolingual project

The final point to be made with regard to international proficiency testing is that
it is by default a monolingual project. It does not involve adjustments to the cul-
tural, linguistic or other needs of particular domestic markets because this would
mean that the same product could not be sold in different cultures. It would need
to be adjusted and involve more than one language, which would complicate mat-
ters from a financial point of view.
Therefore, all international language proficiency tests (paper-based and adap-
tive e-tests) are monolingual, as are diagnostic tests and self-assessment tests
increasingly available, especially for the ‘big’ languages. Shohamy (2011, p. 418),
who argues that “all assessment policies and practices are based on monolingual
constructs whereby test-takers are expected to demonstrate their language profi-
ciency in one language at a time”, claims that these assessment approaches view
language as “a closed and finite system that does not enable other languages to
‘smuggle in’.” And this is true for both international proficiency language testing
and for national language tests in schools and universities.11 They are monolin-

10 The Cambridge English Profile is a project whose goal is to add specific grammatical and
lexical details of English to the functional characterization of the CEFR levels, based on the
Cambridge Learner Corpus, which contains exam scripts written by learners around the world
(Hawkins & Buttery, 2010).
11 In Shohamy’s 2011 paper, from which she is quoted, we understand that her real concern is
with national language testing for immigrants and their unfair treatment by the educational
system which obliges them to take monolingual tests, despite the fact that they are bi- or

gual both as examination suites (i. e., they are intended to test a single language)
and as assessment systems (i. e., they are constructed to measure monolingual
competence). Test papers endorse the idea that effective communication is mono-
lingual (Dendrinos, 2010a) and that proficient users of a language do not use
‘hybrid’ forms, mix languages or codes. As has been discussed elsewhere (Den-
drinos, 2010b), commercial English language testing and teaching continue to be
a monolingual venture.
This monolingualism “is in stark contrast to the current understanding of mul-
tilingual competencies for which various languages and aspects ‘bleed’ into one
another in creative ways” says Shohamy (ibid), who critiques “current mono-
lingual assessment approaches within a political and social context.” Interest-
ingly, her statement makes us think that there is a multilingual trend in language
and language education policies worldwide. Unfortunately, however, this is not
so. Current language policies in the U. S. and in other economically powerful
countries are – sadly – predominantly monolingual. Within the European Union,
however, the promotion of multilingualism has been a consistent endeavour over
the last 10 or 15 years. However, according to the Common European Framework
of Reference for Languages (CEFR, Council of Europe, 2001, p. 4), it has not yet
“been translated into action in either language education or language testing”.
International proficiency testing is not only intent on measuring the test-taker’s
linguistic competence in a single language, but on measuring it against the ‘ideal
native speaker’. The political and economic dimensions of language teaching,
particularly English language teaching, based on the native-speaker paradigm
have been discussed by various scholars (Dendrinos 1998, 1999; Phillipson, 1992,
Pennycook, 1994). What has not been widely discussed is the motive behind pro-
ficiency testing targeting linguistic competence that resembles that of the native
speaker.12 I do think that there is additional motive for testing to aim specifically
at linguistic competence and that is simply easily measurable through objective
test items. Accurate measurement of communicative or sociolinguistic compe-
tence would involve test items and tasks that would take into consideration the

multi-lingual speakers “who rarely reach language proficiency in each of the languages that is
identical to that of their monolingual counterparts” (ibid). Yet, she continues, they are “always
being compared to them and thus receive lower scores. Consequently, they are penalized for
their multilingual competencies, sending a message that multilingual knowledge is a liability.”
12 In the context of the work carried out for the ‘Cambridge English Profile’, the lexical and
grammatical features of the foreign language learners’ language production are measured
against native speaker production. As Hawkins & Buttery (2010, p. 3) assert, they are defined
‘in terms of the linguistic properties of the L2, as used by native speakers that have either been
correctly, or incorrectly, attained at a given level’.

social and cultural awareness of specific groups of test takers – a requirement that
international proficiency testing simply cannot meet.

2. Alternatives to international language proficiency testing

Concerned about the cost of language proficiency testing organisations such as

TOEFL and IELTS which, in effect, have become “an EFL testing condominium
or cartel”, Templer (2004) suggests several alternatives such as in-house testing,
low-cost computer-based placement tests, ‘democratic’ international exams with
hands-on local involvement to reduce exam fees, and learner-directed assessment
with the use of tools such as language portfolios. He also considers an idea dis-
cussed by Bérubé (2003, in Templer, 2004), which is to assign ‘testing handicaps’
to different groups of candidates depending on their social background. He also
proposes low-cost local variants, such as the Malaysian University English Test
(MUET) – an exam developed with exclusive local control in SE Asia – devel-
oped in Malaysia, also recognised in Singapore. However, in suggesting these
alternatives, Templer (ibid) is thinking more of language testing that may or may
not secure a candidate’s position into university. In fact, he is concerned about
the cost of university language entrance tests, which may prevent lower income
students’ access to academia. In considering this issue, he also brings into the
discussion the fact that scores on tests such as TOEFL and IELTS “have assumed
a prime classificatory (and disciplinary) function”.

2.1 Glocal’ alternatives to the international language proficiency

testing industry

Concentrating on regional variants for language proficiency testing leading to

certification, which is required for employment or admittance into university
programmes and which counts as ‘cultural capital’ for the beholder, this chapter
focuses on what I have been calling glocal (global + local) language proficiency
testing13 (Dendrinos, 2004).

13 In a recent talk Pennycook (2010a) gave, he argued for the need to rethink the relation between
the global and local because “the two terms sit in a very complex relation to each other”
and “this relation is not helped by terminology such as ‘glocal’, since this simply elides the
two ideas”. I think this may be true in some cases, but not in others. Actually, in his talk,
Pennycook did not propose an alternative term. I believe it is rather difficult to find an all-
encompassing notion/term representing all the different social practices and systems, which
may abide by global construals while focusing on the local. However, some of the points he

Glocalisation involves locally operated schemes, set up to serve domestic
social conditions and needs, which are informed by international research and
assessment practices. The most obvious benefit of glocal exam suites is that they
are low cost alternatives to profit-driven industrialised testing. The less obvious
advantage, but perhaps more important, is their socially interested control over
forms of knowledge and literacy. Therefore, I should like to suggest that they
would constitute a counter-hegemonic option with respect to the acquisition of
knowledge – perhaps conducive to socio-political aspirations for democratic citi-
A case in point is the Greek National Certificate of Language Proficiency,
known by its Greek acronym KPG (ΚΠΓ) – a language exam suite that is com-
parable to other non-commercial, state-supported testing systems, such as the
French and the Finnish language proficiency exams.14 The basic link between
them is that none of the three are profit-driven commercial exams as they are, all
three of them, controlled by public service organisations. The French exam, how-
ever, is intended to test a single language. Operated by the Centre international
d’étude pédagogiques, under the aegis of the French Ministry of Education, it
has been developed to test proficiency in French as a foreign language and it is
mainly for students of French outside of France. The symbolic profit, justifying
the investment of state funds, is not unrelated to cultural politics of French lan-
guage promotionism. In a marketised discourse, the official website advertises
the DILF/DELF/DALF exams and claims that certification on the basis of these
exams allows one ‘to opt out of a French university’s language entrance exam’
and that ‘having one of these French certifications looks good on your CV’.
The French exam, which is similar to other national exams for the certifi-
cation of proficiency in their languages as foreign languages, such as the English,
German and Italian, are very dissimilar to the Greek and Finnish national lan-
guage exams because the latter two are multilingual suites. They are intended to
test proficiency in several languages as these are used at home and abroad. Both

made in his talk are worth thinking about, including that we consider a model that equates
homogeny with the global and heterogeny with the local (a point I am not sure I agree with) and
that we undertake a thorough exploration not only of globalization but also of localization (and
relocalization in relation to recontextualization and resemioticization), so that the two terms
may (in Pennycook’s words) ‘capture the dynamic of change’. This issue is also discussed in a
book he published that year (cf. Pennycook, 2010b).
14 Though there are a few similarities between the French and the Finnish exams, the Finnish
National Certificate of Language Proficiency is significantly different in that it is set up for
several languages as second or foreign languages in Finland. Its tests are intended for adults,
and they are developed jointly by the National Board of Education and the University of Jyväs-
kylä. Much like the French and other European exam batteries, test tasks measure reading,
listening, writing and speaking skills, on the six-level scale, in line with European models.

these suites have been built taking into account domestic needs related to the
languages they include.
The Finnish exam includes tests in Finnish as a foreign language, not as a prod-
uct to be exported, but as a service to those who apply for citizenship and need to
have their language proficiency certified.15 The Finnish exam suite includes other
languages, which are significant in Finland, even if they bring no profit. Swed-
ish is offered because it is the second official language and Saami because it is
an important, indigenous language in Finland. Finally, low-cost exams are also
offered in English, French, German, Italian, Russian and Spanish.
Likewise, the Greek exams are offered in languages, which are important for
Greece. Beside English, French, German and Italian, which are the most widely
taught and learnt languages in the country, KPG offers exams in Spanish (a poly-
centric language of growing popularity in Greece), and Turkish (the language of
one of Greece’s major neighbours and the only other language which has been
recognised as a minority language in one area of Northern Greece). The next
language exam to be developed is for Russian, a language spoken by many who
immigrate to Greece from Eastern Europe. Certification in Russian would pro-
vide these immigrants with one additional qualification that might secure them
a job.
There are, of course, differences also between the Finnish and the Greek exam
suites. Two of the essential ones are KPG’s view of language not as a structural
but as a semiotic system and the special support of the exam suite to multilin-
gualism, not only through the assessment system itself, but also because it is
the first such exam battery to legitimate language blending as part of the testing

2.2 Characteristics of the KPG glocal language proficiency exam suite

2.2.1 An overview

The KPG exam suite, governed by the Greek Ministry of Education, was insti-
tuted by law in 1999 and became operational in 2002. In 2003, it launched exams
in the four most widely taught languages in Greece: English, French, German
and Italian. From the very start, the exams used the CEFR as a springboard for
content specifications and the KPG adopted the six-level scale of the Council of
15 There is also a Greek as a foreign language exam, but it is not a component of the KPG exam
suite. The ‘Examination for the Certificate of Proficiency in Greek’, is developed and admin-
istered by the Centre of the Greek Language, also under the aegis of the Greek Ministry of

Europe for certification purposes. Test paper development and research related to
test validity and assessment reliability are the responsibility of foreign language
and literature departments in two major universities in Greece.
The language exams in pen-and-paper form, presently offered twice a year,16
are administered by the Greek Ministry of Education, Lifelong Learning and
Religious Affairs ensuring exam confidentiality and venue security by using the
support mechanism it is equipped with so as to carry out the national university-
entrance exams. Test paper content must be approved by the Central Examination
Board (henceforth CEB) before being disseminated, through a V. B. I. (vertical
blanking interval) system, to state-selected examination centres throughout the
country. The administration of the exams is directed and regulated by the CEB,
which is appointed by Ministerial decree. Consisting of university professors who
are experts in the foreign language teaching and testing field, the CEB is also
responsible for specifications regarding exam format and structure, as well as
scoring regulations. The CEB also functions in an expert consulting capacity,
advising the Ministry on matters regarding the development and growth of the
system, exam policies and law amendments, new and revised regulations.
The A (Basic User) level exam was developed upon popular demand for young
candidates – as a preparatory step to higher-level exams. It should perhaps be
noted here that the format of the KPG exams is more or less the same for all
exam levels and all languages. As a need has been noted for an adult A-level
certificate, there are plans to launch such an exam in the future –not necessarily
in all KPG languages. The large percentage of candidates sitting for the B-level
(Autonomous User) and C-level (Proficient User) exams in English are teenagers
and young adults, but the picture is different for other languages. For example, in
Italian and Turkish even the A-level candidates are adults.

2.2.2 A low-cost alternative

The KPG exams, initially funded by the state, are indeed an economical alter-
native to commercial testing. Unlike international proficiency testing with its
overpriced fees, the KPG Board is concerned about how to make the exams as
affordable as possible to the average Greek candidate. So, apart from the fact

16 Presently, only a paper-based version of the KPG exams is offered, but an e-version is being
developed and the goal is, in 2013, to launch computer adaptive tests in the six KPG languages
in five examination centres throughout the country, equipped to facilitate special needs can-
didates to a greater degree than pen-and-paper centres do now. The e-tests will not replace
the paper-based exams but the two options will be offered to cater for candidates who are
computer literate and those who have little computer savvy.

that testing fees are about half the price of commercial international exams, it
was recently decided to develop and administer graded, intergraded paper-based
exams with which a candidate pays a single exam fee, sits for one exam, but may
be certified in one of two levels of proficiency. Each test paper includes tasks for
two levels of language proficiency (A1+A2, B1+B2 and soon C1+C2) and candi-
dates have a double shot for a certificate – for either the higher or the lower of the
two scales, depending on the level of their performance.
There are a series of other KPG tactical moves aiming at implementing a ‘peo-
ple’s economy approach.17 One of them is the policy not to assign a numerical
score to the successful candidate, whose proficiency is identified by the highest
certificate s/he has obtained. Therefore, the issue of score expiration does not
arise, as it does for some commercial English proficiency exams, which stipulate
that scores are valid for two years (so that after that the tests have to be taken and
paid for all over again). A second tactical move is to set up examination centres
not only in big cities, but also in towns and on the islands, so that candidates do
not have to bear the additional cost of travelling to the bigger cities as they have
to do for international exams. In case it becomes too expensive to send non-local
examiners to conduct the speaking test, it is currently carried out through video-
conferencing, as the Ministry of Education has a direct connection with each of
the exam centres all over Greece.
Law and KPG regulations stipulate that only public schools be used as official
exam centres that these centres be under the control of local educational authori-
ties, and that the exam committees, the secretarial assistants and the invigilators
are, all of them, educators working for the public school system which makes
them more accountable for security measures.
Concern about both providing a low-cost alternative to candidates and con-
tributing to the sustainability of the system has not led to measures, which could
jeopardise the validity and reliability of the exams. Therefore, the KPG has not
resorted to the cost-saving solution that most international proficiency tests opt
for, i. e., having a single examiner to conduct the speaking test. To ensure fair
marking, KPG law and regulations require that oral performance be assessed
and marked by two examiners, who must be both present in the room where the
speaking test is being carried out. As such, a large number of trained examiners
are sent to exam centres all over the country to conduct the speaking test on the
same weekend that the rest of the test papers are administered, though it is a rather
costly solution, which most international exams do not prefer.18 Even though the

17 The term is borrowed from Ransom & Baird (2009).

18 Most international, commercial proficiency tests conduct their speaking test in Greece over a
much longer period of time, with far fewer examiners.

option of finishing off in one weekend is ultimately to the candidate’s benefit, the
decision was made for the sake of confidentiality. That is to say, when a speaking
test is administered over the period of say one month, there is always a danger of
the test tasks ‘leaking’ and this would be a serious drawback for a national exam.
Moreover, at each exam period, trained observers are sent out to different
selected centres, not only to monitor the speaking test procedure, but also to
assess examiners’ conduct during the test, their understanding of the evaluation
criteria and the marking reliability (Delieza, 2011;19 Karavas, 2008;20 Karavas
and Delieza, 2009). Despite the cost, the observation system is systematically
implemented throughout the country, aimed at ensuring marking reliability and
inter-rater agreement.
Special concern with fair marking has led to the KPG using two of its trained
script raters to mark each candidate’s script, as well as script-rater ‘coordina-
tors’ (one for every group of 20) who function both as directors and facilitators
during the script assessment and marking process. This means that each script is
blindly marked by two raters. If the Script Rating secretariat discovers that there
is greater rater disagreement than the system allows, a special screening process
is used.

2.2.3 An ‘egalitarian’ alternative

As was already mentioned, glocal multilingual language testing suites, such as

those that we find in Greece and Finland, have as one of their main purposes to
cater for the linguistic needs of the domestic job and education markets, as well as
for other regional social demands. This is quite a valuable function given that the
‘weak’ languages – those of the economically and/or politically disadvantaged
countries – are excluded from the commercial testing industry. They do not have
shares in the condominium of international testing, built by dominant languages
and, especially, English.
Expanding on the point above, it is important to understand that the languages
of economically underprivileged societies and countries, which have limited
international political power are not exportable products or domestic commodi-
ties. Therefore, investment in the development of a language proficiency exam for
foreigners or for domestic use – even if it were possible – does not seem worth-
while. This is why perhaps none of the Balkan countries, for example, has devel-
oped their own language proficiency certification system, nor are there any big

19 See also a shorter article at:

20 The Handbook can be downloaded from

commercial enterprise taking interest in developing exams to certify proficiency
in, say, Albanian or Bulgarian. It is obvious, in other words, that certification
of language proficiency is tightly linked to language dominance and economic
affluence. And this, in turn, has an important bearing on the status of a language,
while it also has significant impact on the social position of its speakers within
society and on their employment rights.
If proficiency in a language cannot be certified, the language itself does not
count as qualification for employment or promotion in the workplace in Greece
and elsewhere. Given that proficiency in ‘weaker’ languages often has no way of
being certified, the KPG suite with its multilingual policy hopes to include some
of these languages which are important for the Greek society and, in doing so,
to develop psychometrically valid and reliable language measurement tools for
them as well.
As a matter of fact, not only ‘weak’ but also ‘strong’ European languages (if
strength is counted by numbers of speakers in the world) are not easily admitted
into the international testing regime with its well advanced measurement tools
and procedures, while even dominant languages are also lagging behind English,
which is definitely the front runner in the race – and a testing race it is. In other
words, it seems that the dominion of language proficiency testing is as powerful
as the language, which it tests. Therefore, language proficiency tests for different
languages do not have the same prestige and it is the case that they are not of the
same quality. Proficiency certificates for some languages may count for some-
thing or other, but according to popular belief they do not count as significant
cultural capital.
Inequity in the quality of testing – at least in psychometric terms – may perpet-
uate to a certain extent the inequality between language proficiency certificates
and between languages per se. After all, tests have been viewed not only as tools
for measuring levels of language competence but also as “tools for manipulating
language” and for “determining the prestige and status of languages” (Shohamy,
Glocal testing uses the knowledge and responds to standards developed through
international proficiency testing (Dendrinos, 2005). It uses this knowledge for the
development of exams in languages where expertise is lacking. Moreover, if the
glocal testing suite is multilingual, this means that it includes several languages
and all of them are treated equally, in the sense that the specifications, the tools,
the assessment criteria and all other products created and procedures followed are
the same for all languages.
A final point to be made presently is that glocal multilingual exam suites abide
by a single code of ethics for all their languages. This code is itself a glocal arte-
fact in the sense that it follows the models proposed by International Language

Testing Association and Association of Language Testers in Europe but makes
adjustments and alterations. Actually, it is Karavas (2011) who convincingly
argues that it is necessary to localise the ethical codes of proficiency testing in
particular and proceeds to present the KPG Code of Ethical Practices, explaining
how the KPG exams adhere to principles of accountability, professionalism and
transparency and makes attempts to follow ‘democratic’ principles (cf. Shohamy,

2.2.4 Socially sensitive ideological underpinnings

Balourdi (2012) has provided ample linguistic evidence to support the claim that
there are ideological inscriptions in test texts that linguistically construe different
realities. The American test texts, for example, are inhabited by Americans, who
are involved either in organized leisure-time activities (which combine entertain-
ment with the exploration of the history, culture and the ‘classical’ American
values, brought forth by famous American old stars and performers) or in a com-
mercial world where profit is the main goal of life. By sharp contrast, the thesis
findings clearly show that texts in the KPG exams:
…linguistically construe a world that has special concern for environmental, health and
social issues. The social participants inhabiting the texts are ‘world citizens’ [not British or
American], with a more or less equal distribution of power. These are the main Actors, Sens-
ers, Sayers, and Possessors in material, mental, verbal, and possessive processes in the texts,
where non-human actants exist but which are dominated by human actants that are socially
active citizens, activists, eco-tourists, experts on different matters and authorial figures. The
text participants are members of national and international social groups, interested in envi-
ronmentally friendly practices, public hygiene, heath and diet, and matters of greater social
interest, or knowledgeable, creative social subjects, aware of what is going on around them.
(ibid, p. 245)

And the writer continues to point out that KPG texts are commonly realized
through third person narratives and the linguistic choices made indicate concern
with objective ‘facts’, related to the reader by the writer, who is positioned as an
expert and adviser. Given that test texts actually address candidates, it is interest-
ing, Balourdi (ibid) says, that the text writer is construed as someone who has
better access to the truth and consequently good knowledge of what has to be
done. The reader, on the other hand, appears to be in need of valid information,
advice and guidance – which obviously educational institution is supposed to be
able to provide.
The aforementioned and other findings from the same thesis discussed earlier
in this chapter demonstrate in the clearest way that there are ideological meanings

in all texts. However, where language teaching and testing is concerned with an
impact on young people, the important question raised is who should have control
over the linguistic construction of reality.

2.3 Glocalised proficiency testing, language education and literacy

While glocal tests take into consideration international research findings and
abide by supranational structures such as the CEFR, they make decisions regard-
ing the test papers, which are meaningful for the specific society for which they
are developed. Test developers in a glocal system know (through intuition and
relevant research) candidates’ cultural experiences, areas of world knowledge,
types of literacies and social needs. Inevitably, they take all this into account
when making decisions regarding test form and content, given that there is cen-
tral concern about the types of competences, skills and strategies the groups of
language learners they address need to develop. In other words, one of the most
important characteristics of glocal testing, the KPG exams in particular, is that
attention is relocated: from the language itself (as an abstract meaning system) to
the user (as a meaning-maker).
Consideration of the Greek foreign language user is what makes the KPG test
papers different from the test papers of the well-known international exams in
several ways. For example, there is considerable attention to assessing candi-
dates’ ‘language awareness’.21Actually, among the several different types of tasks
included in the KPG exams, from B1 level onwards, are those whose purpose
is to test language awareness regarding contextually appropriate use of lexico-
grammar. The language awareness component, at the level of discourse, genre
and text is included in the reading comprehension test paper of the KPG exam.
The correct response requires candidates’ awareness of which language choice is
appropriate in each case, given the linguistic or discursive context.
Meaning is context specific, in the theory of language that the KPG has
adopted. However, KPG reading and listening comprehension test papers con-
centrate less on local meanings, with reference to lexical items and structural
patterns, and more on discourse meanings – both at lower and higher levels of
language proficiency. Furthermore, KPG reading and listening comprehension
papers test the understanding of meanings, as these are semiotically generated
in texts of different genres through verbal, visual and aural modes of production.

21 Language awareness is defined in the KPG exam suite as a more or less conscious understand-
ing of how language operates, in different linguistic, situational and discursive contexts. For
an extensive discussion on the types of language awareness see Little, 1997.

The genre-based approach that the KPG exam suite has adopted is particularly
evident in the writing test paper. As a matter of fact, the higher the level of the
candidate’s language proficiency, the more s/he is expected to have developed
language awareness regarding language choices required so as to produce dis-
course and genre sensitive language. The genre-based approach to writing is also
clearly discernible in the assessment criteria for the marking of scripts.22
There are many more distinctive characteristics in the KPG exams and readers
may be interested in seeing these unique characteristics in the test papers them-
selves.23 Presently, however, we should turn attention and briefly discuss one of
the most unique features of the KPG exams in the next subsection.

2.4 Favouring the multilingual paradigm

The European vision for a multilingual supranational state has been articulated
most recently by the European Commission’s Civil Society Platform to promote
multilingualism (Action Plan for 2014–20) and has had relevant support over the
past decade or so with the production of practical guidelines, manuals and funded
projects to facilitate a shift from monolingualism to multiligualism in language
teaching and learning. The CEFR, a vehicle for language teaching, learning and
assessment in a comparable manner, has been a step in this direction. But the road
is still long. To quote the CEFR (Council of Europe, 2001, p. 4) itself:
… the aim of language education [should be] profoundly modified. It [should] no longer be
seen as simply to achieve ‘mastery’ of one or two, or even three languages, each taken in
isolation, with the ‘ideal native speaker’ as the ultimate model. Instead, the aim should be to
develop a linguistic repertory, in which all linguistic abilities have a place. This implies, of
course, that the languages offered in educational institutions should be diversified and stu-
dents given the opportunity to develop a plurilingual competence.

In an effort to fulfil this goal but also because of KPG’s approach to language use
and the theory of language by which it abides, there has been an attempt from the
very beginning to make the shift from a monolingual to a multilingual paradigm.
The latter has its basis on a view of the languages and cultures that people expe-
rience in their immediate and wider environment not as compartmentalized but
as meaning-making, semiotic systems, interrelated to one another. Such a view
is, according to Shohamy (2011, p. 418), manifested in code switching and in the
simultaneous use of different language functions (e. g. reading in one and speak-

22 Readers interested in the assessment criteria and marking grids, may visit the KPG site for the
exam in English at
23 Past papers, in English, are available –free of charge at

ing in another in the process of academic functioning). In this paradigm, people
learn to make maximum use of all their semiotic resources so as to communicate
effectively in situational contexts, which are often bi-, tri- and multi-lingual. In
such settings, people use code switching and ‘translanguaging’ techniques, draw-
ing upon the resources they have from a variety of contexts and languages. They
use different forms of expression in multimodal texts to create socially situated
meanings. Often they also resort to intra- and inter-linguistic as well as intercul-
tural mediation.
It is this rationale that led the KPG suite, from the start, to incorporate intra-
and inter- linguistic mediation tasks as an exam component in both the writing
and the speaking tests from B1 level. Mediation as an activity to be taught and
tested was legitimated when included in the CEFR in 2001, but it is understood
somewhat differently by the KPG than in the CEFR. Whereas in the latter it is
understood as very similar to translation and interpretation, the KPG makes a
clear distinction between mediation and translation/interpretation (Dendrinos,
2006), and defines intralinguistic mediation as a process whereby information
is extracted from a source text and relaying it in a target text of a different social
variety, register, code. For example, the candidate is provided with a pie chart
showing, for example, what percentages of people read which types of books and
s/he is asked to write a report or a news article on this subject. Another example
may be to ask the candidate to read a proverb and explain to the examiner who
assumes the role of a child what the proverb means.24
Interlinguistic mediation, on the other hand, involves two languages. It is
defined by the KPG as a process whereby information is extracted from a source
text in L1 (in this case Greek which is considered to be the candidates’ common
language) and relaying it in the target language for a given communicative pur-
pose. For example, the candidate is asked to read a Greek magazine article about
practical tips on taking care of a pet and to write an email to a friend who’s just
been given a puppy as a present and has no clue about how to take care of it.
At levels A1+A2 candidates are not tested for their interlinguistic skills in
the KPG exams and, therefore, they are not asked to produce a text in the target
language by relaying information from a source text in Greek. However, trans-
languaging and parallel use of languages is exploited in two ways: firstly, the task
rubrics are written in both Greek and in the target language, and secondly there
are tasks in the reading and listening comprehension test papers which require
the candidate to exhibit his/her understanding of a text in the target language by
responding to choice items in Greek.

24 Though intralinguistic mediation tasks are included in the writing test of the KPG exams, they
are not labeled as ‘mediation’ tasks reserving the term only for interlinguistic tasks.

Glocal proficiency testing is more likely to use translanguaging, parallel use of
two or more languages, as well as linguistic and cultural mediation tasks, whereas
it seems not at all cost-effective to have bi- or mutli-lingual performance tested
in international commercial exam batteries. The moment the product is localised,
be it a test or a textbook, it does not sell globally and makes less profit, if at all.
Mediation as a testing component, mediation task analysis and the perfor-
mance on test tasks by Greek students of English has been the object of sys-
tematic research carried out at the Research Centre for English Language of the
University of Athens (Dendrinos, 2011). A specific research project leading to an
MA dissertation at the Faculty of English Studies of the University of Athens has
shed light on discursively, textually and linguistically hybrid forms that success-
ful communicators use (Stathopoulou, 2009)25 and other papers have discussed
issues linked to the mediation component of the KPG exam.26 Soon, the findings
of another major project will be made available – a PhD thesis entitled Mediation
tasks and mediation performance by Greek users of English, which is being com-
pleted by Stathopoulou.27 Future mediation related projects include work which
aims at providing firstly functional illustrative scale descriptors for mediation
performance and secondly linguistically-based scale descriptors.

3. Conclusion

All issues, which have been raised so far, have aimed at substantiating this chap-
ter’s argument in favour of glocal proficiency language testing. Yet, the most
important line of reasoning is related to the fact that language testing is “a pow-
erful device that is imposed by groups in power to affect language priorities,
language practices and criteria of correctness” (Shohamy, 2006, p. 93). Language
tests have a definite impact on how the knowledge of language and communi-
cation is defined (cf. Shohamy, 2001a, p. 38) and affect people’s attitudes and
understanding of what language is and how it operates. Because of this and other
factors discussed here, it might be wiser that the responsibility for tests, which
function as pedagogical but also as social and political instruments, with an
impact on education and the social order (cf. Shohamy, 2006, p. 93), not be left to
commercial, international testing. The command over the social values inhabit-

25 Visit:
26 For a paper in English, defining mediation and providing analytic documentation of a ‘theory’
of mediation see: Dendrinos, 2006 ( %20
JAL.pdf), and for a popularized paper on the form and function of mediation: Dendrinos &
Stathopoulou, 2010 ( june2010.htm).
27 For the PhD thesis abstract visit:

ing tests and the control over the world embodied in the test texts might be better
off in the care of domestic, non profit-driven educational and testing institutions
that might be better trusted with the power to regulate the social and pedagogical
identity of language learners. Glocal testing (if it is possible to develop glocal
exam suites) can perhaps best serve citizenry as a socially sensitive antidote to
the marketised products of the global testing conglomerates.
An important question, however, following the discussion above is whether
any kind of proficiency testing (glocal or global) should be a constituent of the
school curriculum in one way or other. This is an issue that has come up recently
in Greece, as a demand by foreign language teachers and politicians. Teachers feel
that the foreign language will be viewed as a school subject with more ‘weight’
and students will be even more motivated to do hard work, while the Ministry of
Education has been under pressure to provide opportunity for proficiency certi-
fication within the context of the public school system, so that the average Greek
family does not have to pay so much money28 for preparation classes offered by
private-tuition language schools, most of which are associated directly with inter-
national exam systems leading to language certification.
The other side of the coin, however, is that proficiency language testing con-
stitutes (in the words of Mulderrig (2003) quoted in Templer, 2004) a ‘key trans-
national achievement arena where students are socialised into highly individual-
istic practices of competitive survival and self-responsibility’. There are of course
several other washback effects of language testing on language teaching and
learning, as many scholars have shown (e. g., Alderson & Wall, 1993; Alderson &
Hamp-Lyons, 1996; Prodromou, 2003; Tsagari, 2009). They are not necessarily
all negative. But, with its control over forms of knowledge, testing may distort
the purpose of teaching and learning languages, implant a philosophy of measur-
able, result-driven learning. This is especially true of standardised testing that
measures knowledge by numerical scores and is preoccupied by numerical per-
formance on choice items, requiring certifiable demonstration of language skills
(cf. Brosio, 2003; Hamp-Lyons, 1999; Lynch, 2001).
With this dilemma in mind, education and testing experts in Greece have
advised the Greek Ministry of Education to take a middle-of-the-road deci-
sion. This entails first of all designing a new national curriculum for languages
which contains researched illustrative descriptors of language proficiency on the
6 level scale of the Council of Europe – closely associated to the 6-level can-

28 It has been estimated, through studies carried out at the Research Centre for English Language
Teaching, Testing and Assessment (RCEL), that the total amount of money spent by Greek
families for private tuition language teaching – mainly in the form of proficiency exam prepa-
ration – and for international exam fees is about 600.000.000 Euro a year.

do-statements included in the KPG specifications.29 The new national language
curriculum is a multilingual construct (a unified curriculum for all languages).
Secondly, it has been decided not to turn regular language teaching classes into
exam preparation lessons but to use financial resources, which have been made
available to the two state universities, to provide teachers and students ICT sup-
port to prepare themselves for the KPG exams, if they wish, during extra-school
hours or at home.
The group effort to develop the new integrated languages curriculum,30 and
the collaboration between language teaching and testing scholars, researchers
and practicing teachers of the different languages has facilitated the birth of a
previously absent academic discourse on foreign language teaching and testing in
Greece (cf. Macedo, Dendrinos & Gounari, 2003). This is perhaps one of the most
important impacts of glocal projects of this nature: The KPG and the national lan-
guages curriculum project have motivated teamwork by people from the different
foreign language didactics traditions. It may contribute to a shift from a monolin-
gual to a multilingual paradigm in the field of language didactics in Greece. Such
examples should be followed elsewhere as mainstream foreign language didactics
and testing are still exclusively monolingual.


Alderson, J. C. & Hamp-Lyons. L. (1996). TOEFL preparation courses: a study of

washback. Languages Testing, 13 (3), 280–297.
Alderson, J. C. & Wall, D. (1993). Does washback exist? Applied Linguistics, 14
(2), 115–129.
Balourdi, A. (2012). World Representations in Language Exam Batteries: Critical
discourse Analysis of texts used to test reading comprehension. PhD thesis,
submitted to the Faculty of English Language and Literature, National and
Kapodistrian University of Athens.
Brosio, R. (2003). High-stakes tests: Reasons to strive for better Marx. Journal
for Critical Education Policy Studies, 1(2). Retrieved from http://www.jceps.
Council of Europe. (2001). Common European Framework of Reference for Lan-
guages: Learning, Teaching and Assessment. Cambridge: Cambridge Univer-
sity Press.
29 For the new languages curriculum, visit
30 The curriculum was developed by a team of 25 applied linguists, junior researchers, teacher-
development professionals and practicing teachers who were working under my academic

Delieza, X. (2011). Monitoring KPG examiner conduct. DIRECTIONS in Lan-
guage Teaching and Testing e-journal. RCEL Publications, University of
Athens, 1 (1).
Dendrinos, B. (1992). The EFL Textbook and Ideology. Athens: Grivas Publi-
Dendrinos, B. (1998). Una aproximacion politica a la planificacion de la ense-
nanza de lenguas extranjeras en la Union Europea. In L. Martin-Rojo &
R. Whittaker (Eds.) Poder-decir o el poder de los discursos (pp. 149–168).
Madrid: Arrefice Producciones, S. L.
Dendrinos, B. (1999). The conflictual subjectivity of the EFL practitioner. In
Christidis, A., F. (Ed.) ‘Strong’ and ‘Weak’ Languages in the European
Union: Aspects of Hegemony. Vol. 2., (pp. 711–727). Thessaloniki: Centre for
the Greek Language.
Dendrinos, B. (2001). A politicized view of foreign language education planning
in the European Union. The politics of ELT. Athens: Athens University Press.
Dendrinos, B. (2002). The marketisation of (counter) discourses of English as
a global(ising) Language. In M. Kalantzis, G. Varnava-Skoura & B. Cope
(Eds.) Learning for the Future: New Worlds, New Literacies, New Learning,
New People. The (Australia): Common Ground Publish-
Dendrinos, B. (2004). Multilingual literacy in the EU: Alternative discourse in
foreign language education programmes. In Dendrinos, B. & Mitsikopoulou,
B. (Eds.) Politics of Linguistic Pluralism and the Teaching of Languages in
Europe (pp. 60–70). Athens: Metaichmio Publishers and University of Athens.
Dendrinos, B. (2005). Certification de competences en langues etrangeres, mul-
tilinguisme et plurilinguisme. In Langue nationales et plurilinguisme: Initia-
tives Grecques (pp. 95–100). Athens: The Ministry of National Education and
Religious Affairs & Centre for the Greek Language.
Dendrinos, B. (2006). Mediation in communication, language teaching and test-
ing. Journal of Applied Linguistics. Annual Publication of the Greek Applied
Linguistics Association, 22, 9–35.
Dendrinos, B. (2010a). The global English language proficiency testing industry
and counter hegemonic local alternatives. Paper delivered at the Colloquium
entitled: “British ELT in existential crisis?” (Convener: R. Phillipson), BAAL
2010 (with R. Phillipson, J. Edge, A. Holliday, R. C. Smith & S. Taylor) at the
42nd Annual BAAL Meeting, University of Aberdeen, 9–11 September.
Dendrinos, B. (Ed.) (2010b). Promoting multilingualism in Europe through lan-
guage education. Report of the Work Group on ‘Language Education’, Euro-
pean Commission: Civil Society Platform to promote Multilingualism.

Dendrinos, B. (2011). Testing and teaching mediation. DIRECTIONS in Language
Teaching and Testing e-journal. RCEL Publications, University of Athens, 1
Dendrinos, B. & Stathopoulou, M. (2010). Mediation activities: Cross-Language
Communication Performance. ELT News, KPG Corner, 249, 12. Retrieved
Fulcher, G. (1999). Ethics in language testing. TEASIG Newsletter, 1 (1), 1–4.
Hawkins, J., A. & Buttery, P. (2010). Criterial Features in Learner Corpora:
Theory and Illustrations. English Profile Journal, 1(1), 1- 23.
Hamp-Lyons, L. (1997). Ethics and language testing. In Clapham, C. & Corson,
D. (Eds.) The Encyclopedia of Language and Education. Vol. 7: Language
Testing and Assessment (pp. 323–333). Dordrecht: Kluwer Academic.
Hamp-Lyons, L. (1999). Implications of the examination culture for (English
Education) in Hong Kong. In V. Crew, V. Berry & J. Hung (Eds.) Exploring
Diversity in the Language Curriculum (pp. 133–41). Hong Kong: Hong Kong
Institute of Education.
Hamp-Lyons, L. (2000). Fairness in language testing. In Kunnan, A., J. (Ed.),
Fairness and Validation in Language Assessment, Studies in Language Test-
ing 9 (pp. 99–104). Cambridge: Cambridge University Press.
Higgins, C. (2003). “Ownership” of English in the Outer Circle: An Alternative to
the NS-NNS Dichotomy. TESOL Quarterly, 37 (4), 615–644.
Holliday, A. (2005). The Struggle to Teach English as an International Language.
New York: Oxford University Press.
Karavas, K. (Ed.) (2008). The KPG Speaking Test in English: A Handbook.
Athens: RCEL Publications, University of Athens.
Karavas, K. (2011). Fairness and ethical language testing: The case of the KPG.
DIRECTIONS in Language Teaching and Testing e-journal. RCEL Publi-
cations, University of Athens, 1 (1).
Karavas, K. & Delieza, X. (2009). On site observation of KPG oral examiners:
Implications for oral examiner training and evaluation. Journal of Applied
Language Studies, 3 (1), 51–77.
Lynch, B. K. (2001). Re-thinking testing from a critical perspective. Language
Testing, 18 (1), 351–372.
Little, D. (1997). Language awareness and the autonomous language learner.
Language Awareness, 6 (2–3), 93 – 104.
Macedo, D., Dendrinos, B., & Gounari, P. (2003). The Hegemony of English.
Boulder, Co.: Paradigm Publishers.
McNamara, T. (2002). Language Testing. Oxford: Oxford University Press.
Mitsikopoulou, B. (Ed.) (2008). Branding political entities in a globalised World.
Journal of Language and Politics 7 (3), 353–508.

Mulderrig, J. (2003). Consuming education: a critical discourse analysis of
social actors in New Labour’s education policy. Journal for Critical Edu-
cation Policy Studies, 1 (1). Retrieved from
Pavlou, P. (2005). Who else is coming for TEA? Parents’ involvement in test
choice for young learners. In Pavlou P. and K. Smith (Eds) Serving TEA to
Young Learners: Proceedings of the Conference on Testing Young Learners.
University of Cyprus, IATEFL and CyTEA. Israel: ORANIM – Academic
College of Education (pp. 42–52). Retrieved from
Pennycook, A. (1994). The Cultural Politics of English as an International Lan-
guage. London: Longman.
Pennycook, A. (2010a). Language, context and the locality of the local. Plenary
talk at the 42nd Annual BAAL Meeting, University of Aberdeen, 9–11 Sep-
Pennycook, A. (2010b). Language as a Local Practice. London: Taylor & Francis.
Phillipson, R. (1992). Linguistic Imperialism. Oxford: Oxford University Press.
Prodromou, L. (2003) Idiomaticity and the non-native speaker, English Today 19
(2): 42–48
Prodromou, L. (2008). English as a Lingua Franca. A corpus based analysis.
London: Continuum.
Ransom, David & Baird. V. (Eds.) (2009). People First Economics. Oxford: New
Internationalist publications LTD.
Shohamy, E. (2001a). The power of tests: A critical perspective on the uses and
consequences of language tests. London: Longman.
Shohamy, E (2001b). Democratic assessment as an alternative. Language Testing,
18 (4), 373–391.
Shohamy, E. (2006). Language Policy: Hidden agendas and new approaches.
London & New York: Routledge.
Shohamy, E. (2011). Assessing Multilingual Competencies: Adopting Construct
Valid Assessment Policies. The Modern Language Journal, 95 (3), 418–429.
Spolsky, B. (1995). Measured Words: The Development of Objective Language
Testing. Oxford: Oxford University Press.
Stathopoulou, M. (2009). Written mediation in the KPG exams: Source text regu-
lation resulting in hybrid formations. Unpublished dissertation submitted for
the MA degree in the Applied Linguistics Postgraduate Programme, Faculty
of English Studies, National and Kapodistrian University of Athens, Greece.
Templer, B. (2004). High-stakes testing at high fees: Notes and queries on the
International English Proficiency Assessment Market. Journal for Critical

Education Policy Studies, 2 (1). Retrieved from
Tsagari, D. (2009). The Complexity of Test Washback: An Empirical Study. Frank-
furt am Main: Peter Lang.
Widdowson, G., H. (1994). The ownership of English. TESOL Quarterly, 31,

Towards an Alternative Epistemological and
Methodological Approach in Testing Research

Vanda Papafilippou31
University of Bristol, UK

Traditional validity theory has been grounded in the epistemological tradition of a realist philosophy
of science (Kane, 2001; Moss, Girard and Haniford, 2006), where “[o]ne experiences the world as
rational and necessary, thus deflating attempts to change it” (Agger, 1991, p. 109). Hence, the prevail-
ing epistemological choices have quite clear ethical and political consequences (Moss, 1996), as they
‘block’ other perspectives, creating a specific societal and educational reality and reproducing cer-
tain power relations. However, if we want to acquire a broader perspective about the social phenom-
enon of language testing in order to change the political, economic, and institutional regime of the
production of truth, we should employ different epistemological approaches. This chapter presents
such an approach, thus aspiring to contribute to the formation of an alternative paradigm in language
testing research. In particular, what is suggested is a critical poststructuralist philosophical frame-
work drawing upon the work of Foucault (1977a, 1977b, 1980, 1981, 1982, 1984a, 1998, 2000, 2009)
and Gramsci (1971, 2000). The resulting research design includes two qualitative methods: narrative
interviewing and critical discourse analysis, in an attempt to challenge the existing ‘regime of truth’.

Key words: test validity, consequences, critical language testing, epistemology, methodology.

1. Introduction

Test consequences appear to go far beyond the classroom as they appear to

expand to the whole society and the construction of the modern subject32, since
the subjectivity of the test-taker is argued to be realised only through the test
(Hanson, 2000; McNamara & Roever, 2006). However, as I will argue in the
present chapter, the prevailing epistemological approach in language testing
research – empiricism – constrains us in asking a particular kind of research
questions and as a result in obtaining a particular kind of answers. In other words,
the prevailing, widely-used philosophical approach – empiricism – forces us in a
way to remain within a particular ‘regime of truth’ (Foucault, 1977b), thus con-
tributing to the reproduction of certain ideologies, values and conservative, as I
will further argue, politics. Therefore, if we want to challenge the current testing
practices and explore in greater depth the impact of testing, we have to adopt also
alternative epistemological and methodological approaches.
32 By subjectivity/subject I mean the subject positions (re)produced within a particular discourse.
These subject positions are seen not only as a product of a particular discourse but at the same
time subjected to its meanings, power and regulations.

2. Down the rabbit-hole: the power of tests

As Madaus and Horn (2000) observe, testing has become so entrenched in our cul-
ture that most of us simply take it for granted and fail to consider how it influences
and moulds our social, educational, business and moral life. Filer (2000, p. 44) adds
that most people appear not to be aware of the “biases and assumptions of technical
elites that underpin test content and processes”, mostly because of the emphasis on
the ‘scientific’33 nature of examinations. However, it is exactly this scientific, neu-
tral, objective and fair armour of tests that rendered them effective instruments for
perpetuating class differences, gender and ethnic inequalities (Bourdieu & Passe-
ron, 1990; Shohamy, 2001; Spolsky, 1997; Stobart, 2005) as well as for manipulating
educational systems (Broadfoot, 1996; McNamara & Roever, 2006; Shohamy, 2001).

2.1 Tests and their impact on society

Throughout their history, examinations meant economic and social rewards only
for the minority who possessed the ‘right’ (pun intended) social and cultural
capital34, as they have mostly acted as gate-keeping devices that reduce access to
educational and occupational opportunities for the masses (Bourdieu & Passeron,
1990; Broadfoot, 1996; Filer, 2000; Stobart, 2008; Sutherland, 2001). Bourdieu
and Passeron (1990, p. 153) argued that, “those excluded from studying at various
levels of education eliminate themselves before being examined”, and that the
proportion of those whose elimination is masked by the selection overtly carried
out differs significantly according to social class. Moreover, examinations have
been used over the decades in order to control the curriculum and the content of
education, ensuring in this way the preparation of students in the necessary skills
and attitudes for their future roles in capitalistic societies (Broadfoot, 1996).

2.2 Tests and their impact on the individual

Examinations are argued that apart from naming individuals and assigning them
to categories (A, B, C, ‘fail’), they also assign them to pre-prescribed roles (for
example, pass/fail) urging them to think and act accordingly, classifying them at

33 The single apostrophes are used in order to indicate my disagreement with the particular dis-
34 Bourdieu (1991) defined cultural capital as the knowledge, skills and other cultural acqui-
sitions, such as educational or technical qualifications, while social capital as the resources
based on group membership, relationships, networks of influence and support.

the same time in a strictly hierarchical manner, for example, As are superior to
Bs, (Foucault, 1977a). Hence, examinations constrain in a way us as individuals
to accept what we are supposed to be, a certain ‘identity’ (for example, ‘proficient
user’, ‘weak at maths’) through expressing it in front of everyone (school, family,
employers, and society). Moreover, exams, apart from naming and categorising
people (for example, good/bad student) are also argued to render the individual
a target of social control mechanisms (Foucault, ibid.), as they promote certain
agendas and ideologies (Broadfoot, 1996; Shohamy, 2001) thus contributing to the
acquisition of the legitimate culture and class differences (Bourdieu & Passeron,
1990). So, Hanson (2000, p. 68) appears to be right when arguing that “the individ-
ual in the contemporary society is not so much described by tests as constructed
by them” as the subjectivity of the test-taker is argued to be realised only in the
test itself (Foucault, 1977a; Hanson, 1993; McNamara & Roever, 2006; McNamara
& Shohamy, 2008; Shohamy, 2001; 2007). Therefore, tests’ consequences appear
to go beyond the classroom, beyond the possibly detrimental effects on a person’s
livelihood, as tests appear to influence also the ideological sphere.

2.3 Test consequences and test validity

As we can see, testing occurs in a social and political context (McNamara &
Shohamy2008; Messick, 1988; Shohamy, 2001), and testing practices appear to
be enmeshed in webs of power and political interests. For this reason, Messick
introduced the concept of social consequences of measurement outcomes in his
unitary model of validity. For Messick (1996, p. 245), “[v]alidity is not a property
of the test or assessment as such, but rather of the meaning of test scores”. There-
fore, as test validity is perceived as an interpretation, a meaning attributed to a
test and its outcomes, intended and unintended, it is impossible to be devoid of
values. As Messick (1989, p. 9) argues:
the value implications of score interpretation are not only part of score meaning, but a socially
relevant part of score meaning that often triggers score-based actions and serves to link the
construct measured to questions of applied practice and social policy.

Therefore, test consequences appear to be inextricably related to test validity.

However, have values got a part in test validation?
Test validity for its most part, has been grounded in the epistemological tra-
dition of a realist philosophy of science (Kane, 2001; Moss, Girard and Haniford,
2006). Positivism mainly supports the existence of an objective, value-free real-
ity, which can only be revealed with the help of ‘experts’. Implicit in this general
epistemological approach seems to be “the need for centralisation of authority

within a given context” (Moss, 1992, p. 250) which further contributes to the
“disciplining the disciplines” (Lather, 1993, p. 677) because it acts as another
scientific panopticon in which “[i]f one gets out of hand, the gaze of the others is
there to bring it back into line” (Markus, 1998). But should the role of validity be
to constrain our knowledge? Should it be to act as the guarantor of epistemologi-
cal, methodological, and as a consequence, ideological and political, orthodoxy?
Moreover, as Agger (1991, p. 109) argues, in empiricism, “[o]ne experiences the
world as rational and necessary, thus deflating attempts to change it”. Hence, by
employing exclusively quantitative techniques we are ‘pushed’ towards one set of
answers. So, the prevailing epistemological choices have also quite clear ethical
and political consequences (Moss, 1996), as they reproduce certain ideologies
and values and affect the way we understand reality, ourselves and others.

2.4 (Re)defining test validity

I personally view test validity as a fully discursive product (Cherryholmes, 1988),

or as Lather (1993) puts it, an incitement to discourse, as it is shaped, as any
other discourse, by beliefs, ideologies, linguistic and cultural systems, socio-eco-
nomic and political interests, and about how power is arranged. Current validity
practices though, choose not to explore in depth these issues, mainly due to the
restrictions of their epistemological approaches. Nevertheless, as Cherryholmes
(1988, p. 438) argues, “one must take one’s eye off” the test score itself and start
challenging the construct itself, as constructs and their validation “represent
effects of power and the exercise of power”.
An antifoundational approach to test validation needs to take into consider-
ation also the political processes and the institutions which produce the ‘truth’,
opposed to our “legacy of positivism” which “continues to reinforce acontextual
and ahistorical scientific narratives of construct validation” (Cherryholmes, 1988,
p. 440). Hence, when examining a test’s validity we should examine not only the
narrow educational context but also the socio-economic and political system in
which the test in question has been constructed and being used. Moreover, we
should also examine not only the values upon which the test is based, but also the
values that are promoted to the wider society.
Therefore, test validity should encourage researchers to foreground the insuf-
ficiencies of the construct and their research practices, to leave multiple openings
and challenge conventional discursive procedures and practices, putting ethics
and epistemology together (Lather, 1993); to look for new interpretations, new
facts, new principles, to change familiar connections between words and impres-
sions (Feyerabend, 1993); to pay equal attention to the test and the test-taker.

Lastly, as Cherryholmes (1988, p. 452) stresses, “construct validation is bound
up with questions such as these: What justifies our theoretical constructs and
their measurements? What kind of communities are we building and how are we
building them?” or as Shohamy (2001, p. 134) adds, “What are the values behind
the test? […] What kind of decisions are reached based on the test? […] What
ideology is delivered through the test? What messages about students, teachers
and society does the test assume?
But all these questions, which are in their essence validity questions, can be
answered only by adopting a different epistemological approach, an approach
that would allow us to:
i. “Resist pressure to concentrate on what persons in power have specified as
their chief question” (Cronbach, 1988, p. 7).
ii. Explore also other epistemological possibilities as “social and political forces
are sometimes so salient that we may need a new discipline to deal explicitly
with the politics of applied science” (Messick, 1988, p. 43), and finally,
iii. Transgress taken-for-granted assumptions and generate new questions and
research practices “with no attempt made to arrive at an absolute truth” (Sho-
hamy, 2001, p. 132).

3. Philosophical framework

The proposed philosophical framework draws upon the work of Foucault (1977a,
1977b, 1980, 1981, 1982, 1984a, 1998, 2000, 2009) and Gramsci (1971, 2000)
in order to explore power relationships at both the individual and the societal
level. In particular, Foucault examines in detail what he calls the “microphysics
of power” (1977a, p. 139), that is, the relations of power at local level (Detel, 1996;
Holub, 1992; Masschelein, 2004). Gramsci on the other hand, is interested more
in the analysis of the relations between structures/institutions and individuals
and how they act collectively. Combining these two thinkers thus enables us to go
beyond the critique of the prevailing testing practices within the given context of
our inquiry and act against “practices of normalisation which secure and develop
the present order” (Gur-Ze’ev, Masschelein and Blake2001).

3.1 Ontology

For both Foucault (1970, 1981, 1998) and Gramsci (1971), ‘reality’ as we perceive
it, is nothing more than the product of discursive practices (Dreyfus & Rabi-

now, 1982), and is created only in a “historical relationship with the men who
modify it” (Gramsci, 1971, p. 346). Therefore, reality is not external to ‘man’,
to human consciousness, rather, it is a product of human creation (Femia, 1981;
Golding, 1992; Gramsci, 1971) and reflects certain historically specific interests,
intentions and desires. ‘Objects’ exist not because they correspond to “some pre-
existent natural classification”, but because they are conceived and categorised
thus due to certain qualities that “man distinguished because of his particular
interests” (Femia, 1981, p. 70). Hence, we cannot take for granted and naturalise
concepts/‘objects’ such as language tests, language proficiency, and skills (i. e.
reading, writing, listening, and speaking), or discursively produced categories of
‘subjects’ (pass/fail, good/bad student, etc.).

3.2 Epistemology

Foucault sees the ‘truth’ “of this world” (Foucault, 1977b, p. 14), determined by
the discursive practices of a discipline (Foucault, 1970), subject to “a constant
economic and political incitation” (Foucault, 1977b, p. 13). Each society has its
own ‘regime of truth’, “its ‘general politics’ of truth that is, the types of dis-
course which it accepts and makes function as true” (Foucault, 1984a, p. 72). For
Gramsci too, ‘truth’ is a value which is socially and historically constructed, gen-
erated by the activity of a certain group, and is always political (Golding, 1992).
Thus, ‘truth’ has power.
However, it is not only ‘truth’ that is shaped by power interests, but ‘knowl-
edge’ too. Both thinkers stress the fact that power and knowledge are intertwined
as it is the mechanisms of power and hegemony that produce the different types
of knowledge (Foucault, 1980; Gramsci, 1971). Knowledge is political and situ-
ated in specific historical, economic, cultural and social contexts, being always
value-laden and intrinsically related to power, with the potential to transform
‘men’ (Foucault, 2000; Gramsci, 1971). Therefore, power, or hegemony, has an
epistemological value, in other words, it is crucially important to take it into con-
sideration when designing our methodology.

4. Implications for the research design

As reality is considered as a product of discourse and power relations, in other

words, a human creation that is filtered through our consciousness, I would sug-
gest the adoption of exclusively qualitative research tools. A qualitative approach
provides rich information about the specific context where tests are administered

as well as the power relations at play. Therefore, it would be able to pursue both
the multiple realities of our participants (Denzin & Licoln, 2008) and the extent
to which the test-based interpretations have become part of their/our common
sense (Moss, 1998). The two qualitative research tools that I consider as most
appropriate are: narrative interviewing with the stakeholders and critical dis-
course analysis (CDA). By combining these two research tools we will be given
the opportunity to inquire into how subjects acquire a certain perception of the
world and of themselves, a certain ideology. Moreover, in this way, we will be
able to ‘collect’ organic, local knowledge and explore how stakeholders experi-
ence and think about tests. In other words, to “explore the networks of what is
said, and can be seen in a set of social arrangements,” so as to chart and analyse
the relations between institutions and the various discourses produced (Kendall
& Wickam, 1999, p. 25), as well as the processes through which truth, knowledge
and subjectivity are produced (Fraser, 1989, p. 19).

4.1 Institutions and discourse

For Foucault institutional structures are a means that power uses and contribute
to the production of normalised and docile subjects (Caputo & Yount, 1993). So,
in order to research how subjects acquire a certain perception of the world and of
themselves, we need to examine the role of institutions that “organise, manage or
propagate such cognitions” (van Dijk, 1998, p. 186). That is why it is crucial for us
to compose a corpus of texts that will explain the ‘logic’ behind tests as presented
by the institutions that administer them and indicate the discourses that operate
around and are promoted by these tests in society.

4.2 Critical discourse analysis

Discourse plays an important role in the processes of ‘making up’ people (Ain-
sworth & Hardy, 2004) as it imposes the existing rules of subjectification and
thus a certain way of thinking and conceiving the world (Femia, 1981). CDA,
in particular, attempts to detect how elite groups define ‘reality’, ‘objects’ and
‘subjects’, in our case test-takers, and the role of discourse in the (re)production of
dominance (van Dijk, 1993). Hence, CDA allows us to understand the process of
construction of the category of the test-taker as well as the meanings, ideological
and cultural assumptions that are attached to it.
The approach to CDA I would suggest, mainly draws upon the work of Fou-
cault (1977b, 1977c, 1981, 1984a, 1984b, 2000), Gramsci (1971), Barthes (1975,

1981), de Certeau (1984) and Fairclough (2003). The analysis does not claim any
‘objective’ findings as I believe that if we are to pursue social justice we should do
it without imposing our (predetermined) notions of emancipation (Slembrouck,
2001). What the analysis can only ‘claim’ is that it might ‘show’ another possibil-
ity, another perspective. What we should attempt in other words, is to ‘discover’
what Barthes (1981, p. 40) called the ‘signifiance’ of the text, a term that invokes
the idea of an “infinite labour” of the signifier upon itself, the endless play of the
The approach evolves cyclically and entails twelve ‘steps’:
i. Visual analysis (when applicable)
ii. Identifying the discourses operating in the text (Which are the main parts
of the world or areas of social life that are represented? Which are the main
themes? From which perspective are they represented?)
iii. Ideology (Which words/phrases carry a particular ideological content? What
propositions are neutralised? What conception of the world is presented?
What is presented as ‘common sense’?)
iv. Modes of existence of the discourse(s) (the specific socioeconomic structures
that gave birth and developed this discourse, the relationship of ‘meaning’
and power, the different groups that are implicated, where these discourses
have been used, how they can circulate, who can appropriate them and their
impact on society)
v. Delimitation of ‘objects’ and ‘subjects’ (How are ‘objects’ and ‘subjects’
defined? What are their characteristics?)
vi. Credibility of the discourse(s) (How credible is this discourse? How is this
credibility achieved?)
vii. The emerging relations of power
viii. The regime of truth this discourse belongs to and the apparatus(es) of power
this discourse fits
ix. The macrostructure (historical background and the current socio-economic
x. ‘Silences’ (What is silenced?)
xi. Signifiance (How do I feel as a reader when I ‘enter’ the text? Personal
response to the text)
xii. Self critique

4.3 Narrative interviewing

Traditional, structured or semi-structured, interviewing establishes a priori cat-

egories from which pre-established questions result aiming at ‘capturing’ precise
data, thus attempting to explain the social world (Fontana, 2001). Hence, these
methods assume that there is a kind of truth that can be ‘captured’, while at the
same time, they act as a ‘confessional technology’ (Foucault, 1998), producing a
specific identity by assigning certain characteristics and roles to the confessant.
However, as I have argued in the philosophical framework (Section 3), what is
presented as ‘the truth’ is just one option shaped by power interests and no subject
category is taken for granted.
Narrative interviewing, on the other hand, offers ‘space’ for a story to develop
(Holloway & Jefferson, 2000) and resembles more to a conversation (Kohler
Riessman, 2008)2000 having its agenda open to development and change as it
is the narrator who directs the conversation (Holloway and Jefferson, ibid.). The
term narrative interviewing in social sciences refers to the gathering of narra-
tives, verbal, oral or visual, in order to see the meaning that people ascribe to
their experiences (Trahar, 2008). Narrative inquiry is constantly gaining ground
as a research method in education (Webster & Mertova, 2007) and it is seen as a
potentially emancipatory tool (Freeman, 2003) as it embraces change by reveal-
ing issues not even touched by traditional approaches (Webster and Mertova,
ibid.). That is why narrative inquiry is mainly used to ‘research’ social groups that
are frequently discriminated-against (Lieblich, Tuval-Mashiach and Zilber1998),
as it can render stories that are drowned out by more dominant narratives (Atkin-
son & Delamont, 2006; Daya & Lau, 2007). Also, the stories and experiences we
should render audible are those of test takers and the other stakeholders, as in the
testing literature they are often kept silent (Elana Shohamy, 2001).
Therefore, these attributes render narrative interviewing an extremely useful
‘tool’ for inquiring test consequences on the individual and society, since it
enables us to understand how different stakeholders view themselves, others,
tests and ‘reality’, and can reveal the forms of hegemony in which ‘truth’ operates
and bring new modes of thought into being.

5. Conclusion

In this chapter I argued that an alternative epistemological and methodological

approach is needed in order to inquire into greater depth the consequences of tests
in society and in the construction of the modern subject. However, this does not

mean that one epistemological approach is really exclusive of the others. ‘Science’
can co-exist with the ‘Humanities’. Hence, test validation is not a matter of either/
or. Feyerabend (1993, p. 31) argues that “unanimity of opinion may be fitting for
a rigid church”, but not for a scientific community, a society: “a free society is a
society in which all traditions are given equal rights, equal access to education
and other positions of power” (ibid., p. 228). Validity may be of a unitary nature,
but this does not necessarily mean that our approaches to it should be so too.
Each researcher is entitled to have her/his opinion, her/his own philosophy. Why
should all testers and validators share the same frame of mind? How productive
and how democratic is that? As Cronbach (1988, p. 14) noted:
Fortunately, validators are a community. That enables them to divide up the
investigative and educative burden according to their talents, motives, and politi-
cal ideals. Validation will progress in proportion as we collectively do our dam-
nest – no holds barred – with our minds and hearts.
Test validity therefore, requires both our intellectual and emotional engage-
ment with the study and the text produced, the need to stop thinking the same as
before and acquire a different relationship to knowledge (Foucault, 2000). More-
over, it is our duty to resist dominant discourses and regimes of truth (Cooper
& Blair, 2002), and this has direct implications for thematising, analysing and
presenting our findings. Building thus upon Gustav Landauer’s (2010, p. 214)
famous quote:
The State is a condition, a certain relationship between human beings, a mode
of behaviour; we destroy it by contracting other relationships, by behaving dif-
ferently toward one another […] We are the State and we shall continue to be the
State until we have created the institutions that form a real community.
I would argue that validity is a condition between human beings, truths and
knowledges, a certain way of thinking, and we can subvert it only if we con-
duct research that encourages a constant (re)creation of different relationships
between individuals, institutions, knowledges, because only then we can form a
different -academic and civic – community.


Agger, B. (1991). Critical Theory, Poststructuralism, Postmodernism: Their Soci-

ological Relevance. Annual Review of Sociology, 17, 105–131.
Ainsworth, S., & Hardy, C. (2004). Critical Discourse Analysis and Identity:
Why bother? Critical Discourse Studies, 1(2), 225–259.
Atkinson, P., & Delamont, S. (2006). Rescuing narrative from qualitative research.
Narrative Inquiry, 16(1), 164–172.

Barthes, R. (1975). The Pleasure of the Text. New York: Hill and Wang.
Barthes, R. (1981). Theory of the text. In R. Young (Ed.), Untying the Text: A
Post-Structuralist Reader (pp. 31–47). Boston: Routledge & Kegan Paul.
Bourdieu, P., & Passeron, J.-C. (1990). Reproduction in Education, Society and
Culture. London: Sage Publications.
Broadfoot, P. M. (1996). Education, Assessment and Society. Buckingham, Phila-
delpia: Open University Press.
Caputo, J., & Yount, M. (1993). Institutions, normalization and power. In J. Caputo
& M. Yount (Eds.), Foucault and the critique of institutions (pp. 3–26). Uni-
versity Park, Pennsylvania: The Pennsylvania State University Press.
Cherryholmes, C. H. (1988). Construct Validity and the Discourses of Research.
American Journal of Education, 96(3), 421–457.
Cooper, M., & Blair, C. (2002). Foucault’s Ethics. Qualitative Inquiry, 8(4), 511–
Cronbach, L. J. (1988). Five Perspectives on Validity Argument. In H. Wainer
& H. I. Braun (Eds.), Test Validity (pp. 3–17). Hove and London: Lawrence
Erlbaum Associates.
Daya, S., & Lau, L. (2007). Power and narrative. Narrative Inquiry, 17(1), 1–11.
de Certeau, M. (1984). The Practice of Everyday Life (S. Rendall, Trans.). Berke-
ley and Los Angeles: University of California Press.
Denzin, N. K., & Licoln, Y. S. (2008). Collecting and Interpreting Qualitative
Materials California: Sage Publications.
Detel, W. (1996). Foucault on Power and the Will to Knowledge. European Jour-
nal of Philosophy, 4(3), 296–327.
Dreyfus, H. L., & Rabinow, P. (1982). Michel Foucault: Beyond Structuralism
and Hermeneutics (2nd ed.). Chicago: The University of Chicago Press.
Fairclough, N. (2003). Analysing Discourse: Textual analysis for social sciences.
London: Routledge.
Femia, J. V. (1981). Gramsci’s Political Thought. Oxford: Clarendon Press.
Feyerabend, P. (1993). Against Method (Third ed.). London: Verso.
Filer, A. (2000). Technologies of Testing. In A. Filer (Ed.), Social Practice and
Social Product (pp. 43–45). London: Routledge.
Fontana, A. (2001). Postmodern Trends in Interviewing. In J. F. Gubrium & J. A.
Holstein (Eds.), Hanbook of Interview Research: Context & Method (pp. 161–
175). London: Sage.
Foucault, M. (1970). The Order of Things. London: Routledge.
Foucault, M. (1977a). Discipline and Punish: The Birth of Prison. London: Pen-
guin Books.
Foucault, M. (1977b). The political function of the intellectual. Radical Philoso-
phy, 17((Summer)), 12–14.

Foucault, M. (1977c). Michel Foucault: Language, Counter-Memory, Practice
Oxford: Basil Blackwell.
Foucault, M. (1980). Power/Knowledge Selected Interviews and Other Writings
(ed Colin Gordon). Brighton: Harvester Press.
Foucault, M. (1981). The Order of Discourse. In R. Young (Ed.), Untying the Text:
A Post-Structuralist Reader (pp. 48–78). Boston: Routledge & Kegan Paul.
Foucault, M. (1982). The Subject and Power. In H. L. Dreyfus & P. Rabinow
(Eds.), Michel Foucault: Beyond Structuralism and Hermeneutics (pp. 208–
226). Chicago: The University of Chicago Press.
Foucault, M. (1984a). Truth and power. In P. Rabinow (Ed.), The Foucault Reader:
An Introduction to Foucault’s Thought (pp. 51–75). London: Penguin.
Foucault, M. (1984b). What Is an Author? In P. Rabinow (Ed.), The Foucault
Reader (pp. 101–120). London: Penguin.
Foucault, M. (1998). The History of Sexuality: The Will to Knowledge (R. Hurley,
Trans. Vol. 1). London: Penguin.
Foucault, M. (2000). Interview with Michel Foucault. In J. D. Faubion (Ed.),
Michel Foucault: Power/Essential Works of Foucault, Vol III (pp. 239–297).
London: Penguin.
Foucault, M. (2009). The Archaeology of Knowledge. London: Routledge.
Fraser, N. (1989). Unruly Practices: Power, Discourse and Gender in Contempo-
rary Social Theory. Cambridge: Polity Press.
Freeman, M. (2003). Identity and difference in narrative inquiry: A commentary
on the articles by Erica Burman, Michelle Crossley, Ian Parker, and Shelley
Sclater. Narrative Inquiry, 13(2), 331–346.
Golding, S. (1992). Gramsci’s Democratic Theory: Contributions to a Post-Lib-
eral Democracy. Toronto: University of Toronto Press.
Gramsci, A. (1971). Selections from the Prison Notebooks Edited and Translated
by Quintin Hoare and Geoffrey Nowell-Smith. New York: International Pub-
Gramsci, A. (2000). The Antonio Gramsci Reader: Selected Writings 1916–1935.
New York: New York University Press.
Gur-Ze’ev, I., Masschelein, J., & Blake, N. (2001). Reflectivity, Reflection, and
Counter-Education. Studies in Philosophy and Education, 20, 93–106.
Hanson, F. A. (1993). Testing Testing: Social Consequences of the Examined Life.
Berkeley and Los Angeles: University of California Press.
Hanson, F. A. (2000). How Tests Create What They are Intended to Measure. In
A. Filer (Ed.), Assessment: Social Practice and Social Product (pp. 67–81).
London: Routledge/Falmer.
Holloway, W., & Jefferson, T. (2000). Doing Qualitative Research Differently:
free association, narrative and the interview method. London: Sage.

Holub, R. (1992). Antonio Gramsci: Beyond Marxism and Postmodernism.
London & New York: Routledge.
Kane, M. T. (2001). Current Concerns in Validity Theory. Journal of Educational
Measurement, 38(4), 319–342.
Kendall, G., & Wickam, G. (1999). Using Foucault’s Methods. London & Thou-
sand Oaks: Sage.
Kohler Riessman, C. (2008). Narrative Methods for the Human Sciences. Thou-
sand Oaks: Sage.
Landauer, G. (2010). Revolution and Other Political Writings (G. Kuhn, Trans.).
Oakland: PM Press.
Lather, P. (1993). Fertile Obsession: Validity After Poststructuralism. The Socio-
logical Quarterly, 34(4), 673–693.
Lieblich, A., Tuval-Mashiach, R., & Zilber, T. (1998). Narrative Research: Read-
ing, Analysis, and Interpretation (Vol. 47). Thousand Oaks: Sage Pablications,
Madaus, G. F., & Horn, C. (2000). Testing Technology: The Need for Oversight
In A. Filer (Ed.), Assessment: Social Practice and Social Product (pp. 47–66).
London: Routledge/Falmer.
Markus, K. A. (1998). Science, Measurement, and Validity: Is Completion of
Samuel Messick’s Synthesis Possible? Social Indicators Research, 45(1), 7–34.
Masschelein, J. (2004). How to Conceive Critical Educational Theory Today?
Journal of Philosophy of Education, 38(3), 351–367.
McNamara, T., & Roever, C. (2006). Language Testing: The Social Dimension.
Malden: Blackwell Publishing.
McNamara, T., & Shohamy, E. (2008). Language tests and human rights. [view-
point]. Journal of Applied Linguistics, 18(1), 89–95.
Messick, S. (1988). The Once and Future Issues of Validity: Assessing the Mean-
ing and Consequences of Measurement. In H. Wainer & H. I. Braun (Eds.),
Test Validity (pp. 33–45). Hove and London: Lawrence Erlbaum Associates.
Messick, S. (1989). Meaning and Values in Test Validation: The Science and
Ethics of Assessment. Educational Researcher, 18(2), 5–11.
Messick, S. (1996). Validity and washback in language testing Language Testing,
13(3), 241–256.
Moss, P. A. (1992). Shifting Conceptions of Validity in Educational Measure-
ment: Implications for Performance Assessment. Review of Research in Edu-
cation, 62(3), 229–258.
Moss, P. A. (1996). Enlarging the Dialogue in Educational Measurement: Voices
From Interpretive Research Traditions. Educational Researcher, 25(1), 20–28.
Moss, P. A. (1998). The Role of Consequences in Validity Theory. Educational
Measurement: Issues and Practice, 17(1), 6–12.

Moss, P. A., Girard, B. J., & Haniford, L. C. (2006). Validity in Educational
Assessment. Review of Research in Education 30(109–162).
Shohamy, E. (2001). The Power of Tests: A Critical Perspective on the Uses of
Language Tests. Harlow: Pearson Education Limited.
Shohamy, E. (2007). Language Policy: Hidden agendas and new approaches.
London: Routledge.
Slembrouck, S. (2001). Explanation, Interpretation and Critique in the Analysis of
Discourse. Critique of Anthropology, 21(1), 33–57.
Spolsky, B. (1997). The ethics of gatekeeping tests: what have we learned in a
hundred years? Language Testing, 14(3), 242–247.
Stobart, G. (2005). Fairness in multicultural assessment. Assessment in Edu-
cation, 12(3), 275–287.
Stobart, G. (2008). Testing Times: The uses and abuses of assessment. London:
Sutherland, G. (2001). Examinations and the Construction of Professional Iden-
tity: a case study of England 1800–1950. Assessment in Education, 8(1),
Trahar, S. (2008). It starts with once upon a time. [Editorial]. Compare, 38, 259–
van Dijk, T. (1993). Principles of critical discourse analysis. Discourse & Society,
4(2), 249–283.
van Dijk, T. A. (1998). Ideology: a mutlidisciplinary approach. London: Sage.
Webster, L., & Mertova, P. (2007). Using Narrative Inquiry as a Research Method:
an Introduction to Using Critical Event Narrative Analysis in Research on
Learning and Teaching Oxon: Routledge.

Part II
Language Testing and Assessment
in Schools
Formative Assessment Patterns in CLIL Primary Schools
in Cyprus

Dina Tsagari35 and George Michaeloudes36

University of Cyprus

Content and Language Integrated Learning (CLIL) is an educational approach which involves the
integration of subject content and foreign language learning. It has been used as an umbrella term to
include forms of bilingualism that have been applied in various countries and contexts for different
reasons. The present research explored formative assessment (FA) practices in a CLIL pilot pro-
gramme in Cyprus. Data was collected through teacher questionnaires and classroom observations.
The results showed that the teachers under study seemed to prioritise content over language while
the FA methods frequently used were questioning and provision of evaluative and descriptive feed-
back. Also instances of code switching and L1 use were not excessive. Finally, the most common pat-
tern used in classroom interaction was Initiation – Response – Feedback (IRF). Overall, the results
showed, considering learners’ successful responses, that teachers’ FA strategies were effective to an
extent and promoted learning in the CLIL lessons observed.

Key words: CLIL, formative assessment, feedback, classroom interaction, code switching.

1. Introduction

The necessity for the European Commission to ‘create a channel of shared under-
standings in tandem with the acknowledgement of the diversity of the European
models’ (Coyle, 2007, p. 554) and the need ‘to achieve a greater degree of plurilin-
gualism and [..] make Europe the most competitive and knowledge-based economy
in the world’ (De Graaf, Koopman, Anikina and Weshoff, 2007, p. 603) has led
to the development of an Action Plan for language learning. According to it, all
European citizens need to be fluent in their mother tongue plus two other European
languages known as the MT+2 formula (Marsh, 2003). Various pedagogical innova-
tive methods were used to implement the plan. The use of CLIL was one of them.
The present chapter explores the ways in which CLIL is implemented in a spe-
cific teaching environment by examining the Formative Assessment (FA) meth-
ods that teachers use. In the sections that follow we first define the nature of CLIL
and FA and then describe the rationale of the methodology employed. Finally, we
present and discuss the results and make suggestions for further research in the
field and recommendations for teachers.


2. Literature review

2.1 CLIL

Content and Language Integrated Learning (CLIL) is a pedagogical approach

in Europe37 (Gajo, 2007; Eurydice, 2006) whereby a school subject is taught
through the medium of a second language. CLIL is currently implemented in
various levels of education in several countries in Europe and in other parts of
the world (Kiely, 2009; Pavlou and Ioannou-Georgiou, 2008). Mehisto, Marsh
and Frigols (2008, p. 9) define CLIL as ‘a dual-focused educational approach
in which an additional language is used for the learning and teaching of both
content and language’. CLIL aims to improve both learner proficiency in subject
matter and second/foreign language learning. This is achieved through commu-
nicative methods and task-based activities which aim at creating an environment
conducive to learning. Teachers of CLIL scaffold learning by using a variety of
resources, visual aids, games, role-play and collaborative problem solving to pro-
mote content and language learning.
Significant research, focusing on learners, has been conducted to examine
CLIL effectiveness. For example, research, conducted in Switzerland by the Edu-
cation Department (Serra, 2007) showed that those students taught in L2 per-
formed better in oral and written tests than those taught in L1. In another research
study that explored the effectiveness of CLIL in the Dutch CLIL context, De
Graaf, Koopman, Anikina, and Westhoff (2007) found that ‘students who have
followed a CLIL curriculum reach higher levels of proficiency in English than
their peers, without any negative effects on their academic proficiency in L1 or
on other school subjects’ (ibid, p. 605). CLIL learners, outperformed non-CLIL
learners in all language skills tests and grammar test, in the Basque context as
reported by Lasagabaster (2008). Finally, Mehisto et al., (2008), based on research
findings, concluded that ‘students generally achieve the same or better results
when studying in a second language’ (ibid, p. 20). CLIL learners also develop
linguistic awareness, become able to compare languages and make the appropri-
ate decisions and verifications in order to transfer their meaning effectively (ibid).

2.2 Formative assessment

One of the main aims of classroom-based assessment is to provide teachers with the
necessary information on learners’ performance. This type of assessment can be

37 For discussions of CLIL in the US context see Brinton, Snow and Wesche, 2003; Snow, 2013.

‘summative’ and/or ‘formative’ (Rea-Dickins, 2007). Formative assessment (FA),
in particular, is integrated in everyday classroom routines (Leung and Mohan,
2004). Its purpose is to promote and assess learning. During FA procedures, the
teacher is required to adopt a dual role: that of ‘facilitator of language development
and assessor of language achievement’ (Rea-Dickins, 2008, p. 5). This dual role is
achieved through classroom interaction. There are various patterns of classroom
interaction identified in the literature. The most popular pattern is the IRF pattern
(Initation-Response-Feedback) proposed by Sinclair and Coulthard (1975). Accord-
ing to it, the teacher initiates a learning opportunity (e. g. by asking a question), the
learners respond to this initiation and then the teacher does a follow-up move in
response to learners’ previous answers (for further discussion of the advantages
and disadvantages of the IRF see Tsagari and Michaeloudes, forthcoming).
Various teacher-oriented actions such as ‘questioning’, ‘observing’ and giving
‘feedback’ to learners’ responses are identified during IRF instances. Question-
ing is the most common action of the three. In some cases this comprises 70 % of
teachers’ classroom talk (Tsui, 1995). Teachers ask learners questions to retrieve
information reflecting on students’ learning and their teaching effectiveness, to
highlight knowledge gaps and inadequacies, to revise previous subject matter,
etc. Questions are also used to identify any misconceptions, to promote discus-
sion, or to explore areas requiring further clarification (Black, Harrison, Lee,
Marshall and Wiliam, 2003). In an effort to create opportunities for students to
engage in the learning process, teachers also help learners ask questions during
the lesson to obtain appropriate feedback that will enhance learning.
Another very powerful tool that teachers use to gather classroom data is ‘obser-
vation’. Gardner and Rea-Dickins (2002) explain that ‘Teachers […] reported that
observation and the collection of language samples were the most useful means
for monitoring their learners’ language progress’. Teachers observe learners’ atti-
tudes and responses while teaching subject matter. Learners’ comments, inter-
actions and even body language are also observed by teachers to retrieve as much
information as possible to adjust and revise their lesson plans accordingly.
The purpose of both questioning and observation is to identify learners’ level
of achievement that will eventually lead to the provision of appropriate feedback
in order to promote learning. Through the provision of feedback, teachers help and
scaffold learners to achieve desired performances. Tunstall and Gipps (1996), based
on empirical research, created a typology of various types of feedback divided in the
following two categories: ‘evaluative/judgmental’ feedback (teachers judge learn-
ers’ responses by approving or disapproving and rewarding or punishing them), and
‘descriptive’ feedback (teachers provide learners with feedback based on their cur-
rent achievement to specify attainment or improvement and to construct achieve-
ment or plan the way forward) (Tunstall and Gipps, 1996).

3. Aims of the study

The present study aimed at examining the nature of FA in the CLIL context of
primary schools in Cyprus. It investigated the nature of focus in CLIL lessons
(content and/or language) and examined the types of FA methods and strategies
teachers used. The study was exploratory, as no other empirical study on this
specific topic had been conducted at the time, and was based on the following
research questions:
1. Do teachers focus on subject matter knowledge or L2 language in CLIL les-
2. What ways do teachers use to assess learners’ achievement in subject matter
3. What ways do teachers use to assess learners’ achievement in L2?

4. Methodology of the study

In order to answer the research questions, quantitative and qualitative data was
collected to safeguard the research validity of the study (Cohen, Manion and
Morrison, 2000). In particular, questionnaires were administered to CLIL teach-
ers and observations of CLIL classes were conducted to gather as much accu-
rate data as possible in order to triangulate the research results (McDonough and
McDonough, 1997). In the present chapter, given the confines of space, we will
present the results of the observations conducted in the CLIL classes observed.
However, we will make reference to the results of the teacher questionnaires in
case a point need exemplification or elaboration. The interested reader can refer
to Michaeloudes (2009) for fuller presentation and discussion of the results of the
questionnaire study. Classroom observations were conducted to gain a clearer
picture of the teaching and FA practices used in CLIL lessons. Cohen, et al.,
(2000, p. 305) argue that observations can ‘discover things that participants might
not freely talk about in interview situations, move beyond perception-based data
and access personal knowledge’ (see also McDonough and McDonough, 1997;
Sapsford and Jupp, 2006). Non-participant observations were conducted since
active participation in the lesson might have influenced the reactions of teach-
ers and learners and, therefore, affect the accuracy of the data (Hammersley and
Atkinson, 1983 cited in McDonough and McDonough, 1997). Also other than
audio-recording the classroom observations, field notes were taken recording
non-verbal actions by teachers and learners and description of resources and
materials used during the lesson.

Overall, five lessons were observed and audio-recorded in three of the state
primary schools employing CLIL (see Table 1). All the teachers were female. Two
of them were located in Nicosia (T1 and T3) and one in Limassol (T2). Geography
and Home Economics were the most popular subjects in CLIL implementation at
the time of the study.

Table 1. CLIL lesson observations

T1 Geography1 (G1) 40 5th
T1 Geography2 (G2) 40 6th
T2 Geography3 (G3) 40 6th
T3 Home Economics1 (HE1) 80 5th
T3 Home Economics2 (HE2) 80 6th

4.1 Analysis of data

As soon as the data processing of the tape-recordings (transcription and insertion

of relevant field notes in the transcriptions) was completed, the analysis of the
observational data (content analysis, see Paltridge and Phakiti, 2010) was done
manually using a specially-designed grid which consisted of several categories
(see Table 2).
In the analysis grid, the first column contained the lesson transcript. The
second one contained the field notes taken. The third column indicated whether
the teacher’s focus was on content (C) or language (L) while the fourth examined
the nature of turn-taking that occurred in the lessons based on the IRF model: (I)
was used to code instances of initiation, (R) to code learners’ responses and (F)
when feedback was provided to learners (In the event that there was no R or F fol-
lowing teacher’s inititation, the turn was not coded). Feedback was further ana-
lysed in the next column as ‘evaluative feedback’ (E) and ‘descriptive feedback’
(D) following Tunstall and Gipps (1996) typology. The final column was used to
identify instances of code switching by learners or teachers (C).

Table 2. Example of analysed observational data

Geography 1- Transcribed lesson

Learning episode Field notes C/L IRF E/D C

10. T1: Yes. Very good! You said The learner points at various C F E
Switzerland, France, Netherlands European countries on the I
… map.
11. L1: Luxemburg. R
12. T1: Luxemburg, yes, about F D
here. Yes, Βέλγιο, Belgium.
13. T1: And this? I
14. L1: Πολωνία (Poland). R C
15 T1: Yes. Πολωνία, Poland. C/L F E C

The number of instances in each column was added up and is presented as per-
centages in Graph 1. The coding scheme was checked and piloted with an experi-
enced language teacher who used it on samples of the transcripts. The agreement
of the interpretation of codes was high.

5. Presentation of findings and discussion

5.1 Focus on content or language?

The analysis of the observational data aimed to identify whether lesson focus was
on content and/or language. In a CLIL environment the teacher’s focus of assess-
ment is expected to be on both (Kiely, 2009). Graph 1 presents the percentages of
instances coded as focus on ‘content’, ‘language’ and ‘content and language’ (see
also Table 2, third column).

Graph 1. Focus in CLIL lessons

Content Language Content and Language


Overall, the analysis of the lessons showed that teachers tend to prioritise
content slightly more than language. In 45 % of the assessment episodes identi-
fied (Graph 1), teachers assessed content while in 40 % they assessed language
and only in 15 % they assessed both content and language. (These results reflect
teachers’ views as expressed in the questionnaires, see Michaeloudes, 2009).
Various reasons can explain why teachers did so. It might be the case that
teachers focused on content more because learning in a CLIL environment
might disadvantage some learners in terms of content learning. Perhaps in
their attempt to achieve learning objectives regarding content, teachers placed
greater emphasis on content (Coonan, 2007). It might also be the case that
the learners’ high proficiency in English (almost at B1 level, CEFR; Council
of Europe, 2001), gave the opportunity to teachers to place emphasis on con-
tent as learners could use L2 efficiently. Coyle (2007, p. 549) also stresses that
‘The more advanced the students’ level of foreign language, the less attention
it seems is needed to be paid to linguistic development’. Actually learners were
taught English as an independent subject twice weekly from year 4. In informal
discussions with teachers, they stressed that the majority of learners had also
been attending private afternoon lessons for at least two hours, twice a week,
since year 3, which might have also impacted on their high language level.
In the following subsections, selected extracts from the classroom transcripts
will be presented and discussed to exemplify types of classroom interaction
generated in the lessons that place emphasis on content and/or language. The
text in brackets is a direct translation from L1 to L2 while the underlined text is
field notes that provide extra information. Teachers’ and students’ names were
replaced with pseudonyms for reasons of anonymity.

5.2 Focus on content

The following extract is taken from a lesson on home economics. The teacher is
trying to guide the learners to identify certain content words (Extract 1).
Extract 1. Example of focus on content
133. T3: Yeah … say that or the other one or in Greek … you can give one sen-
tence in Greek …
134. S1: Σίδηρο (iron).
135. S2: Strong.
136. T3: Nαι (yes) strong. What makes us strong with dairy products? What do
we have? Nαι (yes) they make our bones …
137. S1: Σίδηρο (iron).

138. T3: Ναι (yes) σίδηρο, iron and what else..? And? And cal… cal…
139. S1: Calcium.
140. T3: Calcium. Very well!
(Home Economics 1, Teacher 3)
In her attempt to help students reach the desired content knowledge (‘calcium’),
the teacher prompted students to use L1 (turn 133). When students came up with
the correct word, ‘iron’ (turns 134 and 137), she then scaffolded them to find the
desired word ‘calcium’ (turn 139). In this learning episode, and similar others, con-
tent learning seemed to be more important than language. During the numerous
instances of focus on content in the lessons observed, teachers used L2 very often.
Teachers were seeking content achievement rather than language performance
when they prompted learners to answer in L1 (e. g. turn 133). Despite the fact
that learners were taught through the medium of another language, the teachers’
preference for content safeguarded the high standards of achievement for learners.

5.3 Focus on language

Even though focus on language was not as frequent as focus on content (see Graph
1), the teachers employed an interesting array of methods to assess language.
The most common technique was direct questioning usually evaluating whether
students knew or could translate a word from L1 to L2 and vice versa. Teachers
usually asked learners for simple translations in cases of new or difficult words, to
reaffirm students’ understanding of English. Very often simple questioning was
used for reassurance (Extract 2).
Extract 2. Example of questioning
212. T2: Do you know? Do you know the name of this animal? It has long teeth in
the front.
213. S1: Kάστορας (beaver).
214. Τ2: Κάστορας (beaver). Very nice! In English, what is the name? B…?
215. S1: Beaver.
216. T2: Very good! Beaver is the national animal of Canada. National animal?
217. S1: Εθνικό ζώο (national animal).
218. T2: Εθνικό ζώο (national animal). Very good!
(Geography 3, Teacher 2)
In this learning episode the content-related answer expanded language learning,
i. e. the targeted unknown word ‘beaver’ (turns 213–215) led to the unknown
phrase ‘national animal’ (turn 216).

Another strategy teachers used while focusing on language was elaboration.
Teachers used this strategy when there was evidence – through student body
language, facial expressions, questions or their silence – that students did not
understand the question or the assigned task. Elaboration very often works as
scaffolding. Ellis (2003, p. 180) defines scaffolding as ‘the dialogic process by
which one speaker assists another in performing a function that he or she cannot
perform alone’.
Scaffolding in the CLIL lessons observed happened when teachers helped
learners to find the desired answer, e. g. by giving more explanations or simplified
tasks. For example, in Extract 3, from a lesson on Home Economics, the teacher
prompts the learner to place a picture in the right place.
Extract 3. Example of elaboration
74. T3: Now, let’s revise the words. Mary! Can you show us where the milk
group is? … Can you show us the milk group? … Where is the dairy
product group? … The milk group. (The student goes to the food pyra-
mid and places the pictures in the appropriate space= points to the right
75. T3: Ah … there it is. Let’s put it up there.
(Home Economics 2, Teacher 3)
In Extract 3, the teacher asks the learner to place the picture in the right place of
the food pyramid as the focus is on content. The learner struggled so the teacher
repeated the question. The teacher did not receive an answer so she further elabo-
rated by simplifying the question. He used the word ‘milk’ rather than the pos-
sibly unknown word ‘dairy’, helping thus the learner to respond correctly. What is
interesting in this extract, is that the learner does not use language in her response.
Instead the learner demonstrates her understanding of the language used in the
teacher’s questions by placing the picture in the right place. Such learning epi-
sodes confirm the results of the teachers’ questionnaires, too, whereby the teach-
ers reported that they provide explanations and elaborate on complex areas that
learners need help with.
Another interesting aspect of focus on language was the pronunciation of some
‘tricky’ words whereby teachers helped learners to identify and correct pronun-
ciation (Extract 4).
Extract 4. Example of correction of pronunciation
372. T3: No not all. Can you give me the name of one product that doesn’t have
preservative and show it to me please? Yes, John?
373. S1: Tea cakes (PRONUNCIATION ERROR)

374. T3: Tea cakes (Teacher repeats using the correct pronunciation) …. show me
the tea cakes. Show me. Tea cakes are good. They don’t have preserva-
(Home Economics 1, Teacher 3)
After the teacher’s question (turn 372), the learner responded by wrongly pro-
nouncing the word ‘tea’. This could be attributed to interference of L1 whereas,
unlike English, written Greek is pronounced phonetically. The teacher cor-
rected the problem by modeling the correct pronunciation of the word. Actu-
ally the teacher repeated the word offering the correct pronunciation three times
(turn 374) to reinforce learning (These actions were reported by teachers in the
questionnaires, too).

5.4 Focus on content and language

Whole class activities were frequently supported by visual aids like pictures. This
is not uncommon. Coonan (2007, p. 636) stressed that ‘Teachers highlight the
importance of materials with regard to content in CLIL lessons’. In the question-
naires teachers also reported using ‘visual aids’, and ‘media’ to help learners over-
come difficulties. Learners were asked, for example, to name objects in pictures
(focus on language), then categorise the objects according to content knowledge.
For instance, in Extract 5 the teacher uses a map to focus on content and language
during a geography lesson.
Extract 5. Use of visual aids
79. T2: Look at this big map here (The teacher points at the World Atlas). I want
you to make sentences in your groups about where Canada is. (She sticks
a strip of paper with a full sentence written on it on the World Atlas) OK?
You can also use your World Atlas. Open your World Atlas. On page
8.Ok? This is the map of the world. Find Canada. (She goes around the
tables to make sure that the groups are working properly) Bravo James!
Bravo to the table of Cancer (name of the group). Ok. You have 30 sec-
onds τριάντα δευτερόλεπτα to make your sentences. It’s very easy.
80. T2: Ok. Are you ready?
81. L1: Yes.
(Geography 3, Teacher 2)

In this learning episode, the teacher provides the learners with a sample sentence
and the necessary words in the appropriate language form and asks learners to
construct their own statement sentences. Learners are expected, through their

content knowledge to select the correct word in L2. They then form a sentence
using the sample structure provided. By drilling the same pattern to describe the
location of a country on the map students learned the appropriate L2 expression.
Other instances of equal emphasis on both content and language were included in
group or independent tasks. For example, in Extract 6, learners were asked to work
in groups using L2 or work on their worksheets independently, focusing on content.
Extract 6. Use of group work activity
133. T2: Very good! Now where in Canada do we find these climates? I don’t want
you to guess. I want you to find the place that you find them. I want you to
go to page 51 on your World Atlas (She shows the Atlas to the learners)
Ok? And in your groups study the map of Canada that talks about the
climates of Canada. Ok? And I want you to decide in your groups where
we find this climate, ok? At which point? You have less than a minute.
You have about 40 seconds. Where do we find the Arctic climate?(She
monitors the groups and she helps learners)
134. T2: Ok? Have a look at the colours of the map please. Ok, so I’m listening.
Arctic climate. Yes!
135. L1: To the A
136. T2: At point A. Very nice! Excellent! We have the Arctic climate, αρτικό
κλίμα, a very very cold climate up in the polar zone. Yes, Helen! (She
sticks the labels of the climate on the map)
(Geography 3, Teacher 2)
Extract 6 is an example of a learning episode where a task-based activity is used
to assess and promote content and language learning. In this and other similar
instances, learners were assigned a task and prompted to interact with their part-
ners in order to find the answer. The teacher’s detailed instructions promoted
communication, as the learners were asked to work with their partners to reach
the content learning outcome using the target language. Learners would inter-
act with their peers using short sentences in English. The teacher assessed both
aspects (content and language) simultaneously using the same task in the lesson.
Coonan (2007, p. 634) in his research about teachers’ perception of CLIL stressed
that ‘All teachers express a preference for pair/group work’ and that group work
occupies 30 %-40 % up to 70 % of the lesson. Group work activities are more
effective as they create opportunities to integrate learning.
Another strategy used while focusing on both content and language was the
use of body language, gestures and facial expressions, evident when the teacher
pointed to maps, objects or other displays in the classroom, helping learners rec-
ognise and find the appropriate word. In Extract 7 the teacher uses body language
to scaffold learners.

Extract 7. Use of body language
164. T3: Ah … προστατεύει μας (it protects us). Very good! It protects us. So it’s
good for not getting a cold, right? And for what else? For …?
165. S1: Skin.
166. T3: Skin and … (The teacher points at her nails)
167. S1: Nails.
168. T2: Nails. Very good! Cold, skin and nails.
(Home Economics 1, Teacher 3)
The focus in this particular learning episode was mainly on content, which is why
the teacher also uses L1. As the learners had difficulties to find the target word the
teacher pointed at her nails (turn 166). The focus here is on content (the teacher
scaffolds learners to find a related answer) and language (the teacher expects
learners to find the content related answer in L2).

5.5 Code switching

The analysis of the CLIL lessons observed identified instances of code switching
defined as the ‘alternation between two (or more) languages’ (Eldridge 1996, p.
80). The analysis showed that both teachers and learners used code switching for
different purposes. For example, teachers used L1 to give clearer instructions,
explain new subject matter and motivate learners to participate in the learning
process (see Extracts 5, 6). Learners used L1 when they did not know the meaning
of a word or when they did not understand the teacher when she used L2 (Extract
8). On some occasions, learners felt comfortable to answer in L1 while in others
learners asked the teacher whether they were allowed to use L1 (see Extract 8).
Extract 8. Example of code switching
194. T3: Cheese. Very good. Milk cheese. Does anyone know the word for all
these products that are made of milk?
195. S1: Can I say it in Greek?
196. T3: Yes.
197. S1: Γαλακτοκομικά προϊόντα (dairy products).
198. Τ3: Γαλακτοκομικά προϊόντα (dairy products).
199. T3: Do you know it in English? It starts with D.
200. S1: Dairy products.
201. T3: Very good! Dairy products.
(Home Economics 1, Teacher 3)

In this extract, the difficulty of the learner to answer the question is due to lan-
guage not content. Code switching seems to promote learning as the focus is on
content and gives the learner the chance to prove his/her knowledge of the topic.
In other instances, L1 was used when learners were asked by their teachers to
translate a word or phrase. Such questions were used as comprehension checks.
The learners responded by saying the word in L1 (see Extract 8). Teachers occa-
sionally used L1 when they gave instructions or when they taught new compli-
cated subject matter.
Overall, the use of code switching was not excessive (see also Extracts 3, 4,
6). This could perhaps be attributed to the learners’ good command of L2 gained
over the two years of learning English at state school. Another reason could be
the use of effective methods and strategies. Teachers’ integration of visual aids
in the lesson (see Extracts 3, 5, 6) and the use of body language (see Extract 7)
scaffolded learners’ understanding and acquisition of new vocabulary. Another
reason could be the fact that the observed lessons took place at the end of the aca-
demic year when, as Coonan (2007) suggests, the level of the use of L2 is gradu-
ally increased thus increasing learners’ familiarity with the CLIL lesson routines
and strategies. All learners had completed one year and some were in their second
year of learning English. They were familiar with instructions in L2 and seemed
to respond efficiently using it without the need of regular code switching. Learn-
ers’ code switching occurred when they did not know the equivalent word in
English and provided an answer in L1.

5.6 The IRF pattern

The analysis of the observational data showed that the teaching pattern favoured
by teachers was mainly teacher-fronted. Actually, the majority of the learning
episodes were initiated by the teachers (e. g. Extract 2, 3, 5, 6), who directed their
lessons to pre-planned learning objectives. The discourse pattern observed in the
lessons echoed the Initiation – Response – Follow-up (IRF) framework (Sinclair
and Coulthard, 1975). This was used later as a basis to code the transcripts of
the lessons observed. However, despite the fixed lesson format and structure fol-
lowed, the CLIL lessons observed were effective. As Coyle (2007, p. 556) argues
‘… teacher – learner questions are a means of engaging learners cognitively and
generating new language use’. To retain a balance and dual focus on both con-
tent and language, teachers followed pre-planned teacher centered routines. Their
objectives might not have been achieved if the structure of the lesson was more
flexible, e. g. focus on either content or language.

The structure of the lessons observed does not mean that learners were
not given the opportunity to express themselves. The analysis showed that
learners felt free to ask questions when they faced difficulties, e. g. Extract 8.
Students also asked for more elaboration, clearer instructions or explanations
of unknown words. They seemed to work in a safe environment where they
could express themselves with confidence. This is evidence that teachers were
not confined to their teaching plans and remained open to students’ questions
during the lessons. As teachers explained in the questionnaires, they wanted
to make sure that they were comprehensible by students while teaching in L2.
Teachers also made sure that negative comments from other learners were
The fact that teachers seriously considered learners’ responses and, elabo-
rated on them, was evident in all lessons transcribed (e. g. Extracts 1, 2, 4,
6). According to Gourlay (2005), this is the ‘embedded extension’ of an IRF
episode. The teacher asks a question, the learner responds, the teacher expands
on learners’ response with feedback. In the transcriptions, these instances
were identified when feedback was marked with ‘D’ – ‘descriptive’ (see Table
2). Teachers moved from the IRF model to the embedded extension proba-
bly because they identified a misconception or a difficulty amongst learners,
received specific questions from learners or wanted to move the focus on con-
tent or language.
A very common pattern used in the CLIL lessons observed is the I-R-F-R-F
(see Extracts 6, 9) whereby the teacher asks a question, the learners answer it,
and the teacher then provides learners with feedback, usually ‘evaluative’. Then
another learner follows the routine of this pattern answering the next question.
This pattern occurred while teachers assess learners on a particular task or when
they worked independently (e. g. on worksheets).

5.7 Feedback

All instances of feedback identified in the IRF column of the framework (see
Table 1) were categorized as ‘evaluative’ or ‘descriptive’ according to Tunstall
and Gipps (1996).
Evaluative feedback was more commonly used than descriptive feedback (see
Extract 2, 8, 9). In the analysis of the data this appeared in a variety of forms.
Rewarding words like ‘yes’ and ‘well done’ in combination with repetition of
correct answers were the most common instances of evaluative feedback found.
In the following extract (see Extract 9), the teacher praises learners when they
successfully find countries on the map.

Extract 9. Example of evaluative feedback
52. S1: Italy.
53. T1: Italy. Very good!
54. S2: Italy.
55. T1: Yes.
56. S3: Germany.
57. T1: Very good! And …
58. S4: Belgium.
59. S5: Λουξεμβούργο (Luxemburg).
60. T1: Λουξεμβούργο (Luxemburg). Yes. Here is Luxemburg and …
61. S6: Βέλγιο (Belgium).
(Geography 1, Teacher 1)
In this part of the learning episode, the common ways of evaluative feedback are
clear: the repetition of the correct answers, the use of ‘yes’ to show approval and
rewarding words like ‘very good’ are used by the teacher to promote learning
through appraisal and reward and motivate learners.
Another strategy used was game-like activities: the teacher would allocate
points to student groups to motivate them to achieve the desired learning out-
comes (see Extract 10).
Extract 10. Use of a game-like activity
142. S1: It’s a grain.
143. T3: Very good. One point for this group. Έλα (Come on) Jenny! Ιs this a
grain food or a non-grain food?
144. S2: Non-grain food.
145. T3: Very good! One point for the other group. Grain food or non- grain
146. S3: Non-grain food!
147. T3: Very good! Έλα (Come on)! Grain food or non-grain food. It’s rice.
(Home Economics 1, Teacher 3)
The use of games such as these engaged learners’ interest. It actually motivated
them to participate in the learning process by answering content-related questions
using L2 and created an enjoyable atmosphere during the lesson.
As the CLIL lesson requires focus on more than one parameters, when
a learner answered a particular question successfully, the tendency of the
teacher was to provide the learner with both evaluative and descriptive feed-
back. When an answer, with focus on content for example, needed more elabo-
ration to meet language requirements, there was a further expansive prompt
from the teacher.

Extract 11. Example of descriptive feedback
25. T3: So what’s this?
26. S1: Kiwi.
27. T3: Very good! So Kiwi is a … It is a …
28. S1: Fruit!
29. T3: Fruit. Very good! Sit down! Go to your place.
30. S2: Strawberries.
31. T3: Strawberries are …
32. S3: Fruit.
(Home Economics 1, Teacher 3)
This learning episode is characteristic of focusing on both content and language.
The teacher provides the learner with descriptive feedback (turn 27) to specify
attainment for the learner to reach content achievement, e. g. the categorization
of kiwi as ‘fruit’ (turn 28). The teacher was not satisfied with the learners’ first
response (turn 26) as she was looking for both content and language competence
that she achieved through the provision of descriptive feedback.

6. Conclusion

The results of the study are indicative of the complexity of focus in CLIL les-
sons (also in Snow, 2012). As was seen from the analysis of the data, even in the
learning episodes where content was prioritized, the language used was L2. Con-
versely, when the focus was on language, this was related to content.
In addition, the analysis also revealed a variety of FA methods and strategies
used by teachers to assess either content or language or both areas simultaneously.
In the majority of the instances, the main strategy teachers used to assess content
and language was ‘questioning’. This was used to motivate learners and encour-
age them to use the target language. The most common interaction sequence was
the IRF pattern, which followed a teacher-fronted style of teaching. Teachers also
praised learners very often. This stimulated students’ confidence and motivation.
In addition, teachers provided learners with evaluative feedback in the form of
rewards and descriptive feedback, which expanded the IRF interaction sequences
of the lesson (into IRFRF) and created an open learning environment.
However, even though the FA strategies that teachers employed seemed to
be effective to an extent, more work needs to be done to further enlighten the
nature of FA and its relation to the implementation of CLIL. The results of the
study, valuable as they are, are somewhat limited. One of the factors affecting
the scope of the study was that, due to the innovative nature of CLIL in Cyprus,

a very small number of teachers (17 in total) were using CLIL at the time of the
research. Observations of more CLIL classes from different teachers could offer
a clearer picture of the ways in which FA takes place in CLIL lessons. The lack of
time and opportunity to interview the teachers observed after their lessons was
also a limiting factor. The observational data could have been further enhanced
if video-recorded lessons were allowed. Further research could include parents’
perspective, too. For example parents’ satisfaction of feedback on learners’ per-
formance would be another area to explore, as parents are crucial stakeholders in
the implementation of CLIL. Their comments are valuable as they can lead to the
adjustment of teaching strategies and assessment procedures.
Finally, given the use of FA practices by teachers in the present study, we
would like to highlight the importance of teacher training in FA for the success-
ful implementation of CLIL (for other teacher training aspects related to CLIL
teaching see Pavlou and Ioannou-Georgiou, 2008). We believe that teachers can
become clearer and more confident about the focus of their assessment in their
CLIL contexts (Massler, 2011) if they are given the opportunity to attend profes-
sional development courses that combine FA and CLIL education. For example,
teachers can be trained in applying the Tunstall and Gipps (1996) typology, while
employing FA strategies in their CLIL lessons.
We hope that future research will shed more light in the ways FA is imple-
mented in CLIL classes in Cyprus and other educational contexts.


Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2003). Assessment
for Learning: Putting it into Practice. Maidenhead: Open University Press.
Brinton, D. M., Snow, M. A., and Wesche, M. B. (2003). Content-based second
language instruction. Ann Arbor, MI: University of Michigan.
Cohen, L., Manion, L., & Morrison, K. (2000). Research Methods in Education.
London and New York: Routledge Falmer.
Coonan, C. M. (2007). Insider Views of the CLIL Class Through Teacher
Self – Observation – Introspection. The International Journal of Bilingual
Education and Bilingualism, 10(5), 623–646.
Council of Europe. (2001). Common European Framework of Reference for Lan-
guages: Learning, Teaching, Assessment. Cambridge: Cambridge University
Coyle, D. (2007). Content and Language Integrated Learning: Towards a Con-
nected Research Agenda for CLIL Pedagogies. The International Journal of
Bilingual Education and Bilingualism, 10(5), 543–562.

De Graaf, R., Koopman, G. J., Anikina, Y., & Westhoff, G. (2007). An Obser-
vation Tool for Effective L2 Pedagogy in Content and Language Integrated
Learning (CLIL). The International Journal of Bilingual Education and Bilin-
gualism, 10(5), 603–624.
Eldridge, J. (1996). Code-switching in a Turkish secondary school. English Lan-
guage Teaching Journal, 50(4), 303–311
Ellis, R. (2003). Task – based Language Learning and Teaching. Oxford: Oxford
University Press.
Eurydice. (2006). Content and Language Integrated Learning (CLIL) at School in
Europe. Brussels: Eurydice European Unit.
Gajo, L. (2007). Linguistic Knowledge and Subject Knowledge: How Does Bilin-
gualism Contribute to Subject Development? The International Journal of
Bilingual Education and Bilingualism, 10(5), 562–581.
Gardner, S., & Rea-Dickins, P. (2002). Focus on Language Sampling: A key issue
in EAL Assessment. London: National Association for Language Develop-
ment in the Curriculum (NALDIC).
Gourlay, L. (2005). OK, who’s got number one? Permeable Triadic Dialogue,
covert participation and the co-construction of checking episodes. Language
Teaching Research, 9(4), 403–422.
Kiely, R. (2009). CLIL-The Question of Assessment. Retrieved from Developing website:
Lasagabaster, D. (2008). Foreign Language Competence in Content and Lan-
guage Integrated Courses. The Open Applied Linguistics Journal, 1, 31–42.
Leung, C., & Mohan, B. (2004). Teacher formative assessment and talk in class-
room contexts: assessment as discourse and assessment of discourse. Lan-
guage Testing, 21(3), 335–359.
Marsh, D. (2003). The relevance and potential of content and language integrated
learning (CLIL) for achieving MT+2 in Europe. Retrieved from ELC Infor-
mation Bulletin website:
Massler, U. (2011). Assessment in CLIL learning. In S. Ioannou-Georgiou & P.
Pavlou (Eds.), Guidelines for CLIL Implementation in primary and Pre-pri-
mary Education Nicosia Cyprus: Ministry of Educaiton Cyprus Pedagogical
Institute Cyprus.
McDonough, J., & McDonough, S. (1997). Research Methods for English Lan-
guage Teachers. N. Y.: Arnold.
Mehisto, P., Marsh, D., & Frigols, M. J. (2008). Uncovering CLIL. Content and
Language Integrated Learning in Bilingual and Multilingual Education.
Oxford: Macmillan Publishers Limited.

Michaeloudes, G. (2009). Formative Assessment in CLIL: An Observational
Study In Cypriot Primary Schools. Unpublished MA thesis, University of
Bristol, UK.
Paltridge, B., & Phakiti, A. (Eds.). (2010). Continuum Companion to Research
Methods in Applied Linguistics. London: Continuum.
Pavlou, P., & Ioannou-Georgiou, S. (2008). ‘Η εκπαιδευτική προσέγγιση CLIL και
οι προοπτικές εφαρμογής της στην Δημοτική και Προδημοτική Εκπαίδευση
της Κύπρου’. In E. Ftiaka, S. Symeonidou & M. Socratous (Eds.), Quality in
Education: Research and Teaching. Nicosia: University of Cyprus.
Rea-Dickins, P. (2007). Classroom-based assessment: possibilities and pitfalls. In
C. Davison & J. P. Cummins (Eds.), The International Handbook of English
Language Teaching. Norwell, Massachusetts: Springer Publications (Vol. 1,
pp. 505–520).
Rea-Dickins, P. (2008). Classroom-based assessment. In E. Shohamy & N. H.
Hornberger (Eds.), Encyclopedia of language and Education (2 ed., Vol. 7,
pp. 1–15).
Sapsford, R., & Jupp, V. (2006). Data Collection and Analysis (2 ed.). London:
Sage publications.
Serra, C. (2007). Assessing CLIL at Primary School: A Longitudinal Study. The
International Journal of Bilingual Education and Bilingualism, 10(5), 582–601.
Sinclair, J., & Coulthard, M. (1975). Towards an analysis of discourse. Oxford:
Oxford University Press.
Snow, M. A. (2012). The Changing Face of Content-Based Instruction (Via
DVC). 50th Anniversary Lecture Series, Cyprus Fulbright Commision, J. W.
Fulbright Center, Nicosia (17 May 2012).
Snow, M. A. (2013). Content-based language instruction and content and lan-
guage integrated learning. In C. Chapelle (Ed.), The Encyclopedia of Applied
Linguistics. Oxford, UK: Blackwell.
Tsagari D. & Michaeloudis, G. (forthcoming) ‘Formative Assessment Patterns in
CLIL Primary Schools in Cyprus’. In Tsagari, D., S. Papadima-Sophocleous
& S. Ioannou-Georgiou (eds) International Experiences in Language Testing
and Assessment – Selected papers in Memory of Pavlos Pavlou. (Language
Testing and Evaluation Series). Frankfurt am Main: Peter Lang GmbH.
Tsui, A. B. M. (1995). Introducing Classroom Interaction. London: Penguin.
Tunstall, P., & Gipps, C. (1996). Teacher Feedback to Young Children in Forma-
tive Assessment: A Typology. British Educational Research Journal, 22(4),

EFL Learners’ Attitudes Towards Peer Assessment,
Teacher Assessment and Process Writing
Elena Meletiadou38
University of Cyprus

Peer assessment (PA) is considered a prominent form of alternative assessment which can promote
student-centred life-long learning (Assessment Reform Group, 1999). Previous research has indi-
cated that PA combined with teacher assessment (TA) and supported by the process approach to
writing can improve learners’ writing skills and their attitudes towards the assessment of writing
(Tsui & Ng, 2000). However, even though research has been undertaken in PA, its use in secondary
education is an under-researched area (Topping, 2010). The present study explored adolescent EFL
learners’ attitudes towards PA, TA and the process approach to writing. A multi-method approach
of qualitative and quantitative data collection and analysis was used to survey 40 EFL students in a
Cypriot State Institute. The results indicated that learners had an overall positive attitude and valued
both PA and TA. They also believed that process writing helped them improve their writing per-
formance. The current study expands our understanding of how adolescent EFL learners perceive
PA, TA and process writing and suggests their use in combination to improve students’ motivation
towards the development and assessment of writing.

Keywords: peer assessment, teacher assessment, process writing, attitudes, secondary education.

1. Introduction

PA is fundamentally an interpersonal process in which a performance grade

exchange is being established and feedback is given to and received from students
aiming at enhancing the performance of an individual, a team or group as a whole
(van Gennip, Segers, & Tillema, 2009). PA is gaining increased recognition as an
alternative assessment method in higher education (Falchikov, 2004).
The beneficial effects of PA are diverse. PA can create a strong link between
instruction and assessment by forming part of a feedback loop that enables teach-
ers to monitor and modify instruction according to results of student assessment
(Tsagari, 2004). The use of PA also helps students develop certain skills in the
areas of critical reading, analysis, communication, self-evaluation, observation
and self-criticism (Dochy & McDowell, 1997). According to McDowell (1995),
PA aims to empower students in contrast to more traditional methods of assess-
ment which can leave learners feeling disengaged from the overall assessment
process. PA is inspired by social constructivism (which stresses students’ respon-
sibility for their own learning), promotes the active role of students in assessment


and is closely aligned with and embedded in the instructional process (Shepard,
2000). In the assessment literature, it is argued that active students are more
motivated, and, therefore, show more learning gains than passive students (van
Gennip et al., 2009).
The use of PA has increased with the shift to the process approach to writing
(Flower & Hayes, 1981). Process-oriented classrooms “challenge the traditional
practice of teaching writing according to reductionist and mechanistic models”
(Lockhart & Ng, 1995, p. 606). Instead of focusing solely on formal accuracy
and the final product of writing, a process approach instills “greater respect for
individual writers and for the writing itself” (Hyland, 2003, p. 17). Hedgcock
and Lefkowitz (1994) argue that foreign-language students are less motivated to
revise and correct their work since their language classes do not focus extensively
on multiple-draft process-oriented instruction. Research findings show that pro-
cess writing is in general an effective approach in helping students improve their
writing performance and attitudes towards writing at the tertiary, secondary and
primary school levels (Cheung, 1999).
Research has also indicated that students think that teacher assessment (TA) is
as important as PA in developing students’ writing skills (Mendoca & Johnson,
1994). Zhao (2010), who investigated learners’ understanding of PA and TA on
writing, found out that students valued TA more than PA but accepted it passively;
frequently without understanding its significance or value. He also suggested that
learners understood PA better and substantiated Lantolf and Thorne’s (2006, p.
286) assertion that peer interaction should be included among participant struc-
tures conducive to learning through the ZPD (zone of proximal development),
especially in secondary settings. PA and TA can complement each other, with
students at times being more adept at responding to a student’s essay than teach-
ers, who tend to judge the work as a finished product (Caulk, 1994).
Although numerous studies underscore the role and value of PA in writing
instruction, performance and autonomy in learning (McIsasc & Sepe, 1996;
Plutsky & Wilson, 2004), there are a number of conflicting issues that needs to
be further explored.
Firstly, the use of PA and TA in secondary education has not yet been widely
investigated (Tsivitanidou, Zacharia, & Hovardas, 2011). Although some studies
consider PA to be suitable for young learners (Shepard, 2005), several research-
ers claim that PA is more suitable for older learners (Jones & Fletcher, 2002).
Secondly, the literature has shown mixed findings regarding learners’ attitudes
towards PA, ΤΑ and the process approach to writing (Cheung, 1999; Lee, 2008;
Strijbos, Narciss, & Dünnebier, 2010).
The above diverse findings motivated the present researcher to use PA, TA and
process writing with adolescent EFL learners and further explore their attitudes

towards these three methods. The present study aimed to build on previous stud-
ies (Cheng & Warren, 1997; Topping, 1998) and was guided by the following
research questions:
• What are adolescent EFL learners’ attitudes towards TA?
• What are adolescent EFL learners’ attitudes towards process writing?
• What are adolescent EFL learners’ attitudes towards PA?

2. The methodology of the present study

2.1 Participants and educational context

The participants in the present study were forty EFL learners in Cyprus (17 girls
and 23 boys between 13–14 years old). These adolescent students learned English
as a foreign language in a State Institute in Nicosia, the capital city of Cyprus.
These Institutes are run by the Ministry of Education and Culture of the Republic
of Cyprus and aim at teaching learners EFL, among other subjects, and preparing
them for international exams.
The learners were Greek-Cypriots who had been taught by the same teacher
for 5 years and were attending a general English course of eight months’ duration
(from mid-September 2009 to mid-May 2010). Most of these learners had failed
the previous year’s final exams and the two formal tests they had to take per
semester. After several discussions with the learners, which took place prior to
the current study, the researcher concluded that they had a rather negative attitude
towards writing and the assessment of writing.
The researcher decided to employ PA combined with TA and process writing
to explore whether this combination could improve learners’ attitudes towards
writing and the assessment of writing.

2.2 Procedure and research instruments

The researcher conducted a study, which extended over a period of four months
(January-April 2010), (Table 1) to explore the impact of PA on adolescent EFL
learners’ attitudes towards PA of writing. Two groups of twenty intermediate
EFL students were engaged in the study once a week for two teaching sessions (45
mins each). These added up to approximately 24 teaching sessions. The learners
were involved in two rounds of composition writing for three writing tasks (see
Table 1).

Table 1: The schedule of the study

Training of groups (180 min.) and piloting of research instruments

Writing Task 1:
Writing the first draft of a narrative essay
Lesson 1
Lesson 2 Feedback
Lesson 3 Writing the second draft
Lesson 4 Feedback and whole-class discussion
Writing Task 2:
Writing the first draft of a descriptive essay
Lesson 5
Lesson 6 Feedback
Lesson 7 Writing the second draft
Lesson 8 Feedback and whole-class discussion
Writing Task 3:
Writing the first draft of an informal letter
Lesson 9
Lesson 10 Feedback
Lesson 11 Writing the second draft of an informal letter
Lesson 12 Feedback, administration of the questionnaire and whole-class discus-

Three data collection procedures were employed in the present study. These were
class observations, whole-class discussions and questionnaires. For the class
observations and the whole-class discussions, the teacher filled in two specific
forms (Appendix I and II).
Teacher observations were carried out during classwork and recorded on the
teacher observation forms (Appendix I). The teacher observed students’ reactions
towards process writing, TA, PA and their overall behaviour during the feedback
The whole-class discussions (Appendix II) were conducted at the end of every
writing task (Table 1). Semi-formal conversations were initiated during the last
half an hour of all feedback sessions on writing and students were invited to
answer and discuss a list of questions (Appendix II). Students’ mother tongue
(L1) was employed at this stage in order to make sure that all students, even weak
ones, could participate. The teacher kept notes during the whole-class dicsus-
A questionnaire (Appendix III) was also employed to explore students’ atti-
tudes towards process writing, TA and PA. Students were asked to respond to
this questionnaire at the end of the study. The questionnaire consisted of three
questions. Students were reassured that the questionnaire would be anonymous

so as to encourage them to respond as sincerely and freely as possible. All state-
ments required students to respond using a three-point Likert type scale ranging
from “negative” to “positive”.
The first question examined learners’ attitudes towards process writing which
can be a feasible solution to improving the students’ interest and success in writing
(Ho, 2006). The second question referred to students’ feelings when they get feed-
back from their teacher. Previous research has indicated that teacher feedback is
often misinterpreted by students as it is associated to discourse that is not directly
accessible to students (Yang, Badger, & Yu, 2006). The third question was related
to students’ feelings when they get feedback from their peers. Wen and Tsai (2006)
report that students generally display a liking for PA activities because these activ-
ities provide an opportunity for comparison of student work. All instruments were
piloted with a similar group of students before using them in the present study in
order to make sure they were adequate for adolescent EFL learners (Table 1).
To sum up, 24 teacher observations, 6 whole-class discussions and 40 question-
naires were employed in order to investigate students’ feelings towards PA, TA
and process writing and the difficulties they faced during the implementation of
these methods in their EFL writing classes.

3. Findings

The teacher examined all the notes that she kept from the observations and the
whole-class discussions and identified themes via content analysis (Patton, 1987).
A descriptive name was provided to each theme. The themes were then divided
into positive and negative and are presented in the following sections.

3.1 Positive findings

The positive themes that emerged from the content analysis were the following:
(a) learners’ increase of motivation, (b) learners’ preference for TA, and (c) the
need for PA and TA.

3.1.1 Learners’ increase of motivation

The teacher observed that all learners enjoyed writing in drafts. They also com-
mented on the fairness of PA because they had a second chance to reflect on and
improve their essays after receiving both TA and PA. Students seemed more eager

to revise their work and more confident in their redrafts. They also felt proud that
they were able to play the teacher’s role and assess their peers’ compositions. As it
can be seen in the following extract, students appreciated the fact that they could
get an insight into their peers’ work. It helped them identify some of the skills
they needed to focus on in their essay writing (also in Hanrahan & Isaacs, 2001).
Learners prefer writing in drafts and receiving more feedback because it is fairer. They can
correct more mistakes and get a better grade since they get more feedback. Learners seem
very confident and happy to revise their peers’ work. They seem to value their peers’ feed-
back and they take it into consideration when revising their drafts. They also appreciate their
peers’ work and look forward to comparing their work with that of their peers’. They benefit
as learners because they can see what both more knowledgeable and weaker students can do.
They become aware of other students’ standards. (WT1, G1, G2, L3, L4)39

During the whole-class discussions, the students felt that they had acquired a
reasonable grounding in PA procedures and were favourably disposed to partici-
pating in PA in the future. They thought that PA was a very interesting and inno-
vative way to improve their writing skills since they were able to revise their work
after receiving explicit feedback. Finally, as can be seen in the extract below,
learners felt that PA is quite an easy method to use in class since they had received
adequate training and had the teacher’s continuous support.
PA is really interesting. I love the PA forms. They provide me with an outline of what I have
to keep in mind when writing an essay. I understand a lot more about my mistakes and I can
correct them more easily. I feel more confident in myself. Not being able to revise my work
and receive more explicit feedback makes me feel a bit disappointed and bitter since I think
that I miss the chance to show that I can write better. I also understand more about how to
write an essay since I have to assess my peers’ work keeping in mind some criteria. I find the
whole PA procedure quite easy since I have been adequately trained and my teacher is always
here to support me whenever I need it. (WT2, G1, S3)

3.1.2 Learners’ preference for TA

The teacher observed that learners preferred TA over PA because the teacher is
considered to be the expert in the English language and can support students in
their efforts to improve their command of the language. As implied in the follow-
ing extract, weak students tend to rely considerably on their teacher’s help and
frequently need his/her guidance.
Learners often seem to prefer teacher feedback over peer feedback when they have to make a
choice. The teacher is always the expert. He/she knows more. He/she can also show them how
to improve their work. The teacher can clarify things for them, provide them with hints about

39 WT: writing Task, G: group, L: lesson, S: student

how to correct their work and what things they have to study in order to fill in their gaps. They
are often dependent on him/her and they need his/her continuous support and help. Especially
weak students often resort to his/her help and request his/her advice since they are uncertain
about the quality of their work. Learners do not think their peers can provide the same kind
of support to them since they are less knowledgeable than the teacher. (WT2, G1, G2, L6, L8)

When students were asked to talk about their feelings towards TA, they con-
fessed that teacher feedback was valuable as it has the potential to support student
improve achievement. The students also believed that TA was more accurate and
trustworthy (also in Yang et al., 2006). As stated in the extract below, the students
felt that the teacher can provide them with remedial teaching in order to better
comprehend different points, i. e. in the grammar of the language, which often
seem quite obscure to them.
TA is essential in order for me to understand how I can improve my writing. The teacher is
the expert who we need to consult in order to improve our English. The teacher can always
fix things when we mess up everything. She is always ready to help us and provide us with
remedial teaching to overcome our difficulties. We cannot possibly learn how to write Eng-
lish properly without her help. She knows everything. (WT3, G1, S2)

3.1.3 Learners’ need for PA and TA

During the whole-class conversations, most students were quite eager to use PA
in the future in other subjects as well because they felt that PA was a challenging
and effective method at the same time. They also expressed their wish to use a
combination of TA and PA since they seemed to complement each other. Dochy,
Segers, and Sluijsmans (1999), in a review of the literature of sixty-three stud-
ies, also found that a combination of different assessment tasks, including PA,
encouraged students to become more responsible and reflective. As seen in the
following extract, learners thought that TA was superior to PA but PA allowed
them to communicate with each other and explore better solutions to writing

I would like to use PA in the future. It is interesting to be able to assess my peers’ work and
see similarities with my own essays. I learn a lot and improve as a writer at the same time. I
think we should assess our peers’ work in other lessons as well. Some of my peers’ comments
help me understand better the teacher’s comments. Peer comments also seem to complement
TA. They point out things that teachers do not seem to notice. The language that students
use is also simpler and easier. I think, all in all, I like receiving assessment both TA and PA.
(WT3, G1, S18)

3.2 Negative findings

The researcher identified three negatives themes during the content analysis: (a)
complaints about TA, (b) difficulty in using PA, and (c) doubt about ability to

3.2.1 Complaints about TA

Regarding TA, students often seemed to find it difficult to revise their work
because they were unable to understand the teacher feedback. They felt that it
was vague and hasty. Most of them admitted that they valued TA and integrated
it in their revised drafts but sometimes without fully understanding it (also in
Gibbs & Simpson, 2004). As it can be seen in the extract below, they consulted
peer comments when they got confused with teacher comments since they found
peer feedback to be more helpful.
Students in both groups sometimes seemed confused about the teacher comments. They
thought they were so complicated. Students often did not understand why and how they
should correct their mistakes. They tried to correct their work according to the teacher com-
ments but they sometimes asked for clarification or did not correct their work at all. They
often resorted to peer comments when they didn’t understand the teacher comments. Students
seemed to understand peer comments more easily than teacher comments because their peers
used simpler language and were more straightforward. (WT1, 2, G1, G2, L2, L3, L4, L8)

In the whole-class discussions, the learners complained that teacher comments

were often incomprehensible. Lee (2007) also stressed that students sometimes
copied teacher feedback into their redrafts without thinking about why; as a
result, students made similar mistakes in their subsequent writing tasks. Stu-
dents in the current study expressed their wish for face-to-face support from their
teacher in order to clarify any ambiguous points in their feedback.
Reading the teacher’s comments and corrections does not always help. They lack in detail.
I do not know, maybe it is my fault, after all. I do not understand some of her comments or
how to revise them. I have to try really hard. I would prefer to have the chance to talk about
the things I do not understand even for two minutes. Sometimes, I do not know how to revise
my essay. I would like my teacher to be able to explain some things to me. (WT2, G2, S15)

3.2.2 Difficulty in using PA

When assessing their peers’ essays, the learners sometimes asked for additional
information i. e. clarification or for assistance from the teacher. They felt proud

that they were allowed to do their teacher’s job. However, weak learners some-
times felt under pressure. They occasionally complained that they did not have
enough time to revise their work (also in Mok, 2010). A few relatively weak stu-
dents expressed their concern that PA was sometimes difficult and time-consum-
ing although it helped them improve their writing performance. They emphasized
their need for encouragement and more practice.
Weak learners seemed eager to assume the role of the teacher but often needed help in
understanding what was expected from them. They sometimes complained that the task was
demanding and time-consuming. They had trouble concentrating on revising their work.
They were not used to doing that. They seemed eager to assume the role of the teacher but
often needed help in order to understand what was expected from them. A few weak students
also pointed out that they needed more training in order to make more effective revisions.
(WT3, G2, L10, L11, L12)

During the whole-class discussions, weak students complained that using PA in

their classes was sometimes too demanding but tended to “accelerate their learn-
ing” (Nicol & Milligan, 2006). They stressed the fact that they had to consult two
types of feedback and this was both tiring and time-consuming (also in Cheng &
Warren, 1999). They were used to receiving assessment only from their teacher.
Nevertheless, they felt that PA helped them improve their essays and their grades.
In the following extract, they expressed their wish for more training in PA from
their teacher in order to ensure successful implementation of PA in their classes.
Revising my work is often a difficult process. I think I need more support from my teacher
and more training in revision strategies before I can revise my work effectively. Things were
much easier when I just had to write an essay once and received one type of feedback. Of
course my grades were lower as well since I could not really improve my work. I like PA but
it takes so much more time to study both the teacher’s comments and the completed PA form.
Sometimes, it is difficult to compare the comments from my peer and my teacher. It is so
tiring. I do benefit from PA but I think I need more help from my teacher in order to be able
to use PA effectively. (WT1, G1, S2)

3.2.3 Doubt about ability to peer-assess

Overall, students felt that PA had a positive impact on student achievement and
motivation. However, some learners doubted their ability to assess other students’
essays as reliably as the teacher. As indicated in the extract below, several stu-
dents also questioned weak learners’ ability to assess other students’ drafts (also
in van Gennip, Segers, & Tillema, 2010) claiming that they might not have under-
stood or could have misinterpreted the assessment criteria. Finally, the learn-
ers reported improvement in the effectiveness of learning and increase of their

confidence in writing due to the development of their metacognitive and critical
appraisal skills (also in Falchikov, 2005).
I don’t think I can really be a teacher and assess my peers’ work reliably but I try. I doubt
whether weak learners can understand the PA criteria and assess my work as reliably as my
teacher. Of course, since PA is used in combination with TA, I do not mind. In the past, I hated
writing essays, I hardly ever read the teacher’s comments and I did not improve my writing
skills at all. Now, I look forward to revising my draft since I have the chance to improve my
grade and succeed in the end-of the year exams. I also understand more about how to write an
essay since I have to assess my peers’ work keeping in mind some criteria. (WT2, G2, S20)

3.3 Questionnaire

The questionnaire presented some very interesting findings (Table 2). Learn-
ers had an overall positive attitude towards process writing (Table 2) because it
helped them arrive at a product of good quality and improve their marks (also in
Section 3.1.1). The findings also confirm previous research which has produced
positive results (Pennington, Brock, & Yue, 1996).
Regarding TA, students’ response was mainly positive (Table 2) since stu-
dents generally favoured teacher feedback (also in Ferris, 1995). The results of
the present study indicated that written teacher feedback remains an important
component of the ESL/EFL academic writing classroom (also in Paulus, 1999).
Moreover, learners had positive feelings towards PA (Table 2). According to
previous research, the majority of students found value in rating other students’
compositions (Mendoca & Johnson, 1994). In fact, students seemed to consider
TA and PA as equally important (Table 2). Many researchers also stress the com-
plementary functions of TA and PA (Jacobs, Curtis, Braine, & Huang, 1998; Tsui
& Ng, 2000). Finally, all these findings were confirmed by the teacher obser-
vations and whole-class discussions, which were described earlier in this chapter
(see Sections 3.1 and 3.2).

Table 2: Data regarding the questionnaire

List of questions Learners’ response

1. Do you like process writing and using drafts? 75 % positive

2. Do you like getting assessment from your teacher? 90 % positive
3. Do you like getting/providing assessment from/to your peer? 93 % positive

4. Discussion of the results and limitations

The main aim of the current study was to investigate students’ perceptions of PA,
TA and process writing. In relation to the first research question, the current study
confirmed the primacy of TA. The priority given to teacher feedback is explained
by learners’ affective preference for teacher feedback over peer feedback (Cho &
MacArthur, 2010). A good teacher uses experience and skills that are not avail-
able to pupils i. e. superior knowledge (Sadler, 1998).
Nevertheless, peer comments enhance a sense of audience, raise learners’
awareness of their own strengths and weaknesses, encourage collaborative learn-
ing, and foster the ownership of text (Tsui & Ng, 2000). Students complained
that teacher comments often lacked in detail. They frequently resorted to peer
feedback in order to understand the teacher comments and to locate any more
problems with their drafts. Previous research has also stressed that both TA and
PA are imperative and supportive to language learners when learning how to
write (Lin & Chien, 2009).
Regarding the second research question, students seemed positively dis-
posed to process writing (Table 2), since young writers benefit from the struc-
ture and security of following the writing process in their writing (Gardner &
Johnson, 1997). They thought that writing in drafts was fairer than submitting
only one draft. They felt that they had an opportunity to reflect and solve their
writing problems and thus develop both their accuracy and fluency. Process
writing increased their self-confidence by creating a supportive writing envi-
ronment (also in Graham & Perin, 2007) and taught them in combination with
PA how to assume responsibilities for their own learning by correcting their
essays. However, weak students complained that writing only one draft was
easier although they admitted that it did not help them improve their writing
skills or their grades.
In relation to the third research question, the findings indicated that using
PA is associated with a long list of benefits. These include greater responsibil-
ity for learning, more student independence, which are all aspects of increased
learner autonomy, confidence, higher enthusiasm, and motivation. The present
study stressed that the best-liked features of PA among students are: a) increased
awareness of what the task of essay writing actually is; b) the benefit of reading a
peer essay, and c) learning about mistakes and possibility of subsequent improve-
ment (also in Falchikov, 1986). Strachan and Wilcox (1996) also found similar
responses when using PA with university students who thought that the PA pro-
cess was fair, valuable, enjoyable and helpful. However, PA also causes negative
feelings to students such as confusion, doubt regarding their own abilities, resent-

ment about their peers’ skills and tiredness that (mostly) weak students face due
to the complexity of the approach.
The current study adds one more positive and one negative feature to the exist-
ing literature regarding PA. Peer comments seem to be used extensively by stu-
dents in order to clarify teacher feedback. Students claim that TA is frequently
too general aiming at the essay as a whole. Moreover, weak students also com-
plain that consulting two types of assessment (TA and PA) at the same time is
often tiring for them. They claim that it is difficult to compare and sometimes
choose between the two kinds of feedback.
A more striking contribution of the current study lies in its identification of
a positive relationship between the combination of TA, PA and process writing
and the students’ goals of improving their grades and succeeding at the end-of
the year exams. Students showed increased interest in PA and process writing as
they realized that it could help them improve their grades. Student involvement
in assessment, focused on the development of skills to self-regulate performance,
may be facilitated by drawing on the strong motivational force of grades and
exams (also in Bryant & Carless, 2010). In so doing, a combination of the afore-
mentioned methods may encourage examination preparation techniques, which
move beyond memorization. For instance, through PA and process writing, stu-
dents develop their revision skills and improve the quality of their essays on their
own. In this way, they can get better marks at the end-of the year exams.
The teacher never anticipated students to make any explicit connection between
TA, PA, process writing and summative assessment and develop an awareness of
the potential of these methods for examination preparation. This finding indicates
how PA strategies can be adapted to suit the needs of a particular local setting,
and reinforces the point of Kennedy and Lee (2008) that formative assessment
cannot be treated in isolation from, or as an antidote to, the dominance of sum-
mative assessment.
However, this is a small-scale case study with limited claims to generalisability.
A number of cautions must be made. First, the sample of EFL students was small
and focused only upon Cypriot adolescent learners of EFL writing. Second, the
results record only the students’ perceptions at the time of the study. Third, stu-
dents’ perceptions originated from a limited understanding of secondary assess-
ment models. Future studies need to be conducted to confirm the current findings.
Whether or not similar results can be obtained in different settings i. e. with more
learners and other types of populations deserves further empirical investigation.
All in all, and taking into careful consideration the overall positive findings of
the study, it is acknowledged that PA, as an informal classroom-based assessment
method, can have a positive washback effect on learners. PA, when used in com-

bination with process writing and TA, seems to increase adolescent EFL learners’
motivation towards writing and the assessment of writing.

5. Pedagogical implications of the study

The current study has shown secondary students have overall positive attitudes
towards PA, TA and process writing. Its findings have several implications for
teachers and educational researchers interested in, implementing these three
Concerning teachers’ frequently obscure comments, it is suggested that by
providing timely and detailed feedback, which is also combined with brief one-
to-one teacher-student conferences, teachers could greatly improve the value of
feedback they provide to their students.
Teachers or future researchers attempting to employ these methods with adoles-
cent learners should: a) create a learner-friendly environment in which PA and pro-
cess writing would be a normal part of students’ everyday routines, b) provide them
with ample training before the implementation of these methods in their classes,
and c) offer continuous support especially to weak students who will be asked to
take a more active role in their own learning. These learners might consider PA and
process writing as quite challenging new approaches since students have ‘more say
in how they approach their learning and its assessment’ (Williams, 1992, p. 55).
Moreover, educational authorities should provide sufficient training and con-
tinuous support to teachers in order for them to implement PA and process writ-
ing successfully in their classrooms. Continuous training regarding writing and
the assessment of writing should also be provided to both teachers and students
since writing is a highly valued skill in today’s competitive labour market (Cush-
ing Weigle, 2002). PA should not become one more form of assessment which is
included in the secondary school curricula in theory only but should be used in
practice by highly trained knowledgeable in assessment EFL teachers. This is the
only way that ‘assessment for learning’ can be successfully promoted in the EFL
writing classrooms.
Previous studies have also advocated for certain steps to alleviate students’
negative perceptions of PA and process writing. These include: (a) more PA and in
our case process writing experience (Wen & Tsai, 2006); (b) clarity about the PA
criteria, and (c) guidance in regard to PA and process writing (Falchikov, 2005).
Moreover, information security, such as double-blind peer rating, may be a key to
positive feelings about peer rating (Saito & Fujita, 2004).
Graham and Perin (2007) have indicated that in order to prepare and help weak
learners use the process approach to writing teachers should: a) try to provide learn-

ers with good models for each type of writing that is the focus of instruction, b)
develop instructional arrangements in which adolescents work together to plan,
draft, revise, and edit their compositions, and c) teach adolescents strategies for
planning, revising, and editing their compositions. This is an especially powerful
method for adolescents who are struggling writers but it is also effective with ado-
lescents in general (ibid, 2007).
To sum up, the present study suggests the use of PA and process writing as com-
plementary practices to TA to improve adolescent learners’ writing performance
and attitudes towards writing and the assessment of writing (Meletiadou, 2011).

6. Conclusion

This study has explored the application of a combination of PA, TA and process
writing with adolescent EFL learners. It has identified some difficulties that may be
expected and a great number of benefits that learners gain, particularly in respect
of adolescent learners’ perceptions of EFL writing and the assessment of writing.
These benefits include: achievement (as expressed in marks, grades, etc.), learning
benefits as perceived by the students involved, and the beliefs students hold about
PA, TA and process writing (van Gennip et al., 2009). The present study has also
presented evidence against common negative conceptions about PA and process
writing, i. e. the inability of adolescent EFL learners to successfully engage in PA
and process writing (Falchikov & Boud, 1989). It has argued that there should be
greater use of PA and process writing in combination with TA in adolescent learn-
ers’ EFL writing classes because as Jones and Fletcher (2002) indicate the benefits
of doing so, outweigh impediments and arguments to the contrary.
Hopefully, the current study despite its obvious limitations can successfully
enable teachers and researchers to understand how adolescent EFL students of
writing perceive PA, TA and process writing and will inform the development of
future assessment strategies. Educators and researchers are encouraged to imple-
ment more PA activities and then to acquire more insights about the effects as
well as the concerns of using PA with adolescent EFL learners for educational


Assessment Reform Group. (1999). Assessment for Learning: Beyond the Black
Box. Cambridge: University of Cambridge, School of Education.

Bryant, A., & Carless, D. R. (2010). Peer assessment in test-dominated setting:
Empowering, boring or facilitating examination preparation? Educational
Research for Policy and Practice, 9(1), 3–15.
Caulk, N. (1994). Comparing teacher and student responses to written work.
TESOL Quarterly, 28(1), 181–188.
Cheng, W., & Warren, M. (1997). Having second thoughts: Student perceptions
before and after a peer assessment exercise. Studies in Higher Education, 22,
Cheng, W., & Warren, M. (1999). Peer and teacher assessment of the oral and
written tasks of a group project. Assessment and Evaluation in Higher Edu-
cation, 24(3), 301–314.
Cheung, M. (1999). The process of innovation adoption and teacher development.
Education and Research in Education, 13(2), 55–75.
Cho, K., & MacArthur, C. A. (2010). Student revision with peer and expert
reviewing. Learning and Instruction, 20(4), 328–338.
Cushing Weigle, S. (2002). Assessing writing. Cambridge: Cambridge University
Dochy, F., Segers, M., & Sluijsmans, D. (1999). The use of self-, peer and co-
assessment in higher education: A review. Studies in Higher Education, 24(3),
Dochy, F. R. C., & McDowell, L. (1997). Assessment as a tool for learning. Stud-
ies in Educational Evaluation, 23, 279–298.
Falchikov, N. (1986). Product comparisons and process benefits of collaborative
peer group and self assessments. Assessment & Evaluation in Higher Edu-
cation, 11, 146–165.
Falchikov, N. (2004). Involving students in assessment. Psychology Learning and
Teaching, 3, 102–108.
Falchikov, N. (2005). Improving assessment through student involvement: Practi-
cal solutions for aiding learning in Higher and Further Education. London:
Falchikov, N., & Boud, D. (1989). Student self-assessment in higher education: A
meta-analysis. Review of Educational Research, 59, 395–430.
Ferris, D. R. (1995). Student reactions to teacher response in multiple-draft com-
position classrooms. TESOL Quarterly, 29, 33–53.
Flower, L., & Hayes, J. R. (1981). A cognitive process theory of writing. College
Composition and Communication, 32(4), 365–387.
Gardner, A., & Johnson, D. (1997). Teaching personal experience narrative in
the elementary and beyond. Flagstaff, AZ: Northern Arizona Writing Project

Gibbs, G., & Simpson, C. (2004). Conditions under which assessment supports
students’ learning. Learning and Teaching in Higher Education, 1, 3–31.
Graham, S., & Perin, D. (2007). A meta-analysis or writing instruction for adoles-
cent students. Journal of Educational Psychology, 99(3), 445–476.
Hanrahan, S., & Isaacs, G. (2001). Assessing self- and peer assessment: The stu-
dents’ views. Higher Education Research and Development, 20(1), 53–70.
Hedgcock, J., & Lefkowitz, N. (1994). Feedback on feedback: Assessing learner
receptivity to teacher response in L2 composing. Joumal of Second Language
Writing, 3, 141–163.
Ho, B. (2006). Effectiveness of using the process approach to teach writing in six
Hong Kong primary classrooms. Working Papers in English and Communi-
cation, 17, 1–52.
Hyland, K. (2003). Second language writing. New York: Cambridge University
Jacobs, G., Curtis, A., Braine, G., & Huang, S. (1998). Feedback on student writ-
ing: Taking the middle path. Journal of Second Language Writing, 7(3), 307–
Jones, L., & Fletcher, C. (2002). Self-assessment in a selective situation: An eval-
uation of different measurement approaches. Journal of Occupational and
Organizational Psychology, 75, 145–161.
Kennedy, K. J., & Lee, J. C. K. (2008). The changing role of schools in Asian soci-
eties: Schools for the knowledge society. London: Routledge.
Lantolf, J. P., & Thorne, S. L. (2006). Sociocultural theory and the genesis of
second language development. Oxford: Oxford University Press.
Lee, C. (2008). Student reactions to teacher feedback in two Hong Kong second-
ary classrooms. Journal of Second Language Writing, 17, 144–164.
Lee, I. (2007). Feedback in Hong Kong secondary writing classrooms: Assess-
ment for learning or assessment of learning? Assessing Writing, 12(3), 180–
Lin, G. H. C., & Chien, P. S. C. (2009). An investigation into effectiveness of peer
feedback. Journal of Applied Foreign Language Fortune Institute of Technol-
ogy, 3, 79–87.
Lockhart, C., & Ng, P. (1995). Analyzing talk in ESL peer response groups:
Stances, functions and content. Language Learning, 45, 605–655.
McDowell, L. (1995). The impact of innovative assessment on student learning.
Innovations in Education and Training International, 32, 302–313.
McIsasc, C. M., & Sepe, J. F. (1996). Improving the writing of accounting stu-
dents: A cooperative venture. Journal of Accounting Education, 14(4), 515–

Meletiadou, E. (2011). Peer Assessment of Writing in Secondary Education: Its
Impact on Learners’ Performance and Attitudes. Department of English Stud-
ies. University of Cyprus. Nicosia.
Mendoca, C., & Johnson, K. (1994). Peer review negotiations: Revision activities
in ESL writing instruction. TESOL Quarterly, 28(4), 745–768.
Mok, J. (2010). A case study of students’ perceptions of peer assessment in Hong
Kong. ELT Journal, 65(3), 230–239.
Nicol, D., & Milligan, C. (2006). Rethinking technology-supported assessment
in terms of the seven principles of good feedback practice. In C. Bryan & K.
Clegg (Eds.), Innovative Assessment in Higher Education (pp. 1–14). London:
Patton, M. Q. (1987). How to use qualitative methods in evaluation. Newbury
Park, CA: Sage.
Paulus, T. M. (1999). The effect of peer and teacher feedback on student writing.
Journal of Second Language Writing, 8(3), 265–289.
Pennington, C., Brock, N., & Yue, F. (1996). Explaining Hong Kong students’
response to process writing: An exploration of causes and outcomes. Journal
of Second Language Writing, 5(3), 227–252.
Plutsky, S., & Wilson, B. A. (2004). Comparison of the three methods for teach-
ing and evaluating writing: A quasi-experimental study. The Delta Pi Epsilon
Journal, 46(1), 50–61.
Sadler, D. R. (1998). Formative assessment: Revisiting the territory. Assessment
in Education, 5(1), 77–84.
Saito, H., & Fujita, T. (2004). Characteristics and user acceptance of peer rating
in EFL writing classrooms. Language Teaching Research, 8(1), 31–54.
Shepard, L. A. (2000). The role of assessment in a learning culture. Educational
Researcher, 29(7), 4–15.
Shepard, L. A. (2005). Linking formative assessment to scaffolding. Educational
Leadership, 63(3), 66–70.
Strachan, I. B., & Wilcox, S. (1996). Peer and self assessment of group work:
developing an effective response to increased enrolment in a third year course
in microclimatology. Journal of Geography in Higher Education, 20, 343–
Strijbos, J.-W., Narciss, S., & Dünnebier, K. (2010). Peer feedback content and
sender’s competence level in academic writing revision tasks: Are they criti-
cal for feedback perceptions and efficiency? Learning and Instruction, 20(4),
Topping, K. J. (1998). Peer assessment between students in college and university.
Review of Educational Research, 68, 249–276.

Topping, K. J. (2010). Methodological quandaries in studying process and out-
comes in peer assessment. Learning and Instruction, 20(4), 339–343.
Tsagari, D. (2004). Is there life beyond language testing? An introduction to alter-
native language assessment. CRILE Working Papers, 58 (Online). Retrieved
Tsivitanidou, O. E., Zacharia, C. Z., & Hovardas, T. (2011). Investigating sec-
ondary school students’ unmediated peer assessment skills. Learning and
Instruction, 21, 506–519.
Tsui, A., & Ng, M. (2000). Do secondary L2 writers benefit from peer comments?
Journal of Second Language Writing, 9(2), 147–170.
van Gennip, N. A. E., Segers, M. S. R., & Tillema, H. H. (2009). Peer assessment
for learning from a social perspective: The influence of interpersonal vari-
ables and structural features. Educational Research Review, 4(1), 41–54.
van Gennip, N. A. E., Segers, M. S. R., & Tillema, H. H. (2010). Peer assessment
as a collaborative learning activity: The role of interpersonal variables and
conceptions. Learning and Instruction, 20(4), 280–290.
Wen, M., & Tsai, C. (2006). University students’ perceptions of and attitudes
toward (online) peer assessment. Higher Education, 51(1), 27–44.
Williams, E. (1992). Student attitudes towards approaches to learning and assess-
ment. Assessment and Evaluation in Higher Education, 17, 45–58.
Yang, M., Badger, R., & Yu, Z. (2006). A comparative study of peer and teacher
feedback in a Chinese EFL writing class. Journal of Second Language Writ-
ing, 15(3), 179–200.
Zhao, H. (2010). Investigating learners’ use and understanding of peer and teacher
feedback on writing: A comparative study in a Chinese English writing class-
room. Assessing Writing, 15(1), 3–17.

Appendix I
Teacher observation form

Issues for all groups Observation Notes

Students’reactions when receiving TA

Students’ reactions when receiving/providing PA

Students’ reactions when writing in drafts

Any topics raised during observation

Appendix II
Whole-class discussion form

List of questions Notes

1. How do you feel about using drafts and revising your work?

3. How do you feel about the feedback that the teacher gives you?

4. How do you feel about the feedback that your peer gives you?

5. Do you like PA and want to participate in it in the future?

Appendix III

List of questions Negative Neutral Positive

1. Do you like process writing and using drafts?

2. Do you like getting assessment from your teacher?

3. Do you like getting/providing assessment from/to

your peers?

Part III
Language Testing and Assessment
in HE
EFL Students’ Perceptions of Assessment
in Higher Education

Dina Tsagari40
University of Cyprus

Research studies into students’ perceptions of assessment and its effect on achievement are limited
(Brown, 2011; Dorman, Fisher & Waldrip, 2006; Struvyen, Dochy & Janssens, 2002; 2005; Week-
ers, Brown & Veldkamp, 2009). Yet, the literature highlights the centrality of the students’ role,
perceptions and approaches to learning for achievement (Drew, 2001; Sambell, McDowell & Brown,
1997). This chapter reports the findings of a study that investigated university students’ views of the
assessment practices used in their EFL courses and compared these to samples of the actual assess-
ments used. The results revealed that student assessment in the present context does not actively
involve students nor does it encourage substantive and collaborative learning. The chapter argues
for the creation of suitable assessment affordances that support student involvement, empower their
role and eventually strengthen the link between teaching, learning and assessment.

Key words: learner-centredeness, student perceptions, higher education, data triangulation, assess-
ment literacy.

1. Introduction

In todays’ information age, the demand for higher levels of literacy skills and crit-
ical thinking in the professional arena has increased. Such competences require
students to actively engage and monitor their learning. This move towards stu-
dent involvement and learner-centredness requires a fundamental change of the
positioning of students in teaching and learning. To align with the students’ new
role as ‘partners’ (Stiggins 2001), teachers need to view learning from the stu-
dents’ perspective (Horwitz, 1989). Curriculum designers also need to consider
students’ views and experiences of their learning environment (Ekbatani, 2000;
Lindblom-Ylänne & Lonka, 2001).
In language testing and assessment (LTA), students are also seen as ‘important
stakeholders’ (Erickson & Gustafsson, 2005). Actually, good practice in LTA41 rec-
ommends that students’ views of the assessment procedures are taken into account
as they contribute valuable insights into the development and validation of assess-
ment instruments and procedures (Cheng & DeLuca, 2011; Huhta, Kalaja & Pit-
känen-Huhta, 2006; Xie, 2011). More specifically, it is stressed that the inclusion

41 The EALTA Guidelines for Good Practice in Language Testing and Assessment, see http://

of students in the testing and grading cycle enhances the validity of student assess-
ments (Dancer & Kamvounias, 2005; Sambel et al., 1997). For example, assess-
ment instruments and procedures that yield unreliable and invalid results and failure
rates can be avoided (see Cheng and DeLuca, 2011; Falchikov, 2003; Xie, 2011).
As Fox and Cheng (2007) also stress, test-taker ‘… accounts have the potential to
increase test fairness, enhance the validity of inferences drawn from test perfor-
mance, improve the effectiveness of accommodation strategies, and promote posi-
tive washback’ (ibid, p. 9).
Sharing assessment criteria and procedures with students is also said to
facilitate learning, student participation, metacognition and enhance motivation
(Anderson, 2012; Black & Wiliam, 2006, 2009; Ferman, 2005; Frey, 2012). Alter-
native assessment methods such as self- and peer-assessment, for example, are
tools, which can strengthen student involvement (Anderson, 2012; Black, 2009;
Falchikov, 2003; Finch, 2007; Meletiadou and Tsagari, forthcoming; Ross, 2006;
Stiggins, 1994). Students’ views of assessment, in particular, are important vari-
ables that influence (positively or negatively) their effort, efficacy, performance
and attitudes towards the subject matter and, importantly, their learning (Dorman
et al., 2006; Van de Watering, Gijbels, Dochy & Van der Rijt, 2008).
Student involvement and reflection in LTA can also provide teachers with infor-
mation about their own instruction as well (Carr, 2002; Ekbatani & Pierson, 2000).
Teachers can thus gain insights into areas that seem problematic during instruction
(Wenden, 1998), can develop habits of reflection and self-evaluation too, record
progress in preparation, implementation and evaluation and yield results derived
through consensus (Williams, 2011). This further supports the dialectic relation-
ship between learning, teaching and assessment as it replaces the rigid teacher-to-
student assessment pattern that is present in most of the educational contexts to date.
However, despite discussions about students’ centrality in LTA, little empiri-
cal evidence exists that supports whether students’ attitudes and perceptions
of assessment are taken into consideration or whether students’ involvement in
assessment processes is active (Black and Wiliam, 1998; Gijbels & Dochy, 2006;
Sambel, et al., 1997; Struyven, et al., 2002). Research has mainly focused on
aspects such as students’ language attitudes and their impact on language learn-
ing (Lee, 2001; Karahan, 2007; Yang & Lau, 2003), the influence of language
assessment on teachers and teaching methods (Cheng, 2005; Tsagari, 2009; Wall,
2005) and teachers’ practices and perceptions towards LTA (Cheng, Rogers &
Hu, 2004; Fulcher, 2012; Tsagari, 2011a; Vogt & Tsagari, forthcoming).
Given the well-documented evidence that assessment has a profound effect
on students (Gosa, 2004; Tsagari, 2009) and inculcation of positive attitudes to
language learning (Dorman, et al., 2006), it was both timely and opportune to
examine students’ perceptions of assessment.

2. The Study

2.1 Aims of the study

Motivated by the above literature, the present study set out to explore EFL stu-
dents’ perceptions towards LTA practices in the context of higher education (HE).
The study was based on the following research questions:
▪ What are the types of assessment used to evaluate EFL students’ language
skills in HE?
▪ How do students perceive the purposes and practices of these assessments?
▪ Are students actively engaged in their assessment?

2.2 Survey instruments

To answer the research questions posed, triangulation was employed during data
collection (Paltridge & Phakiti, 2010). First, a survey questionnaire was adminis-
tered to undergraduate EFL students of a private university in Nicosia, Cyprus42.
This comprised four parts: a) students’ profile, b) students’ perceptions of the
importance of language skills/areas, c) students’ views of the purposes, meth-
ods and techniques used for the assessment of these skills/areas, and c) students’
satisfaction with the assessment used. The questionnaire included 17 questions.
These contained both Likert scale (five-point)43 and open-ended questions.
Samples of the assessment methods used were also collected and analysed.
These mainly comprised written tests designed and administered internally in
the tertiary institution under study (referred to as ‘in-house tests’). The tests were
designed centrally by a team of EFL teachers of the institution and used as a
primary means of assessing students’ language abilities in EFL. For the purposes
of the present study, the test samples were analysed in terms of the frequency of
types of language skills/areas tested and types of tasks used. The tests, adminis-
tered during the academic year 2010–2011, consisted of 12 achievement (end-of-
term) and 18 progress tests. They yielded 94 test sections and 255 tasks in total.

42 For information about the educational system and EFL in Cyprus see Lamprianou (2012) and
Pavlou (2010).
43 For example, students were required to place statements on 5-point Likert scales such as:
1=unimportant … 5=very important or 1= never … 5=very often.

2.3 Profile of the students

Participants of the study were 141 HE male and female students (20 to 30 years
old) (see Table 1).

Table 1. Characteristics of the participants

Age range Gender

20–22 52.5 % Female 68.1 %

23–30 32.6 % Male 31.9 %
No answer 14.9 %
Programmes of study Letter grade in final exam
BEd in Primary Education – ‘Primary’ 32 % A 22.7 %
Diploma in Business Studies -‘Business’ 23 % B 27.7 %
Diploma in Secretarial Studies – ‘Secretarial’ 14 % C 14.2 %
Diploma in Computer Studies – ‘Computer’ 13 % D 20.6 %
Diploma in Graphic and Advertising Design - E 1.4 %
‘Graphic’ 11 %
F 2.8 %
No answer 7% No answer 10.6 %

The students, who were in their second and third year of their studies, were
attending five different study programmes at the time of the study. Even though
the majority of the students (65 %) received a passing grade (A-C) in the final EFL
exam (see Table 1), not all of the students were successful. Actually a quarter of
them failed their final tests.

2.4 Analysis of the data

The responses to the survey questionnaire were analysed using SPSS 17.0 (Sta-
tistical Package for the Social Sciences). For the analysis of the Likert scale
questions, mean values (M) and standard deviations (SD) were calculated so that
more efficient comparisons could be made. Finally, the results of the analysis of
the test samples, presented in percentages, were compared and contrasted against
students’ perceptions of their assessment as these were depicted in the question-

3. Results

3.1 Purposes of assessment

Table 2 presents the range of the assessment types used in the tertiary institution
as mentioned by students. The results show that written tests prevail over other
assessment methods used. Assignments and projects are occasionally used to
assess students’ language while alternative forms of assessment such as diaries,
self-assessment and portfolios are infrequently used.

Table 2. Types of assessment

Types of assessment N M SD

Written tests 141 4.56 0.83

Assignments 140 2.64 1.49

Projects 141 2.01 1.31

Writing diaries 140 1.63 1.15

Self-assessment 139 1.60 1.11

Portfolios 141 1.45 0.99

With regard to purposes of assessment, students believe that language assess-

ment in their context is mostly used for measurement and administrative pur-
poses such as deciding on grades as well as for learning or teaching purposes
(Table 3).

Table 3. Purposes of assessment

Purposes of assessment Ν M SD
To measure your ability to understand and use the English language 140 3.99 1.06
To measure the progress you have made 141 3.79 1.06
To decide on term and final grades 139 3.62 1.24
To identify your strong and weak points 141 3.58 1.18
To see whether teaching has been successful 140 3.50 1.18
To decide whether a unit/structure needs revision 140 3.44 1.28
To provide you with information about your progress 140 3.16 1.34

Table 4. Students’ perceived importance of assessment purposes

Programme of
Assessment Purposes Programme of Study M SD Study M SD Sig.

To measure your ability to understand

and use the English language Business* 4.41 0.875 Primary 3.61 1.125 0.031
(F4,125=2.882, p=0.025)

Business 4.47 0.507 Primary 3.76 1.131 0.049

To measure the progress you have
made Business 4.47 0.507 Computer 3.44 1.199 0.017
(F4,126=6.389, p<0.001)
Business 4.47 0.507 Secretarial 3.20 1.056 0.001

To decide on term and final grades

Computer 4.33 1.085 Secretarial 3.05 1.433 0.036
(F4,124=3.522, p=0.009)

To see whether teaching has been Primary 3.67 1.022 Secretarial 2.60 1.188 0.021
(F4,125= 3.824, p=0.006) Business 3.72 1.301 Secretarial 2.60 1.188 0.023

To decide whether a unit/ structure Primary 3.73 1.208 Computer 2.72 1.320 0.030
needs revision
(F4,125= 3.392, p=0.011) Business 3.75 1.136 Computer 2.72 1.320 0.038

Primary 3.32 1.393 Secretarial 2.00 1.257 0.003

To provide you with information
about your progress Business 3.81 0.965 Secretarial 2.00 1.257 <0.001
(F4,125=7.843, p<0.001)
Graphic 3.88 0.885 Secretarial 2.00 1.257 0.024

Student N size: Business=32; Computer= 18; Graphic=16; Primary: 45; Secretarial= 20

However, even though there was very little difference in the way students
viewed purposes of assessment (see Table 3), attitudes among student sub-
groups (e. g. programmes of study) were evident (see Table 4). Analysis of
variance (one-way ANOVA) indicated that statistically significant differences
were actually present among various student subgroups with regard to the pur-
poses of assessment (see Table 4). For example, ‘Business’ students believed
to a greater extent than ‘Primary’ students that assessment is used to mea-
sure their performance. In addition, ‘Business’ students believed slightly more
than ‘Primary’, ‘Computer’ and ‘Secreterial’ students that assessment is used
to measure the progress they have made (see Table 4 for further comparisons
across ‘Programmes of Study’). This finding suggests that there might be some
relationship between programmes of study and purposes of assessment. In
other words, it could be the case that attending a specific programme of study
might have an effect on the way students view assessment but this needs to be
further investigated relating it to other factors such as gender or age of students
in each programme of study.
With regard to the importance of the various language skills, the results revealed
that students consider speaking the most important language skill (Table 5) while
grammar is less important.

Table 5. Students’ perceived importance of language aspects

Language aspects N M SD

Speaking 4.30 1.16

Writing 4.15 1.03

Reading 4.13 1.08

Vocabulary 4.12 1.07

Listening 3.99 1.29

Grammar 3.90 1.17

However, it seems that there is a discrepancy between students’ perceptions

of aspects of learning and the actual content of their tests. Analysis of the test
samples showed that speaking and listening were not included in the in-house
tests (see Table 6), despite the importance placed on these by the students
(Table 5).

Table 6. Language skills/areas in sample tests

Language skills/areas Percentage

Writing 37.25

Grammar 27.84

Vocabulary 18.3

Reading 16.86

Speaking -

Listening -

On the contrary, writing and grammar were areas that received a lot of attention
in the tests analysed (Table 6) despite the diminished importance placed on these
language areas by students, especially in the case of the latter skills (Table 5).
Finally, with regard to grading, students believed that various aspects of learning
were taken into consideration when grades were assigned (Table 7). Even though
the most prominent aspect appears to be test results and language performance,
non-test performance aspects, like class attendance and respect to the teacher
(also in Dancer & Kamvounias, 2005; Newfields, 2007), are also taken into con-
sideration in the allocation of grades according to the students.

Table 7. Aspects taken into consideration in final grades

Aspects of learning N M SD

Test results 4.27 1.01

Ability to understand and use the English Language 4.14 1.05

Effort made (participation, preparation, interest) 3.81 1.21

Attendance in the course 3.58 1.23

Respect to the teacher 137 3.45 1.44

Assessment of reading
Analysis of the reading components of the tests revealed that discrete-point items
such ‘multiple-choice’ were the most frequently used task types for the assess-
ment of reading followed by ‘open-’ and ‘closed-ended’ types (Table 8). Task
types such as ‘taking notes’, ‘summary’ or ‘multiple-matching’ were used the
least while ‘information-transfer’ and ‘translation’ were not used at all.

Table 8. Sample tests of reading

Test tasks Percentage

Multiple-choice 48.83

Open-ended 30.25

Closed-ended* 11.62

Taking notes 4.65

Summary 2.32

Multiple-matching 2.32

Information-transfer -

Translation -

* Yes/No answers

Table 9 presents students’ views of the task types of the reading tests, which are,
to a great extent, in agreement with the analysis of the actual reading test tasks
(Table 8).

Table 9. Students’ perception of reading tasks types

Test tasks N M SD

Multiple-choice 3.46 1.38

Open-ended 3.28 1.55

Closed-ended 3.14 1.42

Information-transfer 2.96 1.46

Summary 2.85 1.47

Multiple-matching 2.79 1.40

Taking notes 2.56 1.41

Translation 2.26 1.46

With regard to students’ preference of reading tasks, the most popular task types are
objectively-scored items such ‘multiple-choice’ and ‘closed-ended’ questions (Table
10) which is in accordance with the types of test tasks analysed (see Table 9). ‘Trans-
lation’, ‘taking notes’ and ‘summary’ tasks are less favoured by students.

Table 10. Students’ preferences of reading tasks types

Test tasks Ν M SD

Multiple-choice 4.19 1.17

Closed-ended 3.78 2.99

Multiple-matching 138 3.58 1.32

Open-ended 124 3.17 1.37

Information-transfer 132 3.16 1.36

Translation 2.78 1.51

Taking notes 2.67 1.49

Summary 138 2.47 1.37

Assessment of writing
In Table 11 analysis of the writing tasks used in the tests is presented in rank
order. These include both free-writing tasks, e. g. ‘essay writing’ and ‘letters/
reports’ as well as controlled-writing types such as ‘sentence joining’ and ‘sum-
mary writing’.

Table 11. Sample tests of writing

Test tasks Percentage

Essay writing 48.42

Letters/reports 31.57

Sentence joining 3.15

Summary writing 1.05

Paragraph writing -

Editing a sentence or paragraph -

Comparison of the results (see Tables 11 and 12) shows that there is agreement
between students’ perception of test types and actual task types included in the
tests analysed. As in the case of reading, students appear to have quite a clear
picture of the writing task types used in their tests.

Table 12. Students’ perception of writing test tasks

Test tasks Ν M SD
Essay writing 3.88 1.24
Letters/reports 3.88 1.24
Summary writing 3.16 1.32
Sentence joining 3.06 1.34
Paragraph writing 2.90 1.39
Editing a sentence or paragraph 2.89 1.38

Analysis of students’ preference of writing tasks (Table 13) showed that students
enjoy writing tasks such ‘letters/reports’ and ‘essay writing’ rather than ‘sum-
mary’ or ‘paragraph writing’ which is, to a good extend, in agreement with the
types of writing tasks used in the tests of writing.

Table 13. Students’ preferences of writing tasks types

Test tasks Ν M SD

Letters / reports 137 3.18 1.42

Essay writing 140 3.06 1.42

Sentence joining 137 2.94 1.47

Editing a sentence or paragraph 137 2.83 1.35

Summary writing 139 2.69 1.35

Paragraph writing 136 2.66 1.41

Assessment of grammar
Analysis of the test tasks indicated that objectively-scored items such as ‘gap-
filling’, ‘multiple-choice’ and ‘transformations’ are mostly used for the assess-
ment of grammar (Table 14).

Table 14. Sample tests of grammar

Task Types Percentage

Gap-filling 42.29

Multiple-choice 23.94

Transformations 15.49

Sentence correction 9.85

Matching 1.40

True/false -

Translation of sentences -

The results also showed that there is some agreement between students’ under-
standing of test content and actual test content of grammar tests. Students reported
that the most frequent tasks are ‘multiple-choice’ and ‘true/false’ (Table 15).

Table 15. Students’ perception of grammar test tasks

Test tasks N M SD

Multiple-choice 3.74 1.48

True/false 3.68 1.40

Gap-filling 3.66 1.29

Transformations 141 3.41 1.45

Sentence correction 3.18 1.43

Matching 3.17 1.42

Translation of sentences 2.22 1.43

Tasks such as ‘True/false’, ‘multiple-choice’ and ‘matching’ activities are also the
most preferred task types among students (Table 16) but these do not correspond
to the actual types of written tasks (see Table 14).

Table 16. Students’ preferences of grammar tasks types

Test tasks Ν M SD

True/false 4.27 1.14

Multiple-choice 4.19 1.17

Matching 141 3.43 1.45

Gap-filling 3.12 1.54

Transformations 3.08 1.40

Sentence correction 139 3.02 1.40

Translation of sentences 137 2.76 1.54

Assessment of vocabulary
Finally, the tasks used for the assessment of vocabulary are objectively-scored
items, such as ‘word building’, ‘synonyms/antonyms’, ‘multiple-choice’, ‘gap-
filling’, etc. (see Table 17).

Table 17. Sample tests of vocabulary

Task types Percentage

Word building 34.78

Synonyms/antonyms 33.33

Multiple-choice 17.3

Gap-filling 10.86

Sentence formation 6.52

Matching -

Translation -

Despite small differences, there is an overall agreement between the task types
used in the sample tests and the ones students perceive are included in tests of
vocabulary (Table 18).

Table 18. Students’ perception of vocabulary test tasks

Test tasks N M SD

Synonyms/antonyms 3.71 1.46

Multiple-choice 3.45 1.46

Gap-filling 3.32 1.41

Word building 141 3.11 1.49

Matching 3.06 1.53

Translation 2.81 1.51

Sentence formation 2.80 1.50

‘Multiple-choice’, ‘matching’ and ‘synonyms/antonyms’ are the most preferred

task types (Table 19). This is in accordance with tasks students think are included
in the tests of vocabulary (Table 18). ‘Translation’ and ‘sentence formation’ are
tasks that are the least preferred by students.

Table 19. Students’ preferences of vocabulary tasks types

Test tasks N M SD

Multiple-choice 3.95 1.29

Matching 140 3.44 1.41

Synonyms/antonyms 3.33 1.44

Word building 138 3.23 1.45

Gap-filling 3.12 1.40

Translation 2.99 1.49

Sentence formation 138 2.82 1.45

Finally, further analysis of the data (calculation of means and t-tests) revealed that
gender might be playing an important role in overall task preference (also in Dan-
iels & Welford, 1992). The statistically significant differences between males/
females (see Table 20) showed that female students seem to prefer certain skills/
types of tasks more than males. More specifically, female students show a prefer-
ence for open-ended tasks that require extra time, cognitive and writing load for
the assessment of language skills such as reading, writing and grammar (also in
Levine & Geldman-Caspar, 1996) compared to male students.

Table 20. Skill/task preference by gender

Mean differ-
Gender N M S.D. t Sig.


Male 44 2.23 1.30

Taking note -0.671 -2.493 0.014
Female 89 2.90 1.53

Male 44 2.39 1.42

Translation -0.602 -2.184 0.031
Female 89 2.99 1.53

Male 44 1.84 1.24

Projects -0.844 -3.310 0.001
Female 89 2.69 1.64


Male 45 2.13 1.22

Paragraph writing -0.800 -3.199 0.002
Female 91 2.93 1.44

Male 45 2.29 1.39

Summary writing -0.605 -2.509 0.013
Female 94 2.89 1.30

Male 44 1.50 0.93

Student portfolio -0.678 -3.225 0.002
Female 90 2.18 1.49


Translation of sen- Male 45 2.33 1.45

-0.645 -2.332 0.021
tences Female 96 2.98 1.55

Overall students’ views

In the last part of the questionnaire students were asked to provide their overall
opinion about assessment in the EFL classes. Even though the majority of the
students were satisfied with their assessment (74.5 %), students made some inter-
esting suggestions. They said they would welcome the assessment of listening
and speaking but would rather have less grammar-oriented exams, shorter final
exams, and work more frequently with projects and portfolios.

4. Summary and discussion

The aim of the present study was to explore and document students’ views
towards the assessment practices used in their EFL courses. The results showed
that assessment used in the present context is limited in its scope and coverage
in that it reflects an emphasis on the product of learning (summative assessment)
rather than process towards learning ( formative assessment) (Rea-Dickins, 2007,
2008). In other words, student assessment does not appreciate alternative types of
assessment as it is mainly conducted through product-oriented approaches such
as achievement written tests. The latter also tend to utilise specific language skills
and areas (e. g. writing, reading, grammar and vocabulary) and exclude others
(e. g. speaking and listening). The test types used are of limited scope too, e. g.
mainly objectively-scored items such as ‘multiple-choice’ and ‘gap-filling’ focus-
ing on the ‘right’ answer (Morrow, 1979; Rea-Dickins, 2000). However, this lim-
ited scope in terms of language content and assessment techniques does not fully
represent the range of language skills and strategies that students are expected to
use especially at this level and thus cannot measure complex cognitive operations
(Cheng & DeLuca, 2011; Frey, 2012). Actually, the assessment employed does not
represent real-life integration of skill use and is likely to inadequately prepare
students to use English for real-life purposes (Finch, 2002, 2007). Within this
assessment environment, it seems that students may perform well in their tests
but will not acquire skills that are necessary after exams, that is, real-life, com-
munication skills (Deng & Carless, 2010; Tsagari, 2009). Therefore, within the
present assessment context, students may succeed in the in-house-tests but this
does not necessarily mean that they learn how to use the language (Clark, 2006;
Finch, 2007). If instruction and learning is to be controlled by testing techniques
such as the ones recorded, these are likely to fail to inspire student intrinsic moti-
vation towards learning. Finally the assessment methods under use cannot lead
to valid interpretations of students’ language abilities in that they cannot support
teachers’ instructional aims or students’ learning needs.
This situation creates an important discrepancy between the desired “increased
emphasis” on communication skills necessary in higher education (Doherty,
Kettle, May & Caukill, 2011) and the current assessment tasks used as part of stu-
dents’ assessment. Assessment, ideally, measures performance on real-life tasks
thus allowing us to make inferences about examinee ability in real life (Bachman,
2005; Bachman and Palmer, 2010; Kane, 2006, 2012; Lewkowicz, 2000; Mis-
levy, Almond & Lukas, 2003; Sambell, et al., 1997). This involves the design of
assessments that are based on students’ present and future needs, such as profes-
sional and social interaction (Doherty et al., 2011; Lumley & O’Sullivan, 2005).

However, we did not see any attempt to link students’ needs to test content and
format in the present study. On the contrary, the tests used mostly comprised
of formats usually employed in external standardized testing (Cheng, Rogers,
Wang, 2008; Andrews, Fullilove & Wong, 2002; Qi, 2004; Tsagari, 2009). The
possibility of the influence of large-scale standardized tests on local assessments
is also claimed in several other publications (see Lamprianou, 2012; Papadima
Sophocleous, forthcoming; Pavlou and Ioannou-Georgiou, forthcoming; Tsagari,
forthcoming 1). However, further empirical investigation is needed to validate
this claim and explore the possible of other factors that might also have an effect
on such practices such as teacher training.
Overall, assessment is part of the fabric of classrooms and students are respon-
sive to the characteristics of assessment. They want assessment to be consistent
with their learning. However, the results showed a mismatch between students’
needs/preferences and assessment practices. For example, even though speak-
ing and listening were important aspects of language learning for the students,
these were not represented in the in-house tests, nor were students’ preferences
over certain types of tasks taken into consideration either. Assessment tasks that
do not support student learning are likely to have a detrimental effect on the
confidence of students in successfully performing learning tasks (Dorman et
al., 2006). In addition, alternative assessment methods, which can help students
reflect, make judgments and enhance lifelong learning skills (Carr, 2002; Sambell
et al., 1997) were not employed despite students’ preferences.
The narrow range of assessment methods used in the present context does not
accept a collaborative role for students either. The results showed that students are
not involved in the decision-making of their own assessment. Therefore, assess-
ment in the present context seems to be done to the students rather than with
them (McMillan & Workman, 1998). Against this backdrop and despite the pos-
sible practical or other educational factors (e. g. teacher training), the reality for
students is one of exclusion from the assessment process while forms and tasks
of assessment are largely decided by teachers and administrators (also in Cheng
& Wang, 2007; Dorman, et al., 2006). This discrepancy is likely to influence
students approach to learning, too (Brookhart & Bronowicz, 2003). Given the
assessment practices in place, students are likely to receive inadequate construc-
tive feedback about their progress and learning (also in Tarnanen & Huhta, 2011;
Tsagari, 2011b; Tsagari, forthcoming 2; Vlanti, 2012) under the disappointing use
of ‘assessment of learning’ rather than ‘assessment for/as learning’ (Rea-Dickins,
2007, 2008; Cheng, 2011).
‘Grading, feedback, and reporting of student achievement are key elements
that support learning’ (Cheng & Wang, 2007, p. 85). Of these areas, teachers’
grading practices (including feedback) have received particular attention in the

literature (McMillan & Workman, 1998) due to the fact that grades have impor-
tant consequences for students. In the present study students believe that a
“hodgepodge” of factors – both objective and subjective – is used to grade them,
e. g. academic performance (such students’ results on tests) as well as non-test
performance aspects and behaviour (such as effort, participation, preparation,
interest, respect towards the teacher and class attendance) (also in Brookhart,
1993; Cizek, Rachor & Fitzerald, 1995; Dancer & Kamvounias, 2005; McMillan
& Lawson, 2001; Newfields, 2007). This could, to a certain extent, express the
teachers’ tendency to moderate test results by assessing extra-linguistic aspects
in the informal setting of classrooms (Ferman, 2005). The findings also highlight
the need for more consistency in grade allocation, which will provide students
with clearer messages about what is important and allow transparency for stu-
dents in the assessment process (McMillan, Myran, & Workman, 2002). Never-
theless, before any definite conclusions are drawn here, there is a need to explore
the rationale behind the allocation of final grades by asking the teachers directly
or examining the assessment requirements of the language course syllabi.
The results also revealed some further interesting tendencies. For instance,
students of certain study programmes expressed varying views with regard to
the way they interpreted assessment purposes. Harris (1997) also stresses that
the perceptions university students carry from the subject areas they study have
an impact on learning a foreign language. Gender also seems to be affecting
students’ preferences of assessment types. Struyven et al., (2002) also refer to
gender differences, with males having a more positive attitude towards multiple-
choice formats and female students showing preference to essay exams (see also
Tarnanen & Huhta, 2011). However, exciting as these results are, we cannot claim
with a high degree of certainty that programmes of study or gender are variables
that affect students’ preferences in assessment. This requires further research
worth-taking (Lumley & O’Sullivan, 2005).
In conclusion, what this study has shown is that students’ knowledge and pref-
erences of assessment are valuable in that they mirror their needs in language
learning and stress the incongruity or deficiencies located in the assessments they
experience. Students in the sample were aware of their learning needs, the various
aspects that determine their final grade and, to a certain degree, of test content
(Ross, Rolheiser & Hogaboam-Gray, 2002). This is a positive finding as students
can plan their study and set realistic goals for themselves (Genesee and Upshur,
1996). Therefore, students, especially at tertiary level (but not only) should be
actively involved in their assessment on condition that appropriate support and
training is provided (Smith, Worsfold, Davies, Fisher & McPhail, forthcoming;
Watanabe, 2011).

Nevertheless, further research needs to address the issues raised in this study
in more detail, e. g. explore the interface between students’ perception of assess-
ment and their performance (Dorman et al., 2006) and investigate whether and
how factors such as gender or programmes of study influence students’ assess-
ment preferences, performance and learning. The present study investigated
assessment practices on the basis of students’ self-reports, which is likely to be
different from actual practices, as noted by Brookhart (1994) and McMillan &
Workman (1998). Future studies should triangulate data from student and teach-
ers’ assessment perceptions and practices in the classroom with classroom obser-
vations as well as examination of curriculum documents and materials related
to student assessment and evaluation. Research methods such as questionnaires
(McDowell, Wakelin, Montgomery & King, 2011), in-depth interviews or narra-
tives (Cheng & DeLuca, 2011; Stralberg, 2006) and journals (Tsagari, 2009) with
either individuals or focus groups (Carless, 2011) can prove helpful instruments
in the investigation of students’ perspective in LTA.

5. Concluding remarks

The study highlights the importance of reconfiguring assessment to promote

productive student involvement and stresses the need for learners, teachers and
administrators to collaboratively design effective assessment procedures and use
the results from well-designed assessments to improve learning. It is the thesis of
this chapter that students, especially at tertiary level, need to become active par-
ticipants and decide what is of importance to them and how they can reach their
goals (Clark, 2006; Sambell et al., 1997). It is, therefore, proposed that an assess-
ment culture of collaboration is established that will eventually lead to increased
metacognitive awareness of students’ learning and engagement in a critical but
healthy and effective assessment environment that can result in greater improve-
ment in language learning (Carless, 2011).
However, to be effective, assessment practices also need to acknowledge exist-
ing practices and beliefs of students and teachers based on specific socio-cultural
settings. Such settings also need to consider the interface between assessment
practices and practices used in the mainstream literature and be contextually
grounded in formative assessment practices (also in Carless, 2011).
Furthermore, for students to be actively involved in LTA, it is essential that
they become ‘assessment literate’, that is, they gain “a sound knowledge and
understanding of the principles and practice of assessment” (Taylor, 2009, p. 25)
and become aware of the purposes of assessment (Black & Wiliam, 1998; Finch,
2002; Sadler, 1989). Raising student assessment literacy will eventually eman-

cipate and empower students, e. g. allow them to design their own route to their
learning destinations, take decisions about their study, establish their own focus
and aims and thus become active and responsible learners (Black and Wiliam,
2006, 2009).
It is the teachers’ and institutions’ (as policy-makers) responsibility to train
learners in assessment procedures (Nunan, 1988; Ross, 2006; Tudor, 1996; Wata-
nabe, 2011) and create the appropriate affordances that will allow active and sus-
tained student involvement. However, to be able to do so, learners need help and
guidance (Watanabe, 2011) as well as information about course objectives and
other aspects of the curriculum and a clear idea of the assessment objectives and
scoring criteria.
It is the hope of the writer that the present study has raised awareness about
student involvement in LTA practices and decision-making among professionals
in the field, in the hope that it will lead to greater student involvement in LTA and
promote further research and innovation in the field.
I would like to thank my colleagues, Dr Dimitris Evripidou and Ms Stavroula
Vlanti, for their help in collecting the data and their comments in the initial drafts
of this chapter.


Anderson, J. N. (2012). Student Involvement in Assessment: Healthy Self-

Assessment and Effective Peer Assessment. In Coombe, C., S. Stoynoff, B.
O’Sullivan & P. Davidson (Eds.), The Cambridge Guide to Second Language
Assessment Cambridge: Cambridge University Press (pp. 187–197).
Andrews, S., Fullilove, J., & Wong, Y. (2002). Targeting washback: a case-study.
System 30(3), 207–223.
Bachman, L. F. (2005). Building and supporting a case for test use. Language
Assessment Quarterly, 2(1), 1–34.
Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice: Devel-
oping language assessments and justifying their use in the real world. Oxford:
Oxford University Press.
Black, P. (2009). ‘Formative assessment issues across the curriculum: the theory
and the practice.’ TESOL Quarterly, 43(3), 519–523.
Black, P. & Wiliam, D. (1998). Inside the black box: raising standards through
classroom assessment. London: School of Education King’s College.
Black, P. & Wiliam, D. (2006). Assessment for learning in the classroom. In J.
Gardner (ed), Assessment and Learning, London: Sage (pp. 9–25).

Black, P. & Wiliam, D. (2009). Developing the theory of formative assessment.
Educational Assessment, Evaluation and Accountability, 21(1), 5–31.
Brookhart, S. M. (1993). Teachers’ grading practices: Meaning and values. Jour-
nal of Educational Measurement, 20, 123–142.
Brookhart, S. M. (1994). Teachers’ grading: Practice and theory. Applied Mea-
surement in Education, 7, 279–301.
Brookhart, S. M. & Bronowicz, D. L. (2003). I don’t like writing. It makes fingers
hurt: students talk about their classroom assessments. Assessment in Edu-
cation, 10(2), 221–242.
Brown, G. T. L. (2011). Self-regulation of assessment beliefs and attitudes: A
review of the Student’s Conceptions of Assessment inventory. Educational
Psychology, 31(6), 731–748.
Carless, D. (2011). Reconfiguring assessment to promote productive student learn-
ing. Paper presented at New Direction: Assessment and Evaluation Sympo-
sium organized by British Council. Kuala Lumpur, Malaysia (6–8 July 2011).
Carr, S. C. (2002). Self-evaluation: involving students in their own learning.
Reading and Writing Quarterly, 18, 195–199.
Cheng, L. (2011) Supporting Student Learning: Assessment of Learning and
Assessment for Learning. In Tsagari, D. & I. Csépes (Eds.), Classroom-based
language assessment. Frankfurt: Peter Lang (pp. 191–203).
Cheng, L. (2005). Changing language teaching through language testing: a
washback study. Cambridge: Cambridge University Press.
Cheng, L. & DeLuca, C. (2011). Voices from test-takers: further evidence for
language assessment validation and use. Educational Assessment, 16(2), 104–
Cheng, L., Rogers, W. T. and Wang, X. (2008). Assessment purposes and pro-
cedures in ESL/EFL classrooms. Assessment & Evaluation in Higher Edu-
cation, 33(1), 9–32.
Cheng, L., Rogers, T. & Hu, H. (2004). ESL/EFL instructors’ Classroom Assess-
ment Practices: Purposes, Methods, and Procedures. Language Testing, 21(3),
Cheng, L. & Wang, X. (2007). Grading, Feedback, and Reporting in ESL/EFL
classrooms. Language Assessment Quarterly, 4(1), 85–107.
Cizek, G. J., Rachor, R. E., & Fitzgerald, S. (1995). Further investigation of teach-
ers’ assessment practices. Paper presented at the annual meeting of the Amer-
ican Education Research Association, San Francisco, CA. (ERIC Document
Reproduction Service No. ED384613).
Clark, I. (2006). Assessment for learning: assessment in interaction. Essays in
Education, 18, 1–16.

Dancer, D. & Kamvounias, P. (2005). Student involvement in assessment: a proj-
ect designed to assess class participation fairly and reliably. Assessment &
Evaluation in Higher Education, 30(4), 445–454.
Daniels, S. & Welford, G. (1992). As they like it: Pupil preferences, self-estimates
and performance scores on science tasks. Evaluation & Research in Edu-
cation, 6(1), 1–12.
Deng, C. & Carless, D. R. (2010). Examination preparation or effective teaching:
conflicting priorities in the implementation of a pedagogic innovation. Lan-
guage Assessment Quarterly, 7(4), 285–302.
Doherty, C., Kettle, M. May, L. & Caukill, E. (2011). Takling the talk: oracy
demands in first year university assessment tasks. Assessment in Education:
Principles, Policy & Practice, 18(1), 27–39.
Dorman, J. P., D. L. Fisher & B. G. Waldrip (2006). Classroom environment, stu-
dents’ perceptions of assessment, academic efficacy and attitude to science:
a LISREL analysis. In Fisher, D. L. & Khine, M. S. (Eds.), Contemporary
approaches to research on learning environments: worldviews. Singapore:
World Scientific (pp. 1–28).
Drew, S. (2001). Perceptions of what helps learn and develop in education. Teach-
ing in Higher Education, 6(3), 309–331.
Ekbatani, G. V. (2000). Moving Toward Learner-Directed Assessment. In
Ekbatani, G. V. & H. D. Pierson. Learner-directed Assessment in ESL. Lau-
rence Erlbaum: Mahwah NJ (pp. 1–12).
Ekbatani, G. V. & Pierson, H. D. (2000). Epilogue: Promoting Learner-Directed
Asssessment. In G. V. Ekbatani, & H. D. Pierson. Learner-directed Assess-
ment in ESL. Laurence Erlbaum: Mahwah NJ (pp. 151–152).
Erickson, G. & Gustafsson, J. (2005). Some European students’ and teachers’
views on language testing and assessment: a report on a questionnaire survey.
Retrieved from
Falchikov, N. (2003). Involving students in assessment. Psychology Learning and
Teaching, 3(2), 102–108.
Ferman, I. (2005). Implementing performance-based assessment in the EFL
classroom. ETAI Forum, English Teachers’ Association in Israel, XVI(4).
Finch, A. (2002). Authentic assessment: implications for EFL performance test-
ing in Korea. Secondary Education Research, 49, 89–122.
Finch, A. (2007). Involving language learners in assessment: a new paradigm.
Retrieved from

Fox, J. & Cheng, L. (2007). Did we take the same test? Differing accounts of the
Ontario Secondary School Literacy Test by first and second language test-
takers Assessment in Education 14(10), pp. 9–26.
Frey, B. B. (2012). Defining authentic classroom assessment. Practical Assess-
ment, Research & Evaluation, 17(2), 1–18.
Fulcher, G. (2012). Assessment Literacy for the Language Classroom. Language
Assessment Quarterly, 9(2), pp. 113–132.
Genesee, F. & Upshur, J. A. (1996). Classroom-based evaluation in second lan-
guage education. Cambridge: Cambridge University Press.
Gijbels, D. & Dochy, F. (2006). Students’ assessment preferences and approaches
to learning: can formative assessment make a difference? Educational Stud-
ies, 32 (4), 399–409.
Gosa, C. M. C. (2004). Investigating washback: a case study using student dia-
ries. Unpublished PhD thesis, Department of Linguistics and Modern English
Language, Lancaster University, Lancaster, England.
Harris, M. (1997). Self-assessment of language learning in formal settings. ELT
Journal, 51(1), 12–20.
Horwitz, E. K. (1989). Facing the blackboard: student perceptions of language
learning and the language classroom. ADFL Bulletin, 20(3), 61–64.
Huhta, A., Kalaja, P. & A. Pitkänen-Huhta (2006). Discursive construction of
a high-stakes test: the many faces of a test-taker. Language Testing, 23(3),
Kane, M. (2006). Validity. In R. L. Brennan (Ed.), Educational Measurement (4th
ed., pp. 17–64). Westport, CT: Praeger Publishers.
Kane, M. (2012). Validating score interpretations and uses. Language Testing,
29(1), 3–17.
Karahan, F. (2007). Language attitudes of Turkish studentstowards the English
language and its use in Turkish context. Journal of Arts and Sciences, 7,
Lamprianou, I. (2012). Effects of forced policy-making in high stakes exami-
nations: the case of the Republic of Cyprus. Assessment in Education: Prin-
ciples, Policy & Practice, 19(1), 27–44.
Lee, J. F. K. (2001). Attitudes towards debatable usages among English language
teachers and students. Journal of Applied Linguistics, 6(2), 1–21.
Levine T. & Geldman-Caspar, Z. (1996). Informal Science Writing Produced by
Boys and Girls: Writing Preference and Quality. British Educational Research
Journal, 22(4), 421–439.
Lewkowicz, J. A. (2000). Authenticity in language testing: some outstanding
questions. Language Testing, 17(1), 43–64.

Lindblom-Ylänne, S. & Lonka, K. (2001). Students’ perceptions of assessment
practices in a traditional medical curriculum. Advances in Health Sciences
Education, 6(2), 121–140.
Lumley, T. & O’Sullivan, B. (2005). The effect of test-taker gender, audience
and topic on task performance in tape-mediated assessment of speaking. Lan-
guage Testing, 22(4), 415–437.
McDowell, L., Wakelin, D., Montgomery, C. & King, S. (2011). Does assessment
for learning make a difference? The development of a questionnaire to explore
the student response Assessment & Evaluation in Higher Education, 36(7):
McMillan, J. H. & Lawson, S. R. (2001). Secondary science teachers’ classroom
assessment and grading practices. Educational Measurement: Issues and
Practice, 20, 20–32.
McMillan, J. H., Myran, S. & Workman, D. (2002). Elementary teachers’ class-
room assessment and grading practices. Journal of Educational Research, 95,
McMillan, J. H. & Workman, D. (1998). Classroom assessment and grading
practices: A review of literature. Richmond, VA: Metropolitan Educational
Research Consortium, Virginia Commonwealth University. (ERIC Document
Reproduction Service No. ED453263)
Meletiadou E. & Tsagari, D. (forthcoming). Investigating the attitudes of adoles-
cent EFL learners towards peer assessment of writing. In D. Tsagari (Ed.),
Research in English as a Foreign Language in Cyprus Volume II. Nicosia
University Press: Nicosia, Cyprus.
Mislevy, R. J., Almond, R. G., & Lukas, J. F. (2003). A brief introduction to Evi-
dence-centered Design (Research Report). Princeton, NJ.
Morrow, K. (1979). Communicative language testing: revolution or evolution?
In Brumfit, C. J. & Johnson, K. (Eds.), The communicative approach to lan-
guage teaching. Oxford: Oxford University Press (pp. 143–157).
Newfields, T. (2007). Engendering assessment literacy: narrowing the gap
between teachers and testers. Assessing Foreign Language Performances:
Proceedings of the 2007 KELTA International Conference. August 25, 2007.
College of Education, Seoul National University (pp. 22–36).
Nunan, D. (1988). The learner-centred curriculum. Cambridge: Cambridge Uni-
versity Press.
Paltridge, B. & Phakiti, A. (Eds.) (2010). Companion to Research Methods in
Applied Linguistics. London and New York: Continuum
Papadima-Sophocleous, S. (forthcoming). High-stakes Language Testing in the
Republic of Cyprus. In D. Tsagari, S. Papadima-Sophocleous, & S. Ioan-

nou-Georgiou (Eds.), Language Testing and Assessment around the Globe:
Achievements and Experiences (tentative title). Frankfurt: Peter Lang.
Pavlou, P. (2010). Preface. In P. Pavlou (Ed.), Research on English as a Foreign
Language in Cyprus. University of Nicosia. Nicosia, Cyprus (pp. x-xx).
Pavlou, P. & Ioannou-Georgiou, S. (forthcoming). The Use of Tests and Other
Assessment Methods in Cyprus State Primary School EFL Classes. In D.
Tsagari (Ed.), Research in English as a Foreign Language in Cyprus. Volume
II. University of Nicosia Press. Nicosia, Cyprus.
Qi, L. (2004). Has a high-stakes test produced the intended changes? In L. Cheng
& Y. Watanabe with A. Curtis (Eds.), Washback in language testing: research
contexts and methods. New York: Erlbaum (pp. 171–90).
Rea-Dickins, P. (2008). Classroom-based assessment. In E. Shohamy & N. H.
Hornberger (Eds.), Encyclopedia of language and Education (2 Ed., Vol. 7,
pp. 1–15).
Rea-Dickins, P. (2007). Classroom-based assessment: possibilities and pitfalls. In
C. Davison & J. P. Cummins (Eds.), The International Handbook of English
Language Teaching. Norwell, Massachusetts: Springer Publications (Vol. 1,
pp. 505–520).
Rea-Dickins, P. (2000). Classroom assessment. In T. Hedge. Teaching and learn-
ing in the language classroom. Oxford: Blackwell Publishing (pp. 375–401).
Ross, J. A. (2006). The reliability, validity, and utility of self-assessment. Practi-
cal Assessment Research & Evaluation, 11(10), 1–13.
Ross, J. A., Rolheiser, C. & Hogaboam-Gray, A. (2002). Influences on student
cognitions about evaluation. Assessment in Education, 9(1), 81–95.
Sadler, D. R. (1989). Formative assessment and the design of instructional sys-
tems. Instructional Science, 18(2), 119–144.
Sambel, K., McDowell, L. & Brown, S. (1997). ‘But is it fair?’: an exploratory
study of student perceptions of the consequential validity of assessment. Stud-
ies in Educational Evaluation, 23(4), 349–371.
Smith, C. D., Worsfold, K., Davies, L., Fisher, R. & McPhail, R. (forthcoming).
Assessment literacy and student learning: the case for explicitly developing
students’ ‘assessment literacy’. To appear in Assessment & Evaluation in
Higher Education.
Stiggins, R. (1994). Student-centered classroom assessment. Ontario: Macmillan
College Publishing Co.
Stiggins, R. J. (2001). Student-involved classroom assessment. New Jersey: Mer-
rill Prentice Hall.
Stralberg, S. (2006). Reflections, Journeys, and Possessions: Metaphors of
Assessment Used by High School Students. Teachers College Record, Date
Published: July 05, 2006 ID Number: 12570,

Struyven, K., Dochy, F. & Janssens, S. (2002). Students’ perceptions about assess-
ment in higher education: a review. Paper presented at the Joint Northumbria/
EARLI SIG Assessment and Evaluation Conference: Learning communities
and assessment cultures, University of Northumbria at Newcastle.
Struyven, K., Dochy, F., & Janssens, S. (2005). Students’ perceptions about eval-
uation and assessment in higher education: A review. Assessment & Evalu-
ation in Higher Education, 30(4), 325–341.
Tarnanen, M. & Huhta, A. (2011). Foreign language assessment and feedback
practices in Finland. In D. Tsagari, & I. Csépes (Eds.), Classroom-based lan-
guage assessment. Frankfurt: Peter Lang (pp. 129–146).
Taylor, L. (2009). Developing assessment literacy. Annual Review of Applied Lin-
guistics, 29, 21–36.
Tsagari, D. (forthcoming 1). Investigating the face validity of Cambridge ESOL
exams in the Cypriot context. Cambridge ESOL Funded Research Program
Round 3, Research Programmes 2011, University of Cambridge, ESOL Exam-
inations, UK.
Tsagari, D. (forthcoming 2). Classroom evaluation in EFL state-schools in Greece
and Cyprus: towards ‘assessment literacy’. To appear in Journal of Applied
Linguistics, JAL, Annual Publication of the Greek Applied Linguistics Asso-
Tsagari, D. & Csépes, I. (Eds.) (2011a). Classroom-based language assessment.
Frankfurt: Peter Lang.
Tsagari, D. (2011b). Investigating the ‘assessment literacy’ of EFL state school
teachers in Greece. In D. Tsagari, & I. Csépes (Eds.), Classroom-based lan-
guage assessment. Frankfurt: Peter Lang (pp. 169–190).
Tsagari, D. (2009). The Complexity of Test Washback: An Empirical Study. Frank-
furt am Main: Peter Lang GmbH.
Tudor, I. (1996). Learner-centredness as language education. Cambridge: Cam-
bridge University Press.
Van de Watering, G., Gijbels, D. Dochy F. & Van der Rijt, J. (2008). Students’
assessment preferences, perceptions of assessment and their relationships to
study results. High Education, 56(6), 645–658.
Vlanti, S. (2012). Assessment practices in the English language classroom of
Greek Junior High School. Research Papers in Language Teaching and
Learning, 3(1), 92–122. Retrieved from
Vogt, K. & Tsagari, D. (forthcoming). Assessment Literacy of Foreign Language
Teachers: Findings of a European Study. To appear in Language Assessment

Wall, D. (2005). The Impact of High-Stakes Examinations on Classroom Teach-
ing: A Case Study Using Insights from Testing and Innovation Theory. Cam-
bridge: Cambridge University Press.
Watanabe, Y. (2011) Teaching a Course in Assessment Literacy to Test Takers: Its
Rationale, Procedure, Content and Effectiveness. Cambridge ESOL: Research
Notes, 46, 29–34. Retrieved from
Weekers, A. M., Brown, G. T. L., & Veldkamp, B. P. (2009). Analyzing the dimen-
sionality of the students’ conceptions of assessment inventory. In D. M. McIn-
erney, G. T. L. Brown, & G. A. D. Liem (Eds.), Student perspectives on assess-
ment: What students can tell us about assessment for learning. Charlotte, NC:
Information Age Publishing (pp. 133–157).
Wenden, A. L. (1998). Metacognitive knowledge and language learning. Applied
Linguistics, 19(4), 515–537.
Williams, J. (2011). Research Based Approaches to Assessment and Evaluation
Research. Paper presented at New Direction: Assessment and Evaluation
Symposium organized by British Council. Kuala Lumpur, Malaysia (6–8 July
Xie, Q. (2011) Is Test Taker Perception of Assessment Related to Construct Valid-
ity? International Journal of Testing, 11( 4), 324–348.
Yang, A. and Lau, L. (2003). Student attitudes to the learning of English at sec-
ondary and tertiary levels. System, 31(1), 107–123.

Computer-based Language Tests in
a University Language Centre

Cristina Pérez-Guillot44
Julia Zabala Delgado45
Asunción Jaime Pastor46
Centro de Lenguas. Universitàt Politècnica de València

The Language Centre of the Universitàt Politècnica de València has been progressively adapting
its language levels and course contents to the new educational standards defined in the Common
European Framework of Reference for Languages (CEFR) (Council of Europe, 2001). This chap-
ter describes the advantages of computer-based tests for the development of placement as well as
achievement tests, which comply with the language competences established by the CEFR. The
computer tool to be described in this chapter is based on a previous experience that assisted in suc-
cessfully implementing a paper-based placement test based on the course contents and levels that
were being delivered and set at the Language Centre. The original placement test experienced a
number of transformations to include the linguistic competences defined by the CEFR, and has also
been progressively modified for partial use in the development of achievement tests. The use of this
computer-based tool for the administration and management of placement and achievement tests has
allowed us to optimise our available human and technological resources and expand our knowledge
of the linguistic skills and abilities defined in the CEFR.

Key words: computer-based tool, placement test, achievement test, CEFR, language centres.

1. Introduction

The Common European Framework of Reference for Languages (CEFR) (Coun-

cil of Europe, 2001) provides a common basis for the elaboration of language
syllabuses, curriculum guidelines, tests, examinations, etc. across Europe. It
describes in a comprehensive way what language learners have to learn to do in
order to use a language for communication and what knowledge and skills they
have to develop so as to be able to communicate effectively.
Among the uses of the CEFR, of particular importance is its use for the plan-
ning of language certification in terms of the content syllabus of examinations
and assessment criteria. To implement the CEFR system, Language Centres (and
other educational institutions for that matter) have gradually adapted the design
and content of their courses to the new definition of language levels.


Regardless of the differences in the manner of implementing the CEFR levels,
a general problem common to most Language Centres is how to quickly and
effectively determine the knowledge level of students for the purpose of their
enrolment in courses at the beginning of the academic year and also for the pur-
pose of certification of language competences by means of achievement tests at
the end of each academic term.
After studying the different options available, we considered that the best
solution was to use a computer-based test; given its flexibility of use and the
rapid analysis of results. Although there are commercial programmes available
for determining the students’ level, their cost is usually high (similar observations
can be found in Taylor, Kirsch, Jamieson & Eignor, 1999; Hinkelman & Grose,
2004; Soh, Samal, Person, Nugent & Lang, 2004). Also, commercial tests and
exam software tend to consider only the European Framework levels. There-
fore, tests on the market are unlikely to be adapted to the characteristics of our
institution, the Language Centre (CDL) of the Universitàt Politècnica de Valèn-
cia (UPV). In CDL, we have developed a computer-based test that allows the
assessment of a large number of students in a reliable and fast manner. Among
the advantages of a custom-designed test is its greater flexibility, the immediate
release of results, lower stress for the candidates, management of large volumes of
students, optimisation of time and human resources, lower costs and adaptation
to the specific requirements of the centre (see also Cubeta, 2008; Sartirana, 2008).
In this chapter we will describe the procedure followed in the implementation
of computer based language tests in the present context of inquiry, e. g. a Univer-
sity Language Centre.

2. Considerations for test design

Our first experience involved placement tests, since we had several years of expe-
rience in this field, and the feedback obtained was used to implement our tool for
achievement tests.
The design of placement or achievement tests for students in a language Centre
such as the CDL takes into account the need for the quick assessment of large
numbers of students without giving rise to high financial and human resource
Creating effective tests is one of the main challenges faced by both teachers
and Language Centres, not only regarding content and design but also in terms of
validity and reliability.
A key aspect to take into account when developing a test is to determine the
purpose for which it is designed, that is, its usefulness: “The most important con-

sideration in designing a test is the use for which it is intended so the most impor-
tant quality criterion is its usefulness” (Bachman & Palmer, 1997, p. 17). Indeed,
the design of a test is a two-way process in that the same type of test can be used
for different objectives and in fact its purpose may determine the type of test to
be used. Depending on their purpose, tests can be used either as a source of infor-
mation for decision making in an educational setting or as a tool for investigating
specific aspects of language (Bachman, 1990, p. 52). The purpose of placement
tests is to distribute students into groups based on their language skills before the
course starts (Jean Jimenez, 2008, p. 84), whereas achievement tests measure the
outcomes of students after instruction (Bachman, 1990, p. 70).
Regardless of the type of test we are using, tests should meet two basic require-
ments: validity and reliability (Bachman & Palmer, 1997). According to Alder-
son, Clapham and Wall (1995) validity refers to the extent to which a test mea-
sures what it is intended to measure, i. e. its purpose, whereas reliability refers to
the consistency of test results in that a reliable test gives consistent results across
a wide range of situations.
With regard to validity, the question is not whether a test is valid or invalid
but rather what its level of validity is (Caruso, 2008, p. 38). In fact, as pointed
out by Alderson et al., (1995), the validity of a test depends not on the test itself
but rather on what it is used for. Therefore, the structure of an achievement test,
whose purpose is to assess the attainment of the course learning objectives, will
not be the same as that of a placement test, which serves to distribute students in
the different courses based on their knowledge (Hughes, 1989).

3. Context

The Centro de Lenguas of the Universitàt Politècnica de València is a University

Centre, which offers language services to the university community. One of its
main lines of action is to provide supplementary training in foreign languages so
as to facilitate the integration of its members in the European Higher Education
Area, which is an essential part of the philosophy of internationalisation charac-
terising the UPV.
The structuring of the courses offered by the CDL is governed by the aca-
demic calendar, which in Spanish universities is divided into two semesters, fall
semester from September till December and winter semester from February till
May, with an exam period in the month of January. Therefore, there is a need to
test a large number of new students in both the first and second semester in a very
short period of time, especially in the case of the first semester due to its shorter
duration (three teaching months).

Several attempts had been made to successfully place students into groups.
Our first attempt involved oral interviews, where only communicative skills were
evaluated. This procedure proved to be useful although incomplete, since other
skills could not be assessed and the range of vocabulary used was limited. It also
proved to be very costly and time consuming (a month was required to complete
the process).
In order to overcome these drawbacks, a paper-based test was designed to be
administered to all students. The paper test consisted of a multiple choice gram-
mar section and a written section. Those who demonstrated that they had a level
higher than B1 were then required to do an oral interview whereas students scor-
ing below B1 level were directly assigned to a course without the need for an oral
interview. These changes substantially improved the process and reduced both
time and costs. Furthermore, this led to a more balanced learning environment,
since by evaluating other skills besides speaking, the differences in the level of
the students in each group were less and thus their learning needs were very
Nevertheless, since the test was human-corrected, it was still time consuming
and the process was not as optimised as initially desired. Therefore, we found that
the best solution for our centre was a combination of a computer-based test with
integrated listening tasks, followed by oral interviews with human examiners.
Our decision was based on the fact that we had about 700 students each semester
whose level had to be evaluated in a maximum of one week so as to be able to
keep up with schedule constraints. We considered this solution to be cost effec-
tive, flexible to use and suitable for rapid processing of results. Furthermore, since
our paper-based test was based on a combination of multiple-choice questions, it
meant that our items could be incorporated into a computer platform for machine
testing with minimal modifications.
In this sense, a thorough study of different computer systems was carried out
to select the most suitable computer software for our purposes. Such a computer
software would need to enable us to specify the number, type (grammar, syntax,
vocabulary and so on) and the CEFR level of the questions needed to accurately
determine the level of each student taking into consideration that automatically
corrected tests allow for only minimal subjective deviations. We should mention
that one of the factors that were taken into account was the need for accuracy,
since there was no human corrector to supervise the results.
With respect to the computer software chosen, we needed it to be inexpensive
since our language centre belongs to a public university and financial resources
are limited, easily adaptable to our needs (e. g. possibility of including multime-
dia clips), user friendly for administrators and with a customisable interface to
conform to our corporate image.

The platform PARIS47, which is a UPV proprietary software for developing
exams and analysing items and their results, was finally considered to be the most
appropriate since it covered the aforementioned needs and could be customised
with no need for copyright considerations since it already belonged to the univer-
sity. Furthermore, and even if our initial aim was to use multiple choice and short
answer questions for our tests, PARIS offered a wider range of options to choose
from: multiple choice questions (single or multiple answers possible, admitting
true or false questions), short answer questions (including embedded answers
such as gap filling or cloze tests), long answer questions (in which the examiner
can introduce the word or combination of words that the answer must contain),
open answer (an automatic correction is not possible in this case, but the examiner
can introduce feedback for the student), matching questions (matching one or
multiple concepts), sequencing (establishing the order of words or sentences), and
numerical questions (in which the solution is a single value and needs to answer
to one or more logical conditions such as ‘>n1 O <n2’ or ‘=n1 Y =n2’).
The number of questions used in the test was based on our previous non-com-
puterised experiences. The number of 50 questions was carefully chosen since
we needed a large enough number of items to cover all CEFR levels but not too
many as this would require more time than desirable for test completion. All our
questions had been formulated according to can-do competences defined in the
CEFR for each of the levels evaluated, and had been used for final course exams
for several years. The questions that had not been answered correctly by most of
the students in the level were eliminated from the database, since we considered
that their validity was compromised. So the questions that were finally included
in the database for each of the levels were those that had been satisfactorily
answered by most of the students in that level during their final exam.
Since our initial goal was to accurately determine the level of as many students
as possible in a minimal amount of time and with the lower possible cost, the inte-
grated listening tasks would be automatically corrected (single-answer multiple
choice questions) and would help to further adjust the results and thus eliminate
the need for some of the oral interviews, reducing the need for human correctors
and optimising the whole process time and cost wise.

47 Programa para la Confección y Realización de Exámenes, Análisis y Obtención de Resultados

de Items – Software for Test Development and Data Analysis.

4. Methodology

The computer tool was developed according to the needs and constraints specified
above and consequently it was designed within a yearly timeline which included
several interrelated phases, as well as the use of placement tests for course assign-
ment at the beginning of the teaching period and achievement tests after course
completion in each semester.

4.1 Design of the placement test

The test items, as well as the final tests, were developed by a group of six expert
teachers from the Language Centre with long experience in language teaching
and testing, who are familiar with the European Framework of Reference for
Languages. All the members of the team involved in the project are also members
of official certificate examiner boards (e. g. Cambridge certificates, UPV Official
Accreditation, Chamber of Commerce of Paris, among others) and are well aware
of the specific needs of a university community with a technical profile.
The software was developed by computer experts working closely with Lan-
guage Centre personnel so as to be able to comply with the specifications of the
institution. As aforementioned, a custom-designed computer-based test had sev-
eral advantages that fit the constraints of our institution, amongst them, its flex-
ibility, rapid processing of results and cost and time effectiveness. Flexibility was
an important factor to take into consideration since the use of a large bank of items
of questions allowed us to easily generate different tests to discourage cheating
during an exam session and in between sessions. Moreover, with very few modi-
fications in the type and levels of the questions chosen from the database, the pro-
gramme could be used to develop not only placement tests but also achievement
tests. The rapid processing of results allowed for large numbers of students to be
tested simultaneously, obtaining the results almost immediately and assigning a
level to each student accordingly. Cost and time effectiveness were evident since
once the database of questions had been generated, a single teacher to monitor the
classroom and to revise results and an IT technician for technical problems and
processing were all that was needed for examining around 150 students per ses-
sion. The size of the computer classroom was the only limitation for the number
of students admitted per session.
The computer-based test consists of two parts: a computer-based section with
50 multiple choice and cloze questions testing grammar, vocabulary and syntax,
and a listening test consisting of three passages for levels B1, B2 and C1 with

multiple-choice questions. Students who score above the B1 level are interviewed
in pairs in order to judge more precisely which course best suits their skills.
The grammar, vocabulary and syntax questions for each level were extracted
from the illustrative scales defined by the CEFR which describe language
reception (listening and reading), interaction (spoken and written), and pro-
duction (spoken and written) and were further linked to the can-do competence
descriptors given for each level. Each question was classified according to the
level, type of competence assessed, category (grammar, syntax, vocabulary) and
can-do descriptor covered. The passages for the listening texts were chosen and
recorded according to CEFR specifications for each of the levels. For example
for B1 the CEFR states: “Can understand straightforward factual information
about common everyday or job related topics, identifying both general messages
and specific details, provided speech is clearly articulated in a generally familiar
accent” (Council of Europe, 2001, p. 8). A database of approximately 1000 items
was generated to allow for the creation of different multileveled test models for
each testing session.
Additionally, as we had to divide CEFR levels into two parts to adapt our
courses to the two university terms, part A being offered during the Fall Semester
and part B during the Winter Semester, a large database of items facilitates the
development of placement tests tailored to each academic semester. Furthermore,
the flexibility of the tool allows tests to be designed at exact levels of language
proficiency for use in specific purpose courses. Having the questions divided not
only by level but also by type of competence, category and can-do descriptor
assessed allowed us to design placement tests for specific courses with minimum
extra cost and with the same advantages mentioned above: tests were designed
quickly, results were easily processed and the staff required was minimal.

4.2 Design of the achievement test

Our placement tests have been running for two years now with more than 3000
students placed successfully, as the final exams for the courses as well as the
reports received from the teachers indicate.
The success of the computer tool encouraged us to adapt the tests for assessing
the outcomes of students after completing the different courses offered by the
As the items in the initial database used in the placement tests were classified
according to the different CEFR levels, grammar aspects covered and level of dif-
ficulty, it was easy to select the items which had to be included in the achievement

tests in the ‘Use of English’ section. However we had to re-structure the Listening
section as well as to add a Reading and a Writing section to the programme.
The Listening section included two recordings for each CEFR level. The
questions were of multiple-choice or one-word answer type, with a total of 10
questions per task and a total length of 20 minutes for both tasks.
Similarly, a new Reading section was added to the programme, consisting of
two tasks, which varied in length and difficulty depending on the level being
assessed. The type of items used to develop the Reading section ranged from
matching questions (e. g. matching text and figures in the case of levels A2 or
B1; matching opinions for levels B2 or C1), reordering sentences according to a
sequence of events (for levels B1 and B2), one/two word answers (for levels A2 to
B2), True/false items (B2-C1), or multiple-choice questions (A2 to C1). The time
allocated to this section also varies depending on the CEFR level assessed, from
20 minutes for the A2 level to 60 minutes for the C1 level.
As regards the Writing section, although students have to do it through the
computer, the final correction of this section has to be done by human correctors,
who are sent the candidates’ writings anonymously to avoid any bias. This section
consists of two tasks, whose length and difficulty vary for the different CEFR
levels. The time allocated to complete this section ranges from 30 minutes for A2
to 60 minutes for C1.
The assessment of the Speaking skill is done by the teachers who have deliv-
ered the course. The instructors have had the opportunity of checking the stu-
dents’ progress throughout the course to decide whether one student has achieved
the corresponding level. This way of assessing speaking has proved to be more
effective than using oral interviews as it substantially reduces students’ stress and
increases confidence as well as it considers the overall speaking performance of
the candidate in many different situations.

4.3 Test management

Although the computer lab of the CDL has 24 workstations, only 20 are activated,
enabling four computers to be left free for the purpose of solving technical prob-
Both placement and achievement tests comprise several sections, each of
which has a different time limit depending on the type of activity and the level
of the exam. The system consists of a single interface used for all tests and it has
a timer that displays the amount of time left for the completion of each section.
Students can review their answers as long as there is time remaining.

Following the completion of the test, the data is imported into an Excel spread-
sheet, where the initial codes assigned to the students to provide anonymity are
matched to their IDs to obtain the results of the Reading and Listening sections.
The Writing and Speaking sections are evaluated by human markers, who are
experienced professionals with extensive experience in grading written and oral
exams. However, to ensure that all the oral examiners have the same evaluation
criteria, they are provided with correction guidelines. The correction guidelines
provide descriptors regarding the competences and skills students are expected
to have. These refer to both specific competences and their global achievement.
Concerning the specific characteristics of the interface used as seen by stu-
dents, teachers and administrators, Figure 1 shows the test design screen used for
the input of the parameters that control the administration of a particular session
in real time. The interface allows us to control the duration of the test, the devel-
opment of the testing session (starting, stopping and re-starting of the test), etc.
There is also an option to prevent cheating that allows the examiner to rotate the
questions or the order of the answers to each question.

Figure 1. Test design screen

Figure 2 displays the initial screen on the student’s interface, where students can
choose the test they are going to take and the part of the test they want to take
first, although they are advised to take the test in the assigned order. In the case
of the example, there are two different level tests (A2 and B1) available on the
screen, which allows different groups of students to take different tests in the
same exam session.

Figure 2. Test selection screen

Figure 3 shows a question as seen on the examiner’s interface before test admin-
istration. The screen shows the date when the question was created, the level,
author and type of item (grammar, vocabulary, etc) and there is a field for key-
words, which will facilitate searching for this particular item in the future. On
this screen, the examiner can introduce the value of the question and assign a
negative value to wrong answers to be automatically calculated by the computer.
There is also a field to determine the level of difficulty assigned to the item,
which is established based on the number of right and wrong answers given to the
question during the testing administration.

Figure 3. Examiner’s interface

The question screen seen on the student’s interface is a simplified version of the
examiner’s screen, with a straightforward layout that allows students to go back
and forth within the test provided there is time remaining. At the end of the test,
the students need to click on ‘stop the test’ to confirm their answers and send the
results, which will be stored for further processing
Finally, Figure 4 presents the results screen on the examiner interface where
each student’s name is paired to his/her ID. There is a column for the level of the
test taken and separate columns for the grades obtained in each section.
Figure 4. Screen showing processed results

As it can be seen from figures 1–4, the interface is user friendly and IT assis-
tance is minimal for everyday use, although some IT help is required for the man-
aging of databases. In the case of students, they do not need to have any previous
knowledge of the programme nor do they require to have specific computer skills,
which facilitates test administration.
Additionally, after each test the programme offers a wide range of options for
a better fitting of the scores, such as elimination of ill-defined items, selection of
the rating scale and recalculation of the difficulty and discrimination indices to
assess the validity of each item (Figure 5).

Figure 5. Rating options

Another important advantage of the tool is the display of results in different for-
mats for further analysis. In this sense, the results can be downloaded to an Excel
spreadsheet which shows the data in different ways, for example, a list of all stu-
dents with their answers, each student’s right/wrong/blank answers, total score
and answers selected, sum of the item values, the final weighted values and dif-
ficulty and discrimination indices for each item. As an illustration Figure 6 shows
the percentage of right, wrong and blank answers and the indices of the items in
the Use of English Section of the A2 test.

Figure 6. Percentage of right, wrong and blank answers and difficulty and discrimination indices

The data can also be represented graphically using charts. This permits the rapid
observation of results, highlights any possible error and enables examiners to re-
define the weights allocated to each item (see Figure 7 on the next page).
As we have seen, the software used can be easily modified to provide a reliable
set of results as it permits the modification of the scoring scale and item weights
after the administration of each test. The data obtained after each test is also used
to re-define the attributes of the items in the database depending on the perfor-

mance of each item so that some items will be eliminated from the bank and the
CEFR level originally assigned to, while others will be modified accordingly.
Figure 7. Graphical representation of data

5. Future development

The new degrees offered by Spanish Universities require students to possess a B2

level of language competence when they finish their university studies. To face
this new context Spanish language centres have developed a common framework
for the design of language certificates.
In this sense, the Spanish Association of University Language Centres
(ACLES) is in the process of creating a single language certificate that would
assess language competence at CEFR levels and be recognised nationally and
internationally (Asociación de Centros de Lenguas en la Enseñanza Superior,
The Language Centre of the UPV played an active role in this process and
decided to use the programme designed for placement and achievement tests to
develop a computer-based language exam that could be used for certification pur-
poses. The requirements for this certificate are as follows:

• The four major skills (oral and written expression, oral and written compre-
hension) have to be evaluated, either in an integrated way or separately.
• All examination tasks needed to be based on real life tasks and, whenever
possible, textual material had to be taken from real sources, depending on the
level to be evaluated.
• Whatever the level being certified, the duration of examinations would be no
less than 70 minutes and no more than 250 minutes and wherever possible, oral
examination would consist of an interview with two examiners.
With this in mind, the Certification test will consist of a writing section, a
reading comprehension section, a speaking section and a listening section.
Nonetheless, we need to take into consideration the limitations of our tool,
since it had originally been created for placement and achievement testing, and
therefore, a combination of automatic correction for the reading and listening
sections and human correction for the speaking and writing sections needs to
be implemented.
Additionally, the Listening and Reading sections will have to be re-designed
to adapt to the specifications of the Certificate regarding aspects such as time
length, text types, and question types, among others.
Similarly, future research will focus on incorporating the Speaking tasks in
the programme as this will greatly facilitate test administration while reducing

6. Conclusions

Using computers for both placement and assessment of language proficiency

allows optimising the management of a large number of students, reducing costs,
time and human resources, while increasing student satisfaction. Similarly, its
flexibility permits the fast creation of tests for different purposes.
The use of computer-based level tests for the placement of students according
to their language proficiency is the best solution in a context such as Language
Centres managing large numbers of students in limited periods of time. It offers
flexibility, rapid processing of results and cost and time effectiveness. Similarly,
the use of a computer tool for testing the students’ outcomes at the end of courses
also proved to be an objective method, as it was a way of avoiding teachers’ bias
when marking exams.
The significant benefits obtained from the use of the system has encouraged
us to extend it for use in the accreditation of language competence, as we con-
sider that it is going to mean an improvement in reducing costs and time, both of

which are paramount for universities nowadays. This is reflected in the dramatic
reduction in the number of teachers required to deliver and mark the test (from 25
to 6, just for the speaking and writing section).
However, there are still some limitations, since the number of tests delivered
depends on the number of workstations available, requiring many sessions to
handle all the candidates. Additionally, we should mention the possibility of the
need to solve technical problems arising during test delivery.
In the near future we will also consider the possibility of incorporating the
speaking and writing section, which are currently paper-based, to be recorded
and saved electronically and then sent to human correctors who would evaluate
the results and report them back to administration to calculate the overall final
mark. This will allow us to use our tool for the implementation of language certi-
fication exams in tertiary education as a requirement of the university degrees in
compliance with the Bologna process.


Asociación de Centros de Lenguas en la Enseñanza Superior. (2011). Modelo de

Acreditación de Exámenes de ACLES. Madrid: Gráficas Chiqui.
Alderson, J. Ch., Clapham, C., & Wall, D. (1995). Language Test Construction
and Evaluation. Cambridge: Cambridge University Press.
Bachman, L. F. (1990). Fundamental Considerations in Language Testing.
Oxford: Oxford University Press.
Bachman, L. F., & Palmer, D. (1997). Language testing in practice: designing and
developing useful language tests. Oxford: Oxford University Press.
Caruso, A. (2008). Designing an achievement test for University students: a look
at validity and reliability. Testing in University Language Centres. Quaderni
di Ricerca del Centro Linguistico d’Ateneo Messinesse. Messina: Rubbettino
Editore, 37–48.
Council of Europe. (2001). Common European Framework of Reference for Lan-
guages: Learning, Teaching, Assessment. Cambridge: Cambridge University
Cubeta, G. (2008). Towards mobile multimedia-based teaching and testing: an
interim progress report. Testing in University Language Centres. Quaderni
di Ricerca del Centro Linguistico d’Ateneo Messinesse. Messina: Rubbettino
Editore, 49–63.
Hinkelman, D., & Grose, T. (2004). Placement Testing and Audio Quiz-Making
With Open Source Software. Proceedings of CLaSIC. Current Perspectives
and Future Directions in Foreign Language Teaching and Learning, 974–981.

Hughes, A. (1989). Testing for Language Teachers. Cambridge: Cambridge Uni-
versity Press.
Jean Jimenez, D. R. (2008). An investigation into the factors affecting the con-
tent and design of achievement tests. Testing in University Language Centres.
Quaderni di Ricerca del Centro Linguistico d’Ateneo Messinesse. Messina:
Rubbettino Editore, 83–96.
Sartirana, L. M. (2008). Dall’esame su carta all’ esame su computer: analisi
della prova di inglese informatizzata presso il Servizio Linguistico d’Ateneo
dell’Università Cattolica. Testing in University Language Centres. Quaderni
di Ricerca del Centro Linguistico d’Ateneo Messinesse. Messina: Rubbettino
Editore, 123–138.
Soh, L. K. Samal, A., Person, S., Nugent, G., & Lang, J. (2004). Designing,
Implementing, and Analyzing a Placement Test for Introductory CS Courses.
Proceedings of the 36th SIGCSE Technical Symposium on Computer Science
Education, 505–509.
Taylor, C., Kirsch, I., Jamieson, J., & Eignor, D. (1999). Examining the relation-
ship between computer familiarity and performance on computer-based lan-
guage tasks. Language Learning, 49 (2), 219–274.

Assessing the Quality of Translations

Diamantoula Korda-Savva48
University of Bucharest

The present chapter sheds light on issues relating to the assessment of translations in the context of
methodological courses. Assessment in translation has always been a thorny issue, because passing
judgment over a translated text is quite a subjective act. On the one hand, there is a range of param-
eters, which determine the quality of a translated work (the acceptability of the language used, the
fidelity to the source language, terminological adequacy, etc.). On the other hand, assessment can
take place in the middle or at the end of the translation process and in a number of types of texts:
literary, technical or general texts. This chapter will touch upon issues relating to all the above and
aspires to give some indications towards the development of assessment in the ‘slippery’ area of
Translation Studies. The research on which this chapter is based was carried out in the University
of Bucharest, Department of Classical Philology and Modern Greek Studies and Department of
English. The main aim was to describe the current situation in Translation courses taught at the
University, and to investigate the extent to which there is room for improvement.

Key words: translation quality, assessment, professional/academic standards, pedagogy.

1. Introduction

The notion of assessment points to a large area of study and international interest.
This is so because assessment is inevitably involved in every human action: from
drawing a picture or working efficiently in the office to taking a language exam.
It is the way in which we can tell whether an action is successful or not, whether a
form of behavior is acceptable or not, etc. Assessment, being such a big umbrella
term, however, can be very subjective and arbitrary, if one does not set a list of
valid criteria, which would limit any subjective and vague judgments. The idea
of assessment is also inevitably present in the area of Translation Studies. This is
the area, which this chapter will attempt to explore.
The chapter is divided in three parts: the Literature review, where I make ref-
erence to a few important people of Translation Studies who have discussed the
issue of assessing the quality of translations; the second part, where I explain how
professional and academic standards endeavor to define the assessment of trans-
lations, either in a quantitative or in a qualitative way; the last part of the chapter,
which discusses the methodology and the findings of a survey I conducted in the
educational context in which I work.


2. Literature review

2.1 The complexity of the translation task

I would like to start with a truism: Translation is a complicated task. It involves

many different factors that one has to take into account, like the sender of the
original message, the content of the message, the package (as it were) of the mes-
sage, the recipients of the translated (target) text, the context, or else, the receiv-
ing culture (translation is culture bound), the client or the commissioner of the
translation at hand. The complicated nature of this task is also reflected in the
multiplicity of metaphors employed to picture it.
The word ‘translation’ includes within itself a picture: of something that is
being carried across (from the Greek origin of the word ‘translating’= μεταφέρειν).
Across the centuries the task of translation has been likened to different objects
like a mirror where the original text is reflected, a copy of an original painting,
a carpet whose two sides differ a little. All of them inextricably involve also the
idea of loss of meaning or form, which has been the subject of endless discus-
sions among translation theorists and practitioners. Another vivid picture intro-
duced by Chesterman (1997, p. 8) is that of a pendulum swinging from one end to
another: that is from attention to the Source Text to attention to the Target Text.
These pictures function not only as a reflection of the nature of translation itself
but also as a mirror of the theory behind it and the complexity of the task at hand.
Assessing the quality of translations is present at all stages of the translat-
ing process: from understanding and interpreting the source text, right through
knowledge acquisition, documentation and transferring the form and content
into the target language. Gile (1995, p. 101ff) notes in the Sequential Model he
has devised that the translator runs successive tests in formulating, first of all, a
meaning hypothesis and testing it for appropriateness in the context of the origi-
nal text. If the first test fails, then he/she has to go over the same loop again and
again until he/she reaches a plausible meaning. The same tests run for the refor-
mulation of the meaning arrived at in the Target Language. All this happens in a
continuum: control checks are constantly implemented by tutors of translation for
their students’ translated outcome in an educational context until later on when
professional translators check the translations of their trainees in a professional
and highly competitive context.
So when referring to assessment I will refer both to the process followed by
teachers in a translation classroom and to the assessment of translations in a more
professional context by reviewers and translation institutions to evaluate the
translated outcome of their trainees.

Having made this differentiation, from a review of the literature, which by no
means is complete or exhaustive, it is observed that different scholars place their
emphasis on different factors. In the following section I make reference to some
of them.

2.2 The scholars’ voice

Bassnett supports that “any assessment of translation can only be made by taking
into account both the process of creating it and its function in a given context”
(1980, pp. 9–10, my emphasis). She goes on to underline that every translation is
culture bound. This, of course, opens up a huge area of discussion where Cul-
tural Studies are involved and power relations are introduced. We can talk about
hegemonic cultures and how translation has been manipulated for the sake of
controlling people’s fates by shaping their language practices, ideologies and
sometimes even the production of their literature. Bassnett’s proposal reverber-
ates with words that have been landmarks in the Theory of Translation Studies.
For instance, the word ‘process’ triggers off the idea of Think Aloud Protocols
(TAPs) that try to discover what goes on inside the translator’s ‘black box’ (cf.
Tirkkonen-Condit 1989, Bell 1991, Kussmaul 1995). If one wants to evaluate the
quality of the product, it is wise to have a clear idea of the process that took place
before the actual outcome. As for the word ‘function’, it reminds us of the idea
of Skopos theory, which promotes the function of the translated text, known to
the translator beforehand (cf. Vermeer, 1989). Theorists consider the Skopos of
the translation as the yardstick against which the quality of the translation can
be assessed. This significantly moves the emphasis on the comparison between
the original text and its translated outcome, with two subsequent consequences:
the ‘dethroning’ of the Source text and the independent status of the Target text.
Bassnett goes a step further by commenting that
the problem of evaluation in translation is intimately connected with the […] problem of the
low status of translation which enables critics to make pronouncements about translated texts
from a position of assumed superiority. The growth of Translation Studies as a discipline,
however, should go some way towards raising the level of discussion about translations, and
if there are criteria to be established for the evaluation of a translation, those criteria will be
established within the discipline and not from without. (p. 10)

In the following sections it will be seen that Bassnett’s proposal and wish for the
independence of Translation Studies has been welcomed and celebrated.
Newmark (1988) in his attempt to chart this area of quality assessment proposes
that there are two ways of evaluating a translation according to its functional or
its analytical nature. The first one is subjective and vague, therefore unreliable,

whereas the second is detailed and reveals mistakes that are responsible for judg-
ing a translation to be bad. He notes verbatim: “just as a bad translation is easier
to recognize than a good one, so a mistake is easier to identify than a correct or a
felicitous answer” (ibid, p. 189). Here one could write volumes about the useful-
ness of errors in translation and how Error Elimination is a process that reflects
and diagnoses problems that could be explained to trainee translators and act
in a prescriptive way. Newmark concludes “Ultimately standards are relative,
however much one tries to base them on criteria rather than norms” (ibid, p. 192).
This last suggestion paves the way towards elaborating more on the idea of
norms. The Israeli scholar Toury (1978/2000) first introduced the idea of norms
in Translation Theory. Chersterman (1997), however, gave a twist to this idea in
terms of the assessment of translations (ibid, p. 64–70). He refers to the following
categories of norms:
Expectancy norms (relating to the product of translation) refer to norms that are established
by the expectations of readers of translation concerning what a translation should be like.
Professional norms (relating to the process of translation) which could further be analysed
the accountability norm: a translator should act in such a way that the demands of loyalty
are appropriately met with regard to the original writer, the commissioner of the translation,
the translator himself or herself, the prospective readership and any other relevant parties.
the communication norm: a translator should act in such a way as to optimize communi-
cation, as required by the situation between all the parties involved.
the relation norm: a translator should act in such a way that an appropriate relation of rel-
evant similarity is established and maintained between the source text and the target text.

Of course, norms are highly culture bound – what is considered right in one cul-
ture may be offensive in another – and so is translation. Therefore, it appears that
both norms and translation practices have a relative nature. Professional bodies
have tried to circumvent this obstacle by setting their own concrete criteria and
restrictions. This is what the following section will exemplify.

3. Assessment standards

3.1 Industrial-professional standards

There have been quite a few international professional mechanisms that have
been established so as to provide an objective benchmark against which to mea-
sure the quality of different services including translation. Here is what is quoted
in the site of ISO 9001–2008 GMS49.


ISO 9001–2008 GMS is the latest quality management system standard. It
specifies requirements for a quality management system when an organization
needs to:
• Demonstrate its ability to consistently provide a product that meets customer
and applicable statutory & regulatory requirements, and
• Address customer satisfaction through the effective application of the system,
including processes for continual improvement of the system and the assur-
ance of conformity to customer and applicable regulatory requirements.
There is also ASTM50 (American Society for Testing and Materials) which in
their site provide a guide that summarises their target and practices. Here are
some parts of the articles in their guide:
1.1 [ASTM] identifies factors relevant to the quality of language translation
services for each phase of a translation project. The guide is intended for
use by all stakeholders, with varying levels of knowledge in the field of
1.2 This guide is designed to provide a framework for agreement on specifi-
cations for translation projects. Within this framework, the participants in a
service agreement can define the processes necessary to arrive at a product
of desired quality to serve the needs and expectations of the end user.
1.5 Translation can be viewed in a number of contexts.
1.5.1 One is that of globalization, internationalization, localization, and trans-
lation (GILT), which takes products or services created for one audience
and makes them suitable to various foreign language audiences, whether in
the home country or around the globe51.
Another association is EUAATC, the European Union of Associations of Trans-
lation Companies52 which gives guidelines for the highest standards for quality
translations. It also hints to the training of translators. This last area is not stressed
enough despite its obvious importance. This realization makes me take a side step
and think about translation from a different starting point: that of academia.

51 It should be noted that this guide is written in a way that is not clear whether it takes into
account the hegemonic status of some cultures that issue guides like this over other cultures
that have to follow these guidelines.

3.2 Academic – scholarly standards

The tools used to assess the quality of translations mostly devised by scholars
for use in the classroom or the academia have been quite inventive albeit having
great expectations. For instance, it is very interesting to follow the logic of a very
imaginatively written work entitled Can Theory Help Translators? A dialogue
between the Ivory Tower and the Wordface (Chesterman and Wagner, 2002). This
book is written in the form of a dialogue between two people who deal with
translation but are interested in it from different perspectives; one is Chesterman
representing the academic part of translators and the other is Wagner represent-
ing the body of professional translators. They identify four different approaches
for translation quality (p. 82–83)
 Comparing the TT with ST
 Comparing the TT with parallel texts
 Measuring the reactions of general typical readers (experimental works, time
taken to read a translation compared to original writing, understanding perfor-
mance, gap-filling exercises in TTs etc)
 Trying to get at the decision making process during the translating itself.
Other methods are also ‘reader response-based’ and they may include reporting
tests, performance tests, comprehension tests, etc. (Melis & Albir, 2001).
All of the above constitute ways of assessing the quality of translation in a
more or less objective way. However, there is always room for improvement. On
the one hand, although the industrial standards mentioned above constitute a very
systematic way of assessing translations, they should also be extended to include
the personal factor. On the other hand, all the mechanisms enumerated as testing
tools for the assessment of quality in academia still remain in the sphere of aca-
demic research, which needs to be adjusted to the reality and more specifically to
the translation classroom.

4. The methodology of the research conducted

From my personal experience I often ask myself about the methods I should use
in order to check my students’ translations and give them appropriate feedback.
At the same time, they have to be reminded that translation is a very serious task.
It is not just a theoretical concept but a need for people’s communication around
the globe. Therefore, we also need to emphasize to future translators that a bad
translation is not only aesthetically ugly but also potentially dangerous (e. g. the
instructions leaflet in a medicine box).

Being personally immersed in a translation environment I was intrigued to
find out what teachers and students alike felt about the issue of assessing their
work. The gain of such an insight would be double: both for the teachers and
the students and the discipline of Translation Studies as a whole – especially
its pedagogy. So I started my research aiming at revealing both parts’ views on

4.1 The context of the research

I am currently teaching Translation Studies at the University of Bucharest, Roma-

nia in the Department of Classical Philology and Modern Greek Studies. The
students I teach work towards a two-year Master’s programme on Translation
and Interpreting which specializes in Greek and Romanian. A variety of courses
are offered which include: Audiovisual Translation, Theory and History of Trans-
lation, Methodology, Technical Translation (legal, economic texts), Literary
Translation, Interpreting, Doing Research in Translation Studies, etc.
The sample of my respondents consisted of postgraduate students of the M. A.
programme mentioned above (nine students) together with their two teachers.
Since this sample was quite limited, I also included undergraduate students of
Translation Studies (39 students) who belong to the Department of English at the
University of Bucharest and two of their tutors53. In Appendix I (Questionnaire
for tutors) and Appendix II (Questionnaire for students), one can see in detail the
questions included in the Questionnaires that were distributed in December 2011.
The reason for giving out two sets of questions was to gain a clearer picture of the
frame of mind in which tutors work and the extent to which this is reflected in and
/or related with their students’ answers.
Both questionnaires (to the teachers and their students) had four parts. Part
I gave a profile of each respondent (e. g. sex, year of studies, Department they
belong to, etc.). Part II revealed the practice followed in terms of translation of
texts (what types of texts they translate, the amount of work done per week, the
stages they actually follow in order to translate, etc.). Part III sought to estab-
lish how the respondents check their translation before they bring it to class and
after they see it with their tutor and/or classmates. Part IV comprised open-ended
questions exploring the respondents’ ideas about the improvements to be made in
the ways translations are assessed.

53 It is worth mentioning that the teachers of translation were very reluctant to commit themselves
to answering the questionnaire. This could be interpreted as a way to avoid self-reflection and
insight into their everyday practices. I felt awkward to insist on asking them to participate.

4.2 The results of the research and preliminary analysis

When reading the answers to both questionnaires, in the same avid manner that
any researcher reads the data collected, I started discovering certain patterns that
emerged and are worth discussing in the following paragraphs. The analysis I
followed was mostly qualitative.
Translation Units Most of the student responses were focusing on the lower
linguistic levels such as vocabulary, grammar and, more particularly, the tenses.
For example, many students typically repeated: “it is important to look at the
tenses”, or “I am interested in the tense of the verbs”, “the axis of tenses has
to be observed”, ‘we need to explain better the words, the phrases” etc. Only
very rarely, if at all, was there an answer that would refer to the text-type of the
text to be translated, the semantics of the text (coherence and cohesion) or the
pragmatics of it (the Source language culture, the Target language culture, the
literary, economic, legal conventions, etc.). So assessment was mainly perceived
at the level of correct grammatical structures, especially tenses and vocabulary.
In other words, only part of the whole text was touched upon instead of a more
global textual approach.
Teacher Centred Practices It was very obvious that the students relied very
much on the help of the teacher whom they described as ‘an authorized person’,
a person they ‘trust’. Here are some indicative answers: ‘I would like to see a
translation done by the teacher” “I compare my translation with another already
verified by my teacher”, “the best way is to discuss it with a teacher” etc. How-
ever, in the last part of the questionnaire their answers indicated that they wished
for other methods of assessment like sharing translations with their classmates
(peer assessment), working in smaller groups, checking their translations against
parallel texts in their mother tongue or even self-assessment (translating the same
text in different ways, checking different options).
Teachers’ Expectations vs Students’ Practice Comparing the answers provided
by the students and their tutors I detected a discrepancy. The tutors had laid out
a meticulous plan of the perfect way to translate and assess a translated text (e. g.
read carefully, look up words/phrases the students do not know, check coherence,
cohesiveness, read again), whereas the students seemed to focus mostly on get-
ting “the tenses of the verbs” right. This may be explained if we bear in mind the
fact that the majority of the students were in their first year of studies and had not
yet mastered a rigid practice for their translations nor the necessary vocabulary
to describe it.
First Year Students vs Postgraduate Students As it was expected, postgraduate
students used specialized vocabulary to express their thoughts in a more accu-
rate way (e. g. “we need to check the naturalness of style in the Target Language

and be faithful to the source text at the semantic level, the pragmatic level and
the target culture’s mentality”) as opposed to the first year students that were
very descriptive in order to express what they wanted (e. g. “analyze the text two
or three times to see if it has the same meaning as the original”). Postgraduate
students were also the ones who raised more issues in terms of the assessment
methods proposed. Here is what one of them wrote: “it would be advisable to
have a specialist analyzing the translation (an engineer for a technical text, a legal
attorney for a legal text) and a translator to analyze the feedback in order to avoid
future similar mistakes”. They would also welcome the idea of “beginner transla-
tors to do practice with experienced translators that could act as [their] guides”.
Both of the observations made above give us the right to believe that with proper
guidance the students’ horizons are widened and their translation assessment
methods become more demanding.
Process vs Product The majority of the student respondents (mostly under-
graduate ones) were too worried to get the translated text ‘right’ (by fixing an
ending or using the right word/expression in the Target Language or by doing
“more vocabulary exercises” and longing for “bigger dictionaries”). Therefore,
they paid little attention to the process that leads to the much desired ‘perfect’
result. Consequently, their anxiety and nervousness levels were high, thus pre-
venting them from working freely and in a relaxed and productive way.

5. Conclusion

My main aim in this research was to investigate how students felt they benefitted
from their classes in translation in terms of the assessment methods used. The
results are indicative of people’s awareness that assessment in translation is quite
difficult to achieve and to pin down in a clear and “scientific” way. According to
one of the findings mentioned above, the translation classroom seems to be overtly
teacher-centered. This means that if the teachers’ methodology is restricted and
not explorative in nature, the students will not become confident or acquire a
sense of independence that is essential for future translators. Monolithic models
in which the tutor is the figure of authenticity (cf. “I need a good teacher who
knows how to explain me [sic] the mistakes) will only limit the field of creativity
on the part of the students. What is needed is descriptive rather than prescriptive
guidance escorted by detailed feedback and discussion. The students themselves
in a sporadic way pointed towards the puzzles of a bigger picture, which seems
to pave the way to fairer methods of assessment. They mentioned peer assess-
ment, teacher assessment, self-assessment, portfolios, comparisons among the
students’ finished translations, comparison between translations and published

translations, text analysis, use of parallel texts, keeping a diary recording their
progress, consulting experts and professional translators.
Apart from the points touched upon so far, there could be a host of other areas,
which would also lend themselves to interesting investigation and discussion in
order to broaden the scope of the research carried out. Therefore, further research
could embrace areas like
gender differences: to explain the differences in the answers given by the
female respondents that were so stereotypical and those given by their male coun-
terparts that were far more creative.
cultural differences: to take on board the culture in which this research is
embedded. The educational system of Romania may influence the style of teach-
ing and assessing.
absence of Skopos: to explain why students of translation most of the times
work in a “vacuum”: they do not know why they are translating a text. In other
words, the Skopos according to Vermeer is missing.
lack of linking the translation activity in a classroom with a global picture of
translation: to investigate why there is absence of linking the students’ translat-
ing activity with the real world and all the standards that are set by professional
boards all over the world.
choice of texts to be translated: to examine whether translation is manipulated
to promote certain values and reject others54.
The whole area of assessment of the quality of translations is fascinating and
in need of more empirical research and investigation.


Bassnett-McGuire, S. (1980). Translation Studies. London and New York:

Bell, R. (1991). Translation and Translating. London: Longman.
Chesterman, A. and Wagner, E. (2002). Can Theory Help Translators? Manches-
ter: St. Jerome Publishing.
Chesterman, A. (1997). Memes of Translation. Amsterdam: John Benjamins Pub-
Gile, D. (1995). Basic Concepts and Models for Interpreter and Translator Train-
ing. Amsterdam: John Benjamins Publishing.
54 I wish to thank all the tutors and the students who belong to the Department of English, Fac-
ulty of Foreign Languages and Literatures and the Department of Classical Philology and
Modern Greek Studies, University of Bucharest for their insightful comments and their offer
for help. Without them this research would not have taken place.

Kussmaul, P. (1995). Training the Translator. Amsterdam: John Benjamins Pub-
Melis, N. M. and Albir, A. H. (2001). Assessment in Translation Studies: Research
Needs, Meta, XLVI, 2, 272–287.
Newmark, P. (1988). A Textbook of Translation. New York and London: Prentice
Tirkkonen-Condit, S. (1989). Professional vs. Non-Professional Translation: A
Think-Aloud Protocol Study. In C. Séguinot (Ed.), The Translation Process.
Toronto: H. G. Publications, School of Translation. York University. 73–85.
Toury, G. (1978/2000). The nature and role of norms in literary translation. In L.
Venuti (Ed.), The Translation Studies Reader, London and New York: Rout-
ledge. 198–211.
Vermeer, H. J. (1989). Skopos und Translationsauftrag – Aufsätze. Frankfurt am
Main: IKO.

Appendix I: Questionnaire
Appendix I: Questionnaire Distributed
Distributed ToTutors
To Translation Translation Tutors
Female male
Teaching undergraduate students Department of studies:
Teaching postgraduate students Department of studies:
What type of texts do you usually ask your students to translate?
legal texts technical texts economic texts literary texts audiovisual
translation (subtitling etc.)
Do you ask them to translate into their mother tongue into another foreign language
or both ?
Describe the steps that your students should ideally follow when they are translating.
Step 1
Step 2
Step 3
Step 4
Step 5 (add as many steps as you think are fit for their usual procedure)
Which of the following do you teach your students to pay attention to when you give
them homework, BEFORE discussing their translations with you?
Correctness of the target text in terms of grammar
Correctness of the target text in terms of semantics
Correctness of the target text in terms of pragmatics
Fidelity to the source text
Naturalness of style in the target language
Correctness of the terms used (in technical, legal texts etc.)
Other (please specify)________________________________________________________
When you check their translations AFTER they finish, which of the following methods
do you use?
Do you explain?
Do you analyse?
Do you compare their translation with those of their classmates?
Do you compare their translations with parallel texts originally written in their mother
Do you compare their translation with translations which have already been published?
Other (please
In your opinion which is the best way to check the quality of a translation?
Does the text type you are checking (legal, technical, literary text) influence your
assessment methods?
Do you feel you assess the quality of a translation differently when a text is set for
homework or an exam question?
Thank you very much for your time and your valuable responses.


Appendix II:II: Questionnaire
Questionnaire Distributed
Distributed ToStudents
To Translation Translation Students
Female male
Undergraduate student year 1 2 3 Department of studies:
Postgraduate student year 1 2 Department of studies:
Languages that you study:

How many texts do you translate in a week for your classes?

more than 6 5-6 3-4 1-2 less than 2
What type of texts do you usually translate?
legal texts technical texts economic texts literary texts audiovisual
translation (subtitling etc.)
Do you translate into your mother tongue into another foreign language or both ?
Describe the steps you are following when you are translating
Step 1
Step 2
Step 3
Step 4
(add as many steps as you think are fit for your usual procedure)
Which of the following do you check in your translation BEFORE discussing it with
your teacher?
Correctness of the target text in terms of grammar
Correctness of the target text in terms of semantics
Correctness of the target text in terms of pragmatics
Fidelity to the source text
Naturalness of style in the target language
Correctness of the terms used (in technical, legal texts etc.)
Other (please specify)________________________________________________________
When you check your translation AFTER you finish, with your teacher and classmates,
which of the following methods do you use?
Do you explain?
Do you analyse?
Do you compare your translation with those of your classmates?
Do you compare your translations with other similar texts originally written in your mother
Do you compare your translation with translations which have already been published?
In your opinion which is the best way to check the quality of a translation?
What recommendations would you have for improving the methods so far used in
checking your translations?
Thank you very much for your time and your valuable responses.


Formative Assessment and the Support of Lecturers
in the International University

Kevin Haines55, Estelle Meima56 and Marrit Faber57

Language Centre, University of Groningen

This chapter discusses the English language assessment and provision for academic staff at a Dutch
university. Lecturers need support in delivering some of their teaching in English in the context of
the internationalization of the university curriculum. We present case studies describing our assess-
ment procedures at two faculties, describing how our provision is guided by the principles of for-
mative assessment and ‘Person-in-Context’. This results in an assessment design that involves and
motivates the lecturers, who value the feedback given and appreciate the authenticity of the assess-
ment procedures. We address the assessment process from a quality perspective, making use of the
LanQua model to evaluate the procedures at the two sites. This helps us to highlight problems that
have arisen during our initial use of the procedures and to suggest adaptations that will alleviate
these problems in future use at these and other faculties. We conclude that the prime value of our
approach lies in the positive engagement of the lecturers, who gain greater autonomy through the
formative assessment process. Finally, we outline opportunities for the consistent structural exten-
sion of these procedures across the university.

Key words: internationalization, formative assessment, context, support, autonomy.

1. Introduction

European Education ministers have established a goal that 20 % of all graduates

should have been ‘mobile’ during their studies by the year 2020 (Bal, 2009). Such
mobility results in the use of a common language or lingua franca at universities,
and in the case of the University of Groningen that language is usually English.
The university already provides the majority of its Master’s and PhD programmes
through the medium of English, while English streams are also well-established in
the Bachelor degree programmes at certain faculties, such as the Faculty of Eco-
nomics and Business (Haines & Ashworth, 2008). Meanwhile, other faculties are
developing English as Medium of Instruction (EMI) provision in line with the uni-
versity’s Strategic Plan 2010–2015.
The development of such programmes lies at the core of the international-
ization process, and creates a linguistic environment that is challenging to both
learners and their instructors. If the level of English amongst students is not high

55 k. b.
56 e.

enough, their chances of success in the EMI programme will diminish, with
research indicating that learners need to pass from a B2 to a C level in order to
function successfully in higher education (Green, 2008). In this context, the pri-
mary focus in terms of language provision tends to be placed on student learning
goals. However, the demands of the international environment on the language
skills of academic staff also represent a major issue for universities. From a qual-
ity perspective, therefore, it is important that transparent and consistent method-
ologies are developed that deal with the language needs not only of students but
also of the academics involved in delivering these international programmes.
One argument proposes that the level of academic content may diminish with
internationalization because lecturers58 have greater difficulty presenting content
in English, which is often their second or even third language. Content instructors
are, therefore, often expected to have a level of English that is as close as possible
to the level of an academic using their first language. In our experience, however,
the understanding of this level varies widely amongst the various parties involved
in the internationalization process. Information gleaned through student course
evaluations and professional development procedures reveals that the students,
the lecturers themselves and their programme managers and administrators often
have very different perspectives on the level that is desirable. This is particularly
apparent in discussions about pronunciation and accents, but extends to other
areas of language use, including specific issues with first language interference.
Often, the level expected of lecturers is also compared to native-speaker level,
which can be an unhelpful benchmark for language development. For example,
Murphey, Chen, and Chen (2004) explain that “language learning activities that
posit the native speaker as an ideal model are often in danger of creating dis-
identificatory moments of non-participation and marginalization” (Murphey et
al., 2004, p. 85). Indeed, the use of the native-speaker of English as a language
benchmark for the assessment and learning of L2 academics can lead to the estab-
lishment of unrealistic and inappropriate learning goals.
The setting of such unrealistic or inconsistent goals for academic staff represents
a barrier to their linguistic development, resulting in frustration and potentially in
resentment of the assessment process. Not only does it raise unacceptable expec-
tations by implying a need for near perfection, but it may also be seen as a threat,
precluding the opportunity for development on-the-job. If lecturers are not given the
opportunity to lecture in English until they have reached a certain level, how will
they ever be able to practice and develop their second language skills in the teaching

58 We use the term ‘lecturer’ in this article in its broadest sense to refer to any teaching member
of the academic staff of the university. Such staff may indeed lecture to mass audiences, or
they may use other forms of pedagogical interaction such as seminars, tutorials and coaching.

context? How will they develop the confidence to extend their language use, given
that such extension through practice will inevitably involve making, and learning
from, language errors? Also, if for perceived reasons of quality, programme manag-
ers were to exclude from teaching those lecturers whose English was deemed to be
inadequate, we would be in danger of ‘throwing out the baby with the bath water’.
Many of our lecturers demonstrate the ability to compensate for their language defi-
ciencies with excellent didactic and presentation skills, as well as intercultural com-
petences (Smiskova, Haines & Meima, 2011). We should also consider the resulting
loss of knowledge to the EMI programme, given that lecturers are hard to replace
because of their expertise in specific fields.
To counteract fears that EMI programmes may have a lower academic quality
than Dutch programmes, faculties often start by requesting or requiring their staff
to take English tests. This course of action produces some resistance amongst aca-
demics if they expect to be tested in a formal way. For example, they might initially
expect the assessment to involve formal tests of the type they experienced in lan-
guage education at school, something they may not have encountered for 20 or 30
years. Or they may fear that they will be required to perform language tasks that
are only indirectly related to their current needs. Such resistance to testing is under-
standable in a context in which academics have limited time and are under increas-
ing pressure, for instance to carry out research and to publish in high-level journals.
Furthermore, teaching through English still represents a relatively small proportion
of the teaching tasks for many lecturers, and the design of the curriculum means
that they often only need to teach through English at certain points in the academic
year. These issues combine to make the task of assessing the second language level
of academic staff both sensitive and complex. Our aim at the Language Centre at the
University of Groningen is to assess academic staff in a non-confrontational way,
taking an easily accessible and low-threshold approach, through which we hope to
maximize the level of empowerment experienced by the test-taker.
This chapter aims to show how the tester and test-taker can co-construct a posi-
tive and formative assessment process that is flexible to the demands of the teach-
ing context. This involves building bridges with academics by carrying out assess-
ments in the real-world context of the day-to-day workings of the university. We
believe that assessing authentic language use, as demonstrated by the cases below,
engages the academic staff, breaking down resistance to the assessment process
and nurturing a growth in personal confidence. This in turn engenders learning
autonomy. We show that the key to success lies not only in the assessment design,
but also in a thorough understanding of the context in which this assessment takes
place. This is closely related to Spence-Brown’s observation that “the way in which
individuals frame a task can greatly affect the authenticity of implementation and
interaction, and thus the validity of the task” (Spence-Brown 2001, p. 465). For

instance, we have learned that a positive outcome depends on the manner in which
the task is described and the extent to which the academic staff recognize that task
as meaningful or purposeful to them. In this case, they frame the task as authentic,
which motivates them and results in their active involvement in the task. A positive
outcome is expressed in terms of the willingness of academic staff to take part in
the assessment process and to make active use of the feedback they receive by trans-
ferring it into their daily practice. In this chapter, we look at the ways in which we
have worked with university lecturers, encouraging them to take a pro-active role in
the process of assessing and developing their own language skills, building greater
learner autonomy in the process.

2. Towards consistent assessment

The Common European Framework of Reference (CEFR) (Council of Europe,

2001) has been widely adopted over the last decade as a tool for maintaining
consistency in language assessment in Higher Education settings across Europe.
In this time, the Language Centre at the University of Groningen has established
a methodology, which relies on task specification, the development of language
samples, and regular standardisation sessions with language teachers. This
approach, informed by the CEFR Manual (2009) and reported in Lowie, Haines,
and Jansma (2010), works within the context of many of our courses for students
but is more problematic when applied to the types of diagnostic help needed by
academic staff. We have found that the CEFR does not readily lend itself to the
specific language required by lecturers in their teaching context. This difficulty
is underlined by North’s proposal of a D level above C2 to meet the needs of such
“well-educated non-native speakers” (North, 2010).
Consideration of the value of the CEFR to our procedures, when orientating on
the task of assessing academic staff, led us to the broader question of precisely what
we should be assessing. Coleman argues that “academic discourse, like any other
discourse, is culturally bound, and translation into English implies more than merely
linguistic change” (Coleman, 2006, p. 10). This suggests that the communicative
needs, and hence the assessment needs, of lecturers in EMI programmes encompass
not just another level but also another dimension of language use, possibly incorpo-
rating references to pedagogical and intercultural skills. Solutions to this complex
situation are emerging in practice, as described by researchers such as Wilkinson:
“The respondents … reported that it was beneficial to adapt instructional techniques
in English medium teaching” (Wilkinson, 2005, p. 4). Changes in teaching style,
and in particular a move away from lecturing mode, can result in a more naturalis-
tic and constructive linguistic interaction between instructor and learner. This cor-

responds with Vaughan Stoffer’s observation that “interactivity can safeguard the
quality of education in classrooms mostly comprised of L2 speakers” (2011, p. 69).
This work confirms the need to provide support that goes beyond language in iso-
lation and considers both the pedagogical role of the lecturer and the intercultural
environment in which he/she is being asked to function. It also confirms the need
for a model or framework to support and structure the process of such provision.
We have turned to a model that provides a foundation for the quality of the pro-
cess. Our use of the LanQua Quality Model (2010) as a reference tool in design-
ing support for lecturers and coaches is discussed extensively in Smiskova et
al. (2011). This model, designed for European Higher Education, comprises five
stages: (1) Planning, (2) Definition of Purpose, (3) Implementation, (4) Monitor-
ing and Evaluation, and (5) Adaptation.
The iterative cycle of the LanQua model (Figure 1) is valuable to us primarily
because it encourages us to work reflectively when monitoring the implementation
of assessment procedures in context. This ensures that our procedures are valid in
the sense that they consistently test the language skills that need to be tested. In
particular, we have aimed at authenticity, which Bachman & Palmer have defined
as “the degree of correspondence of the characteristics of a given language test task
to the features of a target language use (TLU) task” (Bachman & Palmer, 1996, p.
23). In our setting, this means testing the skills that are most relevant to the tasks
that academic staff, such as lecturers, carry out in their daily practice. A further
advantage of the LanQua model is that it implies a bottom-up approach, which lends
itself to the involvement and motivation of lecturers through evaluative procedures.

Figure 1. The LanQua Quality Model

In order to ensure that the involvement and motivation of the lecturers remains
central to the assessment process, we have emphasised two complementary
guiding principles. Our first principle is that the assessment of lecturers should
be formative. In the context of language assessment, the assessment should be
formative in the sense that “it produces a change of some kind for the learner”
(Rea-Dickins, 2006, p. 168). Formative assessment provides the means by which
a learner can develop their language skills, in our case with the feedback and
diagnostic support of a language assessor. Our second guiding principle is the
Person-in-Context perspective, which Ushioda (2009) describes as follows:
A focus on real persons, rather than learners as theoretical abstractions; a focus on the agency
of the individual person as a thinking, feeling human being … a focus on the interaction
between this self-reflective intentional agent, and the fluid and complex system of social
relations, activities, experiences and multiple micro- and macro-contexts in which the person
is embedded, moves and is inherently part of. (ibid, p. 220)

The Person-in-Context principle encourages us to work with the academic staff

in assessing their real, contextual language needs, while also challenging us to
define where the formative learning priorities lie within the complexity of these
needs. This chapter shows how we translate these guiding principles into prac-
tice when assessing academic staff. We consider such assessment to be an itera-
tive process, which undergoes the continual evaluation and revision conveyed
by the LanQua model in relation to the dynamic needs of the individual learner
in context. This means that the cases we report below are also inevitably works-
in-progress which will be developed in various ways in the future in response
both to changes in the environment in which they take place and to the feedback
provided by the various stakeholders in the assessment process.

3. Case studies at the University of Groningen

The following cases illustrate how we are able to carry out the assessment
requested by two faculties at the University of Groningen, while maintaining a
formative approach, which is learner-centered in the sense of Person-in-Context.
We discuss our assessment of the spoken English of academic staff at two inter-
national programmes, one in the Faculty of Social Sciences (SOCSCI) and the
other in the Faculty of Medical Sciences (MEDSCI). These cases represent two
of our most recent efforts to assess and support the second language of lecturers.
We have structured our description of the cases under the headings provided by
the LanQua model, integrating both cases under each heading so that the reader
may make immediate comparisons.

3.1 Planning

In some departments, an EMI programme has already been implemented or par-

tially implemented before the lecturers are assessed. For example, a department
at SOCSCI asked the Language Centre to assess the language levels of lectur-
ers who were already teaching through English. The department had spent three
years previously preparing for the EMI programme and yet it was only in the
third year of the project that lecturers became used to the idea of teaching through
English and agreed to be assessed. In order to avoid creating resistance through
imposition, lecturers were given the opportunity to volunteer for this activity. A
group of eight lecturers who were already teaching through English volunteered
for this trial round, and after the positive evaluation of the trial round, the depart-
ment initiated a second round of staff assessments with a new group of volunteers.
In other departments, English support for academic staff is integrated into
the design of the EMI curriculum development project at an earlier stage, which
means that the lecturers are not yet actually teaching in English when the assess-
ment takes place. For instance, a department at MEDSCI decided to offer its staff
the opportunity to deal with potential language issues before the programme com-
menced. Assessments were held approximately a year in advance of the launch of
the programme and, for the reasons outlined above, academic staff were given the
opportunity to volunteer for these assessments.

3.2 Purpose

The main objective at the SOCSCI department was to evaluate the extent to which
the lecturers were able to perform their required tasks in English. Although lec-
turers need to do a substantial amount of writing in English, it was not felt that
this should have the highest priority, as written work can be checked by the Trans-
lation and Correction services before being sent out to the students. Delivering
lectures has a different dynamic, as lecturers need to be rather spontaneous while
delivering the information in a clear, understandable way so that students, for
whom English is generally also a second language, do not become confused or
At the departments in both faculties, the purpose of these assessments was to
divide the staff into three categories on the basis of their presentations. The first
category (A) consisted of staff whose spoken English would not cause difficul-
ties when teaching. This is not to say that their English was perfect, but that they
could make themselves understood in a clear and enthusiastic presentation and
that they did not have significant structural language problems. Such lecturers

often achieve this goal by combining their use of language with other pedagogical
and communicative strategies, so that they maintain their credibility as a teacher.
The second category (B) consisted of staff that would benefit from some language
support in specifically defined areas. It was felt that lecturers in this category
were capable of teaching clearly in English, but that they either had specific struc-
tural language problems or their message lost impact because of issues with their
language use. Such lecturers often feel insecure or uncomfortable when teach-
ing in English and sometimes refer to a perceived loss of personality, although
their use of English does not generally lead to misunderstandings. Our aim is to
increase their sense of confidence through specific language work and associ-
ated reflection and practice of strategies for effective communication when using
a second language. The third category (C) consisted of those academic staff in
need of extensive and often immediate language assistance, whose use of spoken
English might produce misunderstandings and negatively affect the learning pro-
cess of their students. At SOCSCI, we were able to prioritize immediate support
for any staff in category (c), while planning support for staff in category (b) at
moments in the year when English became an issue. At MEDSCI, staff in catego-
ries (b) and (c) had the time to work on their deficiencies before the actual English
progamme began.

3.3 Implementation

At SOCSCI, it was felt that the assessment task should have as low a threshold
level as possible. Our aim was to assess an authentic example of spoken language
in context, and we did not wish to put the lecturers under the extra pressure of
having an outside observer present. Therefore, we chose to view a recording of
the lecture instead of attending. This strategy had further advantages. First of all,
lectures were already being recorded so that students could view them again at
a later date. For this reason, the lecturers had grown accustomed to the presence
of the video camera. Secondly, the lecturers were free to choose which record-
ing they sent us, although in practice it turned out that they only gave one or two
lectures per year in English, so the choice was actually quite limited.
After receiving the SOCSCI videos, we spent approximately thirty minutes
viewing various parts of the lecture, at the beginning, middle and end. This
allowed us to place the lecturer into one of the three categories (a, b or c) described
above. It also allowed us to provide some initial diagnostic support in the form of
examples of language use. Both positive and negative examples of language use
were recorded in a feedback report by the assessor; the time in the video at which
each example occurred was also noted. Not only would the project leader receive

a copy of these reports, but all lecturers were also sent a copy along with the link
to their video, making the process transparent and empowering the lecturer to
make full use of the feedback. See the appendix for an example of an assessment
report for a SOCSCI lecturer in the C category59.
At MEDSCI, the members of staff were not yet teaching through English at
the moment of the assessment, so the set-up for these assessments differed from
the previous case study. Lecturers were asked to prepare and deliver a ten-minute
recorded presentation on materials they anticipated using in the future, thereby
making the topic contextually relevant and meaning that the presenter could later
recycle the materials for use in their course. The presentation took place in the
presence of two assessors, with an audience of four or five peers. After each pre-
sentation, there was a five-minute content-based question-and-answer session,
during which the peers in the audience interacted with the presenter, taking a
similar role to that of a student audience. As everybody in the group had to give a
presentation, there was a high degree of empathy in the audience interaction with
the presenter, and the questions tended to be constructive.
Before the MEDSCI staff delivered their presentations, we asked them to
describe their experience in preparing the presentation. Many of them conceded
that it had taken them longer to prepare than they had anticipated. They were also
asked to describe their strengths and weaknesses and what they hoped to achieve
in terms of language development when lecturing through English. For example,
one lecturer mentioned that she would like to be able to speak with the same
humour that she used when lecturing in Dutch. After the presentation and the
question-and-answer session, we encouraged the lecturers to describe the experi-
ence from their own perspective before feedback was given. The audience in this
procedure created added value, giving an insight into how the lecturers would
cope with the need to respond spontaneously to questions. In the recordings of
the SOCSCI lectures, the question-and-answer session was often not recorded,
meaning that we were not able to give feedback on how the lecturers would spon-
taneously respond to questions. This is a point that will be addressed in future
assessments at SOCSCI.
During the procedure at MEDSCI, the groups were kept small (four or five
people) in order to keep the threshold to a minimum. Only the two assessors were
permitted to give immediate feedback. We felt that it would have created insecu-
rity if the assessors had left without giving the presenters an immediate impres-
sion of their performance, but the feedback given at this moment was kept fairly
general and was always formulated in relatively positive terms due to the sensi-

59 The report in the appendix has been made anonymous and is presented in a slightly edited
form to protect the identity of the lecturer. It is used with his permission.

tivity of the process. After this, we also reviewed parts of the video and wrote a
report similar to those produced for SOCSCI, specifically outlining the aspects
that had gone well and those, which could be improved. Once again, the reports
were sent to the presenter in order to maintain transparency and to encourage

3.4 Monitoring and adaptation

When lecturers were assessed as needing support (categories b and c), they were
advised to attend an intake interview at the Language Centre. This intake session
was an informal discussion during which the most appropriate form of support
was decided. The video and assessment report were used as a reference point,
but the lecturer also had the opportunity to discuss the development of his/her
English in a wider context. For instance, if they felt that the video material was
in some way atypical of their teaching situation, they had the opportunity to
describe their actual needs in more detail. While some lecturers joined courses
in small groups, other lecturers followed individual coaching routes. The video
and assessment report were made available to their teacher or coach as input at
the start of this process.
After the intake sessions at both sites, it was the responsibility of the lectur-
ers to actively seek out individual coaching sessions with a language instructor
or to join a special group session if recommended, at a time convenient to them.
Although this was taken seriously, we notice that take-up could be improved. The
value of the assessment will be lost if individual lecturers and their departments
do not follow through on the advice and feedback given. Time is evidently an
important issue here. While we would like to encourage a pro-active approach,
we recognize that it may be more practical in future to place some lecturers in
appropriate courses rather than waiting for them to take the initiative. Lectur-
ers have complex professional lives, balancing conferences with ever-changing
schedules, and sometimes working at more than one institution. This creates
planning issues that will undoubtedly remain a point of tension, and it may lead
us in future to use more flexible combinations of individual coaching together
with workshops targeting specific language needs.
The main result of the procedures described above is the positive response
amongst academic staff at these departments to the idea of English language
assessments. By way of illustration, SOCSCI have now set up a second round of
assessments with more volunteers. These staff members had heard positive word-
of-mouth reports from colleagues who had already been assessed. Most of the
lecturers assessed felt that they could identify with the feedback provided. The

diagnostic help appears not only to be appreciated in terms of the information it
has provided, but it has also engaged these staff in the process of learning Eng-
This engagement is the product of the contextual relevance of the procedure
and the feedback provided, and we believe that it provides the motivational foun-
dation for academic staff to move forward in the autonomous development of
their English. The lecturers at both sites were generally pleased with the assess-
ment feedback and acknowledged the benefits for their teaching. We experienced
a culture of active participation and cooperation; and this high level of engage-
ment is illustrated by the fact that several lecturers have taken the initiative to
include the video evidence and feedback from the English assessment in the pro-
fessional development portfolio (see section 4 below). Although we are pleased
with this form of assessment and its results, we have discovered that we need to
make our existing language provision more readily accessible to this group of
language learners.

4. Discussion

The following issues emerged from evaluation of these assessment procedures,

and we are currently working on revised structures in which we take these aspects
of the process into account.
Firstly, we noticed that many lecturers teach mainly through Dutch, with only
a limited number of high-profile lectures per year in English. This is because they
are experts in specific fields and are invited to give guest lectures in English on
their specific field of expertise. Despite the limited number of lectures, they are
expected to produce a high level of English; yet for much of the year this need
has a lower priority for them than more pressing work. In the absence of a sense
of urgency, they are not motivated to initiate involvement in an English course.
Secondly, the pre-recorded lectures at SOCSCI do not give us the opportu-
nity to view teacher-student interaction, meaning that the assessor sometimes
has to speculate about how the instructor would perform in such a situation. At
MEDSCI, there is an audience consisting of a small number of largely empathetic
peers, which contrasts to some extent with the dynamics of an authentic student
Thirdly, the issue of how and whether to separate fine language skills and
didactic skills remains a point of debate both amongst assessors and in the client
faculties. The Language Centre is perceived to be a provider of language sup-
port, and another unit within the university is formally responsible for the didac-
tic training of academic staff, many of whom have risen through the ranks of

academia without receiving formal didactic training. Despite this organizational
division of tasks, it is clear that didactic skills and language skills overlap consid-
erably in the classroom or lecture theatre.
This brings us to one of the key issues that emerges from the above cases,
which is the need to ensure transparency to all stakeholders, including the other
units at the university that are involved in the training of lecturers. This transpar-
ency would be supported if the assessment were implemented within, and sup-
ported by, a systematic structure that is recognized by all parties. Such a structure
would also ensure a consistent approach, as the current approach is reliant on the
commitment of managers at specific faculties and hence vulnerable to changes in
key personnel. Furthermore, a common structure would give us the opportunity
to extend our approach more widely as internationalization becomes more firmly
embedded across the university. In this spirit of cooperation with the teacher
training unit, we are currently preparing a pilot course for lecturers in which
language skills and intercultural skills will be integrated.
Finally, we note that the university is implementing a University Teaching
Qualification (Basiskwalificatie Onderwijs or BKO in Dutch), which will be
a precondition for offering a permanent contract. The BKO involves every
lecturer producing a portfolio as evidence of teaching competence, needs and
progress. As the approach described above lends itself to a portfolio-based
structure, this represents an opportunity for us to consolidate the work we have
done in a way that is meaningful for lecturers. Those lecturers who need to use
English in their teaching can add the video and the assessment report to their
BKO portfolio as illustrative material, as some lecturers have voluntarily done
already. They can also provide references to work produced during courses or
individual coaching sessions at the Language Centre as evidence of achieve-
ment, which will make the link between the assessment and the subsequent
support more explicit. Our approach already encourages each lecturer to take
a pro-active approach to their language development, and we believe that the
procedures described above provide a firm foundation for the structuring of
that development in a portfolio.

5. Conclusion

This chapter has provided illustrations through two case studies of our forma-
tive assessment of the English language skills of lecturers who need to teach
through English. We have emphasized that our approach is a work-in-progress
that is subject to the iterative processes of evaluation and revision described
in the LanQua Quality Model. We have also recognized that the design of

the assessment procedures will differ depending on the local setting. What is
important is that the assessment is set up in such a way that it is meaningful to
the lecturers who are being tested. In general, this means that the assessment
task should be founded on an authentic situation that lecturers readily recog-
nize as relevant to their teaching context, with the result that the lecturers also
have confidence that the feedback provided will be valuable to them. The task
is even more appreciated if it involves the busy lecturers in producing some-
thing that they can later recycle in their teaching. Clearly, therefore, the level of
attention invested in the task design is a prerequisite both to the engagement of
the lecturers in the task and to the ability of the assessor to provide meaningful
feedback. And only if the detailed feedback is meaningful, can we claim that
our test is formative in the sense that it produces change in the person being
tested. Finally, it is essential that the advice produced during the assessment
procedure is followed through, with the lecturer accessing courses or coaching
when applicable. To maximize the chance of this follow-through occurring, the
assessment procedures need to be transparent to all stakeholders involved in
the personal development of the individual lecturer.


Bachman, L. F., & Palmer A. S. (1996). Language Testing in Practice. Oxford:

Oxford University Press.
Bal, E. (2009). Eerst was er Bologna, nu is er Leuven. Transfer, 8, 12–14.
Coleman, J. A. (2006). English-medium teaching in European higher education.
In Language Teaching 39, 1–15.
Council of Europe. (2009). Relating language examinations to the Common
European Framework of Reference for languages: learning, teaching, assess-
ment (CEFR): A manual. Strasbourg: Language Policy Division.
Council of Europe (2001). Common European Framework of Reference for Lan-
guages: Learning, teaching, assessment. Cambridge: Cambridge University
Green, A. (2008). English Profile: functional progression in materials for ELT.
Cambridge ESOL Research notes, Issue 33, 19–25.
Haines, K., & Ashworth, A. (2008). A reflective approach to HE language provi-
sion: integrating context and language through semistructured reflection. In
Wilkinson, R. & Zegers, V. (Eds.). Realizing Content and Language Inte-
gration in Higher Education (pp. 201–211). Maastricht: Maastricht University.
Lanqua (2010). Lanqua Quality Model. Retrieved November 4, 2011, from http://

Lowie, W. M., Haines, K. B. J., & Jansma, P. N. (2010). Embedding the CEFR in
the academic domain: Assessment of language tasks. Procedia Social and
Behavioral Sciences, 3, 152–161.
Murphey, T., Chen, J., & Chen, L. (2004). Learners’ constructions of identities
and imagined communities. P. Benson and D. Nunan (Eds): Learners’ Stories
(pp. 83–100). Cambridge: Cambridge University Press.
North, B. (2010, November). The Core Inventory (British Council seminar, 10
November, 2010). Retrieved November 4, 2011, from http://www.teachingeng-
Rea-Dickins, P. (2006). Currents and eddies in the discourse of assessment: a
learning-focused interpretation. In International Journal of Applied Linguis-
tics. 16, 2, 163–188.
Smiskova, H., Haines, K., & Meima, E. (2011). Key considerations for the quality
of language provision for academic staff in the English-medium programmes
(EMI) at a Dutch university. Fremdsprachen und Hochschule 83/84, 87–100.
Spence-Brown, R. (2001). The eye of the beholder: authenticity in an embedded
assessment task. Language Testing 2001 18 (4), 463–481.
Ushioda, E. (2009). A Person-in-Context relational View of Emergent Motivation
Self and Identity. In Z. Dörnyei and E. Ushioda (Eds): Motivation, Language
Identity and the L2 Self (pp. 215–228). Bristol: Multilingual Matters.
Vaughan Stoffer, R. (2011). Internationalisation, Language Policy and Gover-
nance. Unpublished Master’s thesis. Groningen: University of Groningen.
Wilkinson, R. (2005). The impact of language on teaching content: views from
the content teacher. Retrieved November 10, 2011, from http://www.palme-

Oral Presentations in Assessment: a Case Study

Ian Robinson60
University of Calabria

In universities in Italy it is normal to find that the students’ final subject grade is based exclusively on
a final exam. In English language courses this might take the form of a written and an oral exam. This
tends to put a great stress on the learner as the mark is given based on how the learner performs at a
given moment. This chapter looks at a project in which students of the second level university cycle 61
were given alternative forms of assessment in their English course. One of the principal forms of test-
ing involved assessing oral presentations carried out in class. During the course the class was guided in
oral presentation design and then asked to give a presentation on a subject related to their degree major,
which, in this case, was Social Services. These presentations were done on an individual basis and were
presented to the whole class. The process also involved peer-assessment of each student’s presentation.
The present chapter examines this experience and the value of oral presentations as a form of testing.

Key words: oral presentation, assessment, peer-evaluation, visuals.

1. Introduction

This chapter focuses on a project, which investigated how oral presentations can
be used as an assessment tool in university language courses. The project involved
students being taught how to give oral presentations as part of their language course
and then their presentations being evaluated as part of their final course assessment.
This chapter will give a concise overview of the literature of assessment and of
oral presentations. It will then relate the experience of students in an Italian uni-
versity using presentations in assessment and will conclude with some remarks
about further possibilities for research in this area.
The aim of the project was to use oral presentations as part of final assessment
and investigate the use of university students’ peer-assessment.

2. The context

The students involved in this particular project were students in their second
cycle University degree course of Social Work and Social Policies in the Uni-

61 Italian universities operate with a 3 plus 2 system. Undergraduates should graduate in three
years, they can then continue with a two year specialised degree course. Graduates from this
two-year course can have access to a Master’s Degree.

versity of Calabria in southern Italy. There were 60 students regularly attend-
ing lessons. The majority of the students had already completed the three- year
undergraduate course at this university and had therefore already completed the
three English courses that were a compulsory part of the course. The present
English course involved 24 hours with the teacher responsible for the course (the
author of this chapter) and a further 10 hours of lessons with a language tutor.
The tutor concentrated mainly on how to write an academic paper in English. The
main course concentrated on English for Academic Purposes (EAP) and English
for Vocational Purposes (EVP) and involved teaching reading skills, with texts
concerned with social work and social workers.
One of the language needs for graduates from the degree course in Social
Work is that of being able to deliver oral presentations. This does not only require
linguistic competence but is also a skill in itself. Even in prestigious international
conferences, oral presentation skills are not ones that all presenters have mas-
tered. As graduates in Social Work will most likely at some time in their future be
called on to give oral presentations, it was felt that they should learn the relevant
skills. The fact that they would learn these in and through English was a further
challenge for them. It was also hoped that by introducing the oral presentation
as an alternative form of assessment and allowing students to do this during the
last few lessons of the course, rather than during the official exam period, would
ease student anxiety. In the past, for this degree course, the final assessment of
the English course had been on a prefixed day and it was an assessment of stu-
dents’ performance at that particular moment on a test they had not seen before.
The students used to have an exam that involved reading comprehension items,
language in context items and a short writing task. Students who passed this part
of the exam were then admitted to an oral component that consisted of discuss-
ing a topic related to social work. The exam format has now changed so that the
oral presentation gave them the chance to prepare the oral part of their exam well
beforehand and do it before the written exam. The other part of the exam that
changed was the writing section, which now reflected what was being done with
the tutor and was done as part of the course work.

3. Literature review

Assessment has always been an integral part of English Language Teaching

(ELT) and much has been written on the subject. A lot of the literature involves
how to improve forms of assessment. Indeed, Küçük and Walters (2009) drew
attention to the problems of testing by entitling their article “How good is your
test?” Much of the literature aims to guide people into improving testing styles.

Douglas (2010, p.145) calls for those involved in language testing to “introduce
a degree of creativity and personality into the tests they design” and that they
should “design tests that more accurately reflect communicative activities in
classrooms and real-world language use situations”.
In this project the type of assessment under consideration is that of oral pre-
sentations, a “real-world language use”. Jordan (1997, p.193) notes that oral pre-
sentations are one of the activities in the field of speaking for academic purposes,
which would therefore make it a legitimate area of activity and assessment in a
university degree course. This type of “performance” would fall into Bachman
and Palmer’s (2010, p.216) second assessment construct definition in which the
“language ability and topical knowledge [are] defined as a single construct”. It
was felt that using oral presentations in this way would be congruent with Wid-
dowson’s (2003, p.173) plea that “in language teaching and testing our focus of
attention should be on the language as a resource of making meaning” and that
(ibid, p.174) “there is more to linguistic competence than a knowledge of gram-
mar, and more to language capability than linguistic competence”. These ideas
are heeded in this project by linking “language capability” to that of the language
used in oral presentations.
Types of assessment are also important. Brown (2001, p.415) argues the
case for self-evaluation and also asks “[a]nd where peers are available to render
assessment, why not take advantage of such additional input?”. He goes on to
cite research he had done with Hudson that showed as advantages for these types
of assessment the “speed, direct involvement of students, the encouragement of
autonomy, and increased motivation because of self-involvement in the process
of learning”. The benefits of peer-assessment are also noted by Mok (2011, p.231):

If thoughtfully implemented, it can facilitate students’ development of various learning and

life skills, such as learner responsibility, metacognitive strategies, evaluation skills, and a
deeper approach to learning. Although the benefits of peer assessment have, to some extent,
been recognized, there are a number of challenges within the classroom practice of the assess-

Luoma (2004, p. 189) also noted that

the motivation for peer in educational settings is more than making students attend to what
is going on in the classroom when they are not communicating themselves, although that
is one of the advantages. It can help learners become more aware of their learning goals,
learn through evaluation, and learn more from each other. Peer evaluation is not a panacea,

Luoma (ibid) also states that the students might not be suitable to assess linguis-
tic criteria ‘because students are not as adept at language analysis as teachers or

raters, whereas task-related criteria may prove more effective’. It is for this reason
that peer-evaluation seemed suitable for oral presentations.
There is also a growing literature concerning oral presentations; some of this
started from the world of business such as Atkinson’s (2008) work on how to move
away from dense texts on slides, Duarte (2008, 2010) looking at how to design
and structure a presentation to create the optimum effect, Etherington (2009)
on how to overcome nerves and gain self-confidence, Gallo (2010) on using a
very good presenter (Steve Jobs) as a role model, Reynolds (2008; 2010; 2011)
on designing simple, effective presentations, and Williams (2010) who looks at
how, even people who have little or no design background, can produce quality
There is also literature concerning the ELT world. Harrington and LeBeau
(2009), for example, break an oral presentation down into three parts: the physical
message, the visual message and the story message. Each of these stages needs to
be considered. Much of the literature aims to break away from presentations that
people have suffered in the past. As Reynolds notes (2008, p.13) “research sup-
ports the idea that it is indeed more difficult for audiences to process information
when it is being presented to them in spoken and written form at the same time”
and he talks of “death by power point”, which is the type of slide show presen-
tation which has lots of bullet points on each slide and each bullet point involving
lots of words. Reynolds (2008, p.68) refers to text rich power point presentations
as ‘slideuments’ and tries to steer people away from them. The developmental
molecular biologist Medina (2008, p.84) points out that “the brain cannot multi-
task”, therefore it would be difficult for it to accept different types of input at the
same time. Medina (ibid, p.240) also writes that “vision is by far our most domi-
nant sense, taking up half of our brain’s resources” and that “we learn and remem-
ber best through pictures, not through written or spoken words.” Ideas like this
have influenced people such as Reynolds (2008; 2010; 2011) and Duarte (2008) to
describe efficient ways of giving oral presentations. Their work originally started
by paying a lot of attention to presentations that use some form of computer sup-
port but then moved on to describe the importance of the story and how to get and
keep people’s attention. Reynolds (2008, p.7) notes that “we all know, through
our own experience that the current state of presentations in business and aca-
demia causes its own degree of ‘suffering’; for audiences and presenters alike”.
According to him there are three principles in presentation “restraint, simplicity,
and naturalness”. He goes on to suggest how we can give effective presentations
following these principles. When he or Duarte refer to images they do so with
a similar insight to Roam (2008, p.11) who says “we can use the simplicity and
immediacy of pictures to discover and clarify our own ideas and use those same

pictures to clarify our ideas for other people, helping them discover something
new for themselves along the way”.
It is not necessary for people to rely on slideware technology. Indeed the later
work of both Duarte (2011) and Reynolds (2011) moves away from power point
and focuses on what Harrington and LeBeau (2009) had identified as the story
aspect of the presentation. Heath and Heath (2007) say that going through a pre-
sentation full of statistics has no impact, it is when there is an emotional story
behind it all that the message you want to transmit will ‘stick’ in the audience’s
This study therefore looks at using modern ideas of oral presentations to teach
the skills necessary for the undergraduates to be able to give presentations in a
clear way. It then goes on to use this skill and evaluate the performance as part of
the students’ final course evaluation for the English course.

4. Methodology

The present study took place in the English course for students of the Social Policy
and Social Work programme offered at the University of Calabria in southern
Italy in the academic year 2010/2011. The students were taught, in English, how
to give an oral presentation. The main points from the literature were explained
to them in class with a focus on the work of Garr Reynolds (2008; 2010; 2011).
That meant trying to encourage them to emotionally connect with the audience
by telling a story, rather than relating dry facts. The students were shown how to
structure a simple presentation with an introduction, main points and a conclu-
sion. The need for clarity was repeated several times as it is key factor to any
presentation, as is finding a way to obtain and then to keep the attention of the
audience. The work of Medina (2008), Reynolds (2008; 2010; 2011) and Duarte
(2010) was used to help students avoid creating power point ‘slideuments’, but
rather to use clear images to support their oral presentation.
At first the students performed one-minute presentations about themselves to a
small group of classmates. This gave them the confidence to start speaking in this
L2 in front of others. The next presentation was a very short greeting to the whole
class. This meant that they had to come to the front of the class and say hello and
a few other words of greeting, or whatever they wished to say. The aim of this was
for them to gain confidence and to prove to themselves that they were capable of
talking to a large audience.
For their final oral presentation the students were all asked to prepare a short
oral presentation on a theme connected with their degree course. This resulted
in presentations on such things as “mental health”, “adoption”, “immigration”,

“prostitution”, “rights of children”, “domestic violence” and other such topics.
They were told that they could use power point (or other slide software such as
keynote), if they wished, but that it was not compulsory. However, it was sug-
gested that it might be a useful tool for them, especially if they learnt to use the
notes section, which allows a person to use the computer screen to display the
projected image and the student’s written notes, which are not projected during
the presentation stage.
While each student presented her or his presentation, the rest of the class acted
as the audience and also as peer evaluators. To complete the 60 presentations, this
had to be done over several lessons. Each student completed an evaluation sheet
at the end of every presentation (see Appendix). A seven point Likert scale was
used, with 1 being the score for “not good” and 7 for “very good”. The teacher
also completed a similar form. The form for the students was bilingual, in Italian
and English, and any comments could be made in either language.

5. Results and discussion

The students did not all complete or hand in their evaluation sheets. The mini-
mum number of evaluations for a presentation was five, while the maximum was
thirty62. However, the overall average was calculated for each presentation based
on all the evaluations available for all the parts of the assessment forms.
The evaluation form contained five questions, asking students to assess the
• Question 1, the physical aspect of the presentation; posture, use of gestures,
eye contact.
• Question 2, the use of visual support (if it was used) of the presentation; pic-
tures or computer images.
• Question 3, the use of voice of the presentation: clarity of speech, intonation,
and volume.
• Question 4, the ease to understand the presentation; clarity, structure, etc.
• Question 5, an overall evaluation of the presentation.
The results can be seen in Table 1. The students’ peer-evaluations are all shown
as an average. The results from the teacher’s evaluations are also shown as aver-
age scores.

62 One drawback perhaps of the design of this project was that the students obviously did not feel
the obligation to complete the evaluation forms. The low return rate of evaluation forms needs
to be addressed in future projects.

Table 1. Average scores for the five evaluation questions

Question Average student assessment Teacher’s assessment

1 – physical aspects 3.72 3.22

2 – visual aspects 4.17 3.85

3 – use of voice 3.88 4.1

4 – understandability 3.9 4.03

5 – overall assessment 3.94 3.72

As a seven-point scale had been used, 3.5 refers to the middle point, at which the
assessor expresses neither a positive nor negative assessment. Seven shows a very
positive evaluation and 1 an extremely negative one.
The form also gave space for the students to write comments on the good
points, on the weak points and in general about the presentation they had just
observed. There were a total of 243 positive comments, 140 comments on weak
presentations and 81 general comments.
Peer-evaluation was very similar to the teacher’s evaluation. Notably peer-
assessment gave slightly higher values on questions one, two, three and the over-
all impression mark but not on question four. This question was the one about
ease to understand and clarity. It is possible that the teacher being a native Eng-
lish speaker, found the presentations easier to understand than the students who
were listening to an L2. It should be remembered that Luoma (2004) stated that
students were “not as adept at language analysis as teachers or raters”. Only in
question one did the students receive an average mark that was below the mid-
point of 3.5. This is indeed the point at which the evaluation becomes more nega-
tive than positive. Here, the teacher gave a slightly negative evaluation of the stu-
dents’ average performance. This means that aspects such as intonation, volume
and vocal expression were not assessed positively. A factor involved here is that
some students read the presentation word for word and others had memorised it
by heart. These methods tend to flatten the expressive quality in presenters who
have little presenting experience and/or are nervous. All the other averages were
above mid-point and so were positive. If we accept the teacher here as ‘expert’
whose evaluation should be accepted, then this hegemony of scores would point
to peer rating being a valid way to assess. One aspect that is impossible to show
in a written document like this is the quality and the enthusiasm demonstrated in
the presentations. The fact that the number of positive comments far outweighed
the negative ones could be used to suggest the quality of the presentations.

As a part of their overall evaluation of the English course students were asked
to comment on aspects of the course that they enjoyed, aspects they did not enjoy
and to make comments for possible changes in the future. Of the sixty students
only 35 completed this part of the evaluation form, and some of these did not
write comments in all the sections. Even though the results of this part of the
evaluation form are not connected to this work, however some comments are
pertinent. Five students commented on the fact that they had to do a lot of work
to obtain the credits for this course, the workload of having to write a paper, do
a reading comprehension paper and an oral presentation was deemed excessive.
Eighteen students agreed with what this student said63: ‘At first I did not think I
could do the oral presentation. Now, I am happy that we did it.’ Fifteen students
wrote about their feelings for the final assessment along the lines of this example:
‘It was good not to have a big final exam in the exam session.’ It would have been
useful to have had a more detailed follow-up questionnaire focusing on students’
views of oral presentations and peer-evaluation. However, maybe these com-
ments can be interpreted as saying that the students, at least a number of those
concerned enough to make comments, found this work useful and helpful in their
studies as well as removing a moment of stress from their general exam period. It
also allowed the students to be in control of part of their assessment as they had
the opportunity to prepare in advance.

6. Conclusion

Presentations usually happen in key moments, e. g. sometimes they are a sales

pitch for a new product or project, and sometimes they try to have an effect on
what is to come. Sometimes they are at a conference where people are presenting
results of research work. Whenever they take place, they are usually important
moments for the presenter. They are also potentially moments of great interest or
of boredom for the audience. Given their importance, it would seem obvious that
effort should be put into making them as effective as possible. There is literature
about how the brain can digest information and this should be used to inform
our presentations. An oral presentation must be different from a document that
people can read, be it on paper or projected onto a screen, as there is a human ele-
ment involved – the presenter. It should be a time for clear communication. This
project addressed the need that students have to give presentations in academic
or future work settings and reported on how they were taught to give effective
presentations in an L2. As this study took place in an academic setting, it was

63 Translated from Italian by the author of this chapter.

decided to use the oral presentation as part of the final assessment for the English
course the students had to do as well as peer-evaluation.
The results shown above indicate that peer-evaluation can be a useful tool, as
it generally coincided with the teacher’s evaluation, and that using oral presen-
tations as part of the final assessment was considered valid64 by the students.
The length of time it takes to arrive at the final average peer-assessment score
might be considered one of the ‘challenges within the classroom practice of the
assessment’ that Mok (2011, p.231) noted. A future project would have to find
a way to speed this up, and would also have to investigate how to receive more
constructive feedback from the peer-evaluation forms, as this was a weakness of
the project design of this study.
The project proved valuable in meeting one of the needs of the students, that
of giving oral presentations in an effective way, and in providing a useful way
of evaluating which could bring the advantages for learners that Brown (2001),
above, noted ‘speed, direct involvement of students, the encouragement of auton-
omy, and increased motivation because of self-involvement in the process of
learning’ (ibid, p. 415). It can also be claimed that the variety of topics and per-
sonal effort put into making these topics the theme of oral presentations which
were used for assessment purposes would meet Douglas’ (2010) call to “introduce
a degree of creativity and personality into the tests” (ibid, p. 145).


Atkinson, C. (2008). Beyond Bullet Points. Washington: Microsoft Press.

Bachman, L. & Palmer, A. (2010) Language Assessment in Practice. Oxford:
Oxford University Press.
Brown, H. D. (2001). Teaching by Principles: An interactive approach to Lan-
guage Pedagogy 2nd ed. New York: Longman.
Douglas, D. (2010). Understanding Language Testing. London: Hodder Edu-
Duarte, N. (2008). Slide:ology The art and science of creating great presen-
tations. Sebastopol: O’Reilly Media Inc.
Duarte, N. (2010). Resonate. New Jersey. John Wiley and sons.
Etherington, B. (2009). Presentation Skills for Quivering Wrecks. London: Mar-
shall Cavendish Business.
Gallo, C. (2010). The Presentation Secrets of Steve Jobs. New York: McGraw

64 Validity is seen here as the fact that students accepted this form of assessment.

Harrington, D. & LeBeau, C. (2009). Speaking of Speech new edition. Tokyo:
Heath, C. and Heath, D. (2007). Made to Stick: Why some ideas Survive and
others Die. New York: Random House.
Jordan, R. R. (1997). English for Academic Purposes. Cambridge: Cambridge
University Press.
Küçük, F. & Walters, J. D. (2009). How good is your test” ELT Journal, (63)4:
Luoma, S. (2004). Assessing Speaking. Cambridge: Cambridge University Press.
Medina, J. (2008). Brain Rules: 12 principles for Surviving and Thriving at Work,
Home and School. Seattle: Pear Press.
Mok, J. (2011). A case study of students’ perceptions of peer assessment in Hong
Kong. ELT Journal, 65(3): 230–239.
Reynolds, G. (2008). PresentationZen. Berkeley: New Riders.
Reynolds, G. (2010). PresentationZen. Berkeley: New Riders.
Reynolds, G. (2011). The naked presenter. Berkeley: New Riders.
Roam, D. (2008). The Back of the Napkin. New York: Portfolio, Penguin.
Widdowson, H. G. (2003). Defining Issues in English Language Teaching. Oxford:
Oxford University Press.
Williams, R. (2010). The Non-Designer’s Presentation Book. Berkeley: Peachpit

Part IV
High-Stakes Exams
High-stakes Language Testing
in the Republic of Cyprus

Salomi Papadima-Sophocleous65
Cyprus University of Technology

There has been substantial research in second language testing in different parts of the world such as
the US, UK, and Australia but little has been written about language testing in Cyprus. This chapter
aims to offer a historic and evaluative overview of high-stakes language examinations offered by
the Examination Service (ES) of the Ministry of Education and Culture of the Republic of Cyprus
in the last fifty years and some suggestions for improvement. The study records and examines the
different types of ES high-stakes language examinations from 1978 to 2010. Upon a closer and
comparative look of all high-stakes language examination papers of the ES from a specific year
(2009), it is revealed that so far such high-stakes examination practices have mainly been based on
similar, mainly high-stakes International English examinations such as the First Certificate exam
(Cambridge ESOL), and not on analysis of local needs. The present investigation also brought to
light that there is a need, not only for constant updating of the existing high-stakes examinations, but
also for continuous and systematic research in language testing in the Cyprus context. The chapter
concludes with a speculative look at possible future improvements of high-stakes language testing
in Cyprus.

Key words: history, second language testing, high-stakes language examinations.

1. Introduction

During the three successive periods in the history of language testing (Spolsky,
1976), Cyprus was undergoing through turbulent historic, political, social and
other changes, which affected its education system, and as a consequence, its lan-
guage assessment and testing practices. For this reason, the evolution of language
testing in the Republic of Cyprus has not been thoroughly researched yet. The
present investigation is a step towards this end, in other words towards research-
ing and systematically recording this history in order to understand the context in
which examinations have evolved and developed to the present day. More specifi-
cally the aims of this chapter are: first to offer an overview of high-stakes lan-
guage examinations administered by the Examination Services of the Ministry
of Education and Culture of the Republic of Cyprus over the last 50 years (1960–
2010). This concides with two major events: the beginning of modern history of
language testing (Spolsky, 1976) and the establishment of the Republic of Cyprus
as an independent country, after a long period of Turkish (1571–1878) and British


Rule (1878–1960). Secondly, the present chapter offers a first evaluative review
of high-stakes language examinations offered by the Examination Services of
the Ministry of Education and Culture of the Republic of Cyprus and some sug-
gestions for their improvement.

2. Background context of the history of language testing

2.1 Historical periods in the recent development of language testing

In his first attempt to identify historical periods in the recent history of language
testing, Spolsky (1976) identified three periods: the ‘traditional’ or ‘pre-scientific’,
the ‘psychometric-structuralist and the ‘psycholinguistic-sociolinguistic’. While
the first two initiated the commencement of the history of language testing and
marked it with considerable changes (Davies, 2003; Lado, 1961; Spolsky 1995,
2008), it is during the third period that language testing has evolutioned sub-
stantially in different aspects such as validity, reliability, language proficiency,
authenticity, alternative assessment, new technologies, assessment for academic
purposes, washback, impact, ethics, politics, and mediation (Alderson, 1991,
1998; Bachman 1990, 1991; Canale, 1983; Canale and Swain, 1980; Dendrinos,
2006; Oller, 1979; Pennycook, 2001; Shohamy, 2001; Spolsky 1995, 2008; Weir,

2.2 Historical overview of the linguistic landscape in the Republic of Cyprus

While in the rest of the world there was a continuous development in second lan-
guage testing long before 1960, this was not the case in Cyprus. This was due to
the fact that during most of its history, Cyprus was under foreign rule and had not
lived a period of independence for too long. However, Cyprus has always been
a crossroad of languages and civilisations (Lyssiotis, Nicolaidou and Georgiou,
2010; Mallinson, 2010; Maratheftis, 1992; Michael and Nicolaidou, 2009; Mil-
tiadou, 2009, 2010). Greek has been the most prominent language since ancient
times (1500 B. C.) and preserved to this day, French was used during medieval
times (1192–1489), and Italian during the Venetian Rule (1489–1571). Turkish
was used during the Ottoman Rule (1571–1878) (Filippou, 1930) and English was
introduced during the British Rule (1878–1960) (Constantinides, 1930; Myrian-
thopoulos, 1946; Spyridakis, 1954; Pedagogical Institute of Cyprus, 2003). Assyr-
ians, Persians, Arabs and others also spent some time in Cyprus during different
times for various reasons and left their linguistic and cultural mark on the island.

All these languages were constantly tested in real life in everyday situations such
as commerce and administration through the centuries, not in a monolingual but
in a multilingual, authentic way in real life and through mediation (Dendrinos,
2006). After 1960, Greek (one of the two official and mostly spoken languages of
the Republic of Cyprus), English and French (1963–1964) were taught in govern-
ment schools of the Republic. Other languages such as German, Spanish, Rus-
sian and Turkish (the second official language of the Republic) were since then
gradually added as electives in the secondary school curriculum (Ministry of
Education and Culture Cyprus, 2004; Language Policy Division Strasbourg and
Ministry of Education and Culture Cyprus, 2003–3005).
During Spolsky’s pre-scientific period, before the 1950s, in Cyprus, the pre-
occupation was not so much with language testing, let alone research in the
area. During this time, Greek Cypriots did not only perceive education as a
means for personal and social advancement, but also as a weapon for national
survival. They were responsible for their own education (Maratheftis, 1992, p.
12–13) during the Ottoman (1571–1878) and the British Rule (1978–1959) until
1923. Then the English Education Office took full responsibility and the system
became centralised (Maratheftis, 1992, p. 16). During the pre-scientific period,
there is hardly any evidence recorded regarding language learning methods and
testing in Cyprus. It was only with the proclamation of the Republic of Cyprus
(1960), which coincided with the beginning of the third and most recent period
of the history of language testing, that high-stakes language testing started
evolving in the Republic.

2.3 Language testing research in the Republic of Cyprus

Most research and publications in the Republic of Cyprus deal with Education in
general (Pylavakis, 1929; Constantinides, 1930; Myrianthopoulos 1946; Spyri-
dakis, 1954; Persianis, 1998, 2006; Karagiorges, 1986; Persianis, and Polyviou,
1992). These have little to say about assessment and testing in the Republic of
Cyprus in general (Karagiorges, 1986; Maratheftis, 1992) or about language
assessment and testing in particular. It is only in the last 15 years that research
started being conducted in the area of language testing in the Republic of Cyprus
(Pavlou, 1998; Pavlou and Ioannou Georgiou, 2005; Papadima-Sophocleous,
2007, 2008; Papadima-Sophocleous and Alexander, 2007; Tsagari and Pavlou,
2008). Nevertheless, there has hardly been any research regarding the history of
language testing in the Republic of Cyprus or any evaluative research of high-
stakes language examinations.

3. Overview of government high-stakes language
examination practices

Up until 1960, there is not much evidence of prior existence of high-stakes exami-
nations. Testing was mainly administered in class and set by classroom teachers.
Soon after the 1960 proclamation of the Republic of Cyprus, evidence of high-
stakes testing administration was recorded.

3.1 Secondary education and examinations

The system of Common Examinations in the Republic of Cyprus was imple-

mented firstly for the Gymnasium entrance examination in July 1965. This
examination was common to all students and tested them on what they had
studied in the fifth and sixth grades of primary school. The entrance exami-
nations from primary to secondary school were abolished in 1974 (Maratheftis,
1992, p. 41).
The Academic Certificate (Akadimakion Apolytirion), introduced in 1964–
1965 was “awarded after a successful examination, a type of academic qualifi-
cation similar to the French Baccalaureat” (Karagiorges, 1986, p. 50). The exami-
nation for the attainment of the Academic Certificate was also partially used as
an “entrance examination for those tertiary institutions in Greece for which stu-
dents’ entrance was limited” (Annual Report 1966–1967, p. 16–17).
In June 1966 school-based Common Examinations (Eniaies Exetaseis) were
piloted for the main subjects (these subjects are not specified in the Annual
Report 1966–1967) in grades 3 and 6 of all secondary schools (Republic of
Cyprus, Ministry of Education, Annual Report 1966–1967, p. 22). These exam-
inations were based on the curriculum taught and were prepared by inspec-
tors. They were then sent to the school principals who set the examinations
of each subject on the same day in all secondary schools. According to the
Annual Report 1966–1967, this type of examination was new for the education
system of the Republic of Cyprus. There is no mention, however, of any foreign
language testing component in any of the abovementioned examinations even
though English was re-instated in primary grades 5 and 6 in 1965 (Maratheftis,
1992, p. 37) after the end of the British Rule and the proclamation of the Repub-
lic of Cyprus, and French was introduced as a school subject in 1963–1964
(Annual Report 1966–1967, p. 24).
In 1972, the General Examinations (Genikes Exetaseis) at the end of the school
year were abolished (Maratheftis, 1992, p. 39).

3.2 Tertiary education and examinations

Although the oldest Examination Guide found evidences the administration of

entrance examinations for Tertiary Institutions in Greece dates in 1978, other
evidence suggests that that these must have been administered earlier than that:
In the 1966–1967 Annual Report (p. 22), it is noted that a Common Examination
system was also implemented for the university entrance examinations of July
1966 and 1967. These examinations did not, however, include a foreign language
component. Maratheftis (1992, p. 38) also mentions that in 1968, the Entrance
Examinations for Universities in Greece was introduced. These examinations
continued until 2005. The same examinations were also used for entrance to the
University of Cyprus from 1992 to 2005.
In 2006 the Common Examinations (CE) and the University Entrance Exami-
nations (UEE) merged into one named Pancyprian Examination (PE). This new
high-stakes examination has since then been used both as a school-leaving exam-
ination and as a university entrance selection (Lamprianou, 2012).

4. High-stakes language examinations and

the Examination Service

The Examination Service (ES) of the Directorate of Higher Education of the Min-
istry of Education and Culture in the Republic of Cyprus has the responsibility
of the organisation, development, administration, and marking of high-stakes
examinations as well as the publication of the annual Examination Guides and
the maintenance of the Examination Service website66.
The Examination Guides (1978–2011), published annually, and the ES website
are the main and most reliable sources of this research. The earliest Guide copies
found date from 1978 and 1979. The rest date from 1987–1988 to 2011–2012. They
include an overall description of the ES examinations in each subject area, includ-
ing that of languages. From 1998, they also include examination papers. From
these Guides, it is possible to establish the development of the format and content
of the examinations. The ES website provides electronic archives of all the types
of high-stakes examinations offered by the ES.
The Examination Service offers the following types of high-stakes exami-
(a) Exam Type A: The Pancyprian Examinations. These are high-stakes exami-
nations and the largest in scale in the Republic, in terms of test-taker par-

ticipation. These serve both as School Leaving Certificate and as University
Entrance Examinations. They test a great number of high school subjects
including languages;
(b) Exam Type B: The Language Examinations for Greeks Living Abroad.
Greece and the Republic of Cyprus have a long history of migration. This
resulted in many Greeks residing in many countries in the world. The Min-
istry of Education and Culture of the Republic of Cyprus offers special lan-
guage examinations for such people. The aim is to establish the level of their
knowledge and competence in languages such as Greek, English, French,
German, Italian and Spanish.
(c) Exam Type C: For Appointment or Promotion in the Police Force The ES
in collaboration with the Police Force is also responsible for preparing and
marking the Examinations for appointment/promotion in the Police Force
(Greek, English, French, German, Italian, Spanish, Russian, Turkish and
Bulgarian). (ES, Ministry of Education and Culture website)
(d) Exam Type D: The Greek and English Examinations for the Certification
of Language Competence at ‘Very Good’ and ‘Excellent Knowledge’. Such
certification is required for the appointment or promotion in the government
and educational services.
(e) Exam Type E: The Examinations for Appointment in the Public Sector (city
councils, water boards, semi-government organisations, and any other legal
public entities) at Α2-5-7 scales, which are ranks used to classify civil ser-
vants. These examinations test Greek, English, French and German.
(f) Exam Type F: The Examinations for Appointment in the Public Sector (city
councils, water boards, semi-government organisations, and any other legal
public entities) at Α8-10-11 scales, which are ranks used to classify civil ser-
vants. These examinations test Greek, English, French and German.
(g) Exam Type G: The Examination Service is also responsible for preparing and
marking the Examinations for appointment/promotion in the public sector,
semi- government organisations and city councils (Greek) (ES, Ministry of
Education and Culture website) (see Table 1).

Table 1. High-stakes ES language examinations


For appointment in the Public


Certifi- For cer-
Lan- For
cation of tification …and
guage appoint-
lan- compe- (Α2- (Α8- semi-
(Mother Pan ment or
guage tence: 5-7 10-11 gov. org,
tongue: Cyprian Promotion
compe- ‘Very scales) scales) Munici-
M/ exams in the
tence of good’/ * * palities
Foreign: Police
Greeks ‘Excel- etc.*
F) Force*
from lent’ *

      

      -

   -   -

   -   -

   - - - -

   - - - -

 -  - - - -

 -  - - - -

- -  - - - -
ian (F)

*(Examinations organised in collaboration with public services, semi-public organisations,

municipalities, etc.)

This study examines all the language examinations (Greek, English, French,
German, Italian, Spanish, Russian, Turkish and Bulgarian) administered by the
Examination Services of the Ministry of Education and Culture of the Republic

of Cyprus. These examinations are set and marked by inspectors and teachers, in
cooperation with specialists from universities.
For the purpose of this chapter, two types of high-stakes ES examinations are
(a) The High-stakes Pancyprian Language examinations, dating from 1978 to
2010 (examination papers not found: 1980–1986)
(b) Thirty-four 2009 ES examination language papers, which examine Greek,
English, French, German, Italian, Spanish, Russian, Turkish and Bulgarian
(language examinations for school leavers and other organisations such as
the police, government and semi-government organisations, and municipali-

5. High-stakes Pancyprian examinations

A closer view of the Examination Guides and the ES website gave insights into
aspects of Pancyprian examinations such as their historic development, lan-
guages examined, exam format and content, mark allocation, areas assessed, text
types and their rhetoric styles, language level of competence, types of activities
for each language and comparison amongst languages examinations, text and test
length, types of components, format, and number of words for written texts. The
first guides found, published by the Higher Education Directorate of the Ministry
of Education and Culture, date from 1978.

5.1 Language requirement for tertiary studies, High-stakes language

examinations and Examination Guides

Language examinations were required for specific university fields of study such
as philology and law (Examination Guide, 1978–1979). According to the Guides,
from 1990 onwards, more and more fields of study required a language exami-
nation. For example a language examination was required for entrance purposes
in the fields of English, French, German and Italian Language and Literature and
for primary teaching, and nursing in 1990–1991 examinations, for translation
and interpretation (1991–1992), for foreign language and philology (1992–1993),
Hotel studies (1993), modern language and European studies (1996) and after
that, for fields of studies such as Social and Political Sciences (1996), Physical
Education, Economics, Physics and Chemistry (1999), etc.

According to Guides found as early as 1978, high-stakes language exami-
nations tested English French, German and Italian. Gradually, other languages
were included in these high-stakes examinations: Turkish (1992), Spanish (1998)
and Russian (2005).
In the 1978–1979 language examinations, the papers had the same format and
content in all languages examined (English, French, Italian, and German). For the
writing component, examinees were given at least 4 topics to choose from and
were required to develop topics of different types (narration, description, devel-
opment of ideas). The length required of the essay was 250 words. In the section
“Text for comprehension”, a text was given with comprehension questions, and
grammar and vocabulary completion, change and multiple-choice exercises.
Although the guides of the years 1980 to 1988 were not found, the format
and content of the 1988–1989 and 1989–1990 examination guides remain closely
similar to the 1978–1979 Guide. The assumption is that, although some minor
changes were evident regarding aspects such as the number of words for the essay
or the number of essay topics, all language exams, from 1978 to 1990 were gen-
erally very similar in format and content: essay writing, reading comprehension
and exercises (grammar and syntax).

5.2 Language examination features

There were three distinctive periods of change in the high-stakes language exam-
inations: from 1978 to 1997 (earlier period), from 1998 to 2005 (middle period),
and from 2006 to 2010 (recent period).
During the early period, the format and content of English, French, Ital-
ian and German examinations remained more or less the same, although some
minor changes were evident regarding aspects such as the number of words for
the essay or the number of essay topics. The examination papers included two
parts (essay writing and reading comprehension) and remained the same until
1988–1989. From 1989 to 2005, they comprised three parts (essay writing, read-
ing comprehension and grammar and vocabulary exercises.) From 2006 to 2010,
English, French, German, Italian, and Spanish added a forth component, listening
comprehension. The mark allocation was divided as follow: 30 points for essay
writing, 30 points for reading comprehension and 40 points for grammar and
vocabulary exercises, with minor variations in the allocation of points from one
language to another:

Table 2. Middle period high stakes language examinations mark allocation

Essay Writing Reading comprehension Grammar and Vocabulary

30 points 30 points 40 points

From 2006, in the examinations including listening comprehension, the mark allo-
cation was 30 points for writing, 30 points for reading comprehension, 20 points
for grammar and 20 points for listening comprehension for French, German and
Spanish, 25 points for writing, 40 points for reading comprehension, 15 points for
grammar, and 20 points for listening comprehension for English, and 30 points
for writing, 25 for reading comprehension, 25 points for grammar and 20 points
for listening comprehension for Italian:

Table 3. Recent period high stakes examinations mark allocation in different language exami-

Essay Writing Reading Comprehension Grammar Listening


French 30 points 30 points 20 points 20 points

German 30 points 30 points 20 points 20 points

Spanish 30 points 30 points 20 points 20 points

English 25 points 40 points 15 points 20 points

Italian 30 points 25 points 25 points 20 points

The study of Tables 2 and 3 reveals the similarities and differences in the mark
allocation in the different periods given to the different language examinations.
Another high stakes language examination component that had similarities
and differences was that of time allocation. The time allocation was indicated
for the first time in all examinations in 1998 (middle period), two hours and
thirty minutes for English and two hours for all other languages. Time allocation
changed during the recent period, in 2006 for English to three and a half hours,
for French to three hours, for German, Italian, Spanish and Russian to two and a
half hours, for Turkish in 2007, for French, German and Spanish to three hours
and fifteen minutes in 2008 and for Italian in 2009, when the listening compre-
hension component was added.
The study of Table 4 helps in drawing some conclusions regarding the required
length for essay writing: The first conclusion is that from 1978 to 1997, the essay
length of the languages examined remained 250 words. The second conclusion

is that from 1998 to 2010 the length of the essay varies for each language. This
ranges from 180 to 350 words, with English being mostly the longest and German,
Italian, Turkish and Spanish being the shortest. The third conclusion is that the
changes in length vary in a range of 50 words for most, with the distinct exception
of English, the length of which ranges from 250 to 350 words, through the years.
The last observation is that, in 2010, the maximum number of words for all essay
writing is 200 except English. The difference, of course, is linked with the fact
that the level of competence expected in English is higher than that of the rest of
the languages.

Table 4. Required length for essay writing over time (w: words)

1978–1997 1998–2005 2006–2010

English 250 w 300–350 w 250–300 w

1978–1997 1998–2001 2002–2006 2007–2010

French 250 w 180–200 w 220–250 w 180–200 w

1978–1997 1998–2007 2008–2010

German 250 w 120–150 w 150–200 w

1978–1997 1999–2003 2004–2010

Italian 250 w 250 w 150–200 w


Turkish 150–200 w


Spanish 150–200 w

Table 5 shows that from 1978 to 2010, for English, French, German, Italian and
Russian, the writing task mostly used has been essay writing. However, since
2006, for Italian, Spanish and Turkish, candidates are required to produce varied
types of texts such as letter and email. The rhetoric styles of the written text
types ranged from narration to description, development of ideas, opinions and
arguments. A more detailed study of the writing tasks also revealed that, from
the point of view of the test writer, there is no clear understanding of the differ-
ence between text types and rhetoric styles. In the writing task instructions, these
are used interchangeably as if they are the same thing. The following are some

• “Topics are related to the following types of speech: narration, description,
development of ideas and letter” (English examination paper, 2000).
• “Narration, description, communication or development of ideas essay”
(French examination paper, 2005)
• “Topics may include formal or semi-formal letter, essay, description/narration
of experiences, happenings or processes” (German examination paper, 2000).
The texts types the examinees are required to produce gradually become more
authentic-like in some examinations (Spanish, Italian), in other words relevant to
the everyday needs of the candidates, for example letter, email, weekly schedule,
card, article. Sometimes they are required to produce or they are simply asked to
write an essay or a composition type of text that is of pedagogical and not real-
life like nature, although it deals with current topics (for example family, school,
importance in learning a language): “Write an essay about how you spend your
weekdays and weekends, what you do every morning and evening, what do you
do on Saturdays and Sundays” (Russian examination paper, 2010). Moreover, the
tasks presented to the examinees are mainly of pedagogical nature (develop your
ideas on a topic) and not real-life like ones (for example ‘you are hosting a Come-
nius student from the UK for five days. Prepare a daily schedule of the times you
would show him around’).

Table 5. Writing task text types, rhetoric styles and kinds of activities

1978–1990 1990–1997 1998–2005 2006–2010

EN Essay Essay Essay Composition

Letter writing

FR Essay Essay Essay Letter, travel article

GER Essay Essay Personal / semi-for- 2006: Letter,

mal letter; Essay 2007–2010: essay

IT Essay Essay Essay 2009: Essay, letter

2010: Weekly schedule

TURK Not examined Description Description not Essay

not provided provided 2009: Letter

SPA Not examined Not examined 2004: Essay 2008: cards, articles,
emails, response to

RUS Not examined Not examined 2005: Essay Essay

The texts used for reading (e. g. brochure, advertisement, weather report), and lis-
tening comprehension (e. g. telephone conversation, announcement, and conver-
sation) are gradually becoming more authentic. However, this is not consistently so
in all language examinations. Moreover, the tasks are mainly of pedagogical nature
(for example multiple-choice, true-false) rather than authentic or authentic-like.
The texts used as prompts for language use exercises are gradually more
authentic-like or authentic (e. g. a post card, an email, a letter, a dialogue about
going to the cinema, Italian Examination paper 2009), but it is not easy to catego-
rise, and deal with current topics. Moreover, the tasks and questions given to the
examinees are mainly of pedagogical nature (e. g. fill in the blanks) rather than
authentic-like. However, the language competence level expected is not indicated
in any examination except in the 1978 English paper where the level is described
as ‘advanced’, without any other information given, such as the language compe-
tence level system and criteria used to determine this level. As a result, there is no
clear evidence of comparability across all language examinations.

6. Study of 2009 ES language examinations

A closer and comparative look at the characteristics of thirty-four language

examination papers of one specific year, 2009 was undertaken. Their similari-
ties and differences were identified, as regards the following aspects: language,
structure, duration, content (written and oral production and comprehension, lin-
guistic elements: grammar, syntax and vocabulary), topics, marks, language used
for instructions and language level of performance expected. Thirty-four 2009
language examination papers were studied (see Figure 1):

Figure 1. Number of 2009 examination papers per language


















The thirty-four language examination papers examined nine different languages.
The study of the ES Language examinations revealed that they were not all intro-
duced at the same time. The examination of each language was introduced at
different times: Greek English French, German and Italian from as early as 1978,
indicated from the Guides found, Turkish in 1992, Spanish in 1998, Russian in
2005 and Bulgarian (date not found), the latter only for police examinations.
The study also revealed that the language examinations had different purposes,
for example they tested candidates’ linguistic competence for school certificate
attainment, for university entry, for appointment or promotion in different gov-
ernment or semi-government bodies, the police or local councils, etc. All in all,
34 examination papers were examined. These included 7 Greek papers, 6 Eng-
lish, 5 French and 5 German, 3 Italian and 3 Spanish, 2 Russian and 2 Turkish
and a Bulgarian one.
The 2009 language examination papers varied in structure. The study revealed
that 15 % of the 2009 ES language examination papers were divided into four
sections, 70 % in three and 15 % into two sections. The duration of each ES 2009
language examination also differed (see Table 6).

Table 6. Duration for each language examination

Duration of examination papers per language (hours)

1.30 2 2.15 2.30 3 3.15 4.15 Not
Language hours hours hours hours hours hours hours mentioned
Greek 1 2 1 2 1
English 3 3
French 1 1 1 1 1
German 2 1 1 1
Italian 1 1 1
Spanish 1 1 1
Russian 1 1
Turkish 2
Bulgarian 1

The study revealed that the duration of the language examinations varied from
1.30 min to 4.15 min. The difference in duration was evident not only in the exam-
inations of different languages, for example difference in the duration of Greek,
English, French, German, Italian, Spanish, Russian, Turkish and Bulgarian, but
in examinations of the same language as well (e. g. the duration of the Greek

examination ranged from 1.30 min to 3.00) This shows substantial inconsistency
in the duration of the language examinations. This can be taken to mean inconsis-
tency in measuring candidate linguistic competence across examinations.
Generally all language examinations assessed the production and comprehension
of written texts and the knowledge of specific linguistic aspects (grammar, syntax
and vocabulary). The Pancyprian examinations of five (English, French, German,
Italian and Spanish) out of seven (Turkish and Russian) languages also examine
listening comprehension. Finally, the Pancyprian examination of Greek language
also tests the knowledge of Greek Literature. However, the examination of these
linguistic skills and knowledge is not always examined in the same section of the
paper. For example, in some examination papers, the knowledge of some linguistic
elements such as grammar or vocabulary is integrated with reading comprehension.
The section, which focuses on the production of written language, is referred to
in different ways (Figure 2). In most cases, the production of writing is examined
in the form of an essay instead of an authentic discourse form (article, letter, report,
speech, email, etc.) The only exception is found in the Pancyprian examinations.
In most examination papers, the reading comprehension section is referred to
in different ways: Comprehension of written texts, text for comprehension and
text for reading – comprehension. In the Greek language Pancyprian examination
paper, it is referred to as ‘unknown text’.

Figure 2. Production of written language

The comprehension of written texts is tested through a variety of activities: tra-

ditional type questions (straightforward comprehension questions for main points
without a specific communicative purpose, grammar focused) (11 papers), mul-
tiple choice questions, true/false, word and expression matching, development of
main idea of a text in a paragraph, scanning and recording evidence, scanning
for the rhetoric questions, fill in blanks of sentences with multiple choice options

and open or closed-type general questions and vocabulary exercises. They deal
with different aspects of reading through which the comprehension of the text is
achieved: comprehension (9 papers), text summary (6 papers), evidencing, devel-
opment, meaning, style, reasons, arguments, personal opinion, topic, conclusion
rewriting and vocabulary questions dealing with arguments and ideas found in
the text.
The texts include mainly newspaper articles. In some languages such as Ital-
ian, Spanish and Turkish, some other text types such as letters, emails and adver-
tisements are used.
From the total of 34 examination papers analysed, only five (Pancyprian: Eng-
lish, French, German, Italian and Spanish) examine listening comprehension and
only one (Certification of very good knowledge of Greek language) examines
All examinations have a section, which examines grammar, syntax and vocab-
ulary. Examinees are tested on synonyms, opposites, etymology and division of
compound words, formation of words from their derivatives, noun, adjectives and
verb conjugations, spelling, expressions, syntax, etc.
The topics include mainly current social topics and issues such as environ-
ment, drugs, and friendship. Some deal with topics deriving from the examinees’
learning environment or workplace such as school, criminal arrest, etc.
The text length of the language examinations varies from 150 to 550 words.
Although there are some texts, which are of 150-word length and some of 550-
word length, most of them are of 250 to 300-word length:

Figure 3.Text lengh

The language of instructions is usually that of the language tested. The only
exception is the examinations for Greeks from overseas (English, French, German,
Italian, Spanish). In these language examinations, instructions are given in Greek.
The language level is not indicated for any of the language examinations.
Some, such as English or Greek specify it in a vague way, without any description
of what ‘Very good’ or ‘Good’ level mean. Although CEFR started being used in
the teaching of some languages, there has been no attempt to formally and sys-
tematically align the formal language testing in the Republic of Cyprus with the
CEFR scales. It is mainly done intuitively and based on “examiners’ experience
in teaching and testing languages”, based on high-stakes testing models such
as First Certificate (Cambridge ESOL). Levels are determined subjectively, and
reference is made to Beginner, Intermediate and Advanced Levels, without any
specific description of what they represent. The ministry language inspectors link
the Pancyprian English language test to CEFR B1 equivalent; however, exams are
not clearly aligned to CEFR levels or any other similar framework.

7. Suggestions for improvement of high-stakes language


The present evaluative and comparative review of the languages high-stakes exami-
nations revealed that language testing in the Republic of Cyprus has mainly been
influenced by the history of changes in language testing through high-stakes well
known exams mainly from the English-speaking world used in the last 50 years,
rather than on research and analysis of local needs. It also revealed the need to upgrade
language testing practices and to align them in a systematic way with current lan-
guage testing theories and practices based on empirical research. More specifically,
the following aspects need to be improved in the high-stakes ES examinations:
1. The language examinations need to be similar in format (types of activities),
content (language and texts) and expectations (level of language competence),
based on current theories and practices. There should be a “generic core of
aims and objectives to ensure consistency in expectations across the language
specialisms” (Language Policy Division Strasbourg and Ministry of Education
and Culture Cyprus, 2003–3005, p. 23) and comparability.
2. All language examinations should use the same language competence refer-
ence framework.
3. Topics should be enriched and go beyond general everyday topics. They should
include topics related to examinees’ interests, personal, social, educational,
and professional life, as suggested by literature (Council of Europe, 2001).

4. Oral and written text types should consistently reflect the types used in exam-
inees’ personal, educational, social and professional settings and they should
be appropriate to that contextualised and varied settings.
5. All examinations should test all areas of language competence such as listen-
ing comprehension and speaking so that a more comprehensive evaluation of
candidate language competence can be made.
6. Testing of linguistic knowledge should not constitute a separate and isolated
section of the examination paper. It should be relevant to the content and con-
text and constitute an integral part of the other sections of the examination.
7. Language examinations for specific purposes should be tailored according to
the needs of particular groups. In other words, they should test examinees
on content (topics, texts, roles, audience, etc.) relevant to their profession (for
example police, water board, etc.). For example, the examination for the police
force should draw from oral and written texts used by the police in real situ-
ations (oral or written reports to superiors, form completion, etc).
8. All language examinations (French, Italian, Spanish, Russian, German and
Turkish) for a specific purpose, e. g. for the police, should be compatible; they
should have the same structure, duration, marking, and should test the same
skills and aim at the same level and expected outcomes.

8. Conclusions

This study offered an insight into the history of the high-stakes language exami-
nations offered by the Examination Services (ES) of the Ministry of Education
and Culture of the Republic of Cyprus over the last 50 years (1960–2010). From
the research conducted and reported here, it has been made clear that more
research and systematic recording of this history is important in order to under-
stand the historical and educational circumstances under which high-stakes lan-
guage examinations have developed in the Republic of Cyprus and where they
should be directed in the future.
The first records of ES high-stakes language examinations found date back to
1978. The study of the different types of ES high-stakes language examinations
from 1978 to 2010 established the following: Languages were included in the
high-stakes examinations programme from the very beginning. However, each
language was included in the high- stakes examinations at different times: Eng-
lish (1978), French, German and Italian (1988), Turkish (1993), Spanish (1997),
and Russian (2005). There were three distinctive periods of change in the exami-
nations: early period (1978–1997), middle period (1998–2005) and recent period

During the first two periods, the basic structure of all examinations consisted
of testing writing, reading comprehension (including vocabulary) and language
use (vocabulary, grammar, structure, and expressions). During the third period,
listening comprehension was added for English, French, German, Italian and
Spanish. On the whole, although there have been some changes made through the
years, these seem to be minor. The major changes were: increase in the number
of words of the required written production; number of topic choice; mark allo-
cation/weighing; slight change in the focus of written production; more commu-
nicative nature of activities; use of more authentic texts; some gradual variation
of activity types.
Their format, content and other features derive mainly from similar exami-
nations from other countries. During 1978–1997, they were all mainly influ-
enced by high-stakes English examinations from overseas. Later, particularly
from 2006, each examination followed its own development, often influenced by
high-stakes examinations of their respective languages. English, for example is
influenced by examinations such as the Cambridge ESOL First Certificate Exam
(Language Policy Division Strasbourg and Ministry of Education and Culture
Cyprus, 2003–2005, p. 35). French is drawn from DELF and other similar pub-
lications (DELF Junior Scolaire B1 – 200 activités – CLE International, exami-
nation paper 2006, p. 2). Some of them use authentic materials and state their
source. The conclusion drawn is that in general they were based on high-stakes
language examination practices of other countries during all three periods. As a
result, they reflected the same theories and practices in language testing of those
or earlier, more traditional practices.
There is no clear indication anywhere in the documentation that any of these
examinations are based on a specific framework, for example the Common Euro-
pean Framework of Reference (CEFR) for languages (Council of Europe, 2001).
It is evident that there is a need to work towards conformity in both the high-
stakes examination practices in each language and the practices across all lan-
guages. On the whole, the examinations need to become more communicative,
reflecting text types and tasks from real life rather than being of pedagogical
nature. All activities need to be given communicative, contextualised and mean-
ingful scenarios, where examinees have real-life like roles and engage themselves
in authentic-like writing, reading comprehension and listening activities.
In addition, it is evident that practices should move beyond being informed
only by practices of high-stakes examinations in each language and move towards
combining practices with sound research at local level, needs analysis of the spe-
cific context and development of high-stakes language examinations that would
reflect the local needs and context. This inevitably means cooperation of practis-
ing teachers and existing test writers and examiners with specialists in the area of

language testing and high-stakes examination development. It means cooperation
of practising teachers and existing test writers and examiners with specialists
in training experienced language teachers in high-stakes examination processes.
Moreover, a comparative study of all high-stakes language examination papers
of the ES of 2009 revealed the similarities and differences amongst them as well
as their close resemblance with examinations such as Cambridge ESOL First
Certificate. It was evident that high-stakes language examinations in the Repub-
lic of Cyprus need to be based on analysis of local needs, and that continuous
research needs to be conducted in this area. Furthermore, such research needs to
be extended to other areas such as formative and summative assessment and test-
ing in the classroom, not only in government settings but also in private schools,
universities, colleges and other educational settings. Aspects of such research
could include, not only the test content, but also exam administration practices,
testers’ training, test validity and reliability, ethics, etc. Such research will not
only shed light and bring improvement to testing practices locally, and develop
research in testing in the Republic of Cyprus, but also add a missing piece in the
mosaic of the history of testing and assessment in general.

I would like to thank Andreas Makrides for permitting access to the Examination
Guide volumes kept in his Archives, and my colleague, Anastasia Pek for her help
in finding part of the bibliography.


Alderson, J. C. (1998). Developments in Language Testing and Assessment, with

special reference to information technology. Forum for Modern Language
Studies XXXIV (2), 195–20.
Alderson, J. C. (1991). Language testing in the 1990s: How far have we come?
How much further have we to go? In S. Anivan (Ed.), Current Developments
in Language Testing. pp. 1–26Anthology Series 25. Singapore: SEAMEO
Regional Language Centre (RELC).
Bachman, L. F. (1990). Fundamental Considerations in Language Testing.
Oxford: Oxford University Press.
Bachman, L. F. (1991). What does language testing have to offer? TESOL Quar-
terly, 25(4), 671–704.
Canale, M. & Swain, M. (1980). Theoretical bases of communicative approaches
to second language teaching and testing. Applied Linguistics 1, 1–47.

Canale, M. (1983). From communicative competence to communicative language
pedagogy. In J. C. Richards & R. W. Schmidt (eds.), Language and Communi-
cation (pp. 2–27). London: Longman.
Constantinides, K. A. (1930). English Rule in Cyprus in 1878. Lefkosia.
Council of Europe (2001). Common European Framework of Reference for Lan-
guages: Learning, teaching, assessment. Cambridge: Cambridge University
Press / Council of Europe.
Davies, A. (2003). Three heresies of language testing research. Language Testing,
20(4), 355–368.
Dendrinos, B. (2006). Mediation in Communication, Language Teaching and
Testing. Journal of Applied Linguistics, 22, 9–35.
Filippou, L. (1930). Greek Letters in Cyprus during Turkish Rule (1517–1578).
Karagiorges, A. (1986). Education Development in Cyprus, 1960–1977. Nicosia.
Lado, R. (1961). Language testing the construction and use of foreign language
tests. London: Longman.
Lamprianou, I. (accepted, scheduled for 2012). Effects of forced policy-making
in high stakes examinations: the case of the Republic of Cyprus. Assessment
in Education: Principles, Policy & Practice, Special issue: “Impact of High
Stakes Examinations”.
Language Education Policy Profile. CYPRUS. (2003–2005). Strasbourg: Lan-
guage Policy Division, Cyprus: Ministry of Education and Culture, Republic
of Cyprus. Lyssiotis P. (2009). About Cyprus Education. Public Information
Lyssiotis, P. Nicolaidou, A. & Georgiou M. (2010). The Republic of Cyprus, An
Overview. Republic of Cyprus: Press and Information Office.
Mallinson, W. (2010). Cyprus, A Historical Overview. Republic of Cyprus: Press
and Information Office.
Maratheftis, M. (1992). The Cyprus Educational System. Nicosia. n. p.
Michael, M. & Nicolaidou, A. (2009). About Cyprus. Republic of Cyprus: Press
and Information Office.
Miltiadou, M. (2009). The Cyprus Question, A Brief Introduction. Republic of
Cyprus: Press and Information Office. Greek version.
Miltiadou, M. (2010). The Cyprus Question, A Brief Introduction. Republic of
Cyprus: Press and Information Office.
Ministry of Education and Culture, Examination Service, Republic of Cyprus
website. Retrieved from
Ministry of Education, (1967). Annual Report 1966–1967 (with reference and
statistics for the years 1962–1966) Nicosia: Republic of Cyprus

Ministry of Education. (1978). Entrance Examinations for Tertiary Institutions in
Greece. Examination content. Nicosia, Cyprus.
Ministry of Education. (1979). Entrance Examinations for Tertiary Institutions in
Greece. Examination content. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (1988). Examinations
Guide for Higher and Tertiary Educational Institutions. For the Academic
Year 1988–1989. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (1989). Examinations
Guide June 1989 for Higher and Tertiary Educational Institutions. For the
Academic Year 1989–1990. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (1990). Examinations
Guide June 1990 for Higher and Tertiary Educational Institutions. For the
Academic Year 1990–1991. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (1991). Examinations
Guide June 1991 for Higher and Tertiary Educational Institutions. For the
Academic Year 1991–1992. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (1992). Examinations
Guide June 1992 for Higher and Tertiary Educational Institutions. For the
Academic Year 1992–1993. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (1993). Examinations
Guide June 1993 for Higher and Tertiary Educational Institutions. For the
Academic Year 1993–1994. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (1994). Examinations
Guide June 1994 for Higher and Tertiary Educational Institutions. For the
Academic Year 1994–1995. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (1995). Examinations
Guide June 1995 for Higher and Tertiary Educational Institutions. For the
Academic Year 1995–1996. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (1996). Examinations
Guide June 1996 for Higher and Tertiary Educational Institutions in Cyprus
and Greece. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (1997). Examinations
Guide 1997 for Higher and Tertiary Educational Institutions in Cyprus and
Greece. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (1998). Examinations
Guide June 1998 for Entrance in Higher and Tertiary Educational Institutions
in Cyprus and Greece.
Ministry of Education and Culture, Examinations Service. (1999). Examinations
Guide June 1999 for Entrance in Higher and Tertiary Educational Institutions

in Cyprus and Greece. Volume A. General information, tested subjects, con-
tent, examination programme. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (2000). Examinations
Guide June 2000 for Entrance in Higher and Tertiary Educational Insti-
tutions in Cyprus and Greece. Volume A. General information, tested sub-
jects, content, examination programme. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (2001). Examinations
Guide June 2001 for Entrance in Higher and Tertiary Educational Institutions
in Cyprus and Greece. Volume A. General information, tested subjects, con-
tent, examination programme. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (2002). Examinations
Guide June 2002 for Entrance in Higher and Tertiary Educational Institutions
in Cyprus and Greece. Volume A. General information, tested subjects, con-
tent, examination programme.
Ministry of Education and Culture, Examinations Service. (2003). Examinations
Guide June 2003 for Entrance in Higher and Tertiary Educational Institutions
in Cyprus and Greece. Volume A. General information, tested subjects, con-
tent, examination programme. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (2004). Examinations
Guide June 2004 for Entrance in Higher and Tertiary Educational Institutions
in Cyprus and Greece. Volume A. General information, tested subjects, con-
tent, examination programme. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (2005). Examinations
Guide June 2005 for Entrance in Higher and Tertiary Educational Institutions
in Cyprus and Greece. Volume A. General information, tested subjects, con-
tent, examination programme. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (2006). Pancyprian
Examination Guide 2006. Volume B. Examinations content, Sample exami-
nation papers for Lyceums and Technical Schools. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (2007). Pancyprian
Examination Guide. Volume A. General information, tested subjects, content,
examination programme. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (2008). Pancyprian
Examination Guide 2008. Volume A. General information, tested subjects,
content, examination programme. Nicosia, Cyprus.
Ministry of Education and Culture, Examinations Service. (2009). Pancyprian
Examination Guide 2009. Volume A. General information, tested subjects,
content, examination programme. Nicosia, Cyprus.

Ministry of Education and Culture, Examinations Service. (2010). Pancyprian
Examination Guide 2010. Volume A. General information, tested subjects,
content, examination programme. Nicosia, Cyprus.
Ministry of Education and Culture. (2004). Country Report Cyprus. Language
Education Policy Profile. Cyprus: Ministry of Education and Culture.
Myrianthopoulos, K. K. (1946). Education in Cyprus during British Rule (1878–
1946). Lemesos.
Oller, J. W., Jr. (1979). Language Tests at School: A Pragmatic Approach. London:
Papadima Sophocleous, S. & Alexander, C. (2007). NEPTON System Overview
and Functionality. Teaching English with Technology, 7(2), 1–14. Retrieved
Papadima Sophocleous, S. (2007). New English Placement Test Online
(NEPTON): combining theories with application realities, CLESOL 2006
Proceedings Origins and Connections: Linking Theory, Research and Prac-
tice, Aotearoa, New Zealand. CD publication.
Papadima Sophocleous, S. (January 2008). A hybrid CBT and a CAT-based New
English Placement Test Online (NEPTON). CALICO Journal, 25(2), 276–304.
Pavlou, P. (1998). Testing oral proficiency in Greek as a foreign language. In Pro-
ceedings of the 12th International Symposium on Theoretical and Applied
Linguistics (pp. 132–144), Department of Theoretical and Applied Linguis-
tics, School of English, Aristotle University, Thessaloniki, Greece.
Pavlou, P. & Ioannou-Georgiou, S. (2005). The use of tests and other assessment
methods in Cyprus state primary schools EFL. In P. Pavlou & K. Smith (Eds.),
Serving TEA to Young Learners: Proceedings of the Conference on Testing
Young Learners (pp. 54–72). University of Cyprus, International Association
of Teachers of English as Foreign Language (IATEFL) and the Cyprus Teach-
ers of English Association (CyTEA). Israel: ORANIM – Academic College of
Pedagogical Institute. (2003). Pedagogical Research. Volume 36. Evaluation of
teaching French in the Secondary Schools. Cyprus: Ministry of Education
and Culture, Pedagogical Institute.
Pennycook, A. (2001). Critical applied Linguistics: A critical introduction.
London: Routledge.
Persianis, P. & Polyviou, P. (1992). The History of Education in Cyprus. Text and
Sources. Nicosia: Paedagogical Institute of Cyprus.
Persianis, P. (1998). The Education history of girls in Cyprus – A study on the
social and educational development of Cyprus. Nicosia.

Persianis, P. (2006). Comparative History of education in Cyprus. Athens: Guten-
Pylavakis, A. (1929). Lemesos and its schools. Lemesos.
Shohamy, E. (2001). The power of tests: A critical perspective of the uses of lan-
guage tests. London & New York: Longman.
Spolsky, Bernard (1976). Language Testing: Art or Science. In Gerhard Nickel
(Ed.) Proceedings of the Fourth International Congress of Applied Linguis-
tics, 3, (pp. 9–28). Stuttgart: HochschulVerlag.
Spolsky, B. (1995). Measured words: the development of objective language test-
ing. Oxford: Oxford University Press.
Spolsky, Bernard. (2008). Language assessment in historical and future perspec-
tive. In Elana Shohamy & Nancy Hornberger (Eds.), Encyclopedia of lan-
guage and education. Second ed. 7: Language testing and assessment, (pp.
445–454). New York: Springer Science.
Spyridakis, G. (1954). Education Policy of the British Government in Cyprus
(1878–1954). Lefkosia: Cyprus Ethnarchy Office.
Tsagari, D. & P. Pavlou. (2008). Assessment Literacy of EFL Teachers in Greece
and Cyprus. In Burston, J., M. Burston, E. Gabriel and P. Pavlou (Eds.), Lan-
guages for intercultural dialogue (pp. 243–256). European Parliament Office
in Cyprus and Ministry of Education and Culture of the Republic of Cyprus.
Weir, C. J. (1990). Communicative language testing. London: Prentice Hall.

Quality Control in Marking Open-Ended Listening and
Reading Test Items

Kathrin Eberharter67
Bifie68 , Austria

Doris Frötscher69
University of Innsbruck, Austria

When using open-ended listening or reading items in large-scale language tests, the main challenge
is reliable marking of the students’ responses. Because such items enable us to measure an important
aspect of our construct, the understanding of main ideas, we accepted the challenge posed by their
marking. This chapter relates what we believe to be innovative attempts to enhance the reliability
of marking open-ended items. It is situated in the Austrian context, where a standardized national
examination is being introduced at the end of secondary schooling. Both the listening and read-
ing sections include open-ended items, which potentially allow a range of acceptable responses.
We addressed quality control for marking such items in two phases of our test development cycle.
Firstly, we enhanced the central marking of field-tested items by introducing guidelines for stan-
dardization, and by using a grid for evaluating items during marking. Secondly, we established
support structures for the live examination, which is marked de-centrally by the class teachers. To
promote standardization, we offer an online helpdesk and a telephone hotline, which teachers refer
to when dealing with unanticipated student responses. All these steps have contributed to the reli-
ability of marking open-ended items in our exam.

Key words: open-ended items, marking, quality control, reliability, decision-making.

1. Introduction

This chapter is an account of our experiences and achievements in overcoming

the challenge of marking open-ended items. After outlining the particular context
of the Austrian school-leaving examination, we address the characteristics, ben-
efits and shortcomings of open-ended items as described in the testing literature,
and report on research that has so far been conducted. In the fourth section of the
chapter, we explain why open-ended items are used in our examination. The fifth
section outlines steps we have introduced to standardize the marking of field-
tested items by our item writers, and the sixth section explains how we supported
the marking of the operational version of the test by teachers across the country.
68 Federal Institute for Education Research, Innovation and Development of the Austrian school

We conclude our chapter by providing recommendations for test developers using
open-ended items, and by describing future directions of this work.

2. Context

The Austrian exit-level examination in secondary schools, called Matura, is a

high-stakes examination consisting of written and oral tests. Every student who
passes this exam is entitled to study at an Austrian university. Each candidate has
to opt for one foreign language examination, either written or oral, or both. The
most commonly taught foreign languages in Austria are English, French, Italian
and Spanish.
When in 2004 the Austrian curriculum for foreign languages was changed
to include descriptors from the Common European Framework of Reference
for Languages (CEFR), it was realized that this change in curriculum would
also require a change in terms of the exit-level examination. Starting in 2007, a
government-funded project initiated by the University of Innsbruck and school
teachers developed the first standardized Listening and Reading papers for the
2008 Matura exam which was piloted in participating schools. Encouraged by
the positive take-up of the project, efforts to reform the Matura on a large scale
were increased. In 2010, the Austrian Parliament passed a law introducing a stan-
dardized Matura for all academic secondary schools by 2014 (Austrian Ministry
of Education, 2010). Until the examination becomes obligatory, pilot versions of
the new Matura are offered to schools to support a gradual implementation of the
Before the reform, the Matura was an entirely decentralized examination. All
decisions were taken by the individual class teachers, who produced, adminis-
tered and marked the exam, and were only supervised loosely by school authori-
The reformed model of the written Matura for English, French, Spanish and
Italian is centrally developed by the University of Innsbruck and consists of four
paper-and-pencil sections: Reading, Language in Use, Listening and Writing.
The test is standardized in the sense that it is developed to measure a particular
standard of language ability as described through the CEFR levels, and the qual-
ity control mechanisms it undergoes. The CEFR level of the English tasks, after
usually eight years of learning the language, is B2. For the second foreign lan-
guages (French, Spanish and Italian) the tasks are developed at B1 and B2 level
because students can study these languages for either four or six years.
Since 2008, the number of secondary schools opting to pilot the reformed
exam has increased every year. In 2011, 14,500 students in 300 schools, which

represent about 80 percent of Austria’s schools, took the standardized English
Listening and Reading papers for their live examination. Each year, the test is
administered at least three times, of which one is the main administration and the
other two resits.

3. Testing receptive skills through open-ended items

Alderson, Clapham & Wall (1995, p. 57) define short-answer questions as “items
that are open-ended, where the candidates have to think up the answer for them-
selves. The answers may range from one word or a phrase to one or two sen-
tences.” According to the Dictionary of Language Testing (Davies et al., 1999)
the term open-ended is synonymous to constructed-response items and can take
various shapes (p. 32). In the context of this chapter, short-answer questions will
be referred to as ‘open-ended items’ and comprise cloze items and short-answer
questions with answers that are no longer than four words.
Several advantages of using open-ended items in reading and/or listening tests
have been pointed out (Alderson, 2000; Alderson, Clapham & Wall, 1995; Buck,
2001; Khalifa & Weir 2009; McNamara, 2000). McNamara (2000, p. 30), for
instance, explains that open-ended items do not “constrain” the candidate, and
confer more responsibility for the response on the test taker, which “may be per-
ceived as in some ways more demanding and more authentic”. They also reduce
the likelihood that the candidate has provided the correct answer without having
understood the text (Alderson, 2000, p. 227; Khalifa & Weir 2009, p. 87), for
example by guessing (McNamara, 2000, p. 30), or elimination (Alderson, 2000,
p. 227). On a more practical level, Buck (2001, p. 122) states that “it is widely
believed in educational testing that it is easier to write constructed response items
[…] [while] it is harder to write selected response items”.
The main disadvantage most frequently associated with this test format is the
increased complexity of marking candidate responses. According to Alderson et
al. (1995, p. 59), marking open-ended items is especially difficult because “there
are frequently many ways of saying the same thing, and many acceptable alter-
native answers, some of which may not have been anticipated by the item writer”.
This can result in judgments on part of the marker, which become increasingly
difficult with the length of the required response (Hughes, 2003, p. 48). Marking
open-ended items reliably needs time, funding and qualified personnel (Alder-
son, 2000, p. 200). Buck (2001, p. 139) and Alderson (2000, p. 227) add that it is
difficult to produce qualitatively good open-ended items. In light of these short-
comings, Khalifa and Weir (2009, p. 87) conclude that it is justifiable for exami-
nation boards to prefer to use multiple-choice questions over open-ended formats.

To overcome the challenges inherent to testing receptive skills with open-
ended items, several provisions can be made. Buck (2001, p. 138) and Hughes
(2003, p. 166), for instance, suggest keeping the expected responses short as
a quality feature. This is thought to make scoring faster (Buck, 2001, p. 138).
Alderson (2000, p. 227) and Khalifa and Weir (2009, p. 87) call for items that are
so carefully worded that the range of acceptable answers is limited and foresee-
able. Hughes even suggests that the best items have only one single acceptable
answer (2003, p. 144). Furthermore, the need for thorough field testing is stressed
by several authors (Alderson, 2000, p. 227; Alderson, Clapham & Wall, 1995, p.
59; Buck, 2001, p. 139). Alderson et al. recommend the careful recording of any
unexpected answers and whether they were accepted or not during marking ses-
sions. They caution, however, that responses from field testing may not cover all
possible acceptable responses because in a larger population further responses
are likely to appear (1995, p. 107). Alderson (2000, p. 227) underlines the impor-
tance of a complete marking scheme to attain higher objectivity, but also of flex-
ibility in allowing unexpected though acceptable answers.
Buck (2001) argues that using open-ended items in accordance with the rec-
ommendations outlined above leaves test developers in the dilemma of choosing
between a narrower or broader variant. While items with short, unambiguous
responses are easy to mark, they limit the scope of what can be tested to “super-
ficial understanding of clearly stated information” (p. 140). On the other hand, an
assessment of deeper understanding through such items, which is also desirable for
test developers, makes marking more complex as it involves the difficult task of
delineating acceptable answers. Buck identifies two problems associated with this
process: “firstly, determining what constitutes a reasonable interpretation of the
text, and secondly, what constitutes a sufficient response to the question” (p. 140).
Until now, research into the marking of open-ended items has been scarce. To
our knowledge, Harding and his colleagues are the only ones who have addressed
the issue of marker behaviour and decision-making for the receptive skills (Hard-
ing & Ryan, 2009; Harding, Pill & Ryan, 2011; Harding & Pill, 2011). Harding
and Ryan’s (2009) study in the context of the Occupational English Test identified
three types of marker decisions:
1) Decisions regarding spelling
2) Decisions regarding the correctness of an overelaborate response
3) Decisions regarding the adequacy of a response
Based on their analyses, they recommend that markers work in the same room
to enable discussion, and emphasize that there should be more dialogue between
assessors and test developers during marking (2009, pp. 112–113). Gathering
feedback from markers for future task development is also suggested.

4. The use of open-ended items in the Austrian Matura

When it comes to making decisions about the test methods to be used in an exami-
nation, the characteristics of a test method and whether it is capable of measuring
the intended aspects of the construct need to be taken into consideration. As Alder-
son et al. (1995, p. 45) explain, “particular test methods will lend themselves to
testing some abilities, and not be so good at testing others”. In the Austrian context,
the team responsible for developing the Matura had to face two questions regarding
open-ended items: whether they should be used at all, and whether a narrow or a
broader variant (as contrasted by Buck, see above) should be used. Firstly, it was
decided that open-ended items should be included in the Matura because they offer
more insight into the understanding of the text on part of the test taker. We also
feared that the exclusive use of objective items would promote negative washback.
Secondly, we decided not to restrict ourselves only to the narrow variant (short,
unambiguous responses and simple facts). The broader variant allows for targeting
deeper understanding of the text as is called for by the test specifications based on
the CEFR levels. The global descriptors for an independent language user comprise
the understanding of “main ideas” (CEFR level B2) or “main points” (CEFR level
B1) (Council of Europe, 1996, p. 24) Thus, we accepted the difficulties entailed by
the use of open-ended items and developed strategies to cope with them.
The Austrian Matura currently includes two types of open-ended items: short
answers to questions, and gap-filling/sentence-completion items. The responses
may not be longer than four words in order to be accepted during the field-testing
phase. All Reading and Listening test booklets in a live administration usually
include one or two tasks with open-ended items.

5. Marking field-tested open-ended items

The first phase in our test development cycle where open-ended items are marked
is after field-testing. All our tasks undergo at least one field test with a population
of 100 or more students per task. The items are analyzed in terms of both classi-
cal test theory and item response theory. The resulting psychometric properties
of the items are used as a basis for selecting tasks for Standard Setting70, and
subsequently, for the live examination.

70 Standard Setting is a procedure in which a panel of judges, by applying descriptors of the

CEFR, arrives at cut scores for a particular test, that is, the minimum score a candidate needs
to achieve to pass the test and be classified in terms of the CEFR level of the test (North,
Figueras, Takala, Van Avermaet, & Verhelst, 2009, p. 58).

As recommended by Alderson (2000, p. 199), the marking of the field tests
is performed by our item writers in centrally run sessions. These sessions usu-
ally take place twice a year and involve about 15 item writers for English, two
native speakers, and four language testing staff members from the University of
Innsbruck as moderators. They mark about 100 tasks for Listening and Reading
during a period of four to five days. While central marking entails higher demands
with regard to logistics and costs, it ensures higher reliability than decentralized
marking (Alderson, 2000, p. 199–200). We found that central marking has the
advantage that decisions can be discussed in groups and communicated easily
between groups, which helps to ensure reliable marking.
High quality marker decisions are of utmost importance in this phase because
they have effects on the test statistics and subsequent task selection for oper-
ational versions, and are the basis for the marking scheme used in the live exami-
nation. To enhance the quality of these sessions, we implemented two measures:
marking guidelines and an item analysis grid.

5.1 Quality measure 1: marking guidelines

To ensure that our item writers follow standard principles during marking, we
drafted guidelines based on Harding and Ryan’s (2009) work. The first part
describes the procedure: item writers work in groups of three to five markers,
and if possible, the item writer of the particular task is part of that group. As sug-
gested by Alderson et al. (1995, p. 107), markers first do the task as a test taker,
check the marking scheme and discuss the group responses. Then they mark item
for item (i. e. mark the first item for all answer sheets before moving on to the next
item) to increase consistency and reliability. Each new response (i. e., a response
which is not in the original key) is discussed in the group and the consensus
decision protocolled. Should the group face problems with certain responses, the
advice of the native speaker team and language testing staff members is asked for.
The second part of the marking guidelines document outlines the principles
item writers should base the marking on. For example, alternative spelling or
wording as well as grammatical errors are acceptable as long as they do not
impede communication. Also, responses are acceptable if they indicate under-
standing of the text, even if they do not fit the sentence/gap grammatically (see
Buck, 2001, p. 73). On the other hand, responses containing both correct and
incorrect information, or responses giving ideas not mentioned in the text are to
be rejected.
After working with the marking guidelines, we asked the item writers for feed-
back in a questionnaire. Responses (n=13) showed that the procedure was helpful

(e. g. “it was helpful to follow the sequence”; “it ‘systematized’ the correction”),
and that the principles for decision-making were found useful.

5.2 Quality measure 2: item analysis grid

The purpose of the analysis grid is to document any items, which were found to
be problematic in the light of their marking after they were field-tested. This was
found to be necessary in order to: a) identify items which should be eliminated,
irrespective of good statistical properties, before using the task in the live exami-
nation, b) document problems which needed to be improved when revising the
task before a second field test. As Alderson et al. (1995, pp. 200–201) explain,
test developers must “explore to what extent the responses reveal processes and
outcomes which are what was intended, and to what extent they reveal differ-
ent processes and outcomes”. The grid we developed includes seven questions
for a systematic analysis of items, focusing on aspects such as unanticipated but
acceptable responses, range of responses, or problems encountered while mark-
ing. For each item they mark, item writers are asked to consider these questions:
1. How broad is the spectrum of acceptable responses?
2. Have candidates given unexpected but acceptable responses (new concepts)?
3. How time intensive was the correction?
4. Is spelling an issue?
5. Can the item be answered using 4 words?
6. Further observations?
7. Does the item need to be revised?
Although the grid introduced an extra step in the workflow at central marking, we
are confident that it is worth the effort because it provides valuable feedback on
the quality of the items and how practicable their marking is. Our experience has
shown that the questions in the grid help item writers to identify and document
problematic items, and generate suggestions for their revision.
Moreover, as questionnaire responses demonstrate, the grid offers a learning
opportunity for item writers. They stated that it “showed the weaknesses of the
items more clearly”. For example, they learned to avoid “a too broad range of pos-
sible answers”, or items which “can be answered from other parts of the text”. The
grid thus proved to be useful not only for monitoring the quality of the items that
are marked but also for the training effect it has on the item writers. In particular,
it guides item writers to reflect on their work and fosters their awareness about
critical issues with open-ended items. We therefore decided to continue using the
grid for future central marking meetings.

Evidence from our context, therefore, confirms Alderson’s (2000, p. 201) point,
who explains that good tests require field testing, analysis and revision, and that “a
very important consideration, and part of this revision process, is the potential and
frequent discrepancy between expected responses and actual responses”. The grid
and the markers’ comments are stored electronically in the same document as the
task and subsequent statistical output tables. This allows for what was found to be
necessary in the Hungarian exam reform context: “some means should be found
to record comments on tasks/items made by markers, for future reference after the
statistical analysis is complete” (Szabó, Sulyok & Alderson, 2000, p. 62).

6. Teacher marking support

A second focus of our work on marking regards the live examination. An issue
specific to the Austrian context is the fact that the test is developed and deliv-
ered in a standardized fashion, yet the performances of the live administration
are marked de-centrally by the class teachers. This poses a potential threat to
the reliability of the marking and even the resulting test scores, because the
way in which teachers react to unanticipated answers cannot be controlled. As
Alderson argues, “teachers’ judgments can be biased and unreliable and unac-
ceptably variable for high-stakes decisions” (2000, p. 199). Even though the
marking schemes we distribute among the teachers are based on actual test
taker responses from field tests (n<100) and include justifications written by
the item writers, experience has shown that unanticipated responses occurred
frequently. From this it was clear that further marking support would be neces-
sary to ensure that all the efforts that go into designing and implementing the
test would not be undone by having the marking carried out by the teachers.
Therefore, in an attempt to improve marking reliability, two marking support
systems were designed and implemented: a hotline and an online helpdesk. The
support structures are accessible for teachers of all four languages for which the
tests are developed and both the hotline and the online helpdesk are operated by
the test developers and native speakers. Both services are provided for all main
administrations of the test, and teachers are informed about how to access them
in the test package and the marking schemes. For resits, the hotline has proven
to be sufficient.

6.1 Hotline

The telephone hotline was introduced in 2008, the first year of administration. Its
purpose is to provide support to teachers who are unsure about the acceptability
of a student response. After the administration of the English examination, two
hotline sessions are available to teachers during which they can enquire about the
acceptability of their students’ responses. The hotline is staffed by a team of native
speakers (n=4) working on two telephone lines. These native speakers also support
the project as proofreaders and during central correction. They are, thus, familiar
with the principles of language test development. In addition to their experience
and language competence, they are provided with guidelines on what qualifies as
an acceptable answer. To prepare for the hotline, the native speakers do the tasks
themselves as test takers. During the hotline sessions, they answer the calls and take
decisions about the teacher enquiries as a team. The two telephones are in the same
room, which increases the stress level, but is necessary to make sure that a response
accepted by one team can immediately be communicated to the other team. Also, if
one team happens to be off the phone, they can support the other team with the deci-
sion-making. In addition to the native speakers, a research team member is present
at all times to protocol any decisions that are taken and to moderate discussions.
Right from the beginning, the teachers made such great demands on the hotline
that it immediately gave rise to certain difficulties. One challenge was that initially
many of the teachers used the hotline to address problems and questions and also
fears they had concerning the new test. This changed rapidly when in 2010 the new
examination was made mandatory. The acceptance of the test increased and fewer
teachers used the hotline for complaints or feedback. Another challenge was caused
by the legal situation in Austria, which currently demands teachers to finish the
marking within one week. This coincided with an annually rising number of test
takers in the piloting phase71. As a result, the project had to expect higher numbers
of telephone calls than could reliably be dealt with within such a short period of
time. In 2009, therefore, an online helpdesk was developed and implemented.

6.2 Online helpdesk

The online helpdesk is a database system through which the teachers communicate
unanticipated student responses to the test developers. The enquiries are answered
by the same native speaker team that also works at the telephone hotline.

71 In the first three years of the project, the number of students taking the English listening and
reading papers increased from about 2,900 to more than 14,600.

After the end of the examination, the interface, which is a web-based form, is
opened for about two days. The teachers may enter each single student response
that they are unsure of, and their email address. Then the helpdesk is closed, i. e.
teachers can post no more enquiries. In the next step, all enquiries are answered
systematically during a plenary session with the native speakers and a member
of the research team. This involves doing each of the tasks as test takers first
and then discussing all enquiries item after item. If the team finds it necessary
to explain certain decisions, comments can be added to each individual enquiry.
Finally, the accepted and not accepted responses and comments are checked for
their consistency one more time before they are entered into the database. About
four days after the examination the teachers receive an automatically generated
email from the helpdesk including the answers to their particular enquiries and,
in certain cases, also comments from the team.
The introduction of the online helpdesk has proven to be immensely helpful
to the hotline team and test developers, but was also taken up positively by the
teachers. Table 1 below shows the numbers of enquiries per skill for the three
large English administrations in spring 2011 (total population = 14,616).
It frequently happens that different teachers post the same enquiries. There-
fore, the number of enquiries is given twice in the table. Firstly, as the number
of differently worded enquiries, which is adjusted to exclude multiple entries,
and secondly, as the number of all enquiries, which sums up all the enquiries
that were posted by teachers. As can be seen from the table, the number of
enquiries is directly related to the number of candidates per administration,
showing that an increase in population leads to an increase in unanticipated
student responses.

Table 1. Number of enquiries for English main administration, spring 2011.

Differently worded
All enquiries

Listening (15 items) 569 857

Administration 1 Reading (9 items) 179 274
Total 748 1,131

Listening (13 items) 1,759 3,223

Administration 2 Reading (9 items) 1,216 2,314
Total 2,975 5,537

Listening (19 items) 141 163
Administration 3 Reading (10 items) 58 64
Total 199 227

Total (n=14,616) 3,922 6,895

After each main administration of the new Matura, teachers are asked to fill in
a questionnaire regarding the examination. Data collected through this question-
naire showed that for this administration in spring 2011, 54 % of the teachers
(n=137) reported to have used the helpdesk, of which 97 % found it useful.

6.3 Lessons learned

Experience with both support structures has shown that teachers were in need of
guidance when it came to marking the open-ended items. Many of the enquiries
and telephone conversations indicated that the teachers had not yet grasped the
construct behind the Listening and Reading tests and wanted to penalize lan-
guage mistakes like grammatical or spelling mistakes.
After the first administration with hotline and helpdesk in place, we saw that a
sequence of having the online helpdesk before the hotline was beneficial for sev-
eral reasons. Firstly, the helpdesk is not a real-time communication with external
test users and, therefore, any decision can be discussed in depth within the team
and, later, checked for their uniformity before they are sent to the teachers. This
helps to reliably deal with many marking decisions within a short period of time.
Secondly, the records of all the decisions taken at the helpdesk provide a more
elaborated basis than the original marking scheme and are, therefore, very useful
to reduce stress and enhance reliability for decision making during the hotline.
Additional tweaks to the database and its functions have also made life easier for
the test developers because decisions from the helpdesk can be exported, sorted,
searched and archived for later reference.
The procedures we have adopted are efficient in ensuring that, even though
there is no central marking of the live exam, standard decisions are available
for each individual class teacher. Also, the helpdesk and the hotline seem cost
efficient compared to what Alderson (2000, p. 199–200) discusses as alternatives
to central marking sessions: regional groups of markers or double-marking. In
addition, our system involves an empowerment of the teachers, as it is they who
mark the school-leaving exam, albeit with support.

7. Conclusion and future directions

Our experience has shown that the marking of open-ended items for the receptive
skills deserves a closer look. To make the marking of such items more reliable in
the context of the Austrian Matura, we introduced four quality control mecha-
nisms in two phases of our test development cycle. Firstly, for the central marking
of field-tested items, we developed guidelines and procedures to enhance prin-
cipled decision-making, and a tool to evaluate items in the light of their marking.
These procedures are also believed to create a learning opportunity for the item
writers involved in the central marking. Secondly, we made central decisions
available for the de-centralized marking of the live examination through a hotline
and an online helpdesk. An important quality factor in all these steps was the
implementation of team decisions. We are confident that our measures have led
to more principled and reliable marker decisions, both, in the field tests and the
live exam, and can, therefore, recommend them to other test developers faced
with the challenge of marking open-ended items. Our experience shows that with
such mechanisms in place, even large-scale tests might be able to include open-
ended items without restricting the items to asking for short, unambiguous facts
instead of main points or main ideas. In our case, the benefits of including this
item format outweigh the cost needed to achieve reliable marking.
We are aware of the fact that to secure an ongoing benefit from the measures
introduced, two issues need to be considered. Firstly, we need to ascertain that
practicality of marking, for example by trying to limit the number of enquiries at
the helpdesk/hotline, does not become overly important. While this makes mark-
ing easier, it might lead to a restriction of construct representation. Secondly,
when it comes to the operational version of the examination as of 2014 it needs to
be ensured, in the interest of fairness to the test takers, that the use of the hotline
and the helpdesk and adherence to its decision is made obligatory, and that quality
control of teacher marking (e. g., by checking a random sample) is put into place.
At present, this is not yet the case.
The limitations of this chapter are inherent to its character. While this is an
account of what we believe to be good practice in marking open-ended items, it
is not an empirical study proving these efforts to be successful. Research into the
effects of the measures introduced would need to verify the increased reliability
we claim they promote.
In the future, we will continue this work by further developing the helpdesk
system as well as the marking guidelines and the item analysis grid. We are also
planning to evaluate the effect of the new item analysis grid on task develop-
ment. To bring the decision-making during the marking of the field tests and the

helpdesk and hotline even more into line, we plan to include our native speaker
teams more systematically during the marking of field tests. Moreover, we intend
to design a training tool to be used in a pre-marking hot housing session with the
aim of familiarizing markers with the guidelines and giving them feedback on
their decisions.


Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University Press.

Alderson, J. C., Clapham, C. & Wall, D (1995). Language test construction and
evaluation. Cambridge: Cambridge University Press.
Austrian Ministry of Education. (2010). Bundesgesetzblatt für die Republik
Österreich. 52. Bundesgesetz: Änderung des Schulunterrichtsgesetzes. (NR:
GP XXIV RV 714 AB 763 S. 70. BR: AB 8342 S. 786.) Retrieved from http://
Buck, G. (2001). Assessing Listening. Cambridge: Cambridge University Press.
Council of Europe. (1996). Common European framework of reference for lan-
guages: Learning, teaching, assessment. Strasbourg, France: Author.
Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999).
Dictionary of Language Testing. Cambridge: Cambridge University Press.
Harding, L. & Ryan, K. (2009). Decision-making in marking open-ended listen-
ing test items: The case of the OET. Spaan Fellow Working Papers in Second
or Foreign Language Assessment, 7, 99–114.
Harding L., Pill, J. & Ryan, K. (2011). Assessor decision making while marking a
note-taking listening test: The case of the OET. Language Assessment Quar-
terly 8(2), 108–126.
Harding, L. & Pill, J. (2011, May). The virtuous assessor. Markers’ decisions
as ethical decisions. Paper presented at the 8th Annual EALTA Conference,
Siena, Italy. Presentation retrieved from
Hughes, A. (2003). Testing for Language Teachers. Second Edition. Cambridge:
Cambridge University Press.
McNamara, T. (2000). Language testing. Oxford: Oxford University Press.
North, B., Figueras, N., Takala, S., Van Avermaet, P., & Verhelst, N. (2009).
Relating language examinations to the common European framework of
reference for languages: learning, teaching, assessment (CEFR). A Manual.
Strasbourg: Council of Europe.
Khalifa, H. & Weir, C. J. (2009). Examining reading. Research and practice in
assessing second language reading. Cambridge: Cambridge University Press.

Szabó, K., Sulyok, A. & Alderson, J. C. (2000). Chapter 6. Marking the Pilot
Tests. In: J. C. Alderson, E. Nagy & E. Öveges (Eds.), English Language Edu-
cation in Hungary. Part II. (2000). Examining Hungarian Learners’ Achieve-
ments in English (pp. 60–74). Budapest: British Council Hungary
Strategies for Eliciting Language
in Examination Conditions

Mark Griffiths72
Trinity College London

There is a long tradition in oral language examinations of employing structured interviews with
pre-scripted, fixed examiner contributions. This chapter examines an alternative format, the semi-
structured interview, in which examiners adapt their test plans around candidates’ input, in order
to elicit target language and communicative skills related to a chosen level. The Trinity College
London Graded Examinations in Spoken English (GESE) are the subject of study. Formally cali-
brated to the Common European Framework of Reference (Council of Europe, 2001), GESE exami-
nations are set out in 12 levels, which Trinity refers to as ‘Grades’ (e. g. ‘Grade 1’ to ‘Grade 12’.)
By analysing video recordings of 32 spoken examinations from the 12 GESE levels, this research
identifies a range of examiner techniques that have evolved as a result of the examiners’ training and
experience in adapting test plans to elicit target language skills. 14 examples of different elicitation
techniques were identified from 374 minutes of recorded material. Of particular interest is the find-
ing that contrary to what we might expect from oral proficiency testing interviews, direct questions
are only used part of the time, with examiners adapting and developing prompts, creating elicitation
strategies that represent a range of conversation patterns and roles.

Key words: semi-structured, oral, examination, elicitation, prompt.

1. Introduction and context

The context for this chapter is the oral language examination for non-native
speakers of English. Such assessments usually take the form of an interview. Van
Lier (1989) asserted that the oral proficiency interview comes as close as possible
to being an acceptable surrogate for real-life oral assessment (ibid, p. 494). In the
early 21st century, we live in a time when academic and commercial demand has
never been higher for oral proficiency assessments that offer insights into can-
didates’ communicative strengths. However, whilst a range of oral assessments
has been developed across the commercial and academic fields, not all assess-
ment interviews are structured in the same way, in particular where examiner role
and input are concerned. Many of these oral language examinations are designed
around highly-structured interviews in which a series of prompts is delivered to
the candidate(s) by the examiner. In such examinations, the examiner prompts are
typically direct questions or instructions to the candidate(s) to perform a verbal
task. All questions, instructions and examiner interaction are decided upon prior


to the examiner and candidate(s) meeting and before the oral communication
There also now exist oral examinations based solely on pre-recorded examiner
prompts, which are delivered by computer, without the examiner being present
or participating in the conversation that is being generated. Proponents of these
pre-written assessment frameworks are likely to point to the benefits of employ-
ing this method of examination and interview structure, which may include:
economies of scale in terms of test production; potential ease of examination
administration; tightly-controlled production-line consistency and uniformity of
examiner input and examiner response, regardless of candidate contributions.
This uniformity of examiner talk is often identified as a major contribution to
an assessment’s reliability, based on each candidate being exposed to the same
stimulus and response.
Yet the formalised, pre-scripted assessment framework may also be problem-
atic. For example, one may liken the pre-scripted examination to that of an inter-
view by autocue, with the examiner’s contributions limited to reading the words
of a non-present speaker rather than being a full conversational interactant. In
turn, the candidate is assigned a passive/reactive role in the interaction, the con-
tent and quantity of their input being measured and scaled by the pre-written
‘autocue’. The assessment may be standardised in terms of examiner input and
response, but this rigid standardisation may also limit the scope of what the can-
didate is exposed to linguistically and what he/she is permitted to produce in the
target language, with inevitable consequences for what linguistic data is avail-
able for assessment. Additionally, one might reasonably expect the purpose of an
oral proficiency examination to include assessment of a speaker’s communicative
competences. If we take Canale and Swain’s (1980) helpful four-part model of
communicative competences – linguistic, sociolinguistic, discourse and strategic
competence – we can reasonably expect a candidate’s linguistic competence to
be assessed in a pre-scripted exam, as a candidate should demonstrate control
of lexis and grammatical structures. However, it is not immediately easy to see
how a candidate’s sociolinguistic, discourse and strategic competences can be
viewed or assessed if an autocue examination does not build in locally-assembled
unplannedness or unpredictability of sequence and outcome (Goffman 1981),
or if the candidates do not receive feedback tailored to their contributions, or
have limited or no opportunity to shape the development of the discourse. This
again raises questions regarding what exactly the assessment process is telling us
about the candidate’s conversation skills. Morrow (1979) suggested that reliabil-
ity should be ‘subordinate to face validity’ and given the weaknesses identified
above, one may ask whether the pre-scripted examination leaves itself open to

accusations of prioritising reliability over validity and whether pre-scripted oral
assessments are valid assessments of real-world conversation skills.
Some examinations attempt to avoid potential threats to construct validity by
including paired-speaking tasks in assessments, in which the candidates may have
some scope for generating a short exchange. However, their candidate interlocu-
tor is a fellow test-taker, usually of the same approximate learner level and often
of the same L1, meaning that where the range of communicative competences
might be tested, they are done so in relation to speakers who may have similar
learner and error profiles rather than with a native speaker or non-native speakers
from a different language or culture, again raising doubts regarding what the can-
didates’ English output is evidencing. Also, the exchanges in paired speaker tasks
are usually limited in turns available to the candidates, with the content, aims and
timing of the task being controlled by the examiner. This has implications for the
candidates’ opportunity to display discourse and strategic competence.
Exactly what a pre-scripted assessment tells teachers, parents, employers and
the candidates and whether this represents an assessment of real-world linguistic
skills is not a matter for this chapter. Likewise it is not the concern of this chapter
to offer a detailed critique of the strengths and weaknesses of formal, pre-scripted
tests. We can only surmise that oral proficiency assessments do not all have the
same interview structure or test purpose, which sets the scene for exploring an
alternative exam format. This alternative is the focus of the following section and
of the research in this chapter.

2. Focus of the study

This section and the rest of the chapter focus on the semi-structured interview.
The purpose of this examination format remains the same as the formal, struc-
tured interview, aiming as it does at eliciting language and communication
skills of a specific syllabus. However, the conversation format allows for the
interviewer to discuss groupings of topics and questions with candidates in a
range of ways. The example semi-structured examination under investigation
is Trinity College London’s Graded Examinations in Spoken English (GESE).
The examination format is a 1–1 unscripted conversation with an examiner,
within which the candidate demonstrates specific language items and his/her
mastery of the target language and skills of the chosen level as set out in the
examination syllabus. The GESE examinations are set out in 12 levels, which
Trinity refers to as ‘Grades’ (e. g. Grade 1 to Grade 12.), all of which have been
formally mapped to the Common European Framework of Reference (Council
of Europe, 2001; Papageorgiou 2007). Grade 1 is calibrated at CEFR pre-A1,

with Grade 2 at CEFR A1 and the scale continuing to Grade 12, which is cali-
brated to C2. The Grade 1 level of the examination is 5 minutes long with the
examination increasing in length to 25 minutes at Grades 10–12 (C1-C2). To
briefly summarise what is expected of the candidate, throughout the 12 levels
of the examination, a Conversation Phase is included, which in lower levels
is centred on the exchange of basic personal facts and information, extends
into a discussion of 6 listed conversation subject areas by A2.2 on the CEFR,
to stating preferences and opinions by B1, and the candidate discussing their
relationship to, experience and opinions of conversation subject areas listed in
the syllabus from B2 and above. These conversation subject areas are intended
at lower levels to allow candidates to talk about concrete every-day themes,
and as candidates progress towards C2 on the CEFR, the subject areas generate
more discursive conversations, with candidates performing a range of language
functions, including reporting and contrasting ideas and opinions and develop-
ing, justifying and defending points of views. Additional tasks are introduced
at various levels, with the introduction of the Topic Phase at A2.2 in which can-
didates are expected to prepare a topic to discuss. At B2, an Interactive Phase
is introduced, requiring the candidate to take more responsibility for initiating,
maintaining and driving the interaction, with the candidate finding out further
information and making comments in response to an examiner prompt. At C1,
the candidate’s chosen topic becomes the focus of both a Topic Presentation
Phase (a formal monologue presentation on a discursive theme with appropri-
ate counter-points and conclusions) followed by a Topic Discussion Phase in
which the examiner and candidate discuss the topic and the candidate responds
to questions and challenges from the examiner.
With regard to interaction, the GESE examination suite is based on enabling
the candidate to ‘participate in a genuine two-way exchange within the lin-
guistic limits set by the syllabus’ (Trinity College London 2009, p.11). The
unscripted nature of the examination means that either participant – the can-
didate or the examiner – is able to contribute at any point, rather than turns
being fixed by a pre-scripted examination format. The bidirectionality of the
interaction is underlined by the expectation that from A1 upwards, candidates
will ask the examiner questions. This starts at A1 with a request for basic infor-
mation using ‘Where, how, have you got, etc.’ and an increasing emphasis is
placed on the pro-active role of the candidate in the examination throughout
the suite. In these respects, the examination is markedly different in structure
from the formal, pre-scripted assessments, with the semi-structured interview
believed to provide ample opportunity to generate and promote natural, multi-
turn conversations. (‘Conversation’ is a word which is frequently used through-
out the GESE syllabus.)

Turning to the role of the examiner, the semi-structured nature of the GESE
examination means that the examiner is expected to act as a facilitator rather
than an interrogator, aiming to enable a genuine two-way exchange, with either
the candidate or the examiner able to contribute at any point. Given this expec-
tation, restricting examiner contributions to the reading aloud of a pre-written
script regardless of candidate contribution is seen as contrary to the aims of the
examination. Instead, the examiner is expected to use a test plan – a selection
of communication strategies and prompts specifically designed to elicit the
range of target (grammatical, lexical, functional) language items and commu-
nicative skills of a particular level. Examiners have delegated responsibility to
utilise the elicitation techniques in their test plans in flexible sequence as the
examination progresses, with the conversation being locally assembled. Which
items they use and in which sequence examiners use them depends on the
candidate’s contributions and which target language and skills they have not
yet demonstrated.
An additional aim of the semi-structured format utilised in this exami-
nation is that of giving candidates the opportunity to not only demonstrate a
range of language items specific to the level being assessed but also to demon-
strate communicative competences and language skills that reflect real-world
exchanges outside the examination room. All four of the Canale and Swain
(1980) communicative competences have been mapped onto the GESE syl-
labus, with grammatical competence being required from Grade 1; discourse
and strategic competences evident from Grade 3 (A2); and sociolinguistic
competence evident from Grade 4 (B1) (Wall & Taylor 2011; O’Sullivan,
Taylor & Wall 2011).
The absence of examination pre-scripting may lead some to question the reli-
ability of the examination. Trinity College London provides the examiners with
specific training on the design of prompts to elicit specific language items/key
skills. The guidelines given by Trinity on the creation of prompts (confidential
internal documents) strictly direct examiners to locate their prompts within the
language of the level being examined, the full range of which must be the focus
of the examination. No language from levels above the level being examined may
be used. Trinity College London also publishes that it only recruits experienced,
well-qualified teachers, with high standards of initial and on-going examiner
training. It also carries out live monitoring of examining and uses samples of
recorded examinations for monitoring and feedback in its programme of stan-
dardisation and training.

3. Research aims

To date, little has been published about the range of examiner prompts and elic-
itation strategies that arise from this semi-structured interviewing format and
there are many lines of enquiry that one could pursue relating to examiner and
candidate behaviour. The focus of this study is the use of examiner prompts
within the semi-structured interview. Given that examiners have been delegated
responsibility for the design and implementation of their test plans, the research
described here explores the language-eliciting prompts that examiners have
devised through and how these prompts are introduced into the examination.
How do the examiners use the space and flexibility provided by the semi-struc-
tured format to best elicit the target language and how does examiner talk influ-
ence candidate responses in oral proficiency interview tests (Kasper, 2004)? The
study is an investigation into real examples of semi-structured interviewing from
all levels of the GESE exam suite and addresses the following research question:
– Which prompts do examiners develop and draw on to elicit appropriate candi-
date contributions in the forms of level-specific language items and communi-
cative skills?
Given the potential for investigation in this field, it is acknowledged that other
aspects of the semi-structured interview such as inter or intra-examiner variation,
inter-rater reliability, or the possible impact of interactional features on final
examination scores could all be valuable lines of enquiry, along with questions
regarding reliability or validity of the test format. However, they are beyond the
scope of this study. There is also no concern here with Trinity’s choice of test
specifications, purpose, or format.

4. Methodology

The analysis was carried out on a sample of 32 video-recorded oral examinations

taken from the Trinity exam archive. The examinations varied in length accord-
ing to the Trinity level chosen. Only whole examinations were used, rather than
excerpts. All the recordings were filmed between October 2009 and February
2010 and were conducted using the 2010 Trinity GESE Syllabus (which at the
time of writing is still current). The number of examinations in each grade, their
relationship to the Common European Framework of Reference (CEFR), the
length of the examinations, number of samples used and total sample size in min-
utes per Trinity Grade (i. e. level) are given in Table 1:

Table 1. Details of video samples used

Length (min- Number of Total number of

Grade CEFR
utes) samples used minutes

12 C2 25 2 50

11 C1 25 1 25

10 C1 25 2 50

9 B2.3 15 1 15

8 B2.2 15 2 30

7 B2.1 15 1 15

6 B1.2 10 5 50

5 B1.1 10 4 40

4 A2.2 10 4 40

3 A2.1 7 3 21

2 A1 6 3 18

1 A0 5 4 20

TOTAL: 32 374

A total of 7 examiners appeared across the 32 examinations. It was not possible

to use an even number of samples from each examiner due to the nature of film-
ing and other practicalities. Rather than round the number of samples for each
examiner down to the lowest equivalent number, it was decided that the greater
the range of samples analysed, the greater the likelihood of capturing a range
of examiner prompts. All examiners in the sample data have been working for
Trinity for between 5 and 10 years in the role of International ESOL examin-
ers and can be regarded as highly experienced in their roles. The candidates
varied in age from 7 years old for the Grade 1 exam, up to 33 for one of the
candidates taking Grade 12. The numbers of samples from each examiner are
given in Table 2:

Table 2. Numbers of samples from each examiner

Number of Total time in

examinations minutes

1 9 110

2 9 92

3 6 83

4 4 26

5 3 23

6 1 25

7 1 10

The focus of the analysis was the examiners’ contributions and the candidates’
responses as part of the duologue. All of the recordings were listened to by
the researcher and examiner prompts were noted. These included prompts that
appeared to come directly from the examiner test plans (which were analysed
separately) and prompts that seemed to have been adapted from the test plan
as the examination was proceeding, to extend the candidate’s contributions or
elicit specific language items or communicative skills. It was anticipated that as
the focus of each exam was different, depending on the level being taken, there
would be variation in the types of prompts elicited.
Phonetic and other features such as timing and pausing were also not noted
here as it was not felt that they would add anything further to our understanding
of the choice of examiner prompts.
The examples of examiner prompts were analysed manually and then catego-
rised according to prompt type. The labels chosen for each category reflected
the strategy apparently being employed in the use of the prompt. Finally, sample
tokens of each examiner strategy were tabulated to illustrate the range and types
of examiner prompt used. These results are presented in the next section.

5. Results and discussion

To begin with, the reader’s attention is drawn to the fact that the data should only
be described and evaluated in the context of a language examination, in which a
live examiner (rather than a computer) has the aim of eliciting specific language
items and communication skills from a candidate in a one-to-one context, and in

a limited time frame. Comparisons should not be drawn with classroom teacher
talk, the purpose of which serves a wide range of other functions (explanatory,
checking understanding, setting up activities, giving feedback, etc.). Examiner
talk should be seen as a distinctive type of discourse.
Returning to the data, the 207 examples of examiner prompts emerging from
the data were reduced to 14 types distinguished by their structure and apparent
examiner aim. The 14 types did not seem to fall into any kind of level-related
pattern, and as the study was not able to access identical sample lengths from
each of the 12 levels, and with a relatively low example prompt-to-prompt type
ratio, frequencies of occurrence are not provided here as there would be little reli-
able statistical inference to be drawn from such data. However, the 14 different
example types drawn from across the available data are listed below along with
the next response from the candidate and it is the range of example types and their
responses that are the focus here.
Example prompt types in the data:
(1) Examiner asks a direct question using incorrect information
E: Are they red socks?
C: No, they’re blue socks (Grade 1 – A0)
(2) Examiner models the answer and then prompts for full sentence response
E: My birthday’s in July. And yours?
C: My birthday’s in July too (Grade 3 – A1)
(3) Examiner offers a basic statement
E: I finish work at 5.30.
C: I finish work at 3.30.
(4) Examiner uses a direct question re: candidate ability
E: What jobs can you see?
C: He’s a policeman. She’s a nurse. She’s a doctor. (Grade 3 – A2.1)
(5) Examiner uses a direct question to encourage comparisons
E: What’s the difference between a friend and a best friend?
C: A best friend is more important than a friend simple. (Grade 4 – A2.2)
(6) Examiner presents an open instruction
E: Tell me about your interest in cars
C: I’ve been interested in cars maybe since I was 14–15 years ago. Yes, my
father used to teach me to drive. Yes I used to sit in him and used to hold the
steering wheel. (Grade 5 – B1.1)

(7) Examiner elicits a description, avoiding closed question
E: How do you describe your style?
C: I don’t have a style. I clothes, I like it and I buy it. But I’m not looking for it
so much. (Grade 6 – B1.2)
(8) Examiner uses a direct question to elicit an evaluation, avoiding a closed
E: How important is money for you?
C: Not very. I need money to live, but I don’t need to be a rich. (Grade 6 – B1.2)
(9) Examiner presents a simple statement of fact
E: Some people do shopping on the Internet.
C: I surf the net, use MSN, Facebook. But I don’t buy anything. And you?
(Grade 6 – B1.2)
(10) Examiner questions the previous statement, encouraging expansion
E: Really?
C: Yes, and in the future, they will all be touch screen phones. (Grade 7 – B2.1)
(11) Examiner expresses an opinion as a statement, but without a question
E: I sometimes wish I’d been born later.
C: Me too. I think it would be great to be born a few years ago. For example I
would be able to grow up with modern technology. (Grade 9 – B2.3)
(12) Examiner appears to feign lack of knowledge, encouraging candidate to
bridge the knowledge gap
E: I don’t know much about racism.
C: Well, it’s an issue that has been viewed quite negative by people in the UK
and Greece with immigrants in both countries experiencing many prob-
lems. (Grade 12 – C2)
(13) Examiner appears to deliberately mis-paraphrase candidate’s statement
E: So you’re basically saying that immigrants work in conditions and for sala-
ries that are damaging to the economy.
C: No, that’s not what I’m saying. I’m saying that people might think that
immigrants are not healthy for a country’s economy. (Grade 12 – C2)
(14) Examiner stays silent. The candidate must continue the conversation.
C: Nevertheless, it’s not their decision to make. People should make their own
choice about voting. What do you think? (Grade 12 – C2)
A number of issues emerge from this categorisation exercise. The first has to do
with a notion that many would take for granted; that candidate language in an

oral examination is always a response to examiner questions. It is clear that this
is not the case in this examination, where trained and experienced examiners are
allowed to vary their elicitation strategies to elicit target language and communi-
cative skills. For example, whilst questions appear in Examples 1, 4, 5, 7 and 8,
non-question form elicitation strategies exist in Examples 2, 3, 6, 9, 10, 11, 12, 13
and 14. While the data do not represent statistical tendencies or frequency of use,
it is interesting to note that professional elicitation strategies in a semi-structured
interview consist of more than simply a list of questions or instructions. This
range of elicitation strategies is discussed next.
The first elicitation strategy to discuss is that where questions were used, they
were always direct questions. Indirect questions were absent from the whole data
set – not just from the 14 examples listed above. We know that in conversations
between native speakers of English, indirect questions are used commonly as a
pragmalinguistic device to attend to listeners’ face needs and convey appropri-
ate levels of politeness. Yet there were no instances of indirect question requests
such as ‘Can you tell me..’, or ‘Could you explain..’, even at the higher levels. The
reason for this is unclear, but one possible explanation may be that the examiner
in effect modifies his/her expected level of candidate sociopragmatic comprehen-
sion (Ross 1995). That is, the examiner may make no assumptions that candidates
will understand the sociopragmatic meaning of the indirect question form and
they may even suspect that it might lead to candidate confusion. The examiner
performs downward linguistic accommodation, avoiding complex sentences and
politeness forms to facilitate listener comprehension of form over pragmatic
intent. Whether this is an overt strategy by the examination board or overspill of
the examiners’ own teaching experience into the examination room would be a
matter for future investigation.
Alternatively, the absence of indirect question forms may be a function of the
examiner using direct questions for particular effect: Example 6, ‘Tell me about..’
not ‘can you tell me’ may be seen as a more powerful and efficient method for
eliciting the required target language and maximising candidate talking time in
a limited time-frame assessment. The use of such grammatical forms to com-
municate force in an L2 setting has also been observed by Kasper (2004). What
is interesting about the use of direct questions is the range of communicative
purposes: Example 1 requires the candidate to correct the examiner; Example
4 requires a list of vocabulary; Example 5 encourages comparisons; Example 7
elicits a description and Example 8 elicits an evaluation.. The examiners’ prompts
also require a response from the candidate that specifically utilises the language
of the level. For example, in Grade 1, candidates are expected to know colours,
items of clothing, and adjective/noun word order in English. Likewise, Example 5
is taken from Grade 4, in which the candidates are expected to make comparisons

and use comparative adjective and adverb. We see from such examples that direct
questions are used by examiners in combination with tasks related to specific
target language items, structures and functions. We see also that although the
questions themselves may appear relatively closed, the candidate is free to choose
how they wish to answer: they can often personalise the response and they can
choose which aspect of the information they wish to communicate.
A second elicitation strategy present in the data is that of encouraging elabo-
rations through direct requests for expansion. In Example 6, with the candidate
having already mentioned their interest in cars, the examiner responds with a
‘Tell me about…’ prompt to encourage expansion. Example 8 sees the examiner
use the exclamation ‘Really?’ to express interest or surprise. Of note is that the
examiner’s exclamation is not followed by a further examiner remark, indicating
that the examiner has given the floor to the candidate to elaborate.
A third elicitation strategy, seen in Examples 3, 9, 11, 12, is for the examiner
to make various statements to encourage elaboration. Nakatsuhara (2007) has
also observed how statements are effectively used by interviewers as implicit
demands for more information, often resulting in a candidate elaborating their
response and this strategy appears in various forms: Example 3 is a basic short
statement, as is Example 9, which embeds a personal opinion or sentiment into
the statement; however, the examiner invites contrast by making a statement
about only some people. In Example 12, the examiner appears to feign lack of
knowledge, encouraging the candidate to bridge the knowledge gap. It appears
that examiner statements are pragmatically understood as an invitation to elabo-
rate or expand and take the floor, in particular when the examiner displays a lack
of knowledge or incorrect knowledge of a subject.
A fourth example of an elicitation strategy, seen in Example 13, shows the
invitation to elaborate taking a sociopragmatically different form, with the exam-
iner using reformulation (the re-phrasing of what has been said). Heritage (1985,
p. 115) compares L2 reformulations with those used in news interviews, point-
ing out that reformulations are understood as alternatives to going on to a next
question and are routinely treated as invitations to elaborate further. In Example
13, the examiner follows up a previous statement with a reformulation of the
candidates words, ‘So you’re basically saying that immigrants work in conditions
and for salaries that are damaging to the economy.’ This seems to be a deliberate
misrepresentation of the candidate’s words or opinions. This is immediately cor-
rected by the candidate, who then elaborates, which was clearly the examiner’s
A fifth and final elicitation strategy and perhaps the most sociopragmatically
extreme version of elaboration is the use in Example 14 of silence. Jaworski
(1997) points out that we need to go beyond the understanding of silence as an

‘absence of sound’. LoCastro (1995) describes silence as a linguistic resource to
signal pragmatic inferences in interactional contexts. In the context of language
examining, it is clear that the examiner is choosing silence as an intentional strat-
egy for cueing expansion from the candidate, the pragmatic inference being that
the candidate has not satisfied what Grice (1975) would identify as the maxim of
quantity. Interestingly, it was noticeable across all the data that silence was used
by examiners in examinations at B2 and above. This may be related to the CEFR
B2 descriptors which characterise a B2 candidate as being able to ‘initiate dis-
course, take his/her turn when appropriate.’ In Example 14, we see that this was
effective in encouraging the candidate to continue the turn and expand on his/her
point, inviting comment from the examiner.

6. Concluding remarks and future research

It seems appropriate at this point to do some redefining. Throughout this study I

have referred to examiner prompts, which have been shown to be broadly divided
into direct questions, statements and indirect cues for elaboration. However,
given that silence is also in evidence in the data it would seem more accurate
to summarise these as elicitation techniques, of which prompts are one part and
silence is another73. The examples presented illustrate a range of examiner tech-
niques that were found in the data. We see examples of direct questions, direct
instructions, examiner statements, examiner false statements for correction,
encouraging comparisons, eliciting descriptions, eliciting evaluations, feigning
lack of knowledge, encouraging contradiction, encouraging expansion, deliber-
ate mis-paraphrasing in reformulations and silence.
Based on this evidence, there would appear to be a case for further investigat-
ing how semi-structured examinations may be employed to generate candidate
language and whether they afford the candidate an enhanced opportunity to dis-
play a range of communicative skills and core linguistic competences. It would
appear from the type of interaction present in the data that a semi-structured
exam format that utilises a range of elicitation techniques may widen the aperture
through which the examiner views the range of the candidates’ communication
skills, enhancing the validity of an assessment of communication skills.
It is clear from the data presented here that the semi-structured interview
used in this suite of examinations has encouraged a range of examiner elicitation
techniques to be introduced into the assessment room. The data do not give us

73 In other examination contexts, we might choose to include body language as a further elici-
tation technique.

insight into any relationship between examiner experience and range of prompt
type and we cannot give any account of the training given to the examiners by the
examinations board. Yet it is hard to resist the conclusion that this range of exam-
iner techniques represents a broader and arguably more authentic range of oral
examiner input than structured interviews and this could be the subject of further
research. A further potential research strand could be to investigate candidates’
and examiners’ perceptions and attitudes towards semi-structured examinations.
It would also be useful to explore whether semi-structured examinations
encourage candidates to maximise their contributions. Rather than restricting the
candidate to a passive, reactive role, following a set of instructions, answering
questions only or making statements which are not responded to, does the semi-
structured format encourage the candidate to participate more fully in exchanges,
demonstrate more conversational control and even take more responsibility for
driving the conversation? In addition, it would be interesting to observe whether
certain prompt types are more commonly used in particular levels and whether
there is any variation in the frequency of use of certain elicitation techniques
within or between levels. We also do not have clear any data regarding how effec-
tive different elicitation techniques may be at eliciting language specific language
items or communicative skills. If so, what are the implications for providing
examiners with an improved vantage point from which to view candidates’ lan-
guage skills?
Future research could also compare the effectiveness of the examiner tech-
niques listed above with structured, pre-scripted interviews where the examiner
is unable to divert from the list of questions and instructions. Are examination
formats that prescribe test items beforehand and do not allow examiners flexibil-
ity, testing with the same purpose as examinations in a semi-structured format?
Can the same range of language and communicative skills be elicited using a
prescribed question/instruction exam format as can be by using a semi-struc-
tured format? Are the elicitation techniques of structured and semi-structured
examinations natural or realistic? Do the elicitation techniques in both formats
ever occur in natural conversation? Do these elicitation techniques represent the
types of interactions that candidates will encounter in the real world and are these
qualities in an examination that are worth striving for? If you introduce the data
to a group of researchers, rather than the single researcher in this study, will
they identify a wider range of elicitation techniques than those identified in this
research? It would be interesting to hear how other examination boards respond
to the data here regarding the effective use of test prompts and what this tells us
about styles of oral language examining.
It would also be useful to investigate the pragmatics that underpin semi-struc-
tured interviews when the format of the scripted interview is removed and con-

versation is more fluid. What strategies do candidates use? What happens when
communication breaks down? Do candidates rely on a range of strategies that
differ from those used in a scripted and structured language assessment? Which
examiner prompts and elicitation techniques are favoured by examiners using a
semi-structured format or seen as more effective at each grade? It would be useful
also to collect qualitative data from examiners regarding when and why they use
the range of strategies, techniques and prompts at their disposal, and what the
process is in finalising the prompts they include in their test plans. Do they have
preferred techniques that they see as particularly effective? How can we ensure
reliability in a non-prescribed question format? Does a semi-structured interview
support or hinder a teacher’s communicative classroom aims?
In a small-scale study such as this it is clear that the data set is not sufficient
to answer many of these questions. However, as the first study of its type in this
area, one can hope that enough light has appeared through the cracks to illumi-
nate examiner practice and the types of techniques used in such semi-structured
interviews. It would be a positive step if this study stimulated further research
interest into an area which has to date been under-investigated. It appears that
there are multiple elicitation routes and techniques that can be utilised during
an oral interview to elicit target language and communicative skills, and there
is much potential for developing new techniques and informing new practices.


Canale, M. & Swain, M. (1980). Theoretical bases of communicative approaches

to second language teaching and testing. Applied Linguistics, 1(1), 1–47.
Council of Europe (2001). Common European Framework of Reference for Lan-
guages: Learning Teaching, Assessment. Cambridge: Cambridge University
Goffman, E. (1981). Forms of talk. Oxford: Basil Blackwell.
Grice, H. P. (1975). Logic and conversation. In P. Cole and J. L. Morgan (Eds).
Speech Acts. (pp. 41–58). New York: Academic Press.
Heritage, J. (1985). Analyzing news interviews: Aspects of the production of talk
for an orverhearing audience. In T. van Dijk (Ed.) Handbook of Discourse
Analysis (Vol. 3, Discourse and Dialogue, pp. 95–117). London: Academic
Jaworksi, A. (1997). Introduction: an overview. In A. Jaworski (Ed.) Silence:
interdisciplinary perspectives (pp. 3–14). Berlin: Mouton de Gruyter.
Kasper, G. (2004). Speech acts in (inter)action: repeated questions. Intercultural
Pragmatics 1(1), 125–133.

LoCastro, V. (1995). Second languge pragmatics. In E. Hinkel (Ed.) Handbook of
Research in Second Language Teaching and Learning (Vol. 2, pp. 319–344).
New York: Routledge.
Morrow, K. (1979). Communicative language testing: Revolution or evolution? In
C. J. Brumfit, & K. Johnson (Eds.), The communicative approach to language
teaching (pp. 143–159). Oxford: Oxford University Press.
Nakatsuhara, F. (2007). Inter-interviewer variation in oral interview tests. ELT
Journal 62(3), 266–275.
O’Sullivan, B., Taylor, C. & Wall, D. (2011). Establishing evidence of construct:
A case study. Presentation delivered at the European Association of Language
Testing and Assessment (EALTA) conference, Siena, May. Retrieved 15 Feb-
ruary 2012 from
Ross, S. (1995). Aspects of communicative accommodation in oral proficiency
interview discourse. Unpublished PhD dissertation. University of Hawaii at
Papageorgiou, S. (2007). Relating the Trinity College London GESE and ISE
examinations to the Common European Framework of Reference. Piloting
of the Council of Europe Draft Manual, final project report. London: Trinity
College London, retrieved on February 24th 2012 from www.trinitycollege.
Trinity College London (2009). Graded Examinations in Spoken English (GESE)
Syllabus from 1 February 2010. London: Trinity College London.
van Lier, L. (1989). Reeling, writhing, drawling, stretching, and fainting in coils:
oral proficiency interviews as conversation. TESOL Quarterly 23, 489–508.
Wall, D. & Taylor, C. (forthcoming). Communicative Language Testing (CLT):
Reflections on the ‘issues revisited’ from the perspective of an examinations

Part V
Language Testing Practices
Assessing Spoken Language: Scoring Validity

Barry O’Sullivan74
British Council

This chapter focuses on one aspect of the testing of spoken language, that of scoring validity. Scor-
ing validity refers to those elements of the testing process, which are associated with the entire pro-
cess of score and grade awarding, from rater selection, training and monitoring to data analysis and
grade awarding. The main thesis of this chapter is that there is a disconnect between the underlying
construct being tested and the scoring system of the test. The chapter highlights the main problems
with current approaches to the assessment of speaking from the perspective of validation and sug-
gests ways in which these approaches might be improved.

Key words: speaking assessment, validation, construct, scoring system.

1. Introduction

In this chapter, I will focus on one aspect of the testing of spoken language, that of
scoring validity. Scoring validity refers to those elements of the testing process,
which are associated with the entire process of score and grade awarding, from
rater selection, training and monitoring to data analysis and grade awarding. The
main thesis of this chapter is that there is a disconnect between the underlying
construct being tested and the scoring system of the test. With regard to the test-
ing of spoken language, the underlying ability is clearly a characteristic of the
test taker, the test system (which represents the task or tasks through which the
test developer plans to access the ability) and the scoring system (which refers to
the process through which an accurate and consistent score or grade is awarded).
The model shown in Figure 1 represents a conceptualisation of the test system,
in which the three main elements combine to identify the test construct. This
conceptualisation, initially outlined by Weir (2005) and more recently updated by
O’Sullivan (2011) and O’Sullivan & Weir (2011), highlights the links between the
elements and the need to ensure that these links are clearly stated and supported
in the development process. The details of the model are described briefly in the
remainder of this section of the chapter.

74 barry.o’

Figure 1. Model of Test Validation (O’Sullivan 2011, p. 261)

1.1. The test taker

Test taker characteristics can be seen from two broad perspectives. The first of
these refers to characteristics, which are likely to be unique to the individual and
include physical, psychological and experiential aspects.
The first of these, physical characteristics, includes age and gender as well as
both short-term ailments (such as cold, toothache, etc.) and longer-term disabili-
ties (e. g. dyslexia, limited hearing or sight, etc.) All of these characteristics should
be considered by the test developer to determine what influence they might have
on test taker performance (e. g. Is a reading text appropriate for a group of young
learners? Is it likely to be biased towards or against boys or girls?). How the
developer intends to deal with short-term ailments can have a significant impact
on the resources needed in the development and administration process (e. g. if an
alternative test is to be offered to people who miss a test due to illness, then this
second paper must be equivalent to the first paper in all ways).
Psychological characteristics can refer to a candidate’s memory, personality,
cognitive style, affective schemata, concentration, motivation and emotional
state. All of these things are likely to have a significant impact on test taker per-
formance, if not considered by the test developer. For example, in a test of speak-
ing, we might reconsider asking young learners to participate in tasks that last
more than a very short time in order not to disadvantage those test takers who
lack the maturity to concentrate for long periods.
The final set of characteristics, experiential, relate to the experiences gained
by the test taker both before and during the test event. The aspects of experience

likely to affect test performance include education (formal and informal) as well
as experience of the examination and factors such as the amount or length (if any)
of target language country residence.
In terms of the cognitive characteristics of the test taker, we can consider the
mental processing of which they are capable (both cognitive and metacognitive)
and the resources, which they bring to the event. These resources can include
linguistic and general or background knowledge.
It should be clear to the reader that the distinction I made here between the
three aspects is purely for the purpose of explaining the concept. In fact there are
clear links across the different test taker characteristics.

1.2 The test system

The test developer will take the characteristics of the test taking population into
account when designing a series of tasks, which together elicit a sample of appro-
priate language. When designing the tasks the developer will consider a range of
performance parameters, which will include such things as timing, preparation,
score weighting, test taker knowledge of how performance will be scored, etc.
This last point is very important as knowledge of what criteria will be used in task
scoring make it more likely that the candidates will focus their language more
efficiently when responding to a task prompt. If they do not have advance notice
they will either not know what the examiner is looking for or will try to second
guess the situation.
The other aspects of the test system to be taken into account are the linguistic
demands of the tasks and the administration of the test itself. The former refers
to the language of the input and the expected language of the output and can also
include reference to the audience or interlocutor, as it has been shown that this is
appropriate when testing spoken language (see O’Sullivan, 2007). Bygate (1999)
has demonstrated that it is not feasible to predict task performance at the micro
linguistic level. This has also been shown by O’Sullivan, Weir & Saville (2002),
who go on to demonstrate a methodology for predicting and then reviewing task
output in terms of language functions. It is likely, therefore, that the test developer
will focus on the micro linguistic aspects of the task only with regard to input and
will be more concerned with the functional aspects of the output.
In terms of test administration, we should consider security (ensuring that systems
are put in place to strengthen the security of the entire administrative process), the
physical organisation of the test event (e. g. room setup, etc.) and finally to uniformity
of test delivery (e. g. ensuring that systems are in place to make certain that all admin-
istrations of the test are the same so that no test takers are advantaged over others).

1.3 The scoring system

Perhaps the most important aspect of the scoring system is that there should be a
sound theoretical fit between this and the trait or ability being tested. This goes
beyond an appropriate key or rating scale to inform the philosophical under-
pinning of rater and examiner recruitment, training and monitoring. The most
important point here is that the developer considers all aspects of the system at
the development phase, and not as an afterthought, though of course without a
theoretically and practically viable rating scale the system is unlikely to work
effectively. I will return to this point a little later in the chapter.
Another important prerequisite of the scoring system is that the elements
should combine to result in accurate and meaningful decisions. While this notion
includes the traditional idea of reliability, it is actually far broader in that it should
be seen to include all aspects of the psychometric functioning of a test.
Finally we consider the value of any decisions that might be based on test
scores. This relates to criterion-related evidence such as comparison with mea-
sures for example teacher estimates, comparison with other test scores and any
claimed link to performance standards.

2. The early testing of speaking

Modern testing of speaking started with the first general proficiency examination,
published by the University of Cambridge Local Examination Syndicate in 1913.
The Certificate of Proficiency in English (CPE) was a twelve-hour marathon, at
least by today’s standards, of which one hour was devoted to testing oral skills, see
Figure 2.

Figure 2. Text Version of CPE

1913 Examination
(i) Written: (a) Translation from English into French or German 2 hours
(b) Translation from French or German into English, and questions on English
Grammar 2½ hours
(c) English Essay 2 hours
(d) English Literature (The paper on English Language and Literature [Group
A, Subject 1] in the Higher Local Examination) 3 hours
(e) English Phonetics 1½ hours
(ii) Oral: Dictation ½ hour
Reading aloud and Conversation ½ hour

Source: Weir (2003, p. 2)

The one hour oral segment was divided into two parts, one a dictation task and the
other a reading aloud and follow-up discussion or conversation. The dictation task
consisted of a series of passages, which were delivered by the examiner. Sadly,
while elements of the original tests survive, and can be found in Weir & Milanovic
(2003, pp. 479–484), we do not have copies of the original dictation test or of the
read aloud or conversation sections and have no idea of how these were marked.
We can assume that they were awarded a grade by the examiner at the test event,
though it is unlikely that anything that resembled a formal rating scale might have
been used. This is despite the fact that, across the Atlantic in the USA, a number
of educationalists at Teacher’s College Columbia in New York had already devised
workable systems for assessing student writing (Thorndike, 1911; Hillegas, 1912).
Weir (2003) traces the changes to the CPE speaking since its introduction. The
changes to the test format are summarised in Table 1.

Table 1. The Certificate of Proficiency in English Oral Paper Changes

CPE Task Task Task Task Assessment

1913 Dictation Reading Aloud Conversation Unclear
1934 Same Same Same Unclear
1938 Same Same Same Discretion- Unclear
ary task (story
1945 Same Same Same Unclear
1953 Same Same Same Unclear
1966 Same Same Same Unclear
1975 Interview Long turn Reading Listen to pas- Rating Scale:
Questions (topic given 15 aloud – extract sage read by Vocabulary,
based on pho- minutes before from play given examiner and Grammar &
tograph test – notes 10 minutes give appropriate Structure, Into-
allowed) before test. response nation, Rhythm,
Jointly read by Stress & Pronun-
candidate and ciation, Overall
examiner communication
1984 * Same Similar but this Same but candi- Roleplay Fluency,
task combined date only reads Grammati-
with old task 4 short passage cal Accuracy,
Pronunciation &
Stress, Commu-
nicative Ability,
2003 † 1-to-1 inter- Paired/small Individual long Group Discus- Grammati-
view group discus- turn sion cal & Lexical
sion based on Resource, Dis-
photographs course Manage-
ment, Pronun-
* provision for groups and individual tasks marks using selected criteria
† Paired format (groups of three possible)

When the original format is analysed it becomes clear that the limitations of the
approach are considerable. For example, the dictation task required that the can-
didate engage a combination of listening and writing skills rather than speaking,
while the reading aloud task engaged low level reading skills together with pro-
nunciation and intonation control. The conversation task has been criticised, most
effectively by van Lier (1989), for failing to reflect the true characteristics of a
communicative interaction as the format is unlikely to allow a candidate an equal
opportunity to engage fully, with regards to control of the topic for example.
As can be seen from Table 1, the modernisation of the Cambridge CPE speak-
ing paper (which has until recent years tended to lead the way in terms of inno-
vation in speaking test design and delivery) only really began in earnest with
the 1975 revision. The paired format was not formally introduced until 2003 –
though it should be noted that the paired format had been introduced in the First
Certificate in English (FCE) by the same developer in 1996. Despite criticism of
the format from the perspective of potential imbalance and bias due to the use
of a paired format (Foot, 1999) and criticism of the decision by the developers to
award a single score based on performance on all tasks (O’Sullivan, 2002; 2007),
this approach to assessing spoken language has come to dominate the area over
the past decade. In its defence, the rating scale used in the scoring of test taker
performances was the first in a major international test to be driven by a clearly
stated model of language ability (Saville & Hargreaves, 1999).
The approach to the assessment of speaking adopted in the USA was quite dif-
ferent. Here, the focus from early on in the process was on both the content of the
test and on the measurement qualities of the test, though with an increasing con-
centration on the latter over the years. The development of the first formal rating
scale in 1955 for the Foreign Service Institute’s (FSI) language tests marked a sig-
nificant advancement in the standardisation of the rating process (Wilds, 1975).
In fact, the history of the rating scale developed for the FSI tests can itself be
traced to the work of Thorndike and Hillegas at Teacher’s College, Columbia
(New York) almost half a century earlier. The efforts of Thorndike (1911) to stan-
dardise the assessment of learner handwriting and Hillegas (1912) to do the same
thing for composition may appear at first glance to be somewhat simplistic in
today’s technology-driven world, but in fact their impact can still be felt in the
comparative- judgement approaches to assessment that drive most performance-
based tests today.
Looking back over the historical development of the assessment of spoken lan-
guage, it is possible to trace two very different approaches. In the case of the UK,
the early dictation and read-aloud tasks were designed to allow the examiner to
make judgements based on an individual candidate’s ability to perform basic pro-
ductive tasks (though not tasks in which the test taker has had to create a language

output), while the more recent variations tended to recognise the importance of
communication as opposed to basic language production. This can be seen as a
movement from a focus on language display to a focus on language use, while in
the USA (where the development of language tests tended to be closely associ-
ated with military necessities and, therefore, there was always a recognition of
the need to actually perform) the tendency has been different. Here, the move has
been, interestingly, in the opposite direction, as witnessed by the Versant tech-
nology–based assessments introduced by Pearson in its range of automatically
scored tests (discussed briefly below). The early dictation and read-aloud tasks
were designed to allow the examiner to make judgements based on an individual
candidate’s ability to perform basic non-productive tasks, while the more recent
variations tended to recognise the importance of communication as opposed to
basic language production.
With the exception of the FSI scale, little is known of the scoring systems
devised for use with these early tests. It appears that the tendency in the UK to
focus primarily on test construct as manifested in content led to a situation where
the actual scoring was left to the devices of the early examiners, who would have
been expected to simply ‘know’ what a pass or fail sounded like. On the other
hand, in the USA, the lack of a clear focus on test construct meant that it was rare
to find any systematic link between the construct and the criteria included in the
rating scale. This lack of focus is characterized by an emphasis on the underlying
measurement model (the psychometric qualities of the test) with limited attention
to the test content (as a reflection of the construct).

3. Current practice in test scoring

Tests of spoken language are scored in any of a number of ways (see Figure 3).
Basically, these can be seen as being one of two general approaches, human and
machine scoring. Since my main point in this chapter is related to issues around
human rating, I will not focus on the latter approach.
Scoring systems typically involve the use of multiple ratings by trained examin-
ers of each test event. As can be seen in Figure 3, these events can be scored ‘live’
(e. g. as the candidate is speaking) or recorded for later scoring. There is some issue
here with the scoring of recorded samples as it has been argued that live events
attract higher scores than those that have been recorded (see McNamara & Lumley,
1997). This may be due to the fact that the rater is often a participant in the speaking
event which results in a very different experience to a rater who is not fully engaged
in the interaction and so may be missing vital cues which are contributing to the
success of that interaction (from the perspective of the participants that is).

Figure 3. Approaches to Scoring Tests of Spoken Language

The types of rating scale typically used in these tests are either holistic or
analytic, though other variants are regularly suggested, see for example the
boundary definition scale (Upshur & Turner, 1995, 1999) and the indigenous
scale (Abdul Raof, 2011). Holistic scales offer a simple and quick solution to
rating, with the rater matching a test taker performance with one of a series of
descriptions on the scale. The analytic scale is similar in structure, except that
instead of a single score, the rater is expected to award multiple scores since the
scale combines a series of sub-scales each of which looks at a specific criterion
(or aspect of language – e. g. grammar, pronunciation, etc.) It is not uncommon
for analytic and holistic rating scales to share a common underlying under-
standing of language. For example, the Cambridge ESOL examiners in their
Main Suite speaking papers use two scales: the observer uses a multi-criteria
analytic scale when scoring performances while the interlocutor uses a holistic
version of the same scale to reduce rating time during the event. If we examine
the two sets of descriptors, however, we find that at each level (e. g. level 3)
the holistic descriptor contains a summary of all of the criteria contained in
the analytic scale. We can therefore say that the two scales share a common
underlying theory of spoken language. This phenomenon was demonstrated by
O’Sullivan (2007, p.182) when comparing the analytic version of the First Cer-
tificate in English (FCE) analytic rating scale (used by the observer in the test
event) and the holistic version of the same scale (used by the interlocutor in the
same event). As can be seen in Figure 4, reproduced from O’Sullivan (2007),
the links between the two scales are obvious.

Figure 4. Comparing Analytic and Holistic versions of a rating scale (from O’Sullivan 2007, p.182)

The merits and demerits of holistic and analytic scales have been discussed else-
where (Hamp-Lyons, 1991; Carr, 2000; Iwashita & Grove 2003) so will not be
focused on here. Instead, I would like to take an alternative view, or criticism, of
the contents of both these scale types.

3.1. What language are we assessing?

When we consider how language changes, and the considerable impact of tech-
nology on this change, not only in lexis but also in usage (e. g. see Crystal, 2010;
Tagg, forthcoming), it should be obvious that the traditional view of language
which is reflected in existing rating scales is not likely to reflect the actual lan-
guage of daily communication. This disconnect will surely have a significant
effect on the way in which raters view test performance. So, whether we are
assessing writing or speaking, it is necessary to move beyond current practice to
take on board the findings of research into the discourse of a domain of usage. In
the assessment of writing, this is perhaps best exemplified by the trend towards
including tasks that simulate email correspondence. While this is certainly more
realistic than asking learners to write a personal letter (still a favourite task with
many test writers), to use a traditional rating scale when assessing learner perfor-
mance in new types of tasks is to miss the whole point of the task itself. What is
acceptable in email discourse may not be acceptable in more traditional writing

and vice versa. The fact is that different discourse modes will result in different
(and systematic) language output from test takers. The same can be said of spoken
Despite the considerable contribution to our understanding of spoken dis-
course of research undertaken by researchers at the University of Nottingham
(e. g. Carter & McCarthy, 1997, 2006; Carter, Hughes & McCarthy, 2006) there
has been little or no impact on how examination boards treat different discourses
generated in tests of spoken language. It is likely that task formats (e. g. a mono-
logue or a dialogue) which generate these different discourses will not only result
in different linguistic responses to task prompts but will also result in different
listener/rater expectations. The language of the prepared monologue is likely to
reflect that of a more formal written discourse, so current rating scales (which are
based on a formal written grammar) are likely to offer a valid assessment of the
task performance. However, since the language of interaction-based dialogue is
co-constructed in a dynamic and less controlled fashion, it is probable that formal
grammatical rules (based on a written grammar) will not always apply. Instead,
the test developer who wishes to include such a task in a test should look at the
work on spoken grammar referred to here.
In the same way that there will be differences in the grammar of task output
depending on the task format, we should also be looking at the area of fluency to
identify possible issues there. McCarthy (2005) suggested the concept of conflu-
ence, where fluency in an interaction is seen to be co-constructed by the par-
ticipants and where success is measured by the degree to which an individual
interlocutor contributes to the smooth flowing of the interaction. McCarthy sup-
ports the more traditional view of fluency (which is focused on the individual’s
ability to adequately maintain his/her linguistic output), though he argues that
interaction-based tasks should be viewed in a different way to monologic tasks.
For the test developer, the implication is, again, obvious. We should more fully
explore McCarthy’s ideas in order to devise fluency/confluency criteria that are
more likely to offer a systematically coherent perspective on this aspect of com-

4. Conclusions

There is a clear disconnect between test tasks currently used in tests of spoken
language and the criteria that have been traditionally used to make judgements
on the grammar and fluency of a candidate’s performance across different task
types. This is a real concern as the lack of a link between the scoring system and

the other elements of the construct (Figure 5) will compromise the validity of the
So, taking the various points I have made in this chapter and summarising
them as a list of steps that should be taken by test developers in order to improve
the way in which we test spoken language, the following emerges:
1. Ensure that all tasks are appropriate to the test taker population in terms of the
characteristics described in this chapter.
2. If paired or group activities are to be used, consider how to limit the impact on
individual test taker performance of their affective reactions to their partner or
partners – e. g. through pre-test training.
3. Create tasks that are likely to result in performances that can be predicted at
the functional level and later checked (see O’Sullivan, Weir & Saville, 2002).
4. Ensure that the rating scale criteria used to assess task performances are
appropriate in terms of the type of discourse generated.
5. Provide adequate and continuous training to raters so that they are fully
acquainted with the system and with the expectations regarding level.
6. Award scores to each task separately so that the contribution of each task to the
overall score is clear.
7. Do not rely on a single score. If double (or more) rating is not possible, then
consider using statistical procedures such as many-facet Rasch to estimate a
final score or grade.
8. Always look back at a test in terms of analysing the language of the responses
(did test takers respond in terms of language functions in the way you pre-
dicted?), the scoring (were ratings consistent and accurate?) and the level (did
the responses reflect the expected level?).

Figure 5. Disconnect between the Scoring System and the Other Element of the Construct

The ideas I have proposed here are not new. I have referred to this issue in my
own work (O’Sullivan 2007, p. 241) though Pavlou (1997, p. 200) was probably
the first to publish his concerns when he argued that “In order for the linguistic
differences … to be detected the speech samples should be assessed by means
of scales that are sensitive to the individual characteristics of each speech inter-
action”. The challenge now is for test developers to look beyond the world of
language testing to the broader literature on discourse to inform a more theory-
driven approach to task performance assessment.


Abdul Raof, A. H. (2011). An alternative approach to rating scale development.

In B. O’Sullivan (ed.) Language Testing: theories and practices (pp.151–163).
Oxford: Palgrave.
Bygate, M. (1999). Quality of language and purpose of task: patterns of learners’
language on two oral communication tasks. Language Teaching Research,
3(3): 185–214.
Carr, N. (2000). A comparison of the effects of analytic and holistic rating scale
types in the context of composition texts. Issues in Applied Linguistics, 11(2),
Carter, R. and McCarthy, M. (1997). Exploring English Discourse. Cambridge:
Cambridge University Press.
Carter, R. and McCarthy, M. (2006). Cambridge Grammar of English. Cam-
bridge: Cambridge University Press.
Carter, R., Hughes, R and McCarthy, M. (2006). Exploring Grammar in Context.
Cambridge: Cambridge University Press.
Crystal, D. (2010). Language developments in British English. In M. Higgins,
C. Smith and J. Storey (eds.), The Cambridge Companion to Modern British
Culture (pp.26–41). Cambridge: Cambridge University Press.
Foot, M. C. (1999). Relaxing in pairs. ELT Journal, 53(1): 36–41.
Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons
(Ed.), Assessing second-language writing in academic contexts (pp. 241–276).
Norwood, NJ: Ablex.
Hillegas, M. B. (1912). A scale for the measurement of quality in English compo-
sition by young people. New York: Teachers College.
Iwashita, N. and Grove, E. (2003). A comparison of analytic and holistic scales in
the context of a specific purpose speaking test. Prospect, 18(3): 25–35.
McCarthy, M. J. (2005). Fluency and confluence: What fluent speakers do. The
Language Teacher, 29(6): 26–28.

McNamara, T. F. and Lumley, T. (1997). The effect of interlocutor and assessment
mode variables in overseas assessments of speaking skills in occupational I
settings. Language Testing, 14(2): 140–156.
O’Sullivan, B. (2011). Language Testing. In J. Simpson (ed.) Routledge Handbook
of Applied Linguistics (pp. 259–273). Oxford: Routledge.
O’Sullivan, B. (2007). Modelling Performance in Tests of Spoken Language.
Frankfurt: Peter Lang.
O’Sullivan, B. (2002). Learner Acquaintanceship and Oral Proficiency Test Pair-
Task Performance. Language Testing, 19(3): 277–295.
O’Sullivan, B. & Weir, C. (2011). Language Testing and Validation. In Barry
O’Sullivan (ed). Language Testing: Theory & Practice (pp.13–32). Oxford:
O’Sullivan, B., Weir, C. & Saville, N. (2002). Using observation checklists to
validate speaking-test tasks. Language Testing, 19(1): 33–56.
Pavlou, P. (1997). Do different speech interactions in an oral proficiency test yield
different kinds of language? In A. Huhta & S. Luoma. (eds.). Selected Papers
from the 18th Language Testing Research Colloquium (pp. 185 – 201). Jyväs-
kylä, Finland: University of Jyväskylä and University of Tampere.
Saville, N. and Hargreaves, P. (1999). Assessing speaking in the revised FCE.
ELT Journal, 53(1): 42–51.
Tagg, C. (forthcoming). The Discourse of Text Messaging: Analysis of SMS Com-
munication. London: Continuum.
Thorndike, E. L. (1911). A scale for measuring the merit of English writing. Sci-
ence, 33: 935–938.
Upshur, J. A. and Turner, C. (1999). Systematic effects in the rating of second-
language speaking ability: test method and learner discourse. Language Test-
ing, 16(1): 82–111.
Upshur, J. and Turner, C. (1995). Constructing rating scales for second language
tests. ELT Journal, 49(1): 3–12.
van Lier, L. (1989). Reeling, writhing, drawling, stretching, and fainting in coils:
oral proficiency interviews as conversation. TESOL Quarterly, 23(3): 489–
Weir, C. and Milanovic, M. (eds.) (2003). Continuity and Innovation: Revising
the Cambridge Proficiency in English Examination 1913–2002. Cambridge:
Cambridge University Press.
Weir, C. J. (2003). A Survey of the History of the Certificate of Proficiency in
English (CPE) in the 20th Century. In C. J. Weir & M. Milanovic (eds.) Conti-
nuity and Innovation: Revising the Cambridge Proficiency in English Exami-
nation 1913–2002 (pp. 1–56). Cambridge: Cambridge University Press.

Weir, C. J. (2005). Language Testing and Validation: an evidenced-based
approach. Oxford: Palgrave.
Wilds, C. P. (1975). The oral interview test. In R. L. Jones and B. Spolsky (eds.)
Testing Language Proficiency (pp.29–44). Arlington: Center for Applied Lin-

The Use of Assessment Rubrics and
Feedback Forms in Learning

Roxanne Wong75
ELC City University of Hong Kong

In September of 2012 the educational system in Hong Kong will be undergoing a major reform.
Students will go from having to complete three years of university studies to four. With this change,
City University of Hong Kong’s English Language Centre decided to not only make changes to the
curriculum, but also to change the assessment system. Prior to the newly implemented changes, the
courses offered were on a pass-fail basis. They are now added to the students’ cumulative grade
point average. As such, it was imperative that the assessments not only rank the students but also that
they aid the students in their learning. This is a report on the preliminary findings of a pilot study
run during the 2010–2011 academic year. The study set out to investigate student and teacher ease
in understanding the new rubrics and exemplar booklets. In addition the use of the marking scheme
as a tool for learning was also investigated. Initial results of using assessment for learning appear

Key words: Assessment, rubrics, learning, exemplars, pilot.

1. Background

From the early 1960’s, in Hong Kong, the education system was based upon the
British system. By 1971, six years of primary education was made compulsory for
all students. In 1978, the Hong Kong government made three years of lower sec-
ondary education (forms one to form three) mandatory. If students chose to fur-
ther their education, they could then attend four years of upper secondary (forms
four to seven) and three years of university providing they are academically qual-
ified. As of 2012, the education system is undergoing yet another reform. In this
one, there will be three years of compulsory lower secondary, and three years of
additional optional secondary followed by four years of optional university study.
This upcoming change is based on a new government initiative to align the edu-
cation system in Hong Kong more with those of Europe and North America. In
addition, these changes are community wide and have ramifications on the entire
education system in Hong Kong.
There is an eminent need to redesign secondary and university curricula
because of the afore mentioned educational reform. In preparation for these
changes, no help or advice from the local government has been given to the uni-


versities, nor have any timelines been set, and as such, each university in Hong
Kong is making changes that best fit their needs. City University of Hong Kong
has embraced the need for change and decided to rewrite all courses being offered
in its English Language Centre (ELC). In doing so, the department has also
looked at the need to change the assessments of its courses.
Previously, all courses offered by the ELC were offered on a pass-fail basis.
With the changes in the new curriculum come changes in the grading system.
Classes that were previously taken for supplementary language improvement are
now being included in the students overall grade point average (a cumulative
grade average of all courses being taken at the university during that semester);
thus becoming more important classes for our students as it is very common for
prospective employers in Hong Kong to look at a student’s grade point average
when recruiting for positions within a company. Because past courses were on a
pass-fail basis, students’ reported not being as concerned about overall improve-
ment in English language ability, but rather that they simply wanted to finish the
course so they could spend more time focusing on their major subjects. When
reviewing the previous assessment grading criteria, it was discovered that there
was a lack of continuity in grading rubrics. In some courses, language (grammar
and vocabulary) accounted for as high as 30 % of the grade while in others it was
as low as 12 %. For this reason it was decided that new assessment materials and
rubrics be developed.
Prior to the change in the grading system, the ELC had a series of rubrics
that were designed by the assessor who was usually the classroom teacher, and
unhelpful to users not teaching within the department. When reviewing these
rubrics, it was decided that new materials and assessments would be developed
simultaneously. As such, there was a close link and a high degree of cooperation
between the course development team and the assessment team. In addition to
this, it was decided to change the focus of assessment from assessment for testing
purposes only to assessment for learning. In order to help in this shift of assess-
ment purpose, a comprehensive package including rubrics, feedback forms for
both instructor feedback and peer feedback were developed along with a detailed
set of exemplar scripts. This package was given to the instructors for use in their
pilot classes. While the decision to use the materials in the classroom setting
was left to the instructor, the majority of instructors used the materials provided
with their students. This study looks at the usefulness of the new rubrics for both
teachers and students. A total of five instructors and 77 students were interviewed
in the initial pilot with an aim to determine the capability of the rubrics to help
students understand what was needed to improve not only their grades, but also
their ability to write more academic English.

A second major change that is occurring at the same time is the change in the
end of school exams that the Hong Kong students take. Previously students did
2 years of junior secondary at forms four and five before sitting for their HKCE
(Hong Kong Certificate of Education Exam), which is equivalent to the British
O level exams. If they did well in that exam they could go on to do a further two
years of senior secondary at form six and form seven. Students could then take
A-level exams and once again, if they did well, they could be enrolled via the Joint
University Programmes Admissions System (JUPAS) for entrance in the various
Hong Kong Universities. Beginning in 2012, the Hong Kong Examinations and
Assessment Authority will have students taking the new Hong Kong Diploma of
Secondary Education (HKDSE) exam. As it is a new exam, no one knows exactly
what the results will be. In the past, the majority of students entering City Univer-
sity of Hong Kong received a grade D or E on their A-Level exams. The results of
this test range from A to F, A being the highest. The new HKDSE is broken in to
number grades and it is expected that the students entering City University will
be of a level 3 on a 5 point scale. This is reported to be roughly equivalent to the
previous band D and E students. Currently, no correlation studies have been done
therefore it is not possible to determine the relationship between the previous
A-Level system and the new HKDSE.
In addition to not knowing the linguistic level of the students, we now have
the added need to adapt the university wide grading system (A through F) rather
than the previous pass-fail system to all these students. It has been decided that
all ELC classes should have 13 different levels fine grades: A+, A, A- etc through
F. It is now necessary to subdivide almost the entire student body into the vari-
ous grade levels. The classes now become high-stakes for the students as they
are no longer on the pass-fail basis, grades will also now feed into their overall
grade point average. This was the major change, which led the assessment team
to develop the package of materials being presented.
This chapter will report the findings on the usefulness of the package as a
teaching tool as well as some practical issues that have arisen in implementing the
package to date and will discuss some of the shortcomings of the project and the
materials developed. Further study is now underway to improve upon what has
been developed already. It is hoped that the lessons learned from this initial pilot
can be used to further enhance the learning experience of our students as well as
lead into more detailed marking schemes in other courses within our university
and the broader academic community.

2. Literature review

Traditional assessment within the department takes the form of students reading
information from a given text and then using that information on a formal test
paper. This type of assessment is firmly entrenched in the Hong Kong educational
system from kindergarten all the way through University (Carless, 2005). It has
been argued that the examination driven culture is firmly entrenched in the Hong
Kong educational system (Pong & Chow, 2002). This has been said to stem from
the imperial examination system of the late Tang, early Qing dynasties (Morrison
& Tang, 2002).
An attempt was made in the Hong Kong educational system to implement
‘assessment for learning’ under the Target Oriented Curriculum introduced in
the primary and secondary educational system in 1989. This was later rescinded
and replaced by School Based Assessment. Carless found that even when teach-
ers wanted to implement the new practices of assessment for learning they did
not have the support or time (Carless, 2005). These findings are important in the
university setting as the students come to university with preconceived ideas on
what education and, by extension, assessment should involve. If they carry with
them these preconceived notions about the nature of assessment, it is a struggle at
best to implement change into the system. Moreover, instructors carry these same
preconceived notions and the cycle of change is slow (Katz, 1996)
Brown, Iwashita and McNamara (2005, p. 3) wrote that:
It has been pointed out by a number of writers (e. g., Cumming, Kantor, & Powers, 2001,
2002; Fulcher, 1987; Matthews, 1990) that rating scales used in the assessment of second-
language proficiency often have no basis in actual performance. North and Schneider (1998)
comment that “[m]ost scales of language proficiency appear in fact to have been produced
pragmatically by appeals to intuition, the local pedagogic culture and those scales to which
the author had access.

This was found to be true of the marking rubrics for the classes in the ELC. The
new rubrics were developed using a variety of methodologies that were based on
analysis of student writings.
Alderson (2005) stated that diagnostic assessments should allow for detailed
analysis and feedback that students can act upon. To be able to give the stu-
dents the feedback needed to assist learning, analytic scales should be devel-
oped. Weigle (2005) wrote that these result in higher reliability and offer better
diagnostic information, which by extension would aid students in their learning.
Knoch (2009) reported that when a rating scale with descriptors based on dis-
course-analytic measures is used the results are more valid and useful for giving
students feedback on diagnostic assessment.

Teacher beliefs and the implementation of assessment for learning have an
impact on the way they view assessment for learning (Marshall & Drummond,
2006), (Katz, 1996). A study conducted by McClellan ascertained that teachers
believed the primary purpose of assessment was to rank achievement but at the
same time they used it for formative purposes. Students agreed with this but also
claimed that they seldom received formative feedback that was useful to their
learning (Maclellan, 2001). It is for reasons like those found in McClellan’s study
as well as supporting data from surveys done with students in ELC courses that
the ELC decided to implement an assessment for learning plan.

3. The role of the ELC

The ELC at City University has the mission of improving students’ general level
of English. The students generally come to the department at a CEFR level B1 as
determined by their previous Hong Kong A Level scores. It is our aim to get them
to a strong level B2 (which has been roughly correlated to an IELTS band 6 by the
Hong Kong Examinations and Assessment Authority) at the end of one and a half
years of study within the department (Authority, 2008). With the implementation of
the new curriculum and the new set of classes being offered, we also have the added
burden of giving students fine grades when this has never been done in the depart-
ment before. The development and implementation of new courses, the change in
the grading system, and an increase in the number of students per class means a
great number of changes are taking place. At the ELC, we are making all those
changes in one year. As such, the assessment team decided to make a comprehensive
assessment rubric and exemplar booklet to help both instructors and students ease
into at least one aspect of these changes.
The University has declared that the new EAP course will be the ‘gatekeeper’
to the students’ university studies. Students who enter the university with a grade
D or below on the HKALE must pass the EAP course. If they do not, they will not
be able to move to a higher level of English study (ELC, 2011). It is a graduation
requirement that students reach a certain level of competency in English and the
new EAP course is to ensure that.
The ELC has always had a long-standing reputation of quality assurance in
assessments. All exams are double-marked with around twenty percent being triple
marked. In addition, a moderation team meets after all assessments to verify the
marks given. The previous marking rubrics required this extra marking, as they
were very vague in defining the standard. Nevertheless, even with extensive stan-
dardization sessions, consistency across markers could not be ensured. It is hoped

that the new marking scheme, along with the detailed exemplar booklets would
better ensure consistency in grading be maintained across instructors and classes.

4. The study

4.1 Purpose of the study

The aim of the present study was twofold. First it aimed to investigate the overall
effectiveness of a new assessment package as a teaching/learning tool that had
been developed at City University of Hong Kong and second, it looked at the pos-
sibility of linguistic improvement among students. In particular, the study seeks
to answer the following questions:
1) Was the marking scheme useful for helping students learn how to learn?
2) Were the marking scheme and exemplar booklet easy to use, for both students
and instructors?
3) Did the teachers believe the marking scheme enabled students to learn better?
Or was it just used as a tool for assessment?
4) Did the students show growth in their writing from the first diagnostic test to
the final course exam.

4.2 Method of study

To determine the usefulness of the new rubrics in the classes, data was gathered
from three sources: 1) questionnaires given to all students in the pilot study, 2)
small group interviews of all students who completed the questionnaire, and 3)
individual interviews with all teachers of the course.
To determine growth in students writing, data was collected from three tests
taken during the term; a diagnostic test in week 2 used to determine entry level
linguistic abilities, an in-class assessment in week 11, and the final exam in week
14 of the term. The focus of this study was on the students’ development in the cat-
egories of grammatical complexity and grammatical accuracy. The newly devel-
oped rubrics have divided the traditional category of grammar into two separate
categories. The first, complexity, includes the students ability to use a variety of
sentence types as well as the students’ mastery of areas which are usually ‘major
errors’ commonly found in the writings of Hong Kong Cantonese L1 learners
of English. Some examples typically found include verbless sentences, comma

splices and run-on sentences. Grammatical accuracy includes basic mechanics,
subject-verb agreement problems, and correct use of tenses.

4.3 Data collection

The 77 students involved in this study are all year one EAP students from the
ELC of City University. They were a part of a pilot study being conducted Uni-
versity wide. All entered the university with the understanding that they would be
piloting an entire new degree programme. The students were in classes of 16 to 20
students. The data was collected from the five ELC classes in the pilot study. The
researcher was allotted one hour of class time during week eleven so as to conduct
interviews and surveys and all students were interviewed in groups of five to six.
During the same class period, the students also completed the questionnaires.
The small group interviews were audio recorded and transcribed manually. Inter-
views were translated when necessary. The majority of the students answered the
interviewer’s questions in English, but Cantonese was allowed when the student
was unable to articulate a point. If the interviewer was unable to understand the
L1, the aid of a Cantonese-speaking instructor was included for translation.
The teachers in the course all work in the English Language Centre of City
University. They have been with the department from 6 to 12 years. All are sub-
ject trained in English with Qualifications ranging from Cambridge Diploma in
Education to MATEFL76 and MAAL77. All five teachers in the course were inter-
viewed after the course was completed and examinations were marked. This was
done to enable the researcher to get a complete picture of progress.
Three hundred student texts were analyzed (three from each student). The
researcher looked for growth or lack of growth in the five different aspects of
language included in the rubrics. The data was taken from three writing tasks
given to the students, namely, a diagnostic exam, an in class assessment (ICA),
and final examination writings.

4.4 Data analysis

Data analysis mainly involved 1) transcription of interviews with both students

and teachers, 2) analysis and interpretation of data collected through the survey,
and 3) detailed discourse analysis of the students written work. This entailed

76 Master of Arts in Teaching English as a Foreign Language

77 Master of Arts in Applied Linguistics

investigating vocabulary usage, grammatical accuracy, syntactical accuracy,
overall organization and task completion. Interview data was collected and later
transcribed with the intention of finding recurring themes. First the interviews
were analysed in two separate groups (teachers and students), and later the data
from both groups was combined so as to determine any similarities of opinion or
any striking differences. The same was done for the survey questionnaires.
When analyzing the students’ work, each of the five elements of the rubric was
examined against the requirements set out under each category. The first task
completion focuses mainly on the students’ ability to read the task instructions
and answer the question accordingly, using appropriate academic register. The
next element is organization and this measures the flow of ideas within the text.
Vocabulary is grading the students’ ability to use vocabulary appropriate to the
task. This includes vocabulary from the academic word list as well as task spe-
cific vocabulary. As previously stated, grammatical accuracy and syntactical
accuracy measure a students’ error density and sentence type variety.

5. The materials

The course package was designed based on past marking schemes within the
department. By analyzing the shortcomings of past marking schemes and the
intended learning outcomes of the new courses, the assessment team was able
to design a comprehensive package that includes a sample test paper, a feedback
form (Appendix 1), a marking rubric (Appendix 2), explanatory notes for teach-
ers, and an exemplar section with detailed justifications for each of the grades A
through F.
The feedback form was developed for each domain on the rubric. The assess-
ment rubric was developed after careful analysis of 130 student sample papers.
The papers were graded first using a split of high, medium and low. This was then
followed by marking by IELTS examiners. Finally, discourse analysis was under-
taken to create each of the domain descriptors. The explanatory notes and exem-
plar section was then created using one script from each of the five grade levels.

6. Results and discussion

Initial results indicate that the combination of feedback forms and marking
rubrics are helpful in informing learning. The can-do statements of the marking
rubrics (Appendix 2), along with the webpages for independent learning which
have been included in both the exemplar booklets and the student feedback forms,

have garnered positive comments from teachers and students alike. The inclusion
of “caps” or “over-arching statements” (as can be seen in Appendix 2 in bold) has
been noted to provide students with positive benchmarks on their ability as well
as the knowledge of what needs to be done to further improve themselves.

6.1 Student survey and interview analysis

Students were asked 18 questions: four general language questions about their
previous feedback experiences, eight related to the marking scheme, and six
related to the learning process.
The students who enter the ELC have a low level of English. Many of them
have been receiving poor scores in English classes for years. One of the main
problems instructors in the department face is that of motivation. As such,
some of the interview questions and the questions from the survey were geared
towards student motivation. Of great importance to the instructors in the ELC
was the students’ answers regarding beliefs about their writing performance. Of
the 58 respondents in the questionnaire, 63 % believed their overall writing had
improved, and a further 65 % believed their accuracy had improved. When they
were asked in the interviews to explain their answers, they unanimously stated
they were more motivated to learn English now than when they were in secondary
school. Typical comments included:
“It is because often the secondary teachers cannot provide adequate suggestion on the
enhancement of the grammar. However, the university instructor can analyze the grammar
errors with each student, which is useful for the students to have individual practice in differ-
ent areas. I want to learn how to be better now” (Student 1)

“The reason is that the writing part in the secondary school is too examination-oriented, so
it is not practical for me. Now I have a reason to learn better. I must work soon” (Student 2)

Students were asked about their overall improvement in a variety of aspects of

sentence writing. The majority believed they could write better in a variety of
aspects after taking the EAP course than before (see Table 1). Motivational fac-
tors play an important role in the courses taught to students at this level so it was
very important to see whether the students’ beliefs in their areas of improvement
during the course were consistent with their actual improvement.
When asked about the most useful part of the marking scheme, the researcher
was unsurprised that 42 % of students believed task fulfillment was most impor-
tant followed by grammatical accuracy (24 %) and grammatical complexity
(20 %). Part of task achievement was being able to write ‘academically’. This
aspect of the grading criteria is not taught in most local secondary school; there-

fore this feature of the course was new to many of the students. The majority of
students (77 %) believed though that the marking scheme was incomprehensible
without instructor explanation. Seventy-seven percent of students also noted that
they received a great deal of teacher feedback on how to improve their language
The newly designed marking rubrics used ‘can-do’ statements. These were
chosen because of the imminent alignment with the CEFR in 2012, as well as the
fact that the newly designed Hong Kong Diploma of Secondary Education uses
them. As the students in the pilot study have not been exposed to these before,
the researcher also asked about their usefulness in enhancing learning. Sixty-
four percent believed the can-do statements enabled them to know exactly what
aspects of the language they needed to improve in order to get a better grade as
well as to improve their overall language abilities.

Table 1. Student opinions on writing improvement

When asked about the usefulness of the can-do statements, students replied with:

“It can act as a guide for me to know what aspect of my writing can be improved” (Student 2)

“This can make us focus on what we should pay attention to and improve that particular
aspect” (Student 3)

“Can do statement tell us how to do a better job in task fulfillment and other areas” (Student

78 Students own words as given to interviewer in English.

6.2 Instructor interview analysis

Instructors were asked about the usefulness of the marking scheme in both
understanding of the grading criteria for themselves, and for their students. The
researcher was particularly interested in the use of metalanguage throughout the
marking scheme, and the use of overarching statements as both these aspects are
new to the ELC. All five instructors of the course were interviewed after the final
exams were given, in order to get a better overall impression of their usage of the
rubric. The results are very promising.
All five instructors believed it would take time to learn to use the new rubrics.
They agreed it was much more detailed than the current marking schemes
expected to have difficulty when they first began using it. Four of the five instruc-
tors asked the researcher for clarification and further instruction interpretation of
the new rubric.
The instructors all believed that the addition of the exemplar samples and the
booklet with complete teacher’s notes added to it were helpful. They believed
that the language used was accessible to both instructors and students alike.
Although, many did say the limited amount of metalanguage would need to be
explained to students. With that explanation, they added that it was possible to
use all the information in the booklets as teaching materials. This was the one
aspect of the rubrics that came to fruition.
One of the instructors mentioned that she felt better able to tell her students
what was needed to be done so that they would improve. She asserted that this
was not because she did not know how to teach before, but because she now had
better details in the requirements for each of the different domains being tested. It
was stated that having detailed exemplars along with the overarching statements
and caps on levels helped her feel better about the grades she gave.

7. Growth study

When the course was finished, all student scripts were analyzed for growth in
the domains of grammatical complexity and grammatical accuracy. The sources
for the growth study were, EAP1 diagnostic tests (administered in week one of
the course), EAP1 ICA2 scripts (from week 11 of the course), and EAP1 Exam
(taken two weeks after the course finished in week 14). To determine growth, the
researcher evaluated all scripts for the 100 students in the pilot course. The results
were then tabulated using Pearson’s Product Moment Correlation.
Growth can be seen in 62 % of the grammatical complexity category and 55 %
of the grammatical accuracy category. A strong relationship between task fulfill-

ment and grammatical complexity was found (Pre: r:0.72, During: r:0.62, Post:
r:0.83). A moderate relationship was discovered between task fulfillment and
grammatical accuracy (Pre: r:0.47, During: r:0.57, Post: r:0.57).
There was also a group of students who showed no growth or negative growth
in the areas of grammatical accuracy or grammatical complexity. Negative
growth was seen in 15 % of grammatical complexity and 10 % of grammatical
accuracy. 42 % and 36 % showed no growth in the same two categories. However,
when there was a lack of growth in the two grammatical aspects of the student
writing, there was substantial growth, 78 %, in task achievement. This domain is
an area that was very new to the students. The feedback from the students and
teachers indicates more focus was dedicated to these aspects of the course. It
was discovered that students who had negative or no growth, when they took the
diagnostic test, they had less variety in sentence types. They mostly used basic
sentence structures that could be transferred from L1 (subject, verb, object) which
were highly grammatically accurate. They used mainly present simple or past
simple tenses in their writings. As the course went on, in the ICA2 and the final
exam, they had greater variety of sentence types and many more errors caused
often by L1 interference. They also began to attempt to use a variety of different
tense types.
When looking at the work of individual students, a number of them began the
term (as per the diagnostic test) with very simple sentences and limited variety
in tenses. These sentence structures are very similar to what they would be writ-
ing in Chinese. The majority of students also wrote in simple present tense only.
Through translation by a Chinese speaking coworker, it was discovered that the
writing the students were doing could be directly translated from their L1. By the
end of the term (as can be seen in the final exam), the students were able to move
beyond simple translation and began to show use of a variety of more complex
structures. The two tables below show a segment of writing from the best (Table
2) and worst (Table 3) students’ in the course. As shown by the italicized print, the
students still made mistakes in their writing at the end of the course. The under-
lined print shows changes in the students’ grammatical structures.

Table 2. Highest graded final exam

Diagnostic In Class Assessment Final Exam

The naked truth is that The naked truth is that teenag- Maybe in crimes of stealing
there is no medicine to ers will become desensitised which only led to the loss
cure it. to violence, while youngster of money or some personal
always play violent games, things, the victim is still alive
they will opt for solving prob- and the things may find back
lem by acting aggressively and while the stealers are caught.
ultimately they are emotionally
numb to violence.

With a view to reduc- From day to day, these young- As human beings, we should
ing human suffering sters internalise the message cherish the life and the behav-
and saving lifes, animals in which treating others rudely iour of the murders disrespect
should be used in research. is correct in the hostile world, of life; hence, they must be
the consequence of which is punished by the courts.
serious to them.

Table 3. Lowest graded final exam

Diagnostic In Class Assessment Final Exam

We know that nowadays Although they may not want to If death penalty is still work-
there are many new virus going out to beat somebody up ing as a penalty, it is not only
threat our health. immediately after playing vio- harms the victims, but also
lent games, the problem is lead harms the victim’s parents.
them to desensitizn the violent;
feel violent is not a problem.

If we get more these like Teenagers are not mature Victims need to get penalty
of data and generous enough to think about using by their wrong being, but their
knowledge, we can study violent is correct or not, parent do not getting wrong,
more and create more and researches shown that teenag- facing their family members,
more new drugs to attack ers who played violent games causing death is one of the
the virus. become more violence than the harmful punishmant for
peer who have not do so. victims’ parents.

8. Conclusions and implications for further study

This study is a work in progress thus the results should be viewed with that
in mind. Students believed the marking scheme was helpful if aspects of it
were explained to them by their instructors. They had problems with some
of the vocabulary, but overall believed it helped them to understand what was

required for improvement in all language domains. This aspect of it they found
Unfortunately at the time of the interviews, the majority of the students did
not see the exemplar booklet that was provided so the researcher was unable to
investigate the ease of understanding and use by the students. The instructors all
believed the booklet to be highly informative but at the same time very time con-
suming to go through. None read the entire booklet, but rather chose to “dip into
it” when questions arose in marking of student papers. They believed the rubric
allowed for student learning, but all noted that changes needed to be made for it
to fulfill the goal of assessment for learning. Currently they believe it to mainly
be an assessment tool.
The preliminary results have shown that there are still areas for improvement
in the grading rubric. The rubric does not capture the shift in growth between
task fulfillment and the two grammatical aspects. It also does not account for
errors made when students move away from direct translation. As it currently
stands, students will be rewarded a higher grade if they are to continue using
simple language. Because of this, the lower levels of the marking scheme may
penalize the students rather than help them move into a higher grading category.
This may cause negative motivation.
At the time of writing, the researcher has been given a grant to further investi-
gate the usefulness of the rubrics on a larger scale pilot involving 1600 students.
Refinement of the rubrics is ongoing. It is hoped that this new study will bring the
department closer to its goal of assessment for learning.
There is potential for a more refined grading scale, and it would be of great
interest to further investigate the usefulness of the rubric with a larger sample
size. It would also be of interest to research the growth differences, if any, in the
assessment domains of vocabulary and cohesion.


Alderson, C. (2005). Diagnosing foreingn language proficiency. The interface

between learning and assessment. Lonson: Conrinuum.
Authority, H. K. (2008). Benchmarking Studies on INternational Examinations.
Retrieved February 18, 2011, from Hong Kong Examinations and Assessment
Brown, A., Iwashita, N., & McNamara, T. (2005). An Examination of Rater
Orientations and Test-Taker Performance on English-for-Academic Purposes
Speaking Tasks. ETS, MS-29.

Carless, D. (2005). Prospects for the implementation of assessment for learning.
Assessment in Education, 12(1), 39–54.
ELC (English Language Center, City University of Hong Kong) (2011). Who
Takes What? Retrieved January 23, 2011, from English Language Centre
Katz, A. (1996). Teaching style: a way to understand instruction in language
classrooms. In K. M. Bailey, & D. Nunan (Eds.), Voices From the Classroom
(pp. 81–83). Cambridge: Cambridge University Press.
Knoch, U. (2009). Diagnostic assessment of writing: A comparison of two rating
scales. Language Testing, 26(2), 275–304.
Maclellan, E. (2001). Assessment for learning: the differing perceptions of tutors
and students. Assessment and Evaluation in Higher Education, 26(4), 307–
Marshall, B., & Drummond, M. J. (2006). How teachers engage with Assessment
for Learning: lessons from the classroom. Research Papers in Education,
21(2), 133–149.
Morrison, K., & Tang, F. (2002). Testing to destruction: a problem in a small state.
Assessment in Education, 9(3), 289–312.
Pong, W., & Chow, J. (2002). On the pedagogy of examinations in Hong Kong.
Retrieved June 2011, from The HKU Scholars Hub:
Weigle, S. (2005). Assessing Writing. Cambridge: Cambridge University Press.

Appendix 1 - 1 – ICA2
ICA2 Paragraph
Paragraph Writing
Writing Feedback FormFeedback Form

Grammatical Complexity 20%

Strengths  Different types of phrases/ clauses are constructed correctly
 Compare and concessive/ conditional and result/ contrast/ embedded/ gerund/
noun/ place/ prepositional/ reason/ relative –defining/ relative- non-defining/
 Constructed different types complex sentences correctly
 Mostly complex sentences --  correctly constructed
 Some compound sentences --  correctly constructed
 Not many simple sentences --  correctly constructed (≤2 if many simple
 Constructed ‘both…and’/ ‘either…or’/ ‘neither…nor’ correctly
 Correct subordination structures
 Correct parallel structures (IL tip- )
Weaknesses Run-on sentences (≤3)
 Sentence fragments (≤3) (IL tip-,
 Unclear/ missing subject or object complements
 Missing objects of transitive verbs

Grammatical Accuracy 20%

Strengths  High degree of grammatical accuracy (more than half of the paragraph is error
 Quite accurate
Weaknesses Full of errors (1/0)
 Errors are often distracting (may require re-reading/ obvious errors that give the
impression of weak language accuracy) (≤2)
 Very common errors (≤3) (IL tip-, [errors among Cantonese speakers])
 Errors identified:
 Article  Tense
 Capitalisation  Verb form
 pronoun usage  Passive/ Active Voice
 Singular-plural  Word form
 Spelling  Word order
 Subject-verb-agreement


Appendix 2 – Marking Scheme (ICA2 & Exam 2b)
Appendix 2 - Marking Scheme (ICA2 & Exam 2b)
20% 20% 20%
1. Can 1. Can usually
usually use a
show sufficient 1. Can usually
logical range of and correctly
developme cohesive use vocabulary
nt of ideas methods Can often relevant to the
and can correctly. construct 1. Can maintain a topic.
clearly 2. Can usually different types high degree of 2. Can usually
state elaborate the of phrases, grammatical show skillful
opinions. main idea clauses and accuracy. use of
2. Can well to give complex 2. Most sentences academic/
usually completeness sentences are error free. high-level/
use to the correctly. less frequent
academic argument. vocabulary
writing May use and/or
style and different synonyms.
impersonal elaboration
tone. methods.
1. Can often
1. Can use
Has shown evidence of articles read & has answered the question.

1. Can often relevant to the
use a Can often topic though
nt of ideas
sufficient construct 1. Can write quite some instances
and can
range of different types accurately of repetition.
cohesive of phrases and though with 2. Can often
4 methods clauses and a some distracting show skillful
correctly. few types of errors. use of
2. Can often
2. Can often complex 2. Many sentences academic/
elaborate the sentences are error free. high-level/
main idea correctly. less frequent
relevantly. vocabulary
style and
1. Can 1. Can use Can sometimes 1. Can write quite 1. Can use
sometimes different construct accurately vocabulary
show cohesive different though with relevant to the
3 logical methods phrases, clauses some distracting topic though
developme though may and complex errors that are some instances
nt of ideas use cohesive sentences. very common of repetition.
and can devices There may be among 2. Can

clearly mechanicall run-on students. sometimes
state y. sentences and /2. Some sentences show skillful
opinions. 2. Can often or sentence may be error use of
2. Can often elaborate the fragments. free. academic/
use main idea high-level/
academic relevantly less frequent
writing though with vocabulary
style and some unclear and/or
impersonal details (e.g. synonyms.
tone. issues in
of supporting
1. Can
ideas 1. Can use
though vocabulary
1. Can use
they may relevant to the
be Can construct topic though
disjointed. some phrases Can write quite many
2. Can and clauses but accurately instances of
though may
sometimes many simple though with repetition.
use sentences. errors that are 2. Can
2 problems in
academic There may be very common sometimes
writing run-on among students show the use
2. Can
style and sentences and / and are often of academic/
impersonal or sentence distracting. high-level/
elaborate the
tone fragments. less frequent
main idea
though vocabulary
often and/or
relying on synonyms.
Has no 1. Can use Can construct a
evidence of different few types of 1. Can use basic
articles read cohesive phrases and vocabulary
OR has not methods clauses though Can show very though
answered the though may mistakes are limited control repeatedly.
question. have predominant of grammar 2. Can hardly
1. Can show problems in and / or though errors show the use
some evidence referencing. throughout. are of academic/
of idea flow 2. Writes Writing may be predominant high-level/
though ideas without characterized by and always less frequent
are often topic the lack of distracting. vocabulary
disjointed. sentence or variety of and/or
2. Can show can seldom sentence synonyms.
very limited elaborate structures and

292 296
awareness of the main mistakes.
academic idea
writing style. relevantly.

1. Anything worse than 1.

2. No attempt at the task.

Identifying Important Factors in Essay Grading
Using Machine Learning

Victor D. O. Santos79
Iowa State University

Marjolijn Verspoor80
University of Groningen
University of the Free State

John Nerbonne81
University of Groningen

Writing fluently and accurately is a goal for many learners of English. Grading students’ writing
is part of the educational process, with some grades even serving as a qualification for entering
universities. This chapter identifies factors that correlate strongly with essay grades and English
proficiency level, with a view to contributing to Automated Essay Scoring (AES). The study is
based on a database containing 81 variables in spontaneously, short written texts by 481 Dutch high
school learners of English. All texts were holistically scored on a proficiency level from 1 to 5 by
several experts. We investigate which machine learning algorithm (from a subset of those present in
the WEKA82 software) provides the best classification accuracy in terms of predicting the English
proficiency level of the essays. Logistic Model Tree (LMT), which uses logistic regression, achieves
the best accuracy rates (both in terms of precise accuracy and adjacent accuracy) when compared to
human judges. The aim of this chapter is to build a bridge between applied linguistics and language
technology in order to find features that determine essay grades, with a view to future implemen-
tation in an AES system.

Key words: machine learning, English proficiency level, essay scoring, Logistic Model Tree, lan-
guage testing.

1. Introduction

Writing fluently and accurately is important in numerous jobs and professions,

and essay writing is an important exercise in learning to write fluently and accu-
rately. Written proficiency is important in that it is often tested and graded as part
of admissions procedures for higher education and professional education. The
TOEFL exam and others such as the Cambridge exams and IELTS are examples
of high-stakes tests that might have a large impact on test takers, such as allowing
82 Waikato Environment for Knowledge Analysis

or denying them entrance into an academic program at a university, depending
on their score. Other essay grades might be used for purposes that do not impact
test takers’ lives so strongly, such as helping teachers at a language center decide
in which English class a certain student should be placed.
Identifying the factors that determine essay grades is difficult because there
are many potential ones that have been proposed, many of which may in fact not
be relevant to the construct at hand. One might decide to focus on just a few fac-
tors, but, typical real-world data includes various attributes, only a few of which
are actually relevant to the true target concept (Landwehr, Hall, & Frank, 2005).
In our case, the target concept, or construct at hand, is English proficiency.
In this chapter, we would like to use real-world data – texts written by L2
learners of English, which are coded for 81 linguistic variables – and identify
which set of factors is actually relevant to the true target concept: the proficiency
level of the learners. To do so we first investigate to what extent machine learn-
ing algorithms and techniques, such as those implemented in the widely used
WEKA package (University of Waikato), can help us with our task at hand: clas-
sifying/scoring essays according to their level of English proficiency. Given that
machine learning is quite appropriate for dealing with a large number of features
and optimal at finding hidden patterns in data, we want to explore how suit-
able these algorithms are for dealing with the delicate and multivariate reality of
second language proficiency. We are also interested in knowing if and how the
outputs (results) of some classifiers (algorithms) might reflect common practice in
Applied Linguistics. Finally, we would like to know whether there might be sig-
nificant differences in how human raters differ from the classifiers we used and
factors we identified in terms of the accuracy of their classification.
Once we have identified the factors that determine essay grades, we shall be
in a position to focus work on AES to automating the determination of those
factors. We envision our work as contributing to this research line. In the fol-
lowing section we review work in AES to suggest how the present chapter might
contribute to it.

2. Literature review

Automated Essay Scoring has been making substantial progress since its incipi-
ence, dated to the 1960s and the work of Page and his Project Essay Grading
(PEG) system (Page, 1966). Contemporary systems make use of different tech-
niques and frameworks in order to arrive at a classification for a given writing
sample. The number of essays used to train the various systems also varies. We
discuss here some of the main AES systems currently in use.

Page (1966), the developer of the PEG system, defines what he calls trins
and proxes. Trins are intrinsic variables such as punctuation, fluency, grammar,
vocabulary range, etc. As Page explains, these intrinsic variables cannot, how-
ever, be directly measured in an essay and must therefore be approximated by
means of other measures, which he calls proxes. Fluency, for example, is mea-
sured through the prox “number of words” (Page, 1994). The main idea of the
PEG system is quite similar to the one we have employed in our research. The
system obtains as input a training set containing a large number of essays each
with the values for the chosen proxes already assigned and a score for the overall
writing quality. The system, by using regression analysis, arrives at the optimal
weight for each of the proxes. For future ungraded essays, the system extracts
from them the values for the same proxes used in the training phase and reaches
a decision with regard to the level of the essay. The score generated is essentially
a prediction of the grade that a teacher would give for that specific essay (Rudner
& Gagne, 2001).
Intelligent Essay Assessor™ (IEA) is an essay scoring system developed by
Pearson Knowledge Analysis Technologies (PKT), which can provide analytic
and trait scores, as well as holistic scores indicating the overall quality of the
essay (PKT, 2011). This means that IEA can be used not only as an essay scoring
system, but also in informative feedback, where students/test takers can identify
the areas in which they are stronger and those that they need to work on (Figure
1). In addition to being able to analyze the more formal aspects of language, IEA
also examines the quality of the content of essays, since it uses Latent Seman-
tic Analysis (LSA), which according to Landauer et al. (1997) is “a theory and
method for extracting and representing the contextual-usage meaning of words
by statistical computations applied to a large corpus of text (p. 259).
IEA is trained on a domain-specific set of essays (100–200 essays) that have
been previously scored by expert human raters. When fed new essays whose score
is unknown, IEA basically compares through LSA how similar the new essay is
to the ones it has been trained on. If the new essay shows the highest similarity
(both in terms of content and formal aspects, since these are not separate in LSA)
to those essays in the corpus that have been scored a level 3, for example, that will
be the score that IEA will output for the new essay in question. Through LSA,
IEA is able to take into account not only formal linguistic features of the essays,
but also deals with semantics, by representing each essay as a multidimensional
vector. According to the PTK website83, the correlation between IEA and human
graders is “as high or higher than that between two independent human raters”
(PKT, 2011).


Figure 1. IEA™ sample feedback screen

Another quite well known AES system is e-rater®, developed by Education Test-
ing Services (ETS) and employed in the TOEFL (Test of English as a Foreign
Language) exam. The current version of e-rater® is based on over 15 years of
research on Natural Language Processing at ETS and takes several features into
account when holistically scoring a writing piece (ETS, 2011). According to ETS,
some of the features used by e-rater® are: content analysis based on vocabulary
measures, lexical complexity, proportion of grammar errors, proportion of usage
errors, proportion of mechanical errors, organization and development scores,
idiomatic phraseology and others (ETS, 2011). From these features, we see that
e-rater® goes beyond looking only at surface features and also examines orga-
nizational and developmental features, which makes it more suitable for use in a
higher-stakes test such as the TOEFL than a system like PEG84.
Many other systems have been developed, such as ETS1, Criterion85, Intelli-
Metric86 and Betsy87, to mention a few. These systems vary considerably in their
approaches and methods for essay scoring. In 1996, Page makes a distinction
between automated essay scoring systems that focus primarily on content (related

to what is actually said) and those focusing primarily on style (surface features,
related to how things are said) (as cited in Valenti, Neri & Cucchiarelli, 2003).
Intelligent Essay Assessor88, ETS1 and e-rater®89 are examples of the former
type, while PEG and Betsy (a Bayesian system) are examples of the latter.
Rather than approach the problem of classifying texts according to grades
directly, our strategy is to first attempt to isolate factors that correlate highly with
test grades, in order to focus on automating these in a second step. We attempt
thus to “divide and conquer”, focusing here on dividing the problem into more
manageable subproblems.
In our study we will primarily look at surface features, which include non-
linguistic (such as total number of words) and linguistic features (such as total
number of grammatical errors), mainly because the texts are very short (about
150 words) and written by non-advanced learners of English (Dutch high school

3. Research context

In order to train a machine learning system, a corpus of holistically scored essays

needs to be collected, so that it can be used as training data for the system. Since
we are attempting to detect factors which are influential with respect to the holis-
tically assigned grade, the training essays need to be annotated, meaning that we
need to have a specific number of features that we look at in each essay and then
record the value for each of those features (such as grammar mistakes, number of
words, percentage of verbs in the present tense, etc).
The corpus we have used in our research comes from the OTTO project, which
was meant to measure the effect of bilingual education in the Netherlands (Ver-
spoor et al., 2010) and is the same as used by Verspoor et al. (to appear). To
control for scholastic aptitude and L1 background, only Dutch students from
VWO90 schools (a high academic Middle School program in the Netherlands)
were chosen as subjects. In total, there were 481 students from 6 different VWO
schools in their 1st (12 to 13 years old) or 3rd year (14 to 15 years old) of sec-
ondary education. To allow for a range of proficiency levels, the students were
enrolled in either a regular program with 2 or 3 hours of English instructions per
week or in a semi-immersion program with 15 hours of instruction in English per
week. The 1st year students were asked to write about their new school and the

90 Voorbereidend Wetenschappelijk Onderwijs (Preparatory Scientific Education).

3rd year students were asked to write about their previous vacation. The word
limit for the essays was approximately 200 words. The writing samples were
assessed on general language proficiency and human raters gave each essay a
holistic proficiency score between 0 and 5, with 0 indicating the level assigned to
those essays with the most basic language complexity and 5 indicating those with
the most complex language, out of the essays analyzed. As Burstein & Chodorow
(2010) put it, “for holistic scoring, a reader (human or computer) assigns a single
numerical score to the quality of writing in an essay” (p.529). In order to ensure
a high level of inter-rater reliability, the entire scoring procedure was carefully
controlled. There were 8 scorers, all of whom were experienced ESL teachers
(with 3 of them being native speakers of English). After long and detailed discus-
sions, followed by tentative scoring of a subset containing 100 essays, assessment
criteria were established for the subsequent scoring of essays. Two groups of 4
ESL raters were formed and each essay was scored by one of the groups, with the
score of the majority (3 out of 4) being taken to be the final score of the essay.
If a majority vote could not be reached and subsequent discussion between the
members of that group did not solve the issue, then the members of the other
group were consulted in order to settle on the final holistic score for the essay. In
all, 481 essays were scored. As we will see further ahead, the size of this set is
good enough for training a scoring system and some of the more established essay
scoring systems available actually use a smaller set than we do in our work. In
Figure 2, we can find the distribution of the levels among the essays we have used.
Verspoor et al. (to appear) coded each writing sample for features (variables)
drawn both from the Applied Linguistics literature and from their own obser-
vations during the scoring of the essays. The features cover several levels of
linguistic analysis, such as syntactic, lexical, mechanical, and others. Some of
the features, such as range of vocabulary, sentence length, accuracy (no errors),
type-token ratio (TTR), chunks, and number of dependent clauses, for example,
are established features in the literature and have been used in several studies
to measure the complexity of a written sample. Other features, such as specific
types of errors and frequency bands for the word types were chosen in order to
obtain a more fine-grained analysis of language.
In our study, we first investigated which machine learning algorithms found
in WEKA, when trained on this corpus of variables, can achieve results which
would allow them to be used in a future automated essay scoring system for a
low-stakes test. In order to arrive at the optimal system, we experimented with
decreasing the number of features available to the algorithms (through feature
selection) and also discretizing the values (that is, using interval ranges instead of
raw values for the features).

Figure 2. Distribution of the levels (0–5) in our data

Finally, we wanted to know how the results of the best trained system out of those
analyzed might compare to the results observed when humans raters score the
essays and how our results might reflect common practices in second language

4. Methodology

In machine learning, a common method of assessing the classification perfor-

mance of a system is called ten-fold cross validation. This method basically
involves dividing the available data set into ten parts, with nine tenths serving as
the training set and the one tenth serving as the test set. This process is done 10
times (Jurafsky & Martin, 2009). Throughout our study, we have made use of the
ten-fold cross validation method in order to assess the quality of the classification
models. We have experimented with 2 different scenarios:
Scenario 1: All 81 features and their respective values are made available to
the algorithms. We analyze their results in terms of absolute accuracy (assigning
to an essay exactly the level it has been assigned by the human raters) and adja-
cent agreement (giving some credit for adjacent classifications as well).
Adjacent agreement is a looser measure of success when predicting classifi-
cations such as grades, where the classes are ordered. If we are dealing with a
scale of 0–5 in terms of proficiency level, classifying a level 4 essay as level 5 or
level 3 is certainly more desirable than classifying this same level 4 essay as level
1, for example. Therefore, we do not take only absolute accuracy into account.

Also human raters themselves quite often disagree on the exact level of a given
essay, but tend to assign adjacent levels to the same essay. This is the desirable
situation when raters have been as well trained as ours.
Scenario 2: We perform feature selection in order to find a subset of features
that correlate highly with proficiency level and also submit the values of those
features to discretization, before training three of our main systems. It is a known
fact that obtaining comparable results by using fewer features is a gain in knowl-
edge, given that it makes the model simpler, more elegant and easier to be imple-
mented. Using every feature in order to build a classifier might also be seen as
overkill. The question is simple: if we can achieve the same (or possibly even
higher) accuracy in a system by using fewer features, why should we use all of
them? It takes processing power and engineering/programming work in order for
an automatic system to extract the values for each feature and if many of the fea-
tures do not lead to an improvement in classification accuracy, it does not make
much sense to insist on using them if our sole task is classification. In addition,
by using too many features we might be missing some interesting patterns in our
data. By discretizing numerical data we are able to build models faster, since
numerical values do not have to be sorted over and over again, thus improving
performance time of the system. On the other hand, discretizing values leads to a
less fine-grained and transparent analysis, since we group together a continuum
of values that might have individual significance for classification.
Finally, once we have analyzed our 2 scenarios and arrived at the optimal clas-
sification model out of those we have experimented with, we looked at how the
model might be said to meet the gold standard and thus show results which are
similar to those recorded when human raters grade the essays. For this, we needed
not only classification accuracies and adjacent agreement, but also a new experi-
ment in which we compare the results of the system and of the original human
raters with those of a second and independent group of trained raters. In the next
section, we report on the results of our experiments.

5. Results

5.1 Scenario 1

The accuracy of the 11 classifiers used in Scenario 1 (before feature selection

and value discretization) is shown in Table 1 below. We would like to draw the
reader’s attention to the fact that the baseline classification accuracy for our data
would be 27 %, which is the result of dividing the number of essays belonging to

the most common level (level 1 = 131 essays) by the total amount of essays in our
corpus (481 essays).

Table 1. Accuracies (percentage of correct classification) of the 11 different classifiers, before

feature selection and discretization

Classifier Ten-fold cross validation results Weighted scores

(absolute accuracy in percentage) (Cor = 3, Adj = 1,
Inc = 0)

LMT 58.09 1013

Functional Tree 56.07 980

Random Forest 53.97 1001

LAD Tree 53.49 973

Naïve Bayes 52.50 962

Simple Cart 52.10 949

Rep Tree 51.36 948

C4.5 (J48) 50.53 843

BF Tree 49.90 908

NB Tree 45.70 892

Decision Stump 40.73 762

As shown in Table 1, the Logistic Model Tree is the machine learning algorithm
that not only manages to build the best classification model in Scenario 1, when
taking only absolute accuracy into account, but also the model that scores the
highest when adjacent classifications are taken into account as well.

5.2 Scenario 2

As we have just seen, LMT is the classifier that performs the best for our task
when all 81 features are made available to the classifiers, both in terms of absolute
accuracy and adjacent agreement. We now need to know whether doing feature
selection and data discretization increases the accuracy of our systems and, if so,
which of the classifiers performs the best.

By performing feature selection on our data, we arrive at a subset of 8 features
(to be discussed in more detail in Section 6 of this chapter), which together, provide
the optimal classification accuracy for the systems. The removal of any of these fea-
tures from the subset leads to a decrease in classification accuracy. In Table 2 below
we can see what those 8 features are in ascending order of how much they correlate
with proficiency level, with feature 1 being the feature that correlates the highest
with proficiency level.

Table 2. Feature selection based on the Infogain + Ranker method in WEKA

Rank of feature Feature

1 Number of lexical types

2 Number of correct chunks

3 Number of correct + incorrect chunks

4 Percentage of no dependent clauses

5 Percentage of verbs in Present Tense

6 Percentage of errors in verb form

7 Percentage of lexical errors

8 Total number of errors

We have selected three of our classifiers in order to see the effect of feature selection
and data discretization on their accuracy: LMT (our best classifier so far), Naïve
Bayes (the only Bayesian classifier we have experimented with91) and C4.5 (argu-
ably the most common benchmark in machine learning). We can find in Figure 3
below the results of feature selection and data discretization on these 3 classifiers.

Figure 3. C4.5, LMT and NB accuracies after pre-processing of data

Previous Accuracy (dis- Accuracy Accuracy Accuracy

Classifier accuracy cretization only) (attribute (attr. sel + (discr. + attr.
(no pre- selection discr.) Sel)
processing) only)
C4.5 50.53 % 55.23 % 52.93 % 58.70 % 59.53 %
LMT 58.09 % 62.29 % 60.67 % 62.58 % 62.27 %
Naïve B. 52.50 % 60.73 % 55.16 % 59.09 % 60.82 %

91 All other 10 classifiers are Decision Tree classifiers.

As we can see in Figure 3, all 3 classifiers benefit from feature selection and
discretization. However, the best result is achieved by LMT, when first attribute
selection is performed, followed by the discretization of the values of the 8 fea-
tures/attributes in the subset. Therefore, the best absolute frequency we have
managed to achieve for our essay classification task is 62.58 %.
In terms of adjacent classification, the optimal version of LMT (just discussed)
achieves good results. The adjacent agreement of LMT (classifying an essay as
either its original level or an adjacent one) is presented in Figure 4:

Figure 4. Adjacent agreement of LMT per level

Level 0 Level 1 Level 2 Level 3 Level 4 Level 5

100 % 98 % 96 % 94 % 98 % 94 %

As can be noticed, whenever LMT does not assign the exact correct level for an
essay, it assigns an adjacent level in the great majority of cases, which is exactly
what one wants for our low-stakes AES task.
Finally, to see how LMT compares to human raters, we randomly selected 30
essays out of our 481 essays and asked a second and independent group of trained
graders to score them. From Figure 5 we may conclude that LMT achieves a high
correlation92 with the second group, which is quite similar to the one observed for
the original group of scorers.

Figure 5. Correlation coefficients in 2 conditions

Human Raters group 2

Human Raters group 1 0.84
Logistic Model Tree (LMT) 0.87

Considering the results shown in this section, we can say that LMT has the poten-
tial to be a high-performing essay scoring model once the 8 features used can be
automatically extracted from essays.

6. Discussion

Logistic Model Tree (LMT), when trained on 8 discretized features extracted

from hundreds of essays, achieves results similar to those seen in human scor-

92 We have used Pearson correlation in our calculation.

ers. However, we cannot call LMT an AES system, since at this point it is not
embedded in a system that extracts the 8 features and their corresponding values
The features identified show overlap to some extent with what the previous
study by Verspoor et al. (to appear) found using traditional statistical analyses,
but some are surprising because they have thus far not been commonly associated
with proficiency levels and were, therefore, not even considered in the previous
study. For example, the number of lexical types per text was not considered rel-
evant in the previous study as the field of Applied Linguistics usually works with
type-token ratios. Types refers to the raw number of types found in the text, not
adjusted for text length. Chunks (also known as formulaic sequences) are target-
like combinations of two or more words such as compounds, phrasal verbs, prep-
ositional phrases and so on. The field of applied linguistics has long recognized a
link between chunks and proficiency, but the previous study by Verspoor et al. (to
appear) and this one are the first to show such a direct connection between chunks
and proficiency level. The fact that incorrect chunks, which represent non-target
attempts at formulaic language, are a strong predictor is unexpected and was,
therefore, not considered to be relevant in the previous study. The remainder of
the features was also identified in the previous study, but not in as much detail as
in the present one. The percentage of “no dependent clauses” refers to the relative
number of simple sentences in a text, a commonly found indicator of proficiency
level. The percentage of present tense refers to the use of a simple present tense as
opposed to the use of a past tense or of a modal, passive, progressive or perfect.
The present tense is commonly recognized as the tense that beginners will use,
but the fact that it is such a strong predictor is interesting. The last three features
are not completely surprising as the number of errors has often been linked to
proficiency level, but what is unexpected is that we do see that both different
types of errors – verb form errors and lexical errors – and the total number of
errors play such a strong role.
Now the question is to what extent these 8 features that correlate the most with
proficiency level lend themselves differently to automation. Four of the features
should pose no major problem and can be somewhat easily automated: number
of types, percentage of no dependent clauses, percentage of verbs in the present
tense and percentage of errors in verb form. The other four features are much
more difficult to implement, given their intrinsic complexity: number of correct
chunks, number of correct and incorrect chunks, percentage of lexical errors and
total number of errors.
A few lines of code in any of the major programming languages for text analy-
sis (such as ‘Python’) can extract the number of types in an essay. The amount
of subordination has for a long time been used in the SLA literature to represent

the syntactic complexity of texts (Michel et al., 2007). There are already systems
available that are able to identify the number of clauses and dependent clauses
in a sentence. One such system is the one developed by Lu (2010), called L2 syn-
tactic Complexity Analyzer. The percentage of verbs in the present tense can be
extracted by running each essay through a parser and morphological analyzer
(these are both computational linguistics tools). Finally, the percentage of errors
in verb forms can be identified in part by running the essays through a parser and
for each verb checking whether the verb form can be found in a pre-determined
database of existent verb forms in English, for example. Errors due to inappropri-
ate usage, however, are more difficult to detect automatically.
Automatically extracting the other four features is a much more difficult task,
especially due to the fact that each one of them contains various subtypes. In
the number of correct chunks feature, for example, we find collocations, phrasal
verbs, verb-preposition combinations (such as “depend on”), etc. Evert (2009), in
an article entitled “Corpora and Collocation”, summarizes a number of statisti-
cal methods that can be used for extracting collocations93. However, the author
does not put them to the test, so we cannot be sure how accurate and appropriate
they would be. We believe, however, that perhaps an n-gram based approach, in
which we calculate the probability of unigrams, bigrams, trigrams, etc. found in
the essay based on a corpus of native English might be an alternative approach to
determining unique word combinations.
The LMT model we have trained does have limitations, naturally. Firstly, since
it only deals with surface features (no analysis of meaning/semantics is carried
out), it is not appropriate for higher-stakes testing. Secondly, since the model has
been trained by taking various features into account which might be typical of
Dutch learners of English (such as some types of lexical and syntactical errors),
performance of the system on essays written by learners of different L1s might
lead to different results. Only future research will be able to show how the system
would fare in such cases. Finally, there might be an effect of the prompts for the
essays (the essay topics) in the features that were seen to correlate the most with
proficiency level.

7. Conclusion

In the previous sections of this chapter, we have seen that machine learning tech-
niques and algorithms can be of great use towards the task of identifying features
that may lead to automated essay scoring. Machine learning can not only help

93 Some of the methods include chi-squared, mutual information and Z scores.

with identifying a subset of features that correlate the most with the construct
at hand (that is, the target class or proficiency level in our case) and therefore
enhance the power and simplicity of the system by using only those features in
classification, but it can also find patterns in the data, which can subsequently be
used to classify new samples. Different machine learning algorithms make use of
different strategies in order to arrive at their most optimal classification and they
show different classification accuracies. The Logistic Model Tree (LMT) algo-
rithm, which employs logistic regression to arrive at the most optimal classifica-
tory function for each possible class (each of the proficiency levels the essays
could belong to) manages to meet the gold standard by achieving the same clas-
sification accuracy observed in human scorers. In addition to LMT showing the
same classification accuracy (both in terms of exact and adjacent classification)
as human raters, the classification correlation coefficient observed between LMT
and a group of human scorers equals the one recorded between 2 groups of trained
human scorers. The 8 features found are interesting in themselves as they are not
necessarily the ones that are commonly recognized in the Applied Linguistics
literature and will contribute to insights into second language development. How-
ever, the 8 features found do not all easily lend themselves to automatic scoring.
We hope though that once the subset of 8 features that have been used to build the
LMT classification model can be extracted automatically or alternative features
have been found, LMT has the potential to be used for the task of essay scoring,
optimizing the scoring time, increasing the fairness of the scoring process and
decreasing the need for human labor.


Burstein, J. & Chodorow, M. (2010). Progress and New Directions in Technology

for Automated Essay Evaluation. In R. B. Kaplan (Ed.), Oxford Handbook of
Applied Linguistics (pp. 529–538). Oxford. Oxford University Press.
Educational Testing Services (ETS). (2011). How the e-rater Engine Works.
Retrieved September 28, 2011, from
Jurafsky, D. & Martin J. H. 2009. Speech and Language Processing: An Intro-
duction to Natural Language Processing, Speech Recognition, and Compu-
tational Linguistics. 2nd edition. Prentice-Hall.
Landauer, T. K. & Dumais, S. T. (1997). A solution to Plato’s problem: The Latent
Semantic Analysis theory of the acquisition, induction, and representation of
knowledge. Psychological Review, 104, 211–140.
Landwehr, N. Hall, M. & Frank, E. (2005) Logistic model trees. Machine Learn-
ing, 59 (1- 2), 161–205.

Lu, Xiaofei (2010). Automatic analysis of syntactic complexity in second lan-
guage writing. International Journal of Corpus Linguistics, 15 (4), 474–496
Michel, M. C., Kuiken, F. & Vedder, I. (2007). The influence of complexity in
monologic versus dialogic tasks in Dutch L2. International Review of Applied
Linguistics in Language Teaching 45, 241–59.
Page, E. B. (1966). Grading essays by computer: Progress report. Notes from the
1966 Invitational Conference on Testing Problems, 87–100.
Page, E. B. (1994). Computer Grading of Student Prose, Using Modern Concepts
and Software. Journal of Experimental Education, 62, 127–142.
Pearson Knowledge Technologies (PKT). (2011). Retrieved September 28, 2011,
Rudner, L. & Gagne, P. (2001). An overview of three approaches to scoring writ-
ten essays by computer (ERIC Digest number ED 458 290). Retrieved Sep-
tember 28, 2011, from
Evert, S. (2009). Corpora and Collocations. In: A. Lüdeling & M. Kytö, (Eds),
Corpus Linguistics. An International Handbook, vol. 2 (pp. 1212–1248).
Mouton de Gruyter, Berlin: Mouton de Gruyter.
Valenti, S., Neri, F., & Cucchiarelli, A. (2003). An overview of current research
on automated essay grading. Journal of Information Technology Education,
2, 319–330.
Verspoor, M. H., Schuitemaker-King J., Van Rein, E. M. J., De Bot, K., & Edelen-
bos, P. (2010). Tweetalig onderwijs: vormgeving en prestaties. Onderzoek-
srapportage. Retrieved September 28, 2001, from
Verspoor, M., Schmid, M. S., & Xu, X (to appear). A dynamic usage based per-
spective on L2 writing development. Journal of Second Language Writing.

Language Testing and Evaluation

Series editors: Rüdiger Grotjahn and Günther Sigott

Vol. 1 Günther Sigott: Towards Identifying the C-Test Construct. 2004.

Vol. 2 Carsten Röver. Testing ESL Pragmatics. Development and Validation of a Web-Based
Assessment Battery. 2005.
Vol. 3 Tom Lumley: Assessing Second Language Writing. The Rater’s Perspective. 2005.
Vol. 4 Annie Brown: Interviewer Variability in Oral Proficiency Interviews. 2005.
Vol. 5 Jianda Liu: Measuring Interlanguage Pragmatic Knowledge of EFL Learners. 2006.
Vol. 6 Rüdiger Grotjahn (Hrsg. / ed.): Der C-Test: Theorie, Empirie, Anwendungen/The C-Test:
Theory, Empirical Research, Applications. 2006.
Vol. 7 Vivien Berry: Personality Differences and Oral Test Performance. 2007.
Vol. 8 John O‘Dwyer: Formative Evaluation for Organisational Learning. A Case Study of the
Management of a Process of Curriculum Development. 2008.
Vol. 9 Aek Phakiti: Strategic Competence and EFL Reading Test Performance. A Structural Equ-
ation Modeling Approach. 2007.
Vol. 10 Gábor Szabó: Applying Item Response Theory in Language Test Item Bank Building.
Vol. 11 John M. Norris: Validity Evaluation in Language Assessment. 2008.
Vol. 12 Barry O’Sullivan: Modelling Performance in Tests of Spoken Language. 2008.
Vol. 13 Annie Brown / Kathryn Hill (eds.): Tasks and Criteria in Performance Assessment. Pro-
ceedings of the 28th Language Testing Research Colloquium. 2009.
Vol. 14 Ildikó Csépes: Measuring Oral Proficiency through Paired-Task Performance. 2009.
Vol. 15 Dina Tsagari: The Complexity of Test Washback. An Empirical Study. 2009.
Vol. 16 Spiros Papageorgiou: Setting Performance Standards in Europe. The Judges’ Contribution
to Relating Language Examinations to the Common European Framework of Reference.
Vol. 17 Ute Knoch: Diagnostic Writing Assessment. The Development and Validation of a Rating
Scale. 2009.
Vol. 18 Rüdiger Grotjahn (Hrsg. / ed.): Der C-Test: Beiträge aus der aktuellen Forschung/The C-
Test: Contributions from Current Research. 2010.
Vol. 19 Fred Dervin / Eija Suomela-Salmi (eds. / éds): New Approaches to Assessing Language
and (Inter-)Cultural Competences in Higher Education / Nouvelles approches de l'évalua-
tion des compétences langagières et (inter-)culturelles dans l'enseignement supérieur.
Vol. 20 Ana Maria Ducasse: Interaction in Paired Oral Proficiency Assessment in Spanish. Rater
and Candidate Input into Evidence Based Scale Development and Construct Definition.
Vol. 21 Luke Harding: Accent and Listening Assessment. A Validation Study of the Use of Speak-
ers with L2 Accents on an Academic English Listening Test. 2011.
Vol. 22 Thomas Eckes: Introduction to Many-Facet Rasch Measurement. Analyzing and Evaluat-
ing Rater-Mediated Assessments. 2011.
Vol. 23 Gabriele Kecker: Validierung von Sprachprüfungen. Die Zuordnung des TestDaF zum
Gemeinsamen europäischen Referenzrahmen für Sprachen. 2011.
Vol. 24 Lyn May: Interaction in a Paired Speaking Test. The Rater´s Perspective. 2011.
Vol. 25 Dina Tsagari / Ildikó Csépes (eds.): Classroom-Based Language Assessment. 2011.
Vol. 26 Dina Tsagari / Ildikó Csépes (eds.): Collaboration in Language Testing and Assessment.
Vol. 27 Kathryn Hill: Classroom-Based Assessment in the School Foreign Language Classroom.
Vol. 28 Dina Tsagari / Salomi Papadima-Sophocleous / Sophie Ioannou-Georgiou (eds.): Interna-
tional Experiences in Language Testing and Assessment. Selected Papers in Memory of
Pavlos Pavlou. 2013.
Vol. 29 Dina Tsagari / Roelof van Deemter (eds.): Assessment Issues in Language Translation and
Interpreting. 2013.