(Educational Linguistics 26) Martin East (Auth.) - Assessing Foreign Language Studentsâ ™ Spoken Proficiency - Stakeholder Perspectives On A

Educational Linguistics
Martin East
Assessing Foreign
Language
Students’ Spoken
Proficiency
Stakeholder Perspectives on Assessment
Innovation
Volume 26
Series Editor
Francis M. Hult, Lund University, Sweden
Editorial Board
Marilda C. Cavalcanti, Universidade Estadual de Campinas, Brazil
Jasone Cenoz, University of the Basque Country, Spain
Angela Creese, University of Birmingham, United Kingdom
Ingrid Gogolin, Universität Hamburg, Germany
Christine Hélot, Université de Strasbourg, France
Hilary Janks, University of Witwatersrand, South Africa
Claire Kramsch, University of California, Berkeley, U.S.A
Constant Leung, King’s College London, United Kingdom
Angel Lin, University of Hong Kong, Hong Kong
Alastair Pennycook, University of Technology, Sydney, Australia
Educational Linguistics is dedicated to innovative studies of language use and
language learning. The series is based on the idea that there is a need for studies that
break barriers. Accordingly, it provides a space for research that crosses traditional
disciplinary, theoretical, and/or methodological boundaries in ways that advance
knowledge about language (in) education. The series focuses on critical and
contextualized work that offers alternatives to current approaches as well as
practical, substantive ways forward. Contributions explore the dynamic and multi-
layered nature of theory-practice relationships, creative applications of linguistic
and symbolic resources, individual and societal considerations, and diverse social
spaces related to language learning.
The series publishes in-depth studies of educational innovation in contexts
throughout the world: issues of linguistic equity and diversity; educational language
policy; revalorization of indigenous languages; socially responsible (additional)
language teaching; language assessment; first- and additional language literacy;
language teacher education; language development and socialization in
nontraditional settings; the integration of language across academic subjects;
language and technology; and other relevant topics.
The Educational Linguistics series invites authors to contact the general editor
with suggestions and/or proposals for new monographs or edited volumes. For more
information, please contact the publishing editor: Jolanda Voogd, Asssociate
Publishing Editor, Springer, Van Godewijckstraat 30, 3300 AA Dordrecht, The
Netherlands.
More information about this series at http://www.springer.com/series/5894

Martin East
Assessing Foreign Language

Students’ Spoken Proficiency
Stakeholder Perspectives
on Assessment Innovation
Martin East
Faculty of Education and Social Work
The University of Auckland
Auckland, New Zealand
ISSN 1572-0292 ISSN 2215-1656 (electronic)

ISBN 978-981-10-0301-1 ISBN 978-981-10-0303-5 (eBook)
DOI 10.1007/978-981-10-0303-5
Library of Congress Control Number: 2015960962
Springer Singapore Heidelberg New York Dordrecht London

© Springer Science+Business Media Singapore 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, express or implied, with respect to the material contained herein or for any errors
or omissions that may have been made.
Printed on acid-free paper
Springer Science+Business Media Singapore Pte Ltd. is part of Springer Science+Business Media
(www.springer.com)
Foreword
The ability to use a language for spoken communication is one of the main reasons
many people study a foreign language, and learners often evaluate their success in
language learning, as well as the effectiveness of their language course, on the basis
of how well they feel they have improved in their spoken-language proficiency.
When a foreign language is an assessed school subject, the procedures used to arrive
at a valid account of learners’ ability to speak is of crucial importance to schools,
teachers and learners. However, the assessment of spoken-language proficiency, or
‘speaking skills’, has been somewhat problematic in the history of language teach-
ing. On the one hand, the construct of spoken-language proficiency itself has some-
times been inadequately theorised. Whereas ‘speaking skills’ covers a wide range of
different modes of discourse, including small talk, casual conversations, telephone
conversations, transactions, discussions, interviews, meetings, presentations and
debates, conventional approaches to oral proficiency assessment have often assumed
that performance on an oral interview can be assumed to represent general spoken-
language ability rather than simply ability related to one mode of discourse. In real-
ity, each genre has distinct features and characteristics and each poses quite different
issues for teaching, learning and assessment.
In addition the practical logistics of conducting oral assessments make assess-
ment of authentic language use difficult. Unlike the assessment of other language
skills, speaking ability can only be observed in face-to-face interaction. However,
the kind of interaction that results may be influenced by many factors, such as those
related to the context, the interlocutors, the setting and the task. A proficiency test
that is part of a high-stakes school-based assessment is very different from language
used outside of the classroom in a naturalist context for communication. It is chal-
lenging for proficiency tests not only to replicate but also to capture authentic lan-
guage use.
A further issue results from the fact that changes in assessment practices – such
as in the New Zealand example articulated in this book, that is, a move from assess-
ment of learning towards assessment for learning – may not align with learners’
v
vi Foreword
expectation about the nature and function of assessment. If teachers move away
from summative tests and adopt continuous assessment based on an assessment for
learning approach, there may be resistance from learners, since what counts for
them may be their end of course grades rather than the teacher’s well-intentioned
philosophy of assessment.
Against this background, teachers, curriculum developers and language teaching
professionals will find the current book a unique and invaluable account of how
issues such as those mentioned above were addressed in the context of a large-scale
curriculum innovation in language teaching in New Zealand. The context is the
teaching of foreign languages in New Zealand secondary schools and the introduc-
tion of a new curriculum that seeks to achieve a learner-based rather than teacher-
dominated approach to learning. In applying this philosophy to foreign language
teaching, an attempt was made to develop a new approach to oral proficiency assess-
ment. A one-off summative assessment based on students’ performances on an end-
of-year interview has been replaced with a series of peer-to peer interactions that
seek to provide learners with opportunities to show what they can do with a lan-
guage (rather than what they know about it) and how effectively they can use it for
authentic communication. This book results from a 2-year study of how the new
assessment approach worked in practice, as experienced by both teachers and
students.
In this book Martin East provides a fascinating account of how the new assess-
ment approach was introduced; how it differs from previous approaches to assess-
ment; the impact of the approach on teachers, teaching and learners; and the issues
it created for teachers, learners and schools. The importance of this book derives
from a number of features:
• It is a model case study of curriculum innovation in foreign language
education.
• It provides an account of an approach to validity that contrasts a standard psy-
chometric approach derived from performance scores with one that reflects
stakeholders’ views of the innovation.
• It reflects an approach in which assessment is designed to be an integral part of
the teaching and learning programme and that makes use of activities that are
typical in foreign language classrooms.
• Rather than employing a one-off interview, it makes use of a series of tasks to
capture the dynamic and interactive nature of spoken interaction.
• It makes use of qualitative methods to capture the subjective experiences of
teachers and students.
• It provides a detailed account of both the perceived benefits and the perceived
disadvantages of the innovation.
• It raises issues that are of much wider significance than the specific context
(New Zealand students learning foreign languages) in which they were studied.
Foreword vii
Assessing Foreign Language Students’ Spoken Proficiency makes an outstanding

and original contribution to the field of second and foreign language teaching, pro-
viding a theory and research-based account of the development of a learner-centred
approach to oral proficiency assessment. It is an important resource for teachers and
teacher educators as well as assessment and curriculum specialists worldwide. It
deserves to be widely read.
January 2016 Jack C. Richards

Acknowledgments
This book represents the culmination of several years of research work which I
would not have been able to achieve without a range of help and support. First and
foremost, I would like to acknowledge the funding I received from the University of
Auckland in the form of a Research Excellence Award (ECREA 3701329). This
funding provided the opportunity both to administer the national teacher survey and
to travel across the country to carry out interviews with teachers. The award also
funded transcription and data analysis costs, as well as opportunities to present
emerging findings from the research in a range of international fora.
Additionally, I was able to engage the services of two summer scholars who
worked with me at different stages of the research process. I acknowledge Joseph
Poole who undertook some of the transcribing and data entry, alongside initial cod-
ing of the teacher and student surveys, and Abby Metson, who undertook initial
coding of themes emerging from the interviews.
I thank my wife, Julie, whose initial conversations with me helped to shape the
scoping of the project and whose statistical expertise contributed to the quantitative
analyses I report.
I am grateful to Professor Jack C Richards, an internationally recognised author-
ity in second and foreign language teaching and learning, for his interest in my work
and his willingness to write the foreword.
My thanks to Nick Melchior, Senior Editor (Education) for Australia/New
Zealand at Springer, for his enthusiasm to see this research published as part of
Springer’s Educational Linguistics series.
Bearing in mind the duration of this project, it is inevitable that emerging find-
ings have been published in other fora. Two articles have published aspects of the
data I report in Chaps. 5 and 6:
East, M. (2014). Working for positive outcomes? The standards-curriculum align-
ment for Learning Languages and its reception by teachers. Assessment Matters,
6, 65–85.
ix
x Acknowledgments
East, M. (2015). Coming to terms with innovative high-stakes assessment practice:

Teachers’ viewpoints on assessment reform. Language Testing, 32(1), 101–120.
http://dx.doi.org/10.1177/0265532214544393
Additionally, two articles that informed the background material I report in
Chap. 3 are:
East, M., & Scott, A. (2011a). Assessing the foreign language proficiency of high
school students in New Zealand: From the traditional to the innovative. Language
Assessment Quarterly, 8(2), 179–189. http://dx.doi.org/10.1080/15434303.2010
.538779
East, M., & Scott, A. (2011b). Working for positive washback: The standards-
curriculum alignment project for Learning Languages. Assessment Matters, 3,
93–115.
I would like to thank Adèle Scott for her support in verifying the accuracy of the
historical account of events I present in Chap. 3.
Last, and certainly by no means least, I thank sincerely all the participants in this
project who completed a survey, or who asked their students to complete a survey,
or who participated in an interview with me. Without the help and support of those
who are willing to give their time when invited to do so, research projects such as
the one reported here cannot come to fruition. The willingness of participants
enables their voices to be represented.
Contents
1 Mediating Assessment Innovation: Why Stakeholder

Perspectives Matter................................................................................... 1
1.1 Introduction ........................................................................................ 1
1.2 Background: The Importance of Interaction
in Foreign Languages Pedagogy ........................................................ 3
1.2.1 Communicative Language Teaching ...................................... 3
1.2.2 Communicative Language Testing ......................................... 5
1.3 Curriculum and Assessment Reforms in New Zealand...................... 6
1.3.1 Overview................................................................................. 6
1.3.2 Implementing Assessment Reform: A Risky Business........... 7
1.4 Assessment Validation........................................................................ 8
1.4.1 Fundamental Considerations .................................................. 8
1.4.2 The Contribution of Assessment Score Evidence
to a Validity Argument ............................................................ 10
1.4.3 The Limitations of Assessment Score
Evidence to a Validity Argument ............................................ 12
1.4.4 Towards a Broader Understanding
of Assessment Validation ........................................................ 13
1.4.5 A Qualitative Perspective on Assessment Validation ............. 16
1.5 The Structure of This Book ................................................................ 18
1.6 Conclusion.......................................................................................... 19
References ................................................................................................... 21
2 Assessing Spoken Proficiency: What Are the Issues? ............................ 25
2.1 Introduction ........................................................................................ 25
2.2 What Does It Mean to Communicate Proficiently? ........................... 26
2.2.1 Communicative Competence as the Underlying
Theoretical Framework........................................................... 26
2.2.2 Developing the Framework of Communicative
Competence ............................................................................ 27
xi
xii Contents
2.3 Static or Dynamic............................................................................... 31

2.3.1 The Static Assessment Paradigm ............................................ 32
2.3.2 The Dynamic Assessment Paradigm ...................................... 33
2.3.3 Static or Dynamic – A Complex Relationship........................ 35
2.4 Task-Based or Construct Based.......................................................... 37
2.4.1 The Centrality of the Task ...................................................... 37
2.4.2 The Importance of the Construct ............................................ 39
2.5 Single or Paired Performances ........................................................... 40
2.5.1 Single Performance Assessments ........................................... 41
2.5.2 Paired/Group Performance Assessments ................................ 42
2.6 Conclusion.......................................................................................... 45
References ................................................................................................... 46
3 Introducing a New Assessment of Spoken Proficiency: Interact ........... 51
3.1 Introduction ........................................................................................ 51
3.2 The New Zealand Landscape for
Assessment – A Shifting Environment............................................... 52
3.2.1 The 1990s: A Mismatch Between Curricular
Aims and High-Stakes Assessment ........................................ 53
3.2.2 The NCEA System: The Beginnings of Reform .................... 56
3.2.3 The Impact of Assessment Mismatch on FL Programmes ..... 57
3.2.4 The NCEA for Languages – 2002–2010 ................................ 59
3.3 Towards a Learner-Centred Model for High-Stakes Assessment....... 61
3.3.1 2007: The Advent of a New Curriculum ................................ 61
3.3.2 NCEA Mark II ........................................................................ 63
3.4 Revising the Assessments for Languages........................................... 64
3.4.1 2008: The First SCALEs Meeting .......................................... 64
3.4.2 2009: The Second SCALEs Meeting...................................... 66
3.4.3 2010: A Further Opportunity to Confirm
the New Assessments.............................................................. 67
3.4.4 2011 Onwards: Support for the Implementation
of Interact ............................................................................... 69
3.5 Conclusion.......................................................................................... 72
References ................................................................................................... 73
4 Investigating Stakeholder Perspectives on Interact................................ 77
4.1 Introduction ........................................................................................ 77
4.2 Bachman and Palmer’s Test Usefulness Framework.......................... 78
4.2.1 Construct Validity and Reliability .......................................... 79
4.2.2 Interactiveness, Impact, Practicality and Authenticity ........... 81
4.3 2011 Onwards: Interact in Practice.................................................... 84
4.4 The Theoretical Usefulness of Interact .............................................. 87
4.5 A Study into Teachers’ and Students’ Views ...................................... 88
4.6 Study Stage I ...................................................................................... 90
4.6.1 Nationwide Teacher Survey .................................................... 90
4.6.2 Piloting the Teacher Survey .................................................... 91
Contents xiii
4.6.3 Administering the Main Survey ............................................. 92

4.6.4 Teacher Interviews .................................................................. 94
4.7 Stage II ............................................................................................... 95
4.7.1 Teacher Interviews .................................................................. 95
4.7.2 Student Surveys ...................................................................... 95
4.8 Conclusion.......................................................................................... 97
References ................................................................................................... 98
5 The Advantages of Interact ....................................................................... 101
5.1 Introduction ........................................................................................ 101
5.2 The Nationwide Teacher Survey – Section I ...................................... 102
5.2.1 Overview................................................................................. 102
5.2.2 Perceived Relative Usefulness of Converse
and Interact ............................................................................. 103
5.2.3 Variations in Teacher Responses ............................................ 106
5.2.4 Differences in Perception According
to Principal Language Taught ................................................. 108
5.2.5 Differences in Perception According
to Whether or Not Using Interact ........................................... 110
5.3 Advantages of Interact – Survey Data ............................................... 111
5.3.1 Authenticity and Interactiveness............................................. 112
5.3.2 Positive Impact ....................................................................... 114
5.3.3 Validity, Reliability and Potential for Washback .................... 116
5.4 Advantages of Interact – Interviews .................................................. 117
5.4.1 Authenticity and Interactiveness............................................. 118
5.4.2 Positive Impact ....................................................................... 120
5.4.3 Validity, Reliability and Potential for Washback .................... 122
5.5 Conclusion.......................................................................................... 123
References ................................................................................................... 124
6 The Disadvantages of Interact and Suggested Improvements............... 125
6.1 Introduction ........................................................................................ 125
6.2 Disadvantages of Interact – Survey Data ........................................... 126
6.2.1 Impracticality .......................................................................... 126
6.2.2 Negative Impact – Unrealistic Expectations........................... 129
6.2.3 Negative Impact – Interlocutor Variables ............................... 130
6.3 Suggestions for Improvement – Survey Data..................................... 130
6.3.1 Reduce the Number of Interactions Required ........................ 131
6.3.2 Allow Provision for Scaffolding/Rehearsal ............................ 132
6.3.3 Provide More Examples and More Flexible Options ............. 134
6.4 Disadvantages of Interact – Interviews .............................................. 135
6.4.1 Impracticality .......................................................................... 135
6.4.2 Negative Impact – Too Much Work for What It Is Worth ...... 137
6.4.3 Negative Impact – Interlocutor Variables ............................... 138
6.4.4 The Challenges of ‘Spontaneous and Unrehearsed’ ............... 139
xiv Contents
6.5 Suggestions for Improvement – Interviews........................................ 140

6.5.1 Clarifying ‘Spontaneous and Unrehearsed’............................ 141
6.5.2 The Task is Everything ........................................................... 143
6.6 Conclusion.......................................................................................... 144
References ................................................................................................... 145
7 Interact and Higher Proficiency Students:
Addressing the Challenges ....................................................................... 147
7.1 Introduction ........................................................................................ 147
7.2 Examples of Task Types ..................................................................... 149
7.2.1 Talking About the Environment.............................................. 150
7.2.2 Mariage Pour Tous.................................................................. 153
7.2.3 Cat Café .................................................................................. 153
7.2.4 Getting Students to Take the Lead .......................................... 155
7.3 Problems Emerging ............................................................................ 156
7.3.1 Spontaneous and Unrehearsed ................................................ 156
7.3.2 Moving Away from Grammar................................................. 159
7.4 Back to the Task ................................................................................. 162
7.5 Conclusion.......................................................................................... 165
References ................................................................................................... 166
8 Interact and Higher Proficiency Students:
Concluding Perspectives ........................................................................... 167
8.1 Introduction ........................................................................................ 167
8.2 Working for Washback ....................................................................... 168
8.3 The Student Surveys........................................................................... 171
8.3.1 Section I .................................................................................. 172
8.3.2 Taking a Closer Look at the Numbers .................................... 173
8.4 Student Survey Responses – Converse............................................... 175
8.5 Student Survey Responses – Interact ................................................. 177
8.5.1 Spontaneity Versus Grammar ................................................. 178
8.5.2 Types of Task .......................................................................... 182
8.5.3 Peer-to-Peer Interactions ........................................................ 183
8.5.4 Working for Washback ........................................................... 184
8.6 Conclusion.......................................................................................... 185
References ................................................................................................... 187
9 Coming to Terms with Assessment Innovation: Conclusions
and Recommendations.............................................................................. 189
9.1 Introduction ........................................................................................ 189
9.2 Theoretical Underpinnings of Interact ............................................... 190
9.3 Summary of Findings ......................................................................... 192
9.3.1 Overview................................................................................. 192
9.3.2 Positive Dimensions of Assessments Such as Interact ........... 193
9.3.3 Negative Dimensions of Assessments Such as Interact ......... 195
Contents xv
9.4 Static or Dynamic: A Fundamental Problem ..................................... 196

9.4.1 Is Interact a Test?.................................................................... 196
9.4.2 What Do We Want to Measure?.............................................. 199
9.5 Where to from Here?.......................................................................... 201
9.5.1 Scenario 1 ............................................................................... 202
9.5.2 Scenario 2 ............................................................................... 202
9.6 Recommendations .............................................................................. 204
9.7 Limitations and Conclusion ............................................................... 206
References ................................................................................................... 209
Bibliography .................................................................................................... 213
Index ................................................................................................................. 225

List of Figures
Fig. 3.1 The original NCEA assessment matrix ............................................. 60

Fig. 3.2 The revised NCEA assessment matrix .............................................. 70
Fig. 3.3 Key changes between converse and interact..................................... 71
Fig. 4.1 Outcome requirements of interactions .............................................. 85
Fig. 4.2 Procedure for eliciting strength of perception .................................. 90
Fig. 5.1 Numbers of survey respondents (left) compared to numbers
of NCEA (senior secondary) students (2012) (right) ....................... 102
Fig. 5.2 Numbers of survey respondents using/not using interact ................. 103
Fig. 5.3 Percentage histogram of difference scores
(converse – interact) by measure ...................................................... 107
Fig. 5.4 Difference scores averaged across constructs ................................... 108
Fig. 5.5 Sub-construct differences in mean (converse v. interact)
by language taught ............................................................................ 109
Fig. 5.6 Sub-construct differences in mean (converse v. interact)
by whether or not using interact ....................................................... 110
Fig. 8.1 Student survey mean responses by measure
(converse v. interact) ......................................................................... 173
Fig. 8.2 Converse – range of responses by measure....................................... 174
Fig. 8.3 Interact – range of responses by measure ......................................... 174
xvii
List of Tables
Table 3.1 Grades and percentage equivalents (School C and Bursary) ......... 55
Table 4.1 Stages of the study ......................................................................... 89
Table 4.2 Taxonomy of emerging themes from the survey, Section II .......... 93
Table 5.1 Overall means and differences in means (teachers):
converse and interact...................................................................... 104
Table 5.2 Differences in standardised means between
Table 5.3 Analyses of variance of difference scores
for each sub-construct by use of interact ....................................... 111
Table 5.4 Frequencies of mentioning advantages of interact ........................ 112
Table 5.5 Interview participants (Stage I) ...................................................... 117
Table 6.1 Frequencies of mentioning disadvantages of interact .................... 126
Table 6.2 Frequencies of mentioning improvements to interact.................... 131
Table 7.1 Interview participants (Stage II)..................................................... 149
Table 8.1 Overall means and differences in means (students):
Table 8.2 Student survey participants (Stage II) ............................................ 178
xix
Chapter 1
Mediating Assessment Innovation: Why
Stakeholder Perspectives Matter
1.1 Introduction
This book recounts a story of assessment innovation. Situated within a context of

recent across-the-board school curriculum and high-stakes assessment reforms in
New Zealand, the book focuses on one assessment in particular – the assessment of
senior high school students’ spoken communicative proficiency in a modern foreign
language (hereafter FL). Until recently, spoken proficiency was measured by a one-
time end-of-year interview test between teacher and student (known as converse).
The intention of the new assessment (called interact) is that spoken proficiency will
be principally measured by capturing a range of genuine student-initiated peer-to-
peer interactions as they take place in the context of regular classroom work
throughout the year.1
Gardner, Harlen, Hayward, and Stobart (2008) argue that modifications to assess-
ment “must begin with some form of innovation, which might be quite different
from existing practices in any particular situation” (p. 3). Those who played a part
in conceptualising and designing the new assessment (of whom this author was one)
had the best of intentions. We wanted to enhance the opportunity to encourage (and
measure) genuine instances of students’ participation in spontaneous authentic FL
interactions with their peers, in contrast to the somewhat rehearsed, contrived and
controlled ‘conversations’ that were often characteristic of the former assessment.
We built our assessment blueprints on a range of theoretical arguments, including:
that FL students learn to use the target language most effectively when they are
1
In this book I use the terms ‘assessment’ and ‘testing/test’ somewhat interchangeably. A test is a
discrete instance of assessment, whereas assessment is a broader concept. That is, a test is an
assessment. Not all assessments are tests. In this book, the assessment in question includes record-
ing a short instance (a few minutes) of interaction between two or more interlocutors, which may
be part of a longer interaction, and using that instance for assessment purposes. This instance is not
designed to be a test (although in some circumstances it may be operationalised as such), and
several instances, collected together, lead to a holistic grading of performance.
© Springer Science+Business Media Singapore 2016 1

M. East, Assessing Foreign Language Students’ Spoken Proficiency, Educational
Linguistics 26, DOI 10.1007/978-981-10-0303-5_1
2 1 Mediating Assessment Innovation: Why Stakeholder Perspectives Matter
engaged in real language use in the classroom (Willis & Willis, 2007); that they
learn how to communicate through interaction in the target language (Nunan, 2004);
and that engagement in meaningful language communication should be an impor-
tant focal point for assessments (Norris, 2002).
Nevertheless, the new high-stakes assessment signalled a radical departure from
established practices. Despite our laudable intentions, its introduction has not
occurred without controversy. One notable example of backlash was a teacher com-
plaint that landed on the desks of the Deputy Chief Executive for the New Zealand
Qualifications Authority (NZQA), the body responsible for overseeing national
qualifications, and New Zealand’s Minister for Education. The essence of this
teacher’s complaint, lodged at the end of 2013, was that the new assessment was so
flawed and based on spurious foundations that it should immediately be abandoned.
To enlist support for his cause, the teacher also initiated debate via several New
Zealand listservs for languages teachers. A whole range of opinions, both support-
ive and critical, ensued. Although ultimately the debate fizzled out, it was at times
intense and passionate, and revealed not only the depth of many teachers’ feelings
about the reforms but also the diversity of opinion.2
Bearing in mind that the reforms, despite their theoretical justification, imply
significant changes to practice, strong teacher reaction to the new assessment is not
necessarily surprising. However, stakeholder perspectives must be taken seriously if
we are to conclude that a new assessment, when put into operation, is valid or ‘fit
for purpose’.
This book tells the story of the early years of the reform with particular focus on
two key groups of stakeholders – teachers and students in schools – and their per-
spectives on the new assessment as derived from a range of surveys and interviews.
The book thereby takes a fresh approach to building a case for validity. That is, the
main source of evidence for validity claims has conventionally been performance
scores (Chapelle, 1999). The perspectives of teachers and students are often not
sought as contributions to validity arguments, even though it seems logical to
assume that teachers and students would have something worthwhile to say (Elder,
Iwashita, & McNamara, 2002; Winke, 2011). This is particularly so in a context
where assessment innovation is being imposed by virtue of educational policy and
practice, and where the assessment is effectively managed by teachers in schools (as
was the case with converse, and is the case with interact).
This book thus crosses the traditional theoretical and methodological boundaries
associated with applied linguistics and language assessment research, with their
central interest in student performance. It presents an alternative approach where
stakeholders become the centre of interest. This cross-over, where dimensions of
applied linguistics and language assessment research interface with aspects of edu-
2
The languages listservs provide a forum for subscribed New Zealand languages teachers to
engage in debate about topical issues. The debates about interact were part of a broader campaign,
launched by one individual, to see interact rescinded, and reached their peak around the beginning
of 2014. The debates and campaign documents essentially constitute ‘personal communications’ to
which this author was party.
1.2 Background: The Importance of Interaction in Foreign Languages Pedagogy 3
cational policy, provision and practice, make the work an important contribution to
the field of educational linguistics.
The purpose of this opening chapter is to set the scene for, and provide the theo-
retical rationale for, a study that focuses on stakeholder views. The chapter sum-
marises a number of issues which I explore in greater depth in subsequent chapters.
It interweaves the New Zealand case with more global arguments about teaching,
learning and assessment in order to situate the case in question within on-going
international debates. The chapter begins by outlining the essence of New Zealand’s
curriculum and assessment reforms against the backdrop of current understandings
of FL teaching, learning and assessment, and acknowledges the complexities
involved in such reforms. It articulates the centrality of assessment to effective
teaching and learning practice and describes the evidence that assessment develop-
ers would normatively draw on to ensure that assessments are adequate to the task.
The chapter goes on to explain the necessity for broader approaches to validation
and, in particular, the use of stakeholder perspectives. The chapter concludes with
an overview of the study that is the focus of this book.
1.2 Background: The Importance of Interaction

in Foreign Languages Pedagogy
1.2.1 Communicative Language Teaching
For almost half a century, the ability to communicate effectively in a foreign lan-
guage has been fundamental to the aims and goals of many languages programmes
across the globe (Brown, 2007; Richards, 2001; Richards & Rodgers, 2014) – tradi-
tionally operationalised through helping students to acquire proficiency in several
skills – listening, reading, writing and speaking – and built on a theoretical construct
of ‘communicative competence’.
In the UK, for example, the birth, in the early 1970s, of the approach that came
to be known as Communicative Language Teaching or CLT heralded an emphasis
on language in actual use for the purpose of fulfilling learners’ needs in concrete
situations. The introduction of CLT marked a significant shift in pedagogy away
from a linguistic/grammatical emphasis as represented through such approaches as
grammar-translation and audio-lingualism. In its place, the emphasis became “what
it means to know a language and to be able to put that knowledge to use in com-
municating with people in a variety of settings and situations” (Hedge, 2000, p. 45).
A parallel development in the US witnessed the birth, at the start of the 1980s, of
what Kramsch (1986, 1987) refers to as the ‘proficiency movement’ and the
‘proficiency-oriented curriculum’. This development was built on the argument
that language is “primarily a functional tool, one for communication” (Kramsch,
1986, p. 366). This view carried with it an implicit assumption that “the final justi-
fication for developing students’ proficiency in a foreign language is to make them
interactionally competent on the international scene” (p. 367). Such competence

would be acquired by fostering “the ability to function effectively in the language in
real-life contexts” (Higgs, 1984, p. 12).
As Richards and Rodgers (2014) put it, from an historical perspective both
British and American advocates came to view CLT as an approach that “aimed to
(a) make communicative competence the goal of language teaching and (b) develop
procedures for the teaching of the four language skills that acknowledge the inter-
dependence of language and communication” (p. 85). The 1970s and 1980s repre-
sented a foundational period in the movement – what we might refer to as the
beginnings of the ‘classic’ CLT phase (Richards, 2006) – and the establishment of
several key principles that have not only had influence in many contexts across the
globe, but that also retain currency and relevance well into the twenty-first century
(in this regard see Hunter, 2009; Leung, 2005; Savignon, 2005; Spada, 2007;
Tomlinson, 2011).
Richards (2006) goes on to note a developmental CLT phase (1990s onwards)
that has broadened our understanding of the effective operationalisation of
CLT. Brown (2007) speaks of a “new wave of interest” that has moved the emphasis
away from the structural and cognitive aspects of communication and towards its
social, cultural and pragmatic dimensions. This development (which is essentially
an expansion of principles that were already there in essence since the early days)
has drawn attention to “language as interactive communication among individuals,
each with a sociocultural identity.” Brown asserts that, as a consequence, teachers
are “treating the language classroom as a locus of meaningful, authentic exchanges
among users of language,” with FL learning viewed as “the creation of meaning
through interpersonal negotiation among learners” (p. 218).
As Philp, Adams, and Iwashita (2014) make clear, the shift from teacher-led to
student-centred has precipitated increased understanding and appreciation of the
valuable learning potential of peer-to-peer interactions. This learning potential is
underpinned and supported by both a cognitive perspective (e.g., Long’s [1983,
1996] interaction hypothesis) and a sociocultural perspective whereby learning is “a
jointly developed process and inherent in participating in interaction” (p. 8). Philp
et al. describe peer interaction as “any communicative activity carried out between
learners, where there is minimal or no participation from the teacher” (p. 3). It is
“collaborative in the sense of participants working together toward a common goal”
(p. 3), and it increases opportunities for students to speak, practise communication
patterns, engage in negotiation of meaning, and adopt new conversational roles.
Implicit in CLT approaches, the proficiency movement and a focus on interaction
is the end-goal of automaticity in language use (DeKeyser, 2001; Segalowitz, 2005).
Although operationally defined and theoretically achieved in a variety of ways, in
essence automaticity refers to the ability of language users to draw on their knowl-
edge of the FL automatically and spontaneously. Automaticity can be demonstrated
at a range of proficiency levels. Ultimately, however, automatic language users will
be able to “perform a complex series of tasks very quickly and efficiently, without
having to think about the various components and subcomponents of action
involved” (DeKeyser, 2001, p. 125). The Proficiency Guidelines of the American
1.2 Background: The Importance of Interaction in Foreign Languages Pedagogy 5
Council on the Teaching of Foreign Languages (ACTFL, 2012) and the Common
European Framework of Reference for languages or CEFR (Council of Europe,
1998, 2001) represent significant and influential steps towards articulating different
levels of FL learners’ communicative proficiency across a range of skills. The
frameworks recognise and attempt to articulate several levels of automaticity and
proficiency from the most basic users of an additional language (L2) to those who
have achieved a virtually first language (L1) proficiency level.
Pedagogically, the fundamental place and value of spoken communicative inter-
action have been supported by specific realisations of CLT such as task-based lan-
guage teaching (TBLT). TBLT is based upon the learner-centred and experiential
argument that learners’ participation in authentic communicative language use tasks
will foster effective language acquisition (East, 2015; Nunan, 2004; Willis & Willis,
2007). Arguably a strength of TBLT is that it does not neglect what Brown (2007)
refers to as the ‘structural and cognitive aspects of communication’, even though it
aims primarily to foster its ‘social, cultural and pragmatic dimensions’. That is,
TBLT “aims to reconcile, on the one hand, the primary importance of fluency (with
its implications for … communication) with due attention, on the other hand, to
accuracy (with its implications for proficiency)” (East, 2012, p. 23).3 If automaticity
is the end-goal, De Ridder, Vangehuchten, and Seseña Gómez (2007) propose that
TBLT “leads to a higher level of automaticity than the traditional communicative
approach” (p. 310) because it “stimulates the process of automatization to a larger
extent than a purely communicative course with a strong systematic component”
(p. 314).
1.2.2 Communicative Language Testing
The emphasis on communication heralded by the advent of CLT and the proficiency-
oriented curriculum has had significant implications for assessment, including high-
stakes measurements of students’ proficiency. Bachman (2000) acknowledges that
the 1980s marked the start of a movement away from the “narrow conception of
language ability as an isolated ‘trait’.” There was instead a move towards an under-
standing of language use as “the creation of discourse, or the situated negotiation of
meaning, and of language ability as multicomponential and dynamic.” Bachman
goes on to argue that this move would require those who wished to assess language
proficiency to “take into consideration the discoursal and sociolinguistic aspects of
language use, as well as the context in which it takes place” (p. 3). In other words,
the kinds of linguistic knowledge that could arguably be established (and measured)
via the tests and examinations associated with grammar-translation, or the mimick-
ing of words and phrases that had been common to audio-lingualism, were no
3
As East (2012) makes clear, although TBLT has often been interpreted as focusing primarily on
spoken interaction, the approach is designed to foster second language acquisition across the full
range of skills.
longer sufficient. Rather, it was necessary to view proficiency more holistically in

terms of carrying out genuine communication in a range of contexts.
On the basis of arguments concerning the real-world communicative outcomes
of the CLT classroom, the principle of authenticity became fundamental to debates
around language tests (Morrow, 1991; Wood, 1993). In this regard, Bachman and
Palmer (1996) maintain that it is necessary to establish an association between per-
formance on a language test and language in actual use. Performances on assess-
ments need to demonstrate clear resemblance to the target language use (TLU)
domains being targeted in the assessments – the actual real-world situations that the
assessments aim to reflect. For Bachman and Palmer, authenticity is “a critical qual-
ity of language tests” (p. 23). This is because authenticity “relates the test task to the
domain of generalization to which we want our score interpretations to generalize”
(pp. 23–24). In other words, “if we want to find out how our students are likely to
perform in real world language use tasks beyond the classroom, we need to create
assessment opportunities that allow them to use the type of language they are likely
to encounter beyond the classroom” (East, 2008a, p. 24).
In the New Zealand context which is the focus of this book, the advent of CLT
and its subsequent developments, including an emphasis on interaction, the emer-
gence of TBLT, and the need for authentic assessment opportunities, have had con-
siderable influence. The New Zealand case therefore provides a window through
which to examine the outworkings, in one local context, of the educational develop-
ments described above that have been taking place on a global scale. In what fol-
lows I provide a brief overview of the New Zealand reforms (which I discuss in
considerably more detail in Chap. 3), before going on to articulate the challenges
associated with assessment reform.
1.3 Curriculum and Assessment Reforms in New Zealand
1.3.1 Overview
The start of the twenty-first century has been one of considerable educational reform
for New Zealand’s secondary education sector. In 2002 a new high-stakes assess-
ment system was launched – the National Certificate of Educational Achievement
or NCEA. The ‘skills’ or ‘standards’ based system, which relies on a mix of exter-
nal (examination) and internal (school-based) assessments, with achievements
benchmarked against stated standards, replaced a traditional, summative knowledge-
based examination structure. This new assessment model marked the practical
beginning of a move in thinking away from a teacher-led pedagogical paradigm to
a more learner-focused hands-on approach. For FLs, it also marked the introduction
of internal assessments that aimed to reflect a communicative orientation to teach-
ing. This included a school-based test of spoken communicative proficiency called
converse. The NCEA, now well established, operates at three levels designed to
1.3 Curriculum and Assessment Reforms in New Zealand 7
measure increasing proficiency: level 1 for students in Year 11 (15+ years of age and
final year of compulsory schooling); level 2 (Year 12); and level 3 (Year 13, final
year of voluntary schooling).
The continuation of a shift in pedagogical emphasis away from a top-down
didactic model to one that was more learner-centred and experiential was seen in the
launch of a revised national curriculum for schools, published in 2007 and fully
implemented from 2010 (Ministry of Education, 2007). The revised curriculum also
saw the establishment of a new learning area – Learning Languages. This learning
area, which caters for all languages additional to the language of instruction, includ-
ing FLs, “puts students’ ability to communicate at the centre.” It encourages teach-
ing and learning programmes in which “students learn to use the language to make
meaning” and “become more effective communicators” (p. 25). The revised cur-
riculum and the new learning area were to have significant implications and conse-
quences for FL programmes. East (2012) provides a detailed and thorough account
of some of these implications and consequences with regard to TBLT as a specific
realisation of curricular aims in the FL context.
As East (2012) indicates, the advent of a revised curriculum and new learning
area was also to have significant implications for assessment. Between 2008 and
2010, and parallel to the introduction of the revised curriculum, a subject-wide
review of the NCEA was conducted. Its end-goal was to create new NCEA assess-
ments, aligned with the aims and intentions of the revised curriculum. For FLs, the
introduction of interact in place of converse has been one outcome of this process,
based essentially on the argument that interact would promote more opportunities
for authentic spoken interaction than converse had achieved.
1.3.2 Implementing Assessment Reform: A Risky Business
Implementing assessment innovation is, however, a process fraught with challenges,

and the New Zealand case is no exception. Bachman and Palmer (2010) argue that
“people generally use language assessments in order to bring about some beneficial
outcomes or consequences for stakeholders as a result of using the assessment and
making decisions based on the assessment” (p. 86). Certainly, those of us charged
with drawing up the new assessment guidelines for interact (what Bachman and
Palmer refer to as the blueprints) proposed the assessment with this beneficent aim
in mind. Nevertheless, even the best intentioned assessment blueprints are often
created from a theoretical perspective that may turn out to be challenging to imple-
ment within the real worlds inhabited by teachers and students. There always
remains “the possibility that using the assessment will not lead to the intended con-
sequences, or that the assessment use will lead to unintended consequences that
may be detrimental to stakeholders” (p. 87).
Bachman and Palmer (2010) explain the dilemma like this:
… language assessment development does not take place in a predictable and unchanging
world where everyone else knows and understands what we language testers think we know
and understand about language assessment. The real world of language assessment devel-
opment is one in which the other players may not know the “rules of the game” that we’ve
been taught in language assessment courses. Or, they may choose not to play by those rules.
(p. 249)
It may be that we, the assessment developers, despite our good intentions and our
adherence to ‘good language teaching and assessment theories’, may have got it
wrong. We may not have understood or taken into account the range of contexts in
which our proposed assessment would be used. In this case the assessment would
not always be as fit for purpose as we had anticipated. Alternatively, or additionally,
some teachers may not necessarily be aware of, or may choose not to subscribe to,
the principles of good language teaching and assessment that informed our delibera-
tions, and may not fully appreciate the intentions of the proposed assessment. In this
case they may choose not to embrace the new assessment as fully as they could, or
may introduce the assessment in ways that are not fully in accord with its intentions.
Or teachers may choose not to accept the ‘wisdom’ of the assessment developers,
and may perhaps reject the assessment altogether. Whatever the scenario, there has
been clear evidence from a range of teacher reactions to suggest that we need to
keep a careful eye on what might be happening in practice.
In other words, the theory and the practice may not necessarily gel together as
neatly as we would like. Or, as Bachman and Palmer (2010) would put it, assess-
ment developers may provide warrants to the beneficence of the assessment in order
to support an argument about the use of the assessment, but rebuttals to those war-
rants might bring that argument into question. This being the reality, it is important
to consider the kinds of evidence that are required to help all of us (teachers, stu-
dents, assessment developers, and so on) to come to an appropriate conclusion
about the validity and usefulness of new assessments. To draw on Bachman and
Palmer’s words, we need to ask which rules of the game should apply. In what fol-
lows, I outline the essential place of assessment within the educational endeavour
and articulate the ways in which we can attempt to acquire evidence that our assess-
ments are fit for purpose.
1.4 Assessment Validation
1.4.1 Fundamental Considerations
Assessment is a matter of central concern to all those involved in the educational

endeavour. In all contexts where teaching and learning are taking place, assessment,
in one form or another, is the means through which educators evaluate the effective-
ness of the teaching and learning process, and through which judgments are made
that have implications for stakeholders.
Some judgments made as a consequence of the assessment of teaching and learn-
ing are relatively minor. They may relate to next steps in the teaching and learning
process and modifications that may be required to make that process more effective.
1.4 Assessment Validation 9
They may lead to a tweak to a programme here, an alteration to an approach there,

in the name of making improvements to teaching and learning, but the stakes are
low, and the consequences not far-reaching.
Other judgments are more significant, especially when they lead to the grading
of individual performances. The level of significance of the judgment will depend
on how the gradings are used. If used for diagnostic purposes with a view to enhanc-
ing subsequent teaching and learning (i.e., what is going to happen next in the teach-
ing and learning process), the consequences of decisions made on the basis of
grades are not life-changing. If used for accountability purposes with a view to
measuring prior teaching and learning, the consequences of decisions may not
always be positive (e.g., performance outcomes may reveal that an individual has
not ‘made the grade’). Depending on how much is riding on the performance and
what decisions may be made in the light of performance indicators (that is, how
high the stakes are), the consequences, good or bad, may be substantial (Kane,
2002).
When it comes to high-stakes assessments, Shohamy (2001b) makes it clear that
performance outcomes play a very powerful role in modern societies. Doing as well
as possible on such assessments is seen as important, and the results have a wide
range of consequences for those who are required to take the assessments.
Performance outcomes, however they are communicated (that is, whether in the
form of grades, marks, percentages or comments), are frequently the only indicators
used to make decisions, for example, about placement into particular courses of
study, or the award of prizes, or initial and continuing access to higher education, or
certain job opportunities. Many important and far-reaching decisions may be made
on the basis of the grades, leading to the creation of “winners and losers, successes
and failures, rejections and acceptances” (p. 113).
It is thus not surprising that high-stakes assessments can often be negatively
evocative. Shohamy (2007) recalls her own early experiences with testing at school,
where tests were seen as “a hurdle, an unpleasant experience.” Tests were not only
“responsible for turning the enjoyment and fun of learning into pain, tension, and a
feeling of unfairness,” but also “often the source of anger, frustration, pressure,
competition, and even humiliation.” For Shohamy, there was a sense in which her
‘real knowledge’ was not being tapped into, and she was often left with a sense of
not understanding why tests were even necessary when there was so much else in
the learning process that was gratifying and satisfying. She concludes that being
required to complete a test “often felt like betrayal. If learning is so meaningful,
rewarding, and personal, why is it that it needs to be accompanied by the unpleasant
events of being tested?” (p. 142).
A central issue for assessment, therefore, especially when used for making key
decisions about individuals, is how we can effectively collect meaningful and useful
information about teaching and learning from which worthwhile and trustworthy
decisions can be made. All those involved in the assessment process owe it to the
students, as the central stakeholders, to make sure that the marks or scores they
receive are (as far as possible) fair representations of their true abilities, and that these
marks or scores are collected in relevant and (as far as possible) ‘pain free’ ways.
When it comes to assessing FL students’ language proficiency, we need, in

Bachman’s (1990) words, on the one hand to ensure that assessments provide “the
greatest opportunity for test takers to exhibit their ‘best’ performance” so that they
are “better and fairer measures of the language abilities of interest” (p. 156).
Assessments thus need to be constructed in such a way that they facilitate, and do
not hinder, all candidates’ opportunities to demonstrate what they know and can do
with the language. On the other hand, we cannot escape the reality that “the primary
function performed by tests is that of a request for information about the test takers’
language ability” (p. 321). Language assessments also need to be constructed in
such a way that they can discriminate accurately between different levels of perfor-
mance. In other words, and to use the terms that have now become established in the
testing and assessment literature, we need to be concerned about the validity of the
assessment procedure, and the reliability of the performance outcome information.
1.4.2 The Contribution of Assessment Score Evidence

to a Validity Argument
Bearing in mind the central importance of performance outcomes (i.e., grades,

marks and percentages), one fundamental way in which we can demonstrate that
proposed new assessments are valid and reliable is to examine and analyse these
outcomes. To undertake this analysis would be to follow the traditional and well-
established psychometric model which has become, and remains, the basic founda-
tion of the field of language assessment. The psychometric model is fundamental
because it brings with it “the appeal of its rigorous methods and anchorage in psy-
chology, a field much older than applied linguistics, with a correspondingly wider
research base” (McNamara & Roever, 2006, p. 1).
Viewed from within the psychometric tradition, fundamentally validity and reli-
ability are the two basic measurement characteristics of assessments (East, 2008a).
They are primarily concerned with the meaningfulness and precision of assessment
scores in relation to the measurement of an underlying ability or construct – a theo-
retical quality or trait in which individuals differ (Messick, 1995). Since, from this
tradition, performance scores represent the most visible and tangible evidence of the
outcomes of an appropriate assessment, it is not surprising that performance out-
comes have been relied upon for many years to help determine the validity and
reliability of a given assessment. Validity is therefore a measurement concept, with
construct validity historically coming to be regarded as the overarching validity of
importance (Newton & Shaw, 2014).
From a measurement perspective, construct validity may be defined as “the
agreement between a test score or measure and the quality [or construct] it is
believed to measure” (Kaplan & Saccuzzo, 2012, p. 135). Construct validity relates
to whether the assessment task adequately and fairly reflects the construct that the
assessment is trying to tap into (Cohen & Swerdlik, 2005; Kline, 2000) and there-
fore the extent to which the scores are meaningful interpretations of the abilities of
those who complete the assessment. In other words, validity “concerns what the test
measures and how well it does so” in the sense of “what can be inferred from test
scores” (Anastasi & Urbina, 1997, p. 113). With regard to the measurement of spo-
ken communicative proficiency, for example, pertinent issues are: do the scores tell
us something meaningful in relation to students’ abilities relative to a spoken com-
municative proficiency construct? Can we determine, from the scores, how well
students are able to perform relative to the construct in contexts outside the domain
of the assessment?
Reliability is concerned with how scores are awarded and whether the process of
awarding scores is adequate (what are the assessment criteria? How are they
applied? Who applies them?). Reliability is also concerned with the consistency
with which a given assessment can measure the construct in comparison with a dif-
ferent assessment that aims to measure the same construct (parallel forms) or with
the same assessment completed at a different time (test-retest). Again, with regard
to measuring a spoken communicative proficiency construct, the relevant questions
are: can we be satisfied that the process of awarding the scores is adequate to tell us
how well students are able to perform relative to the construct? To what extent do
the scores, when compared with other scores that purport to measure the same con-
struct, tell us the same thing?
If, based on assessment score evidence, we can determine that the scores do tell
us something meaningful in relation to the quality we aim to assess, and are consis-
tent with other measures of the same construct (or the same measure taken at a dif-
ferent time), we can draw a conclusion that the assessment is construct valid, and,
ipso facto, a fair, unbiased and reasonable assessment.
Validity and reliability therefore have to do with fairness (Kunnan, 2000). That
is, from a theoretical perspective, a fair assessment may be described as one where
the construct has been clearly defined, where this construct has been meaningfully
operationalised in the assessment, and where performance scores can be shown to
measure the construct reliably. If we conclude that the assessment is fair, we can be
satisfied that we have obtained useful, meaningful and reasonable information – the
assessment is fit for purpose. We can therefore be satisfied that, especially when key
decisions are being made, the assessment is doing no harm, and may even be benefi-
cial in terms of the performance evidence it provides. This approach, with its use of
“models, formulas, and statistics to examine the degree to which an assessment
argument possesses the salutary characteristics of validity, reliability, comparability,
and fairness” (Mislevy, Wilson, Ercikan, & Chudowsky, 2003, p. 490) has carried,
and continues to carry, a great deal of weight in the field of educational measure-
ment (as Mislevy et al. make clear).
The question then becomes whether performance outcomes alone provide suffi-
cient evidence on which to base claims of validity and suitability. Or are there other
‘rules of the game’ that may need to be applied?
1.4.3 The Limitations of Assessment Score Evidence

to a Validity Argument
It may be useful to return to Bachman and Palmer’s (2010) concern about the dis-
tinction between what assessment developers would like to happen, and what may
actually happen in practice. Bachman and Palmer underscore the fact that the real
world in which assessments must be enacted is “often unpredictable” and “includes
many uncertainties and conflicts, and is constantly changing” (p. 2). Teachers,
whether they are willing or unwilling implementers and enactors of assessments,
can, in Bachman and Palmer’s words, “often become frustrated with the uncertain-
ties, changeability, and limitations of the real world settings in which they work”
(pp. 1–2).
If we are to address concerns about new and innovative assessments, arguably a
broader understanding of validity is required than one that relies on the ‘narrow
vein’ of statistical evidence (Norris, 2008). As McNamara and Roever (2006)
observe, “through marrying itself to psychometrics, language testing has obscured,
perhaps deliberately, its social dimension” (p. 1), and the reality that a whole range
of assessments may have “far-reaching and unanticipated social consequences”
(p. 2). McNamara and Roever point out that, in fact, a concern for the social dimen-
sion of assessment has been around for many years, both generally within education
(e.g., Madaus & Kellaghan, 1992) and specifically within language testing (e.g.,
Spolsky, 1995). What, in McNamara and Roever’s view, has been lacking has been
specific attention to broader issues pertaining to language assessment due to the
ascendancy of the psychometric tradition from the 1950s.
Mirroring Shohamy (2001b), McNamara and Roever (2006) argue that some
language assessments may have serious and life-changing implications for those
who take them. Other language assessments reflect educational policies and direc-
tives in different jurisdictions. In these cases, they suggest that “testing in educa-
tional systems continues to raise issues of the consequential and social effects of
testing, most of which remain the subject of only scant contemporary reflection”
(p. 4). Newton and Shaw (2014) also problematise a primary focus on outcome
evidence as the means of establishing validity, acknowledging that part of develop-
ments in thinking about what makes a given assessment valid pertains to the impact
of the assessment on those who have a stake in its use.
The importance of performance indicators in helping to determine validity and
reliability is not in question here. Especially when it comes to high-stakes and/or
accountability purposes, there is no escaping the use of ranking and grading as
means of communicating assessment outcomes. It is also not in question that people
do have different levels of proficiency in different domains, and grades are one
means of capturing and reporting those differences. Asserting the limitations of
psychometrics should not be taken to mean that assessment score evidence does not
have an important role to play in assessment validation (and certainly my own work
thus far on assessment has included psychometric considerations – e.g., East, 2007,
2008a, 2009). What is in doubt is whether performance scores alone provide
sufficient evidence on which to base conclusions about usefulness and validity.

Arguments proposed, for example, by Shohamy (2001b, 2007), McNamara and
Roever (2006), and Newton and Shaw (2014), suggest that other sources of evi-
dence have an important role to play.
Essentially, validity claims based on assessment score evidence alone cannot
adequately take into account the practical and contextual realities and complexities
of implementing a new assessment. Put simply, assessments have consequences. A
singular reliance on performance outcomes as the means of validating an assess-
ment overlooks the complexities involved in the assessment process – the range of
‘noise factors’ other than the ability we wish to measure that may interfere with
candidates’ performances and that need to be recognised, understood and controlled
for so that we capture students’ best performances (Elder, 1997).
The psychometrician’s response may be to argue that we do not need to be con-
cerned about ‘noise’ if we can demonstrate from the performance score evidence
that the assessment is construct valid, reliable, consistent and fair. The educational-
ist’s retort, coming from the perspective that the assessment process itself has a
clear and tangible impact on students, may present the counter-argument that valid
and reliable scores are only part of a considerably bigger picture. Other ways to
examine the claims to validity of proposed new assessments might provide a win-
dow into some of the ‘real world’ issues faced by stakeholders.
1.4.4 Towards a Broader Understanding of Assessment

Validation
There are several reasons why observed scores (the scores candidates actually
achieve at a particular administration of the assessment) might differ from true
scores (the scores candidates would get if there were no measurement errors), or
why the scores may not necessarily provide us with a true representation of a par-
ticular candidate’s abilities. Messick (1989) articulates this by defining what he
considers as two ‘threats’ to construct validity.
First there is construct under-representation, the situation that occurs when the
assessment task does not include important dimensions or aspects of the construct
under consideration. The threat here is that performance outcomes are unlikely to
reveal the candidate’s true abilities related to the construct that was supposed to
have been measured by the assessment.
Second, there is construct irrelevant variance, the situation that occurs when the
assessment includes variables that are not relevant to the construct. If the assess-
ment task is presented in such a way that some candidates are able to respond cor-
rectly or appropriately in ways that are not relevant to the construct being assessed,
we may get an overly-inflated measurement of ability: some students may perform
better than they normally would (i.e., secure grades or other outcome measures that
are higher than they might have done if the extra ‘clues’ had not been there), and the
assessment is subject to construct irrelevant easiness. Construct irrelevant difficulty

by contrast occurs when extraneous aspects of the assessment task make the task
irrelevantly difficult for some candidates, potentially leading them to perform more
poorly than they might otherwise have done.
In Chaps. 2 and 3 I explore issues that confront New Zealand’s converse versus
interact scenario that have relevance to Messick’s (1989) two potential threats to
construct validity. I summarise the essential issues below.
With regard to construct under-representation, it may be proposed that the old
converse test did not adequately represent the construct of spoken communicative
proficiency that the test would ideally have measured. If this is the case, perfor-
mances on tests such as converse do not reveal candidates’ true abilities to converse
in the target language. The performance outcomes are therefore unreliable indica-
tors of what candidates know and can do.
With regard to construct irrelevant variance, it may be asserted that, because the
old converse test created some situations where candidates could effectively rote
learn their responses, candidates’ performances may have been enhanced, but in
ways that were not relevant to a spoken communicative proficiency construct. In
these cases performances on the test once more do not reveal candidates’ true abili-
ties to converse in the target language. On the other hand, a counter-assertion with
regard to interact may be that peer interaction presents a problem if used as the basis
for assessment. Two candidates of different proficiencies may be paired, or the two
candidates may not wish to interact with each other. In these circumstances, extra-
neous to the task, the task is made irrelevantly more difficult (unless it is argued that
ability to sustain an interaction in difficult circumstances is part of what we want to
measure). This may lead one or both of the candidates to underperform. Again, the
performances on the assessment do not reveal the candidates’ true interactional
abilities.
Kaplan and Saccuzzo (2012) take the argument further when they point out that
a whole range of situational factors may interfere with performance (the room in
which the assessment is taken may be too hot or too cold, or the candidate may be
feeling ill or unhappy on the day of the assessment). If, as I have already argued,
assessment scores have consequences for the candidates, scores affected by factors
(both construct-related and situational) other than the ability being assessed may
potentially compromise the fairness of the assessment as a measure of the underly-
ing construct, leading to unfair consequences.
What is needed is therefore a broader understanding of assessment validation.
That is, in addition to an evidential (i.e., score or outcome related) basis for validity
decisions, we need a consequential basis – what taking this kind of assessment
means for the candidates beyond the scores they might receive. We need to consider
“not only the intended outcome but also the unintended side effects” of the assess-
ments (Messick, 1989, p. 16). A validity argument must also consider the value
implications and social consequences of the assessment. This may include a range
of variables that may have an influence on performance outcomes.
In other words, as Messick (1989) argues, “construct validity binds social conse-
quences of testing to the evidential basis of test interpretation and use” (p. 21).
Bachman (2000) similarly suggests that “investigating the construct validity of

[score] interpretations without also considering values and consequences is a barren
exercise inside the psychometric test-tube, isolated from the real-world decisions
that need to be made and the societal, political and educational mandates that impel
them” (p. 23). Newton and Shaw (2014) assert that an ethical concern about impact
is but one dimension that has muddied the once clear waters of a ‘simple’ focus on
test scores, to the extent that validity theorists differ considerably in the weight they
place on anything other than the scores. Nevertheless, whatever stance to validity is
taken, the approach “needs to be capable of accommodating the fact that conse-
quential evidence has the potential to inform the evaluation of measurement and
decision-making and impacts” (p. 185).
The above arguments take us beyond an understanding of construct validity that
pays sole attention to assessment scores and their interpretation. Debates concern-
ing what does and does not constitute a fair assessment must take into account that
assessments have consequences for the stakeholders, whether positive or negative,
and that these consequences have implications for validity. We must therefore move
beyond the scores to look at the social consequences – what it means to the candi-
dates to receive particular scores and how their performances were affected by dif-
ferent dimensions of the assessment process. We must aim to eliminate the two
threats to construct validity as far as we are able. We must consider assessment bias,
or whether any aspects of the assessment procedure potentially have negative influ-
ence on some students in ways that are irrelevant to the construct under consider-
ation. It is not just what is going on within the assessment (i.e., the task itself); it is
what is going on around the assessment (i.e., the teaching and learning process; the
assessment process; the assessment conditions; the types of task).
As I argue in East (2008a), taking into consideration Bachman’s (1990) com-
ments about communicative language testing – that tests should provide sufficient
opportunity for all test takers to demonstrate their best performance so that they are
better and fairer measures of the construct of interest – one way of looking at fair-
ness is this: did the candidates have the greatest opportunity to demonstrate in the
assessment what they know and can do? Was there part of the assessment procedure
that may have hindered this? What may be the consequences, for the candidates, of
this?
These are critical issues which need to be considered if we are to reduce or avoid
adverse consequences for the students who take the assessment. The implication if
we consider them is that we shall have a fairer and more construct valid assessment.
The implication if we ignore them is that the assessment may be flawed and biased,
and consequently unfair.
When it comes to considering the merits of interact (New Zealand’s new assess-
ment) over converse (the more traditional test that interact has replaced), it appears,
at least when considered from a theoretical perspective, that there are arguments and
counter-arguments, justifications and refutations. Performance outcome evidence
alone cannot settle these discrepancies. Other kinds of evidence need to be collected
if we wish to understand more thoroughly what is happening with the introduction
of interact.
1.4.5 A Qualitative Perspective on Assessment Validation
Writing in the context of the assessment of speaking, and paralleling the perspective
of McNamara and Roever (2006), Lazaraton (2002) argues:
Language testers have generally come to recognize the limitations of traditional statistical
methods for validating oral language tests and have begun to consider more innovative
approaches to test validation, approaches that promise to illuminate the assessment process
itself, rather than just assessment outcomes. (p. xi)
Lazaraton (2002) goes on to explain that, in her view, the most significant devel-
opment in language testing since the 1990s has been “the introduction of qualitative
research methodologies to design, describe, and, most importantly, to validate lan-
guage tests” (p. 2). Indeed, for Lazaraton, the use of qualitative research methods,
although relatively new to the field of applied linguistics (see, e.g., Edge & Richards,
1998), is now very well established in the field of education, leading, in Lazaraton’s
words, to “less skeptical reception” (p. 3) of the findings of qualitative studies in the
educational arena.
Lazaraton’s particular stance in her 2002 study was the use of conversation anal-
ysis (CA) as a methodological approach to investigating oral language performance
and interaction. As Lazaraton (1995) acknowledges, however, there is no single
qualitative approach but rather a range of different approaches. Lazaraton also
acknowledges the reality that qualitative data may be subject to quantitative inter-
rogation involving frequency calculations and inferential statistical analyses. She
notes, nevertheless, that from a qualitative perspective these analyses serve to help
make sense of the phenomena and themes emerging from the qualitative data, and
speculates, “we might ask why qualitative research is not more prevalent than it is
in applied linguistics, given our interest in the social and/or sociocultural context of
language learning and use” (p. 466).
A move towards greater acceptance of the qualitative paradigm in the context of
recognising the limitations of performance score outcomes alone has implications
for the study to be reported in this book. Certainly, in the New Zealand context,
those of us entrusted with developing the blueprints for interact were committed to
principles of validity and reliability. That is, we were concerned with designing an
assessment specification that would enhance genuine opportunities for FL students
to speak in the target language in meaningful ways and that would also discriminate
between different levels of performance. However, an assessment like interact,
designed to be embedded within the normal day-to-day operations of the classroom
and on-going programmes of teaching and learning, brings with it a contextual
social dimension which suggests that performance outcome evidence for validity
and reliability may be insufficient.
Bearing in mind that, with regard to the reform in question in this book, teachers
and students are the primary recipients of, as well as the primary stakeholders in, the
assessment process, it makes sense to ask them, now that the reform is under way,
what they are making of it. Their opinions, perspectives and experiences arguably
have something important to contribute to on-going validation of the new assess-
ment. The study that this book reports is therefore essentially a contribution to a
validity argument, but from the perspective of the stakeholders.
It may be argued, as adherents to the psychometric tradition might assert, and as
I suggest elsewhere (East, 2005), that a study such as the one reported here may
simply result in face validity – that is, the assessment is perceived to be reasonable,
fair, and appropriate, but the perception may be of little value to a robust validation
exercise. It may further be argued that although panels of teachers may be called
upon to add their voice during the process of assessment development (as they were
in the case of the development of interact – see Chap. 3), the opportunity for them
to express their opinions after an assessment has become fully operational does not
form part of the validation process (Haertel, 2002).
The counter-argument to failure to take account of stakeholder views is found in
the perspectives I have shared earlier in this chapter – that assessments have conse-
quences, and that a rigorous examination of those consequences (i.e., a consider-
ation of consequential validity and its contribution to an emerging validity argument)
is arguably necessary.
Stakeholder judgments about an assessment are important means of determining
its consequential validity (Chapelle, 1999; Crocker, 2002; Haertel, 2002; Kane,
2002; Shohamy, 2000, 2001a, 2006; Winke, 2011). Teachers in particular provide a
unique vantage point from which to gauge the effects of assessment on their stu-
dents (Norris, 2008). Including the perspectives of teachers in the assessment vali-
dation process can improve the validity of high-stakes assessments (Ryan, 2002).
This book focuses principally on the teachers as the key implementers of assess-
ment reform. Teachers, argues Winke (2011), provide “unique insight into the col-
lateral effects of tests. They administer tests, know their students and can see how
the testing affects them, and they recognize – sometimes even decide – how the tests
affect what is taught” (p. 633). As a consequence, teachers’ perspectives offer us
crucial information about intended or unintended impacts on the curriculum. Their
perspectives can therefore “shed light on the validity of the tests, that is, whether the
tests measure what they are supposed to and are justified in terms of their outcomes,
uses, and consequences” (p. 633).
Also, as I argued in East (2008a), we need to find out what the students them-
selves think. Rea-Dickins (1997) suggests that those who take the assessment are
the most important stakeholders who might be consulted about that assessment’s
utility. This view is in accord with Bachman and Palmer’s (1996) assertion that “one
way to promote the potential for positive impact is through involving test takers in
the design and development of the test, as well as collecting information from them
about their perceptions of the test and test tasks” (p. 32). It also lines up with
Messick’s (1989) recommendation that candidates’ perceptions should be included
as a crucial source of evidence for construct validity.
Taking the above arguments into account, this book reports a largely qualitative
study into the viewpoints of teachers and students on the outcomes, uses and conse-
quences of a new assessment of FL spoken communicative proficiency (interact) in
comparison with the more traditional test that it has replaced (converse). Its purpose
is to uncover stakeholder perspectives on the usefulness of the assessments that can
be used to inform validity arguments around different kinds of assessment such as

those anticipated by interact in comparison with converse. The fundamental ques-
tions addressed are these: What are teachers and students making of the innovation?
What is working, what is not working, what could work better? What are the impli-
cations, both for on-going classroom practice and for on-going evaluation of the
assessment? This book seeks to answer these questions by drawing on data col-
lected from a substantial 2-year research project which sought teachers’ and stu-
dents’ perspectives at two crucial stages in the assessment implementation
process – the end of 2012, when NCEA levels 1 and 2 had come on stream, and the
end of 2013, when the level 3 assessments were brought into play. The data focus
on these stakeholders’ perceptions of the comparative utility of interact and con-
verse. Conclusions are drawn that not only offer evidence to support or question the
validity of different kinds of assessment but that also illuminate the benefits and
challenges of assessment innovation.
1.5 The Structure of This Book
This chapter has provided an introduction to and rationale for the study that will be
the focus of this book. Situating the study within current conceptualisations of the
goals of language teaching, learning and assessment, the chapter has presented and
explained the traditional psychometric approach to assessment validation. It has
demonstrated the limitations of this approach and has presented an alternative that
focuses on the stakeholders.
Chapter 2 explores several key dimensions of assessing FL students’ spoken
communicative proficiency. It highlights the complexity of the issues and brings out
the reality that the New Zealand context for assessment reform is influenced in par-
ticular by three different areas of debate:
1. Which assessment model better serves the interests of students when it comes to
assessing their proficiency? Static or dynamic?4
2. Which theoretical framework should influence the kinds of tasks we expect FL
students to perform in order to demonstrate their spoken communicative profi-
ciency? Task-based or construct based?
3. Which assessment condition is likely to yield better (i.e., more useful, valid and
reliable) evidence of FL students’ spoken communicative proficiency? The sin-
gle (interview) test or a paired/group assessment?
Building on the three theoretical domains explored in Chap. 2, Chap. 3 expands
on the brief introduction to New Zealand’s curriculum and assessment reforms pre-
sented in this chapter and explains them in more detail. In particular Chap. 3
4
I use the terms ‘static’ and ‘dynamic’ to differentiate broadly between one-time tests that measure
performances at a particular point in time and on-going assessments that build in opportunities for
feedback and feedforward. Alternative differentiating terms include ‘summative’ and ‘formative’,
and assessments of and for learning.
1.6 Conclusion 19
addresses what the reforms, and the developers of the assessment blueprints, were
trying to achieve and the initial stakeholder perspectives that were received as part
of this process.
Chapter 4 provides a detailed account of the methodology for the two-stage
study into stakeholders’ perspectives that is the focus of the remainder of the book.
Bachman and Palmer’s (1996) test usefulness framework is presented as the theo-
retical construct that underpins the study. The chapter articulates the expectations of
interact and evaluates the assessment theoretically against the six qualities in the
framework. The chapter concludes by explaining how the framework was opera-
tionalised in the study.
Chapters 5 and 6 present findings from Stage I of the study (2012) – responses to
a nationwide survey sent to teachers in charge of FL programmes in New Zealand’s
schools (n = 152), and interviews with teachers who had successfully introduced
interact at levels 1 and/or 2 (n = 14). The findings are presented in comparative
terms, that is, how participants perceived interact in practice in comparison with
converse at a crucial intermediate stage in the reform process when both assessment
types (converse and interact) will have been familiar to teachers. Chapter 5 focuses
on the perceived advantages of interact in comparison with converse. Chapter 6
focuses on the perceived disadvantages of interact in comparison with converse,
alongside suggestions for improvements to the new assessment.
Chapters 7 and 8 report on Stage II of the study which focuses on NCEA level 3,
the highest level of the examination. Findings are presented from interviews with
teachers using interact at NCEA level 3 (n = 13), and surveys administered to Year 13
students taking converse at level 3 (2012, n = 30) or interact at level 3 (2013, n = 119).
Chapter 7 explores data derived from the teacher interviews and presents teachers’
views on the operationalisation of interact at this highest level of examination in
comparison with converse. Chapter 8 concludes findings from the teacher data and
provides the opportunity for the students, as primary recipients of the innovation, to
have the final word. The chapter, in common with the earlier chapters, explores per-
spectives in comparative terms, drawing on data from the student surveys.
Chapter 9 provides a summary of the key themes and issues emerging from the
data from both stages of the study. The chapter discusses the data in light of the
background material presented in Chaps. 1, 2, and 3. Findings are then related to
broader issues for the assessment of spoken communicative proficiency as opera-
tionalised in a variety of contexts. Recommendations for practice are presented,
based on the themes and issues discussed. The chapter concludes with the limita-
tions of the present study and directions for future research.
1.6 Conclusion
Kane (2002) argues that a traditional perspective on measurement as “an essentially

noninteractive monitoring device” has latterly turned into a recognition not only
that assessments have consequences but also that assessments can operate as
“the engines of reform and accountability in education.” He concludes that “[f]or

good or ill, these developments are likely to push the policy inferences and assump-
tions to center stage” (p. 33). If we transfer this argument to the introduction of
interact and the replacement of converse, it may be argued that an ideological theo-
retical perspective on effective language pedagogy has influenced the assessment
developers to introduce the new assessment as an ‘engine of reform’, not all of the
consequences of which have been beneficial. This does not necessarily debunk the
theory or bring into question the validity of the new assessment in the light of that
theory. But its claims must be open to scrutiny.
Pushing policy inferences and assumptions to the forefront also raises another
issue of concern. Shohamy (2001b) asserts that, among the consequences of high-
stakes assessment, those being assessed modify their behaviours so as to do as well
as they can on the assessments. She suggests that their willingness to do this “drives
decision makers and those in authority to introduce tests in order to cause test takers
to change their behavior to along their lines,” leading, in her view, to “negative
effects on the quality of knowledge” (p. 113). As I have argued elsewhere (East,
2008b, p. 250), whether or not we accept Shohamy’s claim that knowledge quality
is diminished by the practice of centralised control, it is certainly evident that, when
it comes to high-stakes language assessments, centrally-based policy makers dictate
the types of assessment they are willing to sanction, and their decisions have an
influence, whether beneficial or not, on those taking the assessments.
Additionally, as Bachman and Palmer (2010) assert:
In any [assessment] situation, there will be a number of alternatives, each with advantages
and disadvantages. … If we assume that a single “best” test exists, and we attempt either to
use this test itself, or to use it as a model for developing a test of our own, we are likely to
end up with a test that will be inappropriate for at least some of our test takers. (p. 6)
Differential impacts from different kinds of assessments raise important issues

for stakeholders. In this light, those who advocate for the use of one particular
assessment over another need, in Bachman and Palmer’s (2010) words, “to be able
to demonstrate to stakeholders that the intended uses of their assessment are justi-
fied. This is particularly crucial in situations where high-stakes decisions will be
made at least in part on the basis of a language assessment” (p. 2).
It may be argued that a scrutiny of assessment scores is all we need to provide a
convincing justification for the use of one assessment over another. McNamara
(1997) contends, however, that “research in language testing cannot consist only of
a further burnishing of the already shiny chrome-plated quantitative armour of the
language tester with his (too often his) sophisticated statistical tools and impressive
n-size.” There is rather the need for “the inclusion of another kind of research on
language testing of a more fundamental kind, whose aim is to make us fully aware
of the nature and significance of assessment as a social act” (p. 460). This is espe-
cially the case when the stakes are high and when there are arguments for and
against particular assessment types. For Lazaraton (2002), whose interests were
specifically in assessments of speaking, language assessment as a discipline is
“in the midst of exciting changes in perspective” on the basis of an acceptance that
References 21
“the established psychometric methods for validating oral language tests are effec-
tive, but limited, and other validation methods are required” (p. 25).
This study is offered as a contribution to a ‘more fundamental’ kind of research
than that offered from a purely psychometric perspective. The result is a novel and
comprehensive study into educational innovation, language use and language learn-
ing that will be of interest to many involved in FL teaching and learning at a range
of levels, including practitioners, policy makers, researchers and assessment
specialists.
References
ACTFL. (2012). ACTFL proficiency guidelines 2012. Retrieved from http://www.actfl.org/publica-

tions/guidelines-and-manuals/actfl-proficiency-guidelines-2012
Anastasi, A., & Urbina, S. (1997). Psychological testing. Upper Saddle River, NJ: Prentice Hall.
Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that what we
count counts. Language Testing, 17(1), 1–42. http://dx.doi.org/10.1177/026553220001700101
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, England: Oxford
University Press.
Bachman, L. F., & Palmer, A. (1996). Language testing in practice: Designing and developing
useful language tests. Oxford, England: Oxford University Press.
Bachman, L. F., & Palmer, A. (2010). Language assessment in practice: Developing language
assessments and justifying their use in the real world. Oxford, England: Oxford University
Press.
Brown, H. D. (2007). Principles of language learning and teaching (5th ed.). New York, NY:
Pearson.
Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19,
254–272. http://dx.doi.org/10.1017/s0267190599190135
Cohen, R. J., & Swerdlik, M. E. (2005). Psychological testing and assessment: An introduction to
tests and measurement (6th ed.). New York, NY: McGraw Hill.
Council of Europe. (1998). Modern languages: Teaching, assessment. A common European frame-
work of reference. Strasbourg, France: Council of Europe.
Council of Europe. (2001). Common European framework of reference for languages. Cambridge,
England: Cambridge University Press.
Crocker, L. (2002). Stakeholders in comprehensive validation of standards-based assessments: A
commentary. Educational Measurement: Issues and Practice, 22, 5–6. http://dx.doi.
org/10.1111/j.1745-3992.2002.tb00079.x
De Ridder, I., Vangehuchten, L., & Seseña Gómez, M. (2007). Enhancing automaticity through
task-based language learning. Applied Linguistics, 28(2), 309–315. http://dx.doi.org/10.1093/
applin/aml057
DeKeyser, R. M. (2001). Automaticity and automatization. In P. Robinson (Ed.), Cognition and
second language instruction (pp. 125–151). Cambridge, England: Cambridge University Press.
http://dx.doi.org/10.1017/cbo9781139524780.007
East, M. (2005). Using support resources in writing assessments: Test taker perceptions. New
Zealand Studies in Applied Linguistics, 11(1), 21–36.
East, M. (2007). Bilingual dictionaries in tests of L2 writing proficiency: Do they make a differ-
ence? Language Testing, 24(3), 331–353. http://dx.doi.org/10.1177/0265532207077203
East, M. (2008a). Dictionary use in foreign language writing exams: Impact and implications.
Amsterdam, Netherlands/Philadelphia, PA: John Benjamins. http://dx.doi.org/10.1075/lllt.22
East, M. (2008b). Language evaluation policies and the use of support resources in assessments of
language proficiency. Current Issues in Language Planning, 9(3), 249–261. http://dx.doi.
org/10.1080/14664200802139539
East, M. (2009). Evaluating the reliability of a detailed analytic scoring rubric for foreign language
writing. Assessing Writing, 14(2), 88–115. http://dx.doi.org/10.1016/j.asw.2009.04.001
East, M. (2012). Task-based language teaching from the teachers’ perspective: Insights from New
Zealand. Amsterdam, Netherlands / Philadelphia, PA: John Benjamins. http://dx.doi.
org/10.1075/tblt.3
East, M. (2015). Taking communication to task – again: What difference does a decade make? The
Language Learning Journal, 43(1), 6–19. http://dx.doi.org/10.1080/09571736.2012.723729
Edge, J., & Richards, K. (1998). May I see your warrant please?: Justifying outcomes in qualitative
research. Applied Linguistics, 19, 334–356. http://dx.doi.org/10.1093/applin/19.3.334
Elder, C. (1997). What does test bias have to do with fairness? Language Testing, 14(3), 261–277.
http://dx.doi.org/10.1177/026553229701400304
Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks:
what does the test-taker have to offer? Language Testing, 19(4), 347–368. http://dx.doi.org/10.
1191/0265532202lt235oa
Gardner, J., Harlen, W., Hayward, L., & Stobart, G. (2008). Changing assessment practice:
Process, principles and standards. Belfast, Northern Ireland: Assessment Reform Group.
Haertel, E. H. (2002). Standard setting as a participatory process: Implications for validation of
standards-based accountability programs. Educational Measurement: Issues and Practice, 22,
16–22. http://dx.doi.org/10.1111/j.1745-3992.2002.tb00081.x
Hedge, T. (2000). Teaching and learning in the language classroom. Oxford, England: Oxford
University Press.
Higgs, T. V. (Ed.). (1984). Teaching for proficiency: The organizing principle. Lincolnwood, IL:
National Textbook Company.
Hunter, D. (2009). Communicative language teaching and the ELT Journal: a corpus-based
approach to the history of a discourse. Unpublished doctoral thesis. University of Warwick,
Warwick, England.
Kane, M. J. (2002). Validating high-stakes testing programs. Educational Measurement: Issues
and Practice, 21(1), 31–42. http://dx.doi.org/10.1111/j.1745-3992.2002.tb00083.x
Kaplan, R. M., & Saccuzzo, D. P. (2012). Psychological testing: Principles, applications, and
issues (8th ed.). Belmont, CA: Wadsworth, Centage Learning.
Kline, P. (2000). Handbook of psychological testing (2nd ed.). London, England: Routledge. http://
dx.doi.org/10.4324/9781315812274
Kramsch, C. (1986). From language proficiency to interactional competence. The Modern
Language Journal, 70(4), 366–372. http://dx.doi.org/10.1111/j.1540-4781.1986.tb05291.x
Kramsch, C. (1987). The proficiency movement: Second language acquisition perspectives.
Studies in Second Language Acquisition, 9(3), 355–362. http://dx.doi.org/10.1017/
s0272263100006732
Kunnan, A. J. (2000). Fairness and justice for all. In A. J. Kunnan (Ed.), Fairness and validation in
language assessment (pp. 1–14). Cambridge, England: Cambridge University Press.
Lazaraton, A. (1995). Qualitative research in applied linguistics: A progress report. TESOL
Quarterly, 29(3), 455–472. http://dx.doi.org/10.2307/3588071
Lazaraton, A. (2002). A qualitative approach to the validation of oral language tests. Cambridge,
Leung, C. (2005). Convivial communication: Recontextualizing communicative competence.
International Journal of Applied Linguistics, 15(2), 119–144. http://dx.doi.
org/10.1111/j.1473-4192.2005.00084.x
Long, M. (1983). Native speaker/non-native speaker conversation and the negotiation of compre-
hensible input. Applied Linguistics, 4(2), 126–141. http://dx.doi.org/10.1093/applin/4.2.126
Long, M. (1996). The role of the linguistic environment in second language acquisition. In
W. Ritchie & T. Bhatia (Eds.), Handbook of second language acquisition (pp. 413–468).
New York, NY: Academic.
References 23
Madaus, G. F., & Kellaghan, T. (1992). Curriculum evaluation and assessment. In P. W. Jackson
(Ed.), Handbook on research on curriculum (pp. 119–154). New York, NY: Macmillan.
McNamara, T. (1997). ‘Interaction’ in second language performance assessment: Whose perfor-
mance? Applied Linguistics, 18(4), 446–466. http://dx.doi.org/10.1093/applin/18.4.446
McNamara, T., & Roever, C. (2006). Language testing: The social dimension. Malden, MA:
Blackwell.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’
responses and performances as scientific inquiry into score meaning. American Psychologist,
50, 741–749. http://dx.doi.org/10.1037//0003-066x.50.9.741
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103).
New York, NY: Macmillan.
Ministry of Education. (2007). The New Zealand Curriculum. Wellington, NZ: Learning Media.
Mislevy, R., Wilson, M. R., Ercikan, K., & Chudowsky, N. (2003). Psychometric principles in
student assessment. In T. Kellaghan, & D. L. Stufflebeam (Eds.), International handbook of
educational evaluation (Vol. 9, pp. 489–531). Dordrecht, Netherlands: Kluwer Academic
Publishers. http://dx.doi.org/10.1007/978-94-010-0309-4_31
Morrow, K. (1991). Evaluating communicative tests. In S. Anivan (Ed.), Current developments in
language testing (pp. 111–118). Singapore, Singapore: SEAMEO Regional Language Centre.
Newton, P., & Shaw, S. (2014). Validity in educational and psychological assessment. London,
England: Sage.
Norris, J. (2002). Interpretations, intended uses and designs in task-based language assessment.
Language Testing, 19(4), 337–346. http://dx.doi.org/10.1191/0265532202lt234ed
Norris, J. (2008). Validity evaluation in language assessment. Frankfurt am Main, Germany: Peter
Lang.
Nunan, D. (2004). Task-based language teaching. Cambridge, England: Cambridge University
Press. http://dx.doi.org/10.1017/cbo9780511667336
Philp, J., Adams, R., & Iwashita, N. (2014). Peer interaction and second language learning.
New York, NY: Routledge. http://dx.doi.org/10.4324/9780203551349
Rea-Dickins, P. (1997). So, why do we need relationships with stakeholders in language testing?
A view from the UK. Language Testing, 14(3), 304–314. http://dx.doi.
org/10.1177/026553229701400307
Richards, J. C. (2001). Curriculum development in language teaching. Cambridge, England:
Cambridge University Press. http://dx.doi.org/10.1017/cbo9780511667220
Richards, J. C. (2006). Communicative language teaching today. Cambridge, England: Cambridge
University Press.
Richards, J. C., & Rodgers, T. S. (2014). Approaches and methods in language teaching (3rd ed.).
Cambridge, England: Cambridge University Press.
Ryan, K. (2002). Assessment validation in the context of high-stakes assessment. Educational
Measurement: Issues and Practice, 22, 7–15. http://dx.doi.org/10.1111/j.1745-3992.2002.
tb00080.x
Savignon, S. (2005). Communicative language teaching: Strategies and goals. In E. Hinkel (Ed.),
Handbook of research in second language teaching and learning (pp. 635–651). Mahwah, NJ:
Lawrence Erlbaum.
Segalowitz, N. (2005). Automaticity and second languages. In C. J. Doughty, & M. H. Long (Eds.),
The handbook of second language acquisition (pp. 381–408). Oxford, England: Blackwell.
http://dx.doi.org/10.1002/9780470756492.ch13
Shohamy, E. (2000). Fairness in language testing. In A. J. Kunnan (Ed.), Fairness and validation
in language assessment (pp. 15–19). Cambridge, England: Cambridge University Press.
Shohamy, E. (2001a). The power of tests: A critical perspective on the uses of language tests.
Harlow, England: Longman/Pearson. http://dx.doi.org/10.4324/9781315837970
Shohamy, E. (2001b). The social responsibility of the language testers. In R. L. Cooper (Ed.), New
perspectives and issues in educational language policy (pp. 113–130). Amsterdam,
Netherlands/Philadelphia, PA: John Benjamins Publishing Company. http://dx.doi.
org/10.1075/z.104.09sho
Shohamy, E. (2006). Language policy: Hidden agendas and new approaches. New York, NY:
Routledge. http://dx.doi.org/10.4324/9780203387962
Shohamy, E. (2007). Tests as power tools: Looking back, looking forward. In J. Fox, M. Wesche,
D. Bayliss, L. Cheng, C. E. Turner, & C. Doe (Eds.), Language testing reconsidered (pp. 141–
152). Ottawa, Canada: University of Ottawa Press.
Spada, N. (2007). Communicative language teaching: Current status and future prospects. In
J. Cummins, & C. Davison (Eds.), International handbook of English language teaching
(pp. 271–288). New York, NY: Springer. http://dx.doi.org/10.1007/978-0-387-46301-8_20
Spolsky, B. (1995). Measured words. Oxford, England: Oxford University Press.
Tomlinson, B. (Ed.). (2011). Materials development in language teaching (2nd ed.). Cambridge,
Willis, D., & Willis, J. (2007). Doing task-based teaching. Oxford, England: Oxford University
Press.
Winke, P. (2011). Evaluating the validity of a high-stakes ESL test: Why teachers’ perceptions
matter. TESOL Quarterly, 45(4), 628–660. http://onlinelibrary.wiley.com/doi/10.5054/
tq.2011.268063/abstract
Wood, R. (1993). Assessment and testing. Cambridge, England: Cambridge University Press.
Chapter 2
Assessing Spoken Proficiency: What Are
the Issues?
2.1 Introduction
In Chap. 1 I argued that communication for real world purposes has now become
well established as the goal of current approaches to language teaching and learn-
ing. In essence, the CLT framework is seen by many as “the most influential
approach in the history of second/foreign language instruction” (Spada, 2007,
p. 283) and “persists in the present period as the dominant model for language
teaching and learning” (Hunter, 2009, p. 1).
In recent developments to the overarching framework, such as TBLT, the four
skills model of listening, reading, writing and speaking has become more integrated.
Speaking is, however, arguably at the core, with the objective of “developing learn-
ers’ fluency and accuracy, as well as their sociocultural communicative competence
requiring adapting the language from context to context and from genre to genre”
(Hinkel, 2010, p. 123). It makes sense, therefore, that developing FL students’ spo-
ken communicative proficiency will be a significant component of FL teaching and
learning programmes that operate within current CLT frameworks, particularly
where an emphasis is placed on ‘meaningful, authentic exchanges’ and ‘interper-
sonal negotiation among learners’ (Brown, 2007). It also makes sense that spoken
communicative proficiency will be an important focus for assessment, and that valid
assessments of this proficiency will aim to measure instances of language use as
authentically as possible.
The centrality of speaking skills to the contemporary FL classroom raises two
essential questions for assessment which I aim to answer in this chapter: what does it
mean to speak proficiently in the FL? What modes of assessment might best capture
authentic instances of spoken proficiency for measurement purposes? A fundamental
task is therefore to define a spoken communicative proficiency construct that informs,
from a theoretical perspective, what current communicative approaches to language
teaching and learning aim to achieve. Following on from that emerge considerations
of how to tap into facets of that construct for purposes of assessment. In this chapter

Linguistics 26, DOI 10.1007/978-981-10-0303-5_2
26 2 Assessing Spoken Proficiency: What Are the Issues?
I situate assessments of spoken communicative proficiency within a broader under-

standing of communicative competence. I take one early influential model of com-
municative competence (Canale, 1983; Canale & Swain, 1980) as my starting point
for articulating what it means to speak proficiently in the target language. I go on to
consider a range of issues that concern how we might most effectively measure FL
students’ spoken communicative proficiency in high-stakes contexts.
2.2 What Does It Mean to Communicate Proficiently?
2.2.1 Communicative Competence as the Underlying

Theoretical Framework
The well-established Canale and Swain framework provides a useful starting-point

for articulating what a construct of spoken communicative proficiency might look
like.
Canale and Swain’s (1980) aim was to “establish a clear statement of the content
and boundaries of communicative competence … that will lead to more useful and
effective second language teaching, and allow more valid and reliable measurement
of second language communication skills” (p. 1). According to the Canale and
Swain framework, proficiency in using a language for communicative purposes
could be seen as involving four essential dimensions of competence, neatly
expressed by Canale (1983) in this way:
1. Grammatical competence: ability to use the ‘language code’ accurately, includ-
ing correct lexis and spelling, accurate formation of words and sentences, and
pronunciation.
2. Sociolinguistic competence: ability to use and understand language appropriate
to different sociolinguistic contexts, and to choose suitable meanings and forms
for the context.
3. Discourse competence: ability to create unified texts in different modes (such as
spontaneous conversation or discursive essay-writing) by combining and inter-
preting appropriate meanings, and applying cohesion and coherence rules
appropriately.
4. Strategic competence: ability to draw on verbal and nonverbal strategies both to
“compensate for breakdowns in communication due to insufficient competence
or to performance limitations” and to “enhance the rhetorical effect of utter-
ances” (Canale, 1983, p. 339).
If these four principles or dimensions of competence are applied to speaking, the
measurement of a spoken communicative proficiency construct may be related to
performances that would demonstrate the following:
1. Grammatical proficiency: the FL speaker would be able to demonstrate profi-
ciency in applying the grammatical rules that underpin the language, i.e., speak
using accurate language, including adequate pronunciation.
2.2 What Does It Mean to Communicate Proficiently? 27
2. Sociolinguistic proficiency: the FL speaker would be able to demonstrate the use

of appropriate language that is fit for the context.
3. Discourse proficiency: the FL speaker would be able to demonstrate an ability to
use extended discourse cohesively and coherently, and therefore fluently.
4. Strategic proficiency: the FL speaker would be able to demonstrate “how to cope
in an authentic communicative situation and how to keep the communicative
channel open” (Canale & Swain, 1980, p. 25) by using appropriate compensa-
tory strategies (questioning, hesitation, etc.).
At its simplest, and based on these criteria, a valid assessment of spoken com-
municative proficiency would be one that measures students’ proficiency across the
four dimensions of communicative competence in accordance with students’ abili-
ties and stage in learning. (That is, as I noted in Chap. 1, automaticity and profi-
ciency are relative and not absolute, and frameworks such as the ACTFL Guidelines
[ACTFL, 2012] or the CEFR [Council of Europe, 2001] provide means of articulat-
ing this relativity.)
2.2.2 Developing the Framework of Communicative

Competence
Applying the Canale and Swain model to assessments in practice, the first three
components could arguably be demonstrated (and assessed) by looking at students’
individual spoken performances, for example in the delivery of a monologue (e.g.,
a phone message, a short presentation, a speech or a lecture). In turn, the first three
components might be demonstrated via pre-planned and pre-rehearsed scripting.
This does not necessarily disauthenticate the assessment (after all, writing out what
you want to say prior to recording a short telephone message may be the appropriate
thing to do in some circumstances). However, the fourth component (strategic com-
petence) implies a dimension of reciprocity – that there are at least two interlocutors
involved. Implicit here is that, at least with regard to speaking, monologues argu-
ably do not provide a complete account of what FL students should be able to do
with spoken language communicatively. That is, FL students’ spoken communica-
tive proficiency cannot be fully determined without reference to some kind of inter-
actional ability, and this interactional ability presupposes the ability to deal with the
unexpected.
In this regard, Kramsch (1986) poses the question whether proficiency is synony-
mous with what she terms interactional competence. In Kramsch’s view, a profi-
ciency oriented assessment that stresses “behavioural functions and the lexical and
grammatical forms of the language” – characteristics that may arguably be mea-
sured against the first three components of the Canale and Swain model and via a
pre-planned monologic assessment – overlooks the “dynamic process of communi-
cation” (p. 368, my emphasis).
The interactional dimension of communicative proficiency is in fact not lost on

Canale and Swain (1980). Indeed, they assert that FL students “must have the oppor-
tunity to take part in meaningful communicative interaction with highly competent
speakers of the language, i.e. to respond to genuine communicative needs in realis-
tic second language situations” (p. 27, my emphasis). In their view, this exposure to
authentic communicative situations is crucial if communicative competence is to
become communicative confidence (or we might say automaticity). An insistence
on providing adequate interactional opportunities is therefore “motivated strongly
by the theoretical distinction between communicative competence and communica-
tive performance,” and is therefore “significant not only with respect to classroom
activities but to testing as well” (p. 27).
Interactional competence may be theorised as incorporating all four components
of Canale and Swain’s model. However, subsequent developments call into question
whether the model is sufficiently complete to take into account all necessary facets
of successful communicative interaction.
In the context of recognising the importance of interaction to the assessment of
spoken communicative proficiency, Martinez-Flor, Usó-Juan, and Alcón Soler
(2006), for example, offer a more elaborate notion of discourse competence than
Canale and Swain. They include, in addition to cohesion and coherence, “knowl-
edge of discourse markers (e.g., well, oh, I see, okay)” and “the management of
various conversational rules (e.g., turn-taking mechanisms, how to open and close a
conversation)” (p. 147). Additionally, they argue for the inclusion of pragmatic
competence. They see this as interlocutors’ knowledge of “the function or illocu-
tionary force implied in the utterance they intend to produce as well as the contex-
tual factors that affect the appropriacy of such an utterance,” along with “how to
vary their spoken utterances appropriately with respect to register, that is, when to
use formal or informal styles” (p. 149). Roever (2011) notes, however, that prag-
matic competence already forms part of all major models of communicative compe-
tence, including Canale and Swain’s. It is therefore arguably included in Canale and
Swain’s articulation of discourse and sociolinguistic competence, and Martínez-
Flor et al. offer us an elaboration that clarifies dimensions of the Canale and Swain
model.
Martinez-Flor et al. (2006) also suggest the inclusion of intercultural compe-
tence, or knowledge of “how to produce an appropriate spoken text within a particu-
lar sociocultural context.” This, in their view, includes, in addition to language
choice, being aware of “the rules of behavior that exist in a particular community in
order to avoid possible miscommunication” as well as “non-verbal means of com-
munication (i.e., body language, facial expressions, eye contact, etc.)” (p. 150). In
this connection, Kramsch (1986) presents an early understanding of what appears to
be missing from the Canale and Swain model. Drawing on the example of a cus-
tomer ordering “the legendary cup of coffee in a French restaurant after 3 years of
French,” she argues that the challenges that may be encountered are most likely not
due to not knowing the appropriate grammar or lexis, or not knowing basic
behavioural rules. More probably the challenges come down to differences in per-
ception and understanding of “the different social relationships existing in France
2.2 What Does It Mean to Communicate Proficiently? 29
between waiters and customers, of the different affective, social, and cultural values
attached to cups of coffee, of the different perception French waiters might have of
American citizens” (p. 368). Fundamentally, the challenges reside in the different
expectations, assumptions and perspectives that can exist between two interlocutors
from essentially different worlds.
Certainly at the linguistic level, Canale and Swain’s model of sociolinguistic
competence (or, in the words of Martinez-Flor et al., 2006, how to produce an
appropriate spoken text within a particular sociocultural context) is arguably suffi-
cient to ensure successful interaction. Kramsch’s (2005) argument is, however, that
interactional competence is “more than just learning to get one’s message across,
clearly, accurately, and appropriately, or even to interact successfully with native
speakers” (p. 551, my emphasis). Interactional competence must contain a dimen-
sion that goes beyond language use. This additional dimension has become the stuff
of an augmentation to the communicative competence construct to include intercul-
tural communicative competence. This development has given rise to a rich and
varied literature spanning several decades (e.g., Byram, 1997, 2008, 2009; Byram,
Gribkova, & Starkey, 2002; Byram, Holmes, & Savvides, 2013; Liddicoat, 2005,
2008; Liddicoat & Crozet, 2000; Lo Bianco, Liddicoat, & Crozet, 1999).
Taking perspectives concerning intercultural communicative competence into
account, for spoken interactions to be effective they arguably require some level of
understanding of, and competence in, appropriate interactional behaviour (when,
for example, it is appropriate, in France, to shake someone’s hand or kiss them on
the cheek – faire la bise). Inappropriate behaviour may lead to a breakdown in com-
munication that is not related to linguistic proficiency but is nonetheless related to
intercultural proficiency (or lack thereof). From this stance, intercultural profi-
ciency – what Hinkel (2010) refers to as the sociocultural – becomes part of the
interaction and therefore arguably part of a spoken communicative proficiency con-
struct that we need to measure.
Intercultural communicative competence is an important theoretical construct
that should inform contemporary FL teaching and learning programmes (see, e.g.,
East, 2012, in this regard). However, its role as an underlying competence to be
assessed is complex (see Sercu, 2010, for a discussion of the challenges involved in
assessing intercultural competence, including the multicomponential nature of the
construct, lack of an holistic measure of intercultural communicative competence,
and problems with articulating levels of intercultural proficiency and objectivising
or scoring this proficiency). It is consequently a matter of debate where, and how,
such competence fits within the assessment of spoken communicative proficiency.
The debate raises several questions: can intercultural communicative competence
be measured adequately or straightforwardly as part of a spoken communicative
proficiency construct? Can (or should) intercultural communicative competence be
made a discrete facet of a spoken communicative proficiency construct (alongside
grammatical, sociolinguistic, discourse and strategic competence)?
Certainly, awareness of appropriate rules of behaviour and non-verbal means of
communication (Martinez-Flor et al., 2006) are likely to influence the effectiveness
of FL students’ spoken interactions. They must therefore somehow be taken into
account in determining FL users’ spoken communicative proficiency.
A counter-argument is that, even though intercultural communicative compe-

tence may form part of a broader ‘assessment package’ (Byram, 2000; Dervin,
2010), formulating this competence as a discrete facet of FL students’ spoken com-
municative proficiency is arguably not necessary. Intercultural competence may
(directly or indirectly) inform the sociolinguistic, discourse and strategic choices
that an FL speaker might make. In terms of measuring students’ FL spoken com-
municative proficiency, it may be that the problems Kramsch (1986) identifies in
ordering (or failing to order) a cup of coffee, can be resolved (and assessed) at the
linguistic level. That is, an FL speaker who is proficient in French grammatically,
sociolinguistically, discoursally and strategically at a relatively basic functional
level would most likely succeed in ordering and acquiring a cup of coffee in a
French restaurant. From this perspective, intercultural awareness is implicit in FL
users’ sociolinguistic, discourse and strategic choices (although the awareness may
benefit from more transparent articulation in assessment).
In summary, the Canale and Swain model (Canale, 1983; Canale & Swain,
1980), although open to critique due to its potential incompleteness, provides a use-
ful and foundational means of defining communicative competence. Indeed, Brown
(2007) argues that this model, although presented in the 1980s and subsequently
developed by others such as Bachman (1990), remains a fundamental reference
point for describing what is meant by communicative competence in relation to
second language teaching. In turn, the model provides a straightforward, useful and
relevant way of conceptualising, for purposes of teaching, learning and assessment,
what it means to speak proficiently in an FL. (Indeed, the facets of the Canale and
Swain framework are discernible in more detailed frameworks such as the ACTFL
Guidelines.)
The above discussion, intended to lay a foundation for considering the assess-
ment of spoken communicative proficiency, may lead to the conclusion that assess-
ing speaking is a simple matter: test setters design a test with two aims in mind.
They wish to represent adequately what Bachman and Palmer (1996) refer to as a
target language use or TLU domain (ordering a cup of coffee in a French restaurant
is one very simple example). They also wish to measure adequately the different
facets of a spoken communicative proficiency construct as demonstrated in the per-
formance (i.e., to what extent the display of a range of competencies is sufficient to
fulfil the task and achieve the outcome). The test is administered, perhaps with the
examiner playing the role of the waiter, and the scoring provides an indication of
relative performances across the facets of interest.
However, deciding on the most effective ways of assessing FL students’ spoken
communicative proficiency, especially in high-stakes/accountability contexts,
means not only taking into consideration a spoken communicative proficiency con-
struct and a task that represents a TLU domain. A range of factors that influence
contemporary conceptualisations of effective assessment practices need to be taken
into account. In what follows, I focus on three independent but intersecting dimen-
sions of assessment practice and relate these to the assessment of FL speaking:
static or dynamic; task-based or construct based; single or paired/group
performances.
2.3 Static or Dynamic 31
2.3 Static or Dynamic
Over 20 years ago, Gipps (1994) wrote of a ‘paradigm shift’ that was moving edu-
cational thinking away from a testing and examination culture towards a broader
model of educational assessment that would include, in addition to standardised
tests, a range of assessment instruments (such as classroom assessments, practical
and oral assessments, coursework and portfolios) and a variety of approaches
(norm-referenced, criterion-referenced, formative and performance-based). The
shift, she argued, was precipitated by a requirement for assessment to fulfil a wide
range of purposes. This reality has meant that “the major traditional model under-
pinning assessment theory, the psychometric model, is no longer adequate, hence
the paradigm shift” (p. 1). With these words, Gipps (1994) acknowledged a process
that has been occurring over the past 50 years, raising several key issues that still
resonate well into the twenty-first century. Indeed, Gipps’ argument is reproduced
in Torrance (2013b), who notes (2013a) that Gipps’ contribution, alongside others,
represents perspectives that “in many respects summarise where the field of assess-
ment is now” (p. 16).
Gipps and Murphy (1994) argue that assessment, broadly speaking, is required
to fulfil one of two goals: a ‘managerial and accountability’ goal and a ‘professional
and learning’ goal. The first kind of assessment may be called ‘summative’ in that
its purpose is to measure, at the end of a course or a series of lessons, the capability
of students in relation to the goals of the programme. The second kind of assessment
may be called ‘formative’ in that it sits within the teaching and learning process and
builds within it opportunities for feedback and feedforward. Another way of con-
ceptualising the difference between the two foci is to use the descriptors the ‘assess-
ment of learning’ and ‘assessment for learning’ (ARG, 1999, 2002a, 2002b).
The focus of summative assessment is on outcomes. Its most traditional realisa-
tion is the static timed examination, designed to test subject-matter acquisition and
retention at a particular point in time (Gipps, 1994). By contrast, the focus of forma-
tive assessment is on enhancing learning. The directly interventionist Dynamic
Assessment (DA) model that incorporates “modifying learner performance during
the assessment itself” (Poehner & Lantolf, 2005, p. 235, my emphases) represents
the starkest contrast to summative tests. Although Poehner and Lantolf maintain
that it is not possible to make a simple comparison between DA and formative
assessment for learning, Leung (2007) asserts that the two approaches do share a
similar pedagogic or pro-learning orientation, and assessment for learning can be
seen as “a fellow-traveller” with DA with regard to “conceptual leanings, assess-
ment sensibilities and educational orientation” (p. 257).
There is a sense in which the two foci for assessment identified by Gipps and
Murphy (1994) operate on a continuum of practice, with timed examinations at one
end, and DA at the other, and different modes and types of assessment sitting at dif-
ferent points between the two. In what follows, I use the terms ‘static’ and ‘dynamic’
as labels to differentiate between the two broad conceptualisations. The labels dis-
tinguish between a static or unchanging model in the sense that summative tests
take place at a single point in time and measure performances at that time, and a
dynamic or changing model in the sense that, when assessments take place on more
than one occasion, or when re-assessment opportunities are possible, learners’ per-
formances will likely change on subsequent occasions by virtue of some kind of
intervention.
2.3.1 The Static Assessment Paradigm
In high-stakes contexts, the summative end-of-course examination has been the tra-
ditional and well-established means of assessing the outcomes of teaching and
learning for many years. The summative examination has its basis in a behaviourist
product-oriented and knowledge-based approach to learning. This approach empha-
sises the discriminatory nature of tests, that is, their ability both to identify different
levels of test taker proficiency and to predict future academic performance.
Performance outcomes, presented as a mark or grade of some kind, are used to rank
students relative to one another. The primary concern of test developers is to ensure
standardisation, that is, that all candidates are measured in a uniform way, and that
the grades they receive are meaningful indicators of relative ability. As Wajda
(2011) observes, “[t]he basic pragmatic and ethical premises of this orientation are
accountability and fairness understood as objectivity and equal treatment of test-
takers” (p. 278).
Tests and examinations are therefore designed to evaluate an aspect or aspects of
students’ learning in a formalised way. Furthermore, tests and examinations, par-
ticularly for high-stakes measurement purposes, seem “as unavoidable as tomor-
row’s sunrise in virtually all educational settings around the world” (Brown &
Abeywickrama, 2010, p. 1). They have become “conventional methods of measure-
ment,” their gate-keeping function regarded as “an acceptable norm” (p. 1), and
their primary concerns seen as validity and reliability, the “meat and potatoes of the
measurement game” (Popham, 2006, p. 100).
In the arena of languages assessment, end-of-course examinations remain as
standard practice in many FL assessment contexts, including those that purport to
measure communicative proficiency (see, e.g., University of Cambridge, 2014).
Static one-time speaking tests are therefore normative in a range of contexts, with
the implication that they are sufficient and ‘fit for purpose’ with regard to measuring
communicative proficiency (Luoma, 2004). By this argument, a test in which candi-
dates are required (for example) to order a cup of coffee and to interact with and
respond to the waiter would represent a valid means of assessing this TLU domain.
Brown and Abeywickrama (2010) argue, however, that language tests as opera-
tionalised within a more traditional behaviourist model would frequently examine
sentence-level grammatical proficiency, knowledge of vocabulary items and ability
to translate from one language to the other. Such tests incorporate minimal, if any,
focus on authentic communication. The discrete-point language test that fitted
within this paradigm arguably purported to examine proficiency in the four skills –
listening, reading, writing and speaking – but would emphasise the discrete nature
of these skills, alongside testing discrete aspects of them, such as the limited reper-
toire of interactional skills required to order and acquire a cup of coffee. Static tests
arguably offer us incomplete evidence of performance.
In Chap. 1 I raised a more fundamental problem with summative examinations
built on the psychometric model. I presented Shohamy’s (2007) perspective that
tests were “a hurdle, an unpleasant experience” which turned “the enjoyment and
fun of learning into pain, tension, and a feeling of unfairness” (p. 142). Brown and
Abeywickrama (2010) offer a similar perspective: tests, they argue, have “a way of
scaring students,” and of engendering feelings of anxiety and self-questioning
“along with a fervent hope that you would come out on the other end with at least a
sense of worthiness.” Brown and Abeywickrama conclude that “[t]he fear of failure
is perhaps one of the strongest negative emotions a student can experience, and the
most common instrument inflicting such fear is the test” (p. 1).
In high-stakes contexts, where there may be serious consequences for test takers
on the basis of the test scores, the affective impact of the test taking process may
have negative implications for the accurate measurement of candidates’ actual abili-
ties (see Chap. 1 for a discussion of this and its implications for validity and
reliability).
The drive towards what Gipps (1994) called a ‘paradigm shift’ has arisen partly to
address the negative connotations and negative impact of tests. So-called ‘alternative
assessments’ are now commonly in use alongside written examinations and stan-
dardised tests. This has led in practice to a rejection of the sufficiency of the psycho-
metric model with its exclusive focus on validity and reliability as the fundamental
measurement characteristics of tests. These alternative kinds of assessment fit more
comfortably within a formative, dynamic assessment for learning paradigm.
2.3.2 The Dynamic Assessment Paradigm
Dynamic assessment for learning, in contrast to the static assessment of learning,

sits within a constructivist process-oriented approach to teaching and learning
which favours on-going assessment and opportunities for feedback. This type of
assessment is concerned with bringing out the best performances of those who are
being assessed by using procedures that ‘bias for best’ and ‘work for washback’
(Swain, 1984).
In establishing the impetus for alternative modes of assessment in the UK, the
fundamental argument of the Assessment Reform Group (ARG, 1999) was that
there is “no evidence that increasing the amount of testing will enhance learning.
Instead the focus needs to be on helping teachers use assessment, as part of teaching
and learning, in ways that will raise pupils’ achievement.” In other words, “assess-
ment as a regular element in classroom work holds the key to better learning” (p. 2).
Assessment for learning therefore essentially has a feedback-feedforward goal:
assessment becomes “the process of seeking and interpreting evidence for use by
learners and their teachers to decide where the learners are in their learning, where
they need to go and how best to get there” (ARG, 2002a).
Assessment for learning provides scope for assessment no longer to be seen as
“an activity that is distinct from, and perhaps even at odds with, the goals of teach-
ing” (Poehner, 2008, p. 4). Poehner goes on to explore in depth the phenomenon of
Dynamic Assessment (DA) as a model in which teaching, learning and assessment
are seamlessly interwoven. DA is therefore more than using assessment activities
for formative purposes whereby feedback is offered on the assessment activity. In
DA, the feedback becomes part of the assessment, and there is no distinction
between assessment and teaching/learning.
The Dynamic Assessment model allows those who are being assessed, and those
doing the assessing, to move away from “observing individuals’ independent per-
formance [which] reveals, at best, the results of past development” (Poehner, 2008,
p. 1). Rather, DA is built on the Vygotskian premise of enabling students to work
within their zones of proximal development (ZPD), thereby mediating “the distance
between the actual developmental level as determined by independent problem
solving and the level of potential development as determined through problem solv-
ing under adult guidance, or in collaboration with more capable peers” (Vygotsky,
1978, p. 86). This active collaboration “simultaneously reveals the full range of
[learners’] abilities and promotes their development. In educational contexts, this
means that assessment – understanding learners’ abilities – and instruction – sup-
porting learner development – are a dialectically integrated activity” (Poehner,
2008, pp. 1–2). This integration “occurs as intervention is embedded within the
assessment procedure in order to interpret individuals’ abilities and lead them to
higher levels of functioning” (p. 6).
Dynamic Assessment represents a directly interventionist model. If, however,
assessment is to be used both for learning and for measurement purposes, more
indirect intervention practices may be required so that feedback and feedforward
support, but do not detract from, evidence of learners’ own work. One means of
operationalising more indirect intervention may be through the use of on-going
coursework that contributes to an assessed portfolio. Assessment portfolios provide
the opportunity for students to collect samples of work in the context of their teach-
ing and learning programme, on which they may receive feedback with a view to
enhancing the quality of the submissions. The collection of evidence may then be
submitted at the end of the course and graded summatively. In portfolio assessment,
the teacher thus operates as both instructor and assessor (Rea-Dickins, 2004).
Portfolio assessment invites students to provide the best samples of their work
which demonstrate what they know and can do (Sunstein & Lovell, 2000). In terms
of assessing languages, it may be argued that the coursework/portfolio option is
“well suited to an understanding of assessing communicative competence in a way
that provides opportunity for those being assessed to demonstrate the full extent of
their proficiency” (East, 2008a, p. 27).
2.3.3 Static or Dynamic – A Complex Relationship
The use of portfolios for assessment purposes brings out the genuine tension
between static and dynamic, and the two contrasting assessment goals of manage-
ment and accountability versus skills-development and learning. From a psycho-
metric perspective, portfolios are challenging when used for measurement and
accountability purposes. The work submitted may be highly individualised, making
comparisons between students problematic. This individualisation may be exacer-
bated by the fact that students may receive different levels of feedback, depending
on the context, the teacher and the student, and it may be difficult to separate out the
influence of feedback from the students’ ‘pure’ or own work. There is also the risk,
depending on the circumstances, that the work is not the student’s own. In essence,
from these perspectives validity and reliability are called into question, and it
becomes challenging to know how to interpret the grades from such an exercise.
A more formative or dynamic socio-constructivist assessment response would be
to argue that static one-time summative tests cannot take account of active collabo-
ration, intervention and feedback. These are part of the process that enables students
to move from one level of proficiency to a higher level of proficiency by virtue of
the collaborative interaction. The result of this process is that the end product (the
portfolio) is a more accurate reflection of the student’s real proficiency, and thereby
arguably a more valid and reliable reflection of what that student knows and can do.
Furthermore, static tests can also be used formatively when they occur at summative
stages within an on-going programme (e.g., end of a teaching unit), and when the
data available from students’ performances are used to feed back into the on-going
teaching and learning process.
Despite the tension between two assessment paradigms – static and dynamic/
summative and formative/assessment of and for learning – the paradigms are in
reality not mutually exclusive. Neither can it be argued that either paradigm is
‘right’ or ‘wrong’, ‘better’ or ‘worse’. They are “just different, and based on differ-
ent assumptions about what we want to measure” (East, 2008a, p. 9). The tension
between them means that, in practice, there is “often an attempt to ‘mix and match’,
with assessment for learning sometimes taking the dominant position in the argu-
ments, and with the assessment of learning staking its claim when there is a feeling
that its influence is being watered down” (p. 9).
East (2008a) draws on the example of the ways in which the UK’s high-stakes
assessment system for the final year of compulsory schooling – the General
Certificate of Secondary Education or GCSE – has, since its introduction in 1986,
effectively been subject to the conflict that arises from the apparent incompatibility
of two different but equally important assessment paradigms. This results in a
“bouncing back and forth between more traditional testing practices and skills
development,” a tension that is “driven by conflicting beliefs among those who
devise or advise on the assessment policies about what assessment should be about –
the assessment of learning or assessment for learning” (p. 10). As Gipps (1994) puts
it, on the one hand, political intervention in the UK has “sometimes initiated, some-
times reinforced the move towards a more practical and vocationally oriented cur-
riculum and thus the move towards more practical, school-based assessment.” In
contrasting moves, the UK Government “has also been concerned with issues of
accountability and with what it sees as the maintenance of traditional academic
standards through the use of externally set tests” (p. viii). Gipps concludes that these
divergent stances have created a complex and confusing environment. Indeed, the
latest iteration of the GCSE, proposed for first examination in schools in a rolling
3-year programme of implementation beginning in 2017, sees a return to a tradi-
tional summative examination format and the removal of modular and coursework
options (Gov.UK, 2015).
Translated to the context of the measurement of FL students’ spoken communi-
cative proficiency, the static-dynamic tension leads to the following dichotomy: the
demand for speaking assessments that provide accurate, just and appropriate perfor-
mance outcomes (Luoma, 2004) might mean that a static assessment model, with its
central (psychometric) concerns for validity and reliability, might be the appropriate
medium for assessing speaking, as operationalised through a formal ‘speaking test’.
Certainly current practice in many jurisdictions, for example the tests of English of
the US-based Educational Testing Service (TOEFL, TOEIC) or the UK’s new FL
GCSEs (see above), would suggest that this is the case. However, when considering
the negative implications of summative testing, and when seen in the light of
Brown’s (2007) assertion concerning social constructivism and the language class-
room as a “locus of meaningful, authentic exchanges among users of language”
(p. 218), a more dynamic or formative assessment model appears to have much to
commend it. This is especially so when the portfolio model appears to hold out the
possibility that a range of evidence might be available for summative measurement
use.
One potential means of reconciling the conflict between two contrasting para-
digms is the use of so-called ‘performance-based assessments’ which have been
precipitated by the advent of such initiatives as the proficiency movement. A
performance-based assessment model stands in contrast to more traditional types of
language testing in that it “typically involves oral production, written production,
openended responses, integrated performance (across skill areas), group perfor-
mance, and other interactive tasks” (Brown & Abeywickrama, 2010, p. 16). In the-
ory this leads to “more direct and more accurate testing because students are
assessed as they perform actual or simulated real-world tasks” and “learners are
measured in the process of performing the targeted linguistic acts” (p. 16). However,
the testing is occurring in the context of on-going classroom work to the extent that
Brown and Abeywickrama suggest that, when seeing performance-based assess-
ment in action in the languages classroom, “you may have a difficult time distin-
guishing between formal and informal assessment” (p. 16).
The blurring of the boundaries between formal testing and informal assessment
leads Clapham (2000) to argue that performance testing and alternative assessment
have a good deal in common. Both forms of assessment are “concerned with asking
students to create or produce something, and both focus on eliciting samples of
2.4 Task-Based or Construct Based 37
language which are as close to real life as possible.” The essential difference
between the two, in Clapham’s view, is that “performance testers agonize about the
validity and reliability of their instruments while alternative assessors do not”
(p. 152). However, the same essential format could be used whether the assessment
is ‘performance’ (with all this implies about measurement and accountability) or
‘alternative’ (with all this implies about formative assessment and feedback). When
transferred to the high-stakes arena, issues of validity and reliability must be taken
into consideration, but there is arguably a conceptual framework in which these
considerations can be addressed.
2.4 Task-Based or Construct Based
Whether operationalised within a static or dynamic model of assessment, or the

‘hybrid’ that performance-based testing might offer, a second consideration for the
effective measurement of spoken communicative proficiency is the nature of the
task that candidates are asked to complete and thereby the nature of the evidence we
need to seek of candidates’ abilities.
2.4.1 The Centrality of the Task
Brown and Abeywickrama (2010) argue that, because a characteristic of many

performance-based language assessments is the use of an interactive task,
performance-based assessment may alternatively be called ‘task-based language
assessment’ or TBLA.
TBLA represents the enactment, through assessment, of the task-based language
teaching (TBLT) approach that, as I noted in Chap. 1, has become a specific realisa-
tion of CLT. The essence of a task-based approach is to engage learners in “real
language use in the classroom” by the use of “discussions, problems, games and so
on” that “require learners to use language for themselves” (Willis & Willis, 2007,
p. 1). A range of definitions of ‘task’ for the purposes of TBLT have been proposed
(see, e.g., Samuda & Bygate, 2008, for a useful overview). Essential features of
tasks are, however, that they engage learners in real language use and negotiation of
meaning; that they have an outcome beyond the use of language; and that language
users have to rely on their own resources to reach the outcome. In TBLA, as in
TBLT, task is defined in specific ways which differentiate it from a communicative
activity (Nunan, 2004), the purpose of which may simply be to practise a particular
grammatical form, albeit in a communicative context.
The classic ‘cup of coffee’ scenario lends itself to useful examples of the differ-
ences between task and activity. A simple communicative activity for assessment
purposes may be as follows: ‘work with a partner. One of you is a waiter in a French
restaurant. One of you is a customer. Order a cup of coffee and something to eat
from the menu, and then ask for the bill. When you have finished, swap roles.’ The
simple transactional role-play activity is arguably authentic and promotes a level of
personal interaction. An authentic menu card may be available as a resource. The
primary goal becomes the practice and use of the appropriate language, which may
well be specified in advance, and using a list of pre-defined phrases and a pre-
determined structure. The outcomes are determined by reference to that language.
A task scenario would promote interaction using a range of language and would
require an outcome beyond the use of language: ‘work with a partner. You and your
partner are in a French café on the last day of your school exchange trip and you
wish to order a drink and something to eat. Between you, you are down to your last
20 euros. Goal: come to a consensus on the items you can afford to buy.’ The authen-
tic menu card is available as a resource, and possibly conditions that delimit indi-
vidual choices (e.g., lactose intolerant; must have gluten free; does not want
caffeine). The task requires the partners to express an opinion about what they
would like to eat and drink, but they also have to solve a problem. The partners are
therefore required to go beyond their own opinions to reach an outcome (i.e., con-
sensus on the order, given the opinions expressed). The primary goal is the outcome
(rather than the language used to get there). Participants make their own choices
about the language they wish to use to achieve the outcome (i.e., suitable language
and grammatical structures are not pre-determined or imposed – even though par-
ticular language and grammatical structures may be anticipated in the responses). In
this case, the role-play is moving towards becoming what Richards (2006) refers to
as a ‘fluency task’ where, despite being “heavily constrained by the specified situa-
tion and characters” (p. 15), the language may be “entirely improvised by the stu-
dents” (p. 15) and the goal is on “getting meanings across using any available
communicative resources” (p. 16).
A TBLT/TBLA understanding of task moves beyond interpreting authenticity in
purely situational terms (‘you are ordering a cup of coffee in a French restaurant
…’), which is actually difficult to operationalise authentically and may not be rele-
vant to the needs and aspirations of those taking the assessment. Authenticity is
interpreted interactionally – that is, “the task may be authentic in the sense that it
requires the learners to utilise the types of skills that they might use in any real-life
interactional situation beyond the task (such as co-operating and collaborating,
expressing points of view, or negotiating meaning)” (East, 2012, pp. 80–81). This
interpretation of authenticity broadens our understanding of what constitutes a TLU
domain. Also, this understanding of task moves beyond simple “Q/A exchanges
clustered around topics such as ‘the family’, ‘hobbies’, or ‘likes and dislikes’”
(Mitchell & Martin, 1997, p. 23) – or the ubiquitous cup of coffee – that can be
largely rote-learnt. It provides broader opportunity to “elicit the kinds of communi-
cative behaviour (such as the negotiation of meaning) that naturally arises from
performing real-life language tasks” (Van den Branden, 2006, p. 9).
The use of more open-ended tasks than very prescribed situational role-plays
should not be taken to suggest that task completion does not utilise pre-learnt and
pre-fabricated ‘chunks’ of language (see, e.g., East, 2012, Chap. 3, in this regard).
Nor does it suggest that there will have been no prior preparation or rehearsal.
2.4 Task-Based or Construct Based 39
Indeed, in the language learning context, a good deal of the literature that informs
the TBLT approach speaks of task preparation and task repetition as valid means of
enhancing ultimate task completion (Bygate, 1996, 2001, 2005; Ellis, 2005;
Mochizuki & Ortega, 2008; Pinter, 2005, 2007; Skehan, 2009). Nitta and
Nakatsuhara (2014) transfer these arguments to the assessment context in a recent
useful study into the effects of prior planning time on candidate performances when
completing a paired oral assessment task. Rather, the move beyond the ‘simple Q/A
exchange’ facilitates opportunities for FL students to make their own choices about
appropriate language in their attempts to negotiate meanings spontaneously in the
process of achieving the task outcome (although they may well draw on pre-learnt
formulaic sequences as part of completing the task).
A central element of TBLA, then, is “the notion that tasks that require examinees
to engage in meaningful language communication are an important focal point for
the development and use of particular tests” (Norris, 2002, p. 337). This notion of
assessment “does not simply utilise the real-world task as a means of eliciting par-
ticular components of the language system which are then measured or evaluated;
instead, the construct of interest is performance of the task itself” (Long & Norris,
2000, p. 600). In this sense, the successful completion of the task becomes the
essential criterion against which students’ performances (and proficiency) are deter-
mined. If, however, TBLA is being used for high-stakes assessment purposes, the
question arises whether task completion becomes a sufficient criterion on which to
judge proficiency.
2.4.2 The Importance of the Construct
Approaching assessment from a task-based perspective, the outcomes of task com-

pletion (the assessment scores) are there to tell us something about the candidate’s
ability to deal with the requirements and challenges of the situations that the tasks
replicate in the assessment. As Luoma (2004) argues, when it is straightforward to
define the TLU domain (as it may be in the scenario of coming to agreement on a
food and drink order in a café), the main information we wish to glean from the
assessment is ‘how well can the candidates fulfil the task?’ Provided that the criteria
for different levels of candidate performances are sufficiently explicit and task-
outcome related, there is arguably no problem in interpreting different levels of
proficiency.
If, however, we wish to measure candidates’ spoken communicative proficiency
in broader or more general terms, or across a range of different task types or genres,
or a range of different interactional opportunities, it may be that the construct should
be the primary design criterion that will inform the tasks to be completed. In these
cases, the main information we wish to glean from the assessment is ‘how well can
the candidates communicate?’ The construct definition, and the facets of the con-
struct that are considered to be important, become the primary means of discrimi-
nating between different levels of candidate performance. Fulfilling the task
outcome is important, but becomes secondary to a more general interpretation of

proficiency. As Bachman and Palmer (2010) note, however, “we can define the con-
struct from a number of perspectives.” Although this may include an overarching
“theoretical model of language proficiency” (p. 43) such as Canale and Swain
(1980), it may equally include a definition of what is required for a particular TLU
domain, for which the task becomes the operationalisation of the construct. Students’
performances are then measured against the facets of the particular defined con-
struct that are deemed to be important.
In practice, when assessing students’ spoken communicative proficiency in rela-
tion to the performance of communicative language use tasks, there is arguably a
need for evidence both that the candidate is able to complete the task successfully
and that the candidate is able to demonstrate different facets of a defined theoretical
construct. When two partners are demonstrating their proficiency to interact suc-
cessfully when negotiating, for example, what to buy, we are just as interested in
measuring whether the candidates can complete the task successfully as in measur-
ing whether the candidates can do so in a way that demonstrates proficiency in the
different facets of the construct under consideration. Indeed, the ability to demon-
strate proficiency across these different facets of the defined communicative profi-
ciency construct is arguably implicit in candidates’ ability to perform the task
successfully. Successful task performance is likely to be hindered by inadequate
proficiency in any one of the defined facets. (Successful task performance is also
potentially hindered by the nature of the task that students are asked to complete, an
issue I consider in Chap. 4.)
Luoma (2004) concludes that “[u]ltimately, the test developers need to include
both construct and task considerations in the design and development of speaking
tests” (p. 42). In other words, there is a conceptual equivalence between task and
construct. This is in accord with Bachman (2002), who argues that “sound proce-
dures for the design, development and use of language tests must incorporate both a
specification of the assessment tasks to be included and definitions of the abilities to
be assessed” (p. 457). An alternative way of viewing this both-and requirement is as
a ‘constructive alignment’ (Biggs & Tang, 2011) between the general (linguistic
proficiency) outcomes which we expect of learners following a particular course of
instruction in the FL (the constructs), and which have been made transparent to the
learners, and the measurement of these outcomes through specific criterion-
referenced assessment opportunities (the tasks).
2.5 Single or Paired Performances
So far I have considered the arguments around whether the measurement of spoken
communicative proficiency is more effectively operationalised through a static (one-
time summative) or dynamic (on-going formative) model, or through a task-based
(outcome focused) or construct based (proficiency focused) model. The third con-
sideration for measuring spoken communicative proficiency that I will discuss is
2.5 Single or Paired Performances 41
whether the assessment should be single (focusing on the candidate as an individual)

or paired/grouped (focusing on the candidate in interaction with at least one peer).
2.5.1 Single Performance Assessments
According to Luoma (2004), the most common way of organising speaking assess-
ments is “to assess examinees one at a time, often in an interview format” (p. 35).
Luoma asserts that, until relatively recently, this well-established format for speak-
ing assessments has not really been brought into question even though assessment
methods for other communicative skills have been critiqued and revised. In Luoma’s
view, although the interview test procedure may be costly in terms of the time and
resources involved, it is also flexible in that the questions posed by the examiner can
be adapted to suit the performance of individual candidates. Also, single interview
tests, in Luoma’s view, “do provide the examinees with an opportunity to show a
range of how well they can speak the language, so they do work as tests” (p. 35).
The Oral Proficiency Interview test of the American Council on the Teaching of
Foreign Languages (ACTFL-OPI) arguably represents one of the most influential
examples of this kind of test for assessing FL students. The ACTFL-OPI has been
designed to measure candidates’ functional use of the FL in a way that supports a
communicative approach to language teaching, with its emphasis on meaningful
interaction (Turner, 1998). The OPI is operationalised as a face-to-face or telephone
interview test between a certified ACTFL assessor and the candidate. A version that
can be administered solely by computer is also available (Language Testing
International, 2014). Performances are graded on a ten-point scale relative to a
range of detailed proficiency descriptors from ‘novice low’ (lowest) to ‘distin-
guished’ (highest) (ACTFL, 2012). The OPI has had significant washback into FL
classrooms, at least in the US, with the ACTFL Proficiency Guidelines that inform
the test having “a strong effect on the content and the teaching methodology of
many foreign language courses” (Yoffe, 1997, p. 2).
There is widespread and on-going acceptance of tests like the ACTFL-OPI as valid
and reliable measures of candidates’ spoken communicative proficiency. Nevertheless,
the validity of these kinds of assessment has been called into question in a number of
ways that highlight the limitations of the single interview test format.
Ostensibly the ACTFL-OPI, along with other single interview tests, aims to cap-
ture interaction between the test candidate and the examiner. In practice, however,
the test is one-sided. In Luoma’s (2004) words, “the interlocutor [examiner] initi-
ates all phases of the interaction and asks the questions, whereas the role of the
examinee is to comply and answer” (p. 35). This can potentially lead to two dimen-
sions of artificiality.
First, as Yoffe (1997) argues, although the ACTFL-OPI “purports to assess func-
tional speaking ability,” there is strong encouragement for raters to “pay careful
attention to the form of the language produced rather than to the message conveyed”
(p. 5). This can consequently lead the examiner to attempt to ‘force’ particular
grammatical structures into use. It might also encourage test candidates to do the
same. This makes the test potentially more a measure of grammatical knowledge
than interactional proficiency and also possibly leads to artificiality of language –
using particular grammatical constructions for the sake of demonstrating knowl-
edge that may not occur in ‘normal’ conversation.
Second (and this is arguably not a concern for externally examined candidates,
but more a concern for assessments that are operationalised within or at the end of
on-going programmes of teaching and learning where teachers are asked to examine
their own candidates), the crucial role of the teacher/examiner in guiding the inter-
action leaves open the possibility that, with the best of intentions, candidates will
know in advance what they are likely to be asked, and can therefore prepare their
responses.1
Fundamentally, then, a weakness of the single interview test is that it does not
represent normal conversation (van Lier, 1989), with its spontaneity and openness
to pursue a range of directions without being governed by having to account for
specific grammatical constructions. Single interview tests therefore “focus too
much on the individual rather than the individual in interaction” (McNamara, 1996,
p. 85, my emphasis). Also, “clearly, if we want to test spoken interaction, a valid test
must include reciprocity conditions” (Weir, 2005, p. 72) and the unpredictability of
these, or, to use Kramsch’s (1986) terminology, the opportunity to measure ‘inter-
actional competence’. In turn, single candidate interview tests run the risk of con-
struct under-representation (Messick, 1989) (see Chap. 1).
2.5.2 Paired/Group Performance Assessments
Luoma (2004) suggests that, given the limitations of individual interview tests, an
alternative is to assess candidates in pairs. The fundamental operational difference
between the single interview test and the paired assessment format is that “the
examinees are asked to interact with each other, with the examiner observing rather
than taking part in the interaction directly” (p. 36). Paired speaking assessments
may therefore offer advantages that individual interview tests do not, and the use of
paired or group spoken interactions for assessment purposes has been growing since
the 1980s (Ducasse & Brown, 2009). Paired speaking assessments are now fre-
quently used in both classroom and high-stakes assessment contexts (May, 2011),
such as the speaking assessments that make up the suite of international Cambridge
examinations (Cambridge English language assessment, 2015).
1
This argument is not proposed to discredit the role of teachers as examiners. It is, rather, to high-
light the potential danger of having teachers as examiners. These teachers inevitably wish their
candidates to perform at their best. Teachers may therefore prepare their students for the assess-
ment in ways that ultimately diminish the opportunity for candidates to demonstrate their own
proficiency.
2.5 Single or Paired Performances 43
Several advantages to the paired format may be advanced. In contrast to examiner-

candidate interview tests, paired assessments that allow two peers to interact have
been found to lead to greater balance (more equal interlocution) between the part-
ners (Együd & Glover, 2001; Luoma, 2004). Also, the paired assessment format can
elicit a broader spectrum of functional competence (Galaczi, 2010) and a wider
range of interactional patterns (Saville & Hargreaves, 1999; Swain, 2001).
Negotiation of meaning and co-construction of discourse allow candidates to dis-
play dimensions of interactional competence such as collaboration, cooperation and
coordination (Jacoby & Ochs, 1995), prompting, elaboration, finishing sentences,
referring to a partner’s ideas and paraphrasing (Brooks, 2009), turn taking, initiating
topics and engaging in extended discourse with a peer rather than a teacher/exam-
iner (Ducasse & Brown, 2009; May, 2011).
The paired speaking assessment format therefore provides greater opportunities
to capture a range of examples of speaking that reflect how interactions usually take
place (Skehan, 2001). We are thus able to measure a more comprehensive spoken
communicative proficiency construct in the paired assessment than the construct
that is tapped into in the single interview test. This arguably allows for better or
more useful inferences to be made about the candidate’s proficiency in wider real-
life contexts (Galaczi, 2010), or for score interpretations to be relatable to the
broader real-world scenarios created in the assessment (Bachman & Palmer, 1996).
There are also arguably a number of consequential advantages to paired spoken
assessments. One key advantage, according to Ducasse and Brown (2009), is the
claim to positive washback on classroom practices (Messick, 1996). That is, the
paired format will either mirror what is already happening in the regular functioning
of a CLT-oriented classroom in terms of pair/group work, or it will encourage more
paired interaction in class (Galaczi, 2010; Swain, 2001). This creates “a conscious
feedback loop between teaching and testing, in terms of not only content but of
approach” (Morrow, 1991, p. 111). Paired assessments are therefore arguably more
representative of ‘best practice’ in CLT classrooms (Együd & Glover, 2001; Taylor,
2001) in that they encourage an approach that is “likely to have a real effect on the
actual teaching styles used in the classroom regarding the encouragement of oral
production by the students in a wide variety of contexts” (Smallwood, 1994, p. 70).
There is also evidence to suggest that students view paired or group interactions
positively (Együd & Glover, 2001; Fulcher, 1996; Nakatsuhara, 2009), and that
paired assessments provoke less anxiety in candidates (Fulcher, 1996; Ockey, 2001).
Paired or group assessments may also be more time and cost efficient because can-
didates can be assessed together, and raters can assess two or more candidates at the
same time (Ducasse & Brown, 2009; Galaczi, 2010; Ockey, 2001; Swain, 2001).
The paired spoken assessment format therefore appears to have several advan-
tages in comparison with single interview tests, not least of which relate to the abil-
ity of the paired format to assess a more broadly defined construct of spoken
communicative proficiency that includes dimensions of interactional competence.
Taylor and Wigglesworth (2009) sum up the advantages like this: in the learning
context, more classroom opportunities are being provided for students to use lan-
guage actively across a range of skills, and to offer and obtain feedback on their
language use. In the assessment context, paired or group assessments provide the
opportunity for students to demonstrate their interactive skills in ways that the
single-candidate interview test simply cannot do.
Nevertheless, several concerns about the paired assessment format have been
identified. One major concern regarding paired assessments has been the issue of
the impact that one candidate can have on another and therefore whether it is impor-
tant to take into consideration how pairs are put together (Davis, 2009; Foot, 1999;
Fulcher, 2003; Galaczi & ffrench, 2011). So-called ‘interlocutor effects’ (O’Sullivan,
2002) such as age, gender, cultural or first language background, personality, or how
well the two candidates know and get on with each other can influence the amount
and quality of the language produced in the interaction. Interlocutor effects there-
fore have implications for construct irrelevant variance (Messick, 1989). Interlocutor
variability thus “holds fundamental implications and challenges for oral perfor-
mance assessment, since certain interlocutor variables could become a potential
threat to a test’s validity and fairness” (Galaczi & ffrench, 2011, p. 166). This prob-
lem can of course also impact on single interview tests. As Leung and Lewkowicz
(2006) point out, all oral performances are “essentially co-constructed through
social interaction” such that “all participants in the interaction are likely to affect
individual performances” (p. 217). The situation is arguably exacerbated in paired/
group oral scenarios.
Studies into the impact of pairings on performances in speaking assessments
have in fact led to contrasting findings. Csépes’ (2002) findings about scores given
by raters suggest that raters’ perceptions of students’ proficiency were not influ-
enced, either positively or negatively, by considerable variations in the proficiency
level of partners. This finding is supported by Nakatsuhara (2009), who concluded
that, regardless of whether students are paired with partners of similar or different
proficiency levels, they are likely to be given comparable opportunities to display
their communicative proficiency such that pairing students with different levels of
proficiency may not be problematic. Davis (2009) similarly notes that differences in
proficiency level between interlocutors appear to have little impact on raw scores,
with neither higher nor lower proficiency candidates being disadvantaged. Davis
does not preclude an impact, however, but argues that that impact may be “indirect
and unpredictable, rather than simple and consistent” (p. 388). Norton (2005), by
contrast, suggests that “being paired with a candidate who has higher linguistic abil-
ity may be beneficial for lower level candidates who are able to incorporate some of
their partner’s expressions into their own speech” (p. 291). This finding of benefit
for lower proficiency learners is consistent with that of an earlier study by Iwashita
(1996). These findings have implications for pairings of students with comparable
proficiency, particularly for lower proficiency pairings.
A further problem for paired assessments concerns performance measurement,
and how performances in paired interactions can be measured and scored reliably.
Because the interaction in the paired assessment context is co-constructed, co-
participants’ performances become interdependent, and this presents scoring chal-
lenges (Brooks, 2009). The question becomes whether the scores can be considered
to be true and accurate measures of each candidate’s real proficiency since the
2.6 Conclusion 45
scores may differ if a candidate is assessed on a similar task, but with a different
interlocutor. We often require reliable measurements of individual performances
(McNamara & Roever, 2006). However, especially given the potential and complex
impact of interlocutor variables, “we have to ask … how scores can be given to an
individual test taker rather than pairs of test takers in a paired test format” (Fulcher,
2003, p. 46). That is, if, as Weir (2005) argues, “an individual’s performance is
clearly affected by the way the discourse is co-constructed by the person they are
interacting with,” this becomes a problem for the reliable measurement of an indi-
vidual candidate’s proficiency, and yet “[h]ow to factor this into or out of assess-
ment criteria is yet to be established in a satisfactory manner” (p. 153).
The paired assessment format clearly presents both benefits and challenges. May
(2009) concludes, on the one hand, that “[i]t is clear that paired speaking tests have
the potential to elicit features of interactional competence, including a range of
conversation management skills, that are generally not featured in traditional lan-
guage testing interviews” (p. 415). She acknowledges, on the other hand, the scor-
ing challenge of “the separability of the individual candidate’s contribution”
(p. 419). May also questions the ethicality and impartiality of exposing candidates
to an assessment format that may disadvantage them due to interactional variables
that do not relate to the candidates’ overall speaking proficiency. In her view, “[t]he
[potentially negative] consequences for a candidate involved in an asymmetric
interaction are very real, not simply a matter of rater perception” (p. 416). That is,
although scoring may indeed be impacted by interlocutor variables in ways that are
hard to determine or isolate, the impact on the candidates is broader than the scores
and relates to how they may feel about the assessment altogether, and how these
affective factors may influence performance, whether for good or ill. Although it
may be argued that “[c]oping successfully with such real-life interaction demands
… becomes part of the construct of interactional competence” (Galaczi, 2010, p. 8),
the impact of these ‘real-life demands’ requires critical examination.
Taking this range of evidence about single candidate and paired/group speaking
assessments into account, East (2015) concludes that the jury is still out with regard
to the usefulness of the paired or group speaking assessment format in comparison
with the single interview test.
2.6 Conclusion
The above review has raised a range of issues that need to be accounted for when
considering the most appropriate ways of assessing FL students’ spoken communi-
cative proficiency. These include: what it means to speak proficiently (how a spoken
communicative proficiency construct is to be defined); the paradigm in which the
assessment will sit (static or dynamic/summative or formative); whether the out-
come of interest is performance of the task or evidence of contextually appropriate
language proficiency (task-based or construct based); and whether single or paired/
group performances enable us to measure proficiency adequately. The issues reveal
that it is not a straightforward matter to construct and execute valid and reliable
assessments of spoken communicative proficiency.
Writing in the context of the constraints of paired assessments, Galaczi (2010)
argues that assessment providers have an “ethical responsibility to construct tests
which would be fair and would not provide (intentionally or unintentionally) dif-
ferential and unequal treatment of candidates based on background variables” (p. 8).
This concern applies more broadly to the range of considerations, as rehearsed in
this chapter, that must be taken into account when designing and operationalising
speaking assessments. Building on arguments that I have presented in this and the
preceding chapter, the study that is the substance of this book is one step towards
fulfilling this ethical obligation.
References
ACTFL. (2012). ACTFL proficiency guidelines 2012. Retrieved from http://www.actfl.org/publica-

tions/guidelines-and-manuals/actfl-proficiency-guidelines-2012
ARG. (1999). Assessment for learning: Beyond the black box. Cambridge, England: University of
Cambridge Faculty of Education.
ARG. (2002a). Assessment for learning: 10 principles. Retrieved from http://webarchive.
nationalarchives.gov.uk/20101021152907/http:/www.ttrb.ac.uk/ViewArticle2.aspx?ContentId=
15313
ARG. (2002b). Testing, motivation and learning. Cambridge, England: University of Cambridge
Faculty of Education.
Bachman, L. F. (2002). Some reflections on task-based language performance assessment.
Language Testing, 19(4), 453–476. http://dx.doi.org/10.1191/0265532202lt240oa
University Press.
Press.
Biggs, J., & Tang, C. (2011). Teaching for quality learning at university: What the student does
(4th ed.). Maidenhead, England: McGraw-Hill/Open University Press.
Brooks, L. (2009). Interacting in pairs in a test of oral proficiency: Co-constructing a better perfor-
mance. Language Testing, 26(3), 341–366. http://dx.doi.org/10.1177/0265532209104666
Pearson.
Brown, H. D., & Abeywickrama, P. (2010). Language assessment: Principles and classroom prac-
tices (2nd ed.). New York, NY: Pearson.
Bygate, M. (1996). Effects of task repetition: Appraising the developing language of learners. In
J. Willis & D. Willis (Eds.), Challenge and change in language teaching (pp. 136–146).
Oxford, England: Macmillan.
Bygate, M. (2001). Effects of task repetition on the structure and control of oral language. In
M. Bygate, P. Skehan, & M. Swain (Eds.), Researching pedagogic tasks (pp. 23–48). Harlow,
England: Longman.
Bygate, M. (2005). Oral second language abilities as expertise. In K. Johnson (Ed.), Expertise in
second language learning and teaching (pp. 104–127). New York, NY: Palgrave Macmillan.
References 47
Byram, M. (1997). Teaching and assessing intercultural communicative competence. Clevedon,

England: Multilingual Matters.
Byram, M. (2000). Assessing intercultural competence in language teaching. Sprogforum, 18(6),
8–13.
Byram, M. (2008). From foreign language education to education for intercultural citizenship:
Essays and reflections. Clevedon, England: Multilingual Matters.
Byram, M. (2009). Intercultural competence in foreign languages: The intercultural speaker and
the pedagogy of foreign language education. In D. K. Deardorff (Ed.), The Sage handbook of
intercultural competence (pp. 321–332). Thousand Oaks, CA: Sage.
Byram, M., Gribkova, B., & Starkey, H. (2002). Developing the intercultural dimension in lan-
guage teaching: A practical introduction for teachers. Strasbourg, France: Council of Europe.
Byram, M., Holmes, P., & Savvides, N. (2013). Intercultural communicative competence in for-
eign language education: Questions of theory, practice and research. The Language Learning
Journal, 41(3), 251–253. http://dx.doi.org/10.1080/09571736.2013.836343
Cambridge English language assessment. (2015). Retrieved from http://www.cambridgeenglish.
org/exams/
Canale, M. (1983). On some dimensions of language proficiency. In J. W. J. Oller (Ed.), Issues in
language testing research (pp. 333–342). Rowley, MA: Newbury House.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second lan-
guage teaching and testing. Applied Linguistics, 1(1), 1–47. http://dx.doi.org/10.1093/
applin/i.1.1
Clapham, C. (2000). Assessment and testing. Annual Review of Applied Linguistics, 20, 147–161.
http://dx.doi.org/10.1017/s0267190500200093
Csépes, I. (2002). Is testing speaking in pairs disadvantageous for students? Effects on oral test
scores. novELTy, 9(1), 22–45.
Davis, L. (2009). The influence of interlocutor proficiency in a paired oral assessment. Language
Testing, 26(3), 367–396. http://dx.doi.org/10.1177/0265532209104667
Dervin, F. (2010). Assessing intercultural competence in Language Learning and Teaching: A criti-
cal review of current efforts. In F. Dervin & E. Suomela-Salmi (Eds.), New approaches to
assessment in higher education (pp. 157–173). Bern, Switzerland: Peter Lang.
Ducasse, A., & Brown, A. (2009). Assessing paired orals: Raters’ orientation to interaction.
Language Testing, 26(3), 423–443. http://dx.doi.org/10.1177/0265532209104669
org/10.1075/tblt.3
East, M. (2015). Coming to terms with innovative high-stakes assessment practice: Teachers’
viewpoints on assessment reform. Language Testing, 32(1), 101–120. http://dx.doi.
org/10.1177/0265532214544393
Együd, G., & Glover, P. (2001). Readers respond. Oral testing in pairs - secondary school perspec-
tive. ELT Journal, 55(1), 70–76. http://dx.doi.org/10.1093/elt/55.1.70
Ellis, R. (Ed.). (2005). Planning and task performance in a second language. Amsterdam,
Netherlands/Philadelphia, PA: John Benjamins. http://dx.doi.org/10.1075/lllt.11
Foot, M. C. (1999). Relaxing in pairs. ELT Journal, 53(1), 36–41. http://dx.doi.org/10.1093/
elt/53.1.36
Fulcher, G. (1996). Testing tasks: Issues in task design and the group oral. Language Testing,
13(1), 23–51. http://dx.doi.org/10.1177/026553229601300103
Fulcher, G. (2003). Testing second language speaking. Harlow, England: Pearson. http://dx.doi.
org/10.4324/9781315837376
Galaczi, E. D. (2010). Paired speaking tests: An approach grounded in theory and practice. In
J. Mader, & Z. Ürkün (Eds.), Recent approaches to teaching and assessing speaking. IATEFL
TEA SIG conference proceedings. Canterbury, England: IATEFL Publications.
Galaczi, E. D., & ffrench, A. (2011). Context validity. In L. Taylor (Ed.), Examining speaking:
Research and practice in assessing second language speaking (pp. 112–170). Cambridge,
Gipps, C. (1994). Beyond testing: Towards a theory of educational assessment. London, England:
The Falmer Press. http://dx.doi.org/10.4324/9780203486009
Gipps, C., & Murphy, P. (1994). A fair test? Assessment, achievement and equity. Buckingham,
England: Open University Press.
Gov.UK. (2015). Get the facts: GCSE reform. Retrieved from https://www.gov.uk/government/
publications/get-the-facts-gcse-and-a-level-reform/get-the-facts-gcse-reform
Hinkel, E. (2010). Integrating the four skills: Current and historical perspectives. In R. Kaplan
(Ed.), The Oxford Handbook of Applied Linguistics (2nd ed., pp. 110–123). Oxford, England:
Oxford University Press. http://dx.doi.org/10.1093/oxfordhb/9780195384253.013.0008
Hunter, D. (2009). Communicative language teaching and the ELT Journal: a corpus-based
Warwick, England
Iwashita, N. (1996). The validity of the paired interview in oral performance assessment. Melbourne
Papers in Language Testing, 5(2), 51–65.
Jacoby, S., & Ochs, E. (1995). Co-construction: An introduction. Research on Language and
Social Interaction, 28(3), 171–183.
Kramsch, C. (2005). Post 9/11: Foreign languages between knowledge and power. Applied
Linguistics, 26(4), 545–567. http://dx.doi.org/10.1093/applin/ami026
Language Testing International. (2014). ACTFL Oral Proficiency Interview by Computer (OPIc).
Retrieved from http://www.languagetesting.com/oral-proficiency-interview-by-computer-opic
Leung, C. (2007). Dynamic Assessment: Assessment for and as Teaching? Language Assessment
Leung, C., & Lewkowicz, J. (2006). Expanding horizons and unresolved conundrums: Language
testing and assessment. TESOL Quarterly, 40(1), 211–234. http://dx.doi.org/10.2307/40264517
Liddicoat, A. (2005). Teaching languages for intercultural communication. In D. Cunningham, &
A. Hatoss (Eds.), An international perspective on language policies, practices and proficiencies
(pp. 201–214). Belgrave, Australia: Fédération Internationale des Professeurs de Langues
Vivantes (FIPLV).
Liddicoat, A. (2008). Pedagogical practice for integrating the intercultural in language teaching
and learning. Japanese Studies, 28(3), 277–290. http://dx.doi.org/10.1080/10371390802446844
Liddicoat, A., & Crozet, C. (Eds.). (2000). Teaching languages, teaching cultures. Melbourne,
Australia: Language Australia.
Lo Bianco, J., Liddicoat, A., & Crozet, C. (Eds.). (1999). Striving for the third place: Intercultural
competence through language education. Melbourne, Australia: Language Australia.
Long, M., & Norris, J. (2000). Task-based teaching and assessment. In M. Byram (Ed.), Routledge
encyclopedia of language teaching and learning (pp. 597–603). London, England: Routledge.
Luoma, S. (2004). Assessing speaking. Cambridge, England: Cambridge University Press. http://
dx.doi.org/10.1017/cbo9780511733017
Martinez-Flor, A., Usó-Juan, E., & Alcón Soler, E. (2006). Towards acquiring communicative
competence through speaking. In E. Usó-Juan, & A. Martínez-Flor (Eds.), Studies on language
acquisition: Current trends in the development and teaching of the four language skills
(pp. 139–157). Berlin, Germany/New York, NY: Walter de Gruyter. http://dx.doi.
org/10.1515/9783110197778.3.139
May, L. (2009). Co-constructed interaction in a paired speaking test: The rater’s perspective.
May, L. (2011). Interactional competence in a paired speaking test: Features salient to raters.
Language Assessment Quarterly, 8(2), 127–145. http://dx.doi.org/10.1080/15434303.2011.56
5845
References 49
McNamara, T. (1996). Measuring second language performance. London, England: Longman.

Blackwell.
Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), 241–256.
http://dx.doi.org/10.1177/026553229601300302
Mitchell, R., & Martin, C. (1997). Rote learning, creativity and ‘understanding’ in classroom for-
eign language teaching. Language Teaching Research, 1(1), 1–27. http://dx.doi.
org/10.1177/136216889700100102
Mochizuki, N., & Ortega, L. (2008). Balancing communication and grammar in beginning-level
foreign language classrooms: A study of guided planning and relativization. Language
Teaching Research, 12(1), 11–37. http://dx.doi.org/10.1177/1362168807084492
Nakatsuhara, F. (2009). Conversational styles in group oral tests: How is the conversation con-
structed? Unpublished doctoral thesis. University of Essex, Essex, England.
Nitta, R., & Nakatsuhara, F. (2014). A multifaceted approach to investigating pre-task planning
effects on paired oral test performance. Language Testing, 31(2), 147–175. http://dx.doi.
org/10.1177/0265532213514401
Norton, J. (2005). The paired format in the Cambridge Speaking Tests. ELT Journal, 59(4), 287–
297. http://dx.doi.org/10.1093/elt/cci057
O’Sullivan, B. (2002). Learner acquaintanceship and oral proficiency test pair-task performance.
Ockey, G. J. (2001). Is the oral interview superior to the group oral? Working Papers, International
University of Japan, 11, 22–40.
Pinter, A. (2005). Task repetition with 10-year old children. In C. Edwards & J. Willis (Eds.),
Teachers exploring tasks in English language teaching (pp. 113–126). New York, NY: Palgrave
Macmillan.
Pinter, A. (2007). What children say: Benefits of task repetition. In K. Van den Branden, K. Van
Gorp, & M. Verhelst (Eds.), Tasks in action: Task-based language education from a classroom-
based perspective (pp. 131–158). Newcastle, England: Cambridge Scholars Publishing.
Poehner, M. (2008). Dynamic assessment: A Vygotskian approach to understanding and promot-
ing L2 development. New York, NY: Springer.
Poehner, M., & Lantolf, J. P. (2005). Dynamic assessment in the language classroom. Language
Teaching Research, 9(3), 233–265. http://dx.doi.org/10.1191/1362168805lr166oa
Popham, W. J. (2006). Assessment for educational leaders. Boston, MA: Pearson.
Rea-Dickins, P. (2004). Understanding teachers as agents of assessment. Language Testing, 21(3),
249–258. http://dx.doi.org/10.1191/0265532204lt283ed
University Press.
Roever, C. (2011). Testing of second language pragmatics: Past and future. Language Testing,
28(4), 463–481. http://dx.doi.org/10.1177/0265532210394633
Samuda, V., & Bygate, M. (2008). Tasks in second language learning. Basingstoke, England:
Palgrave Macmillan. http://dx.doi.org/10.1057/9780230596429
Saville, N., & Hargreaves, P. (1999). Assessing speaking in the revised FCE. ELT Journal, 53(1),
42–51. http://dx.doi.org/10.1093/elt/53.1.42
Sercu, L. (2010). Assessing intercultural competence: More questions than answers. In A. Paran &
L. Sercu (Eds.), Testing the untestable in language education (pp. 17–34). Clevedon, England:
Multilingual Matters.
152). Ottawa, Ontario: University of Ottawa Press.
Skehan, P. (2001). Tasks and language performance assessment. In M. Bygate, P. Skehan, &
M. Swain (Eds.), Researching pedagogic tasks: Second language learning, teaching and test-
ing (pp. 167–185). London, England: Longman.
Skehan, P. (2009). Modelling second language performance: Integrating complexity, accuracy,
fluency, and lexis. Applied Linguistics, 30(4), 510–532. http://dx.doi.org/10.1093/applin/
amp047
Smallwood, I. M. (1994). Oral assessment: A case for continuous assessment at HKCEE level.
New Horizons: Journal of Education, Hong Kong Teachers’ Association, 35, 68–73.
J. Cummins & C. Davison (Eds.), International Handbook of English Language Teaching
(pp. 271-288). New York, NY: Springer. http://dx.doi.org/10.1007/978-0-387-46301-8_20
Sunstein, B. S., & Lovell, J. H. (Eds.). (2000). The portfolio standard: How students can show us
what they know and are able to do. Portsmouth, NH: Heinemann.
Swain, M. (1984). Large-scale communicative language testing: A case study. In S. Savignon &
M. Burns (Eds.), Initiatives in communicative language teaching: A book of readings (pp. 185–
201). Reading, MA: Addison-Wesley.
Swain, M. (2001). Examining dialogue: Another approach to content specification and to validat-
ing inferences drawn from test scores. Language Testing, 18(3), 275–302. http://dx.doi.
org/10.1177/026553220101800302
Taylor, L. (2001). The paired speaking test format: Recent studies. Research Notes, 6, 15–17.
Taylor, L., & Wigglesworth, G. (2009). Are two heads better than one? Pair work in L2 assessment
contexts. Language Testing, 26(3), 325–339. http://dx.doi.org/10.1177/0265532209104665
Torrance, H. (Ed.). (2013a). Educational assessment and evaluation: Major themes in education
(Purposes, functions and technical issues, Vol. 1). London, England/New York, NY: Routledge.
Torrance, H. (Ed.). (2013b). Educational assessment and evaluation: Major themes in education
(Current issues in formative assessment, teaching and learning, Vol. 4). London, England/New
York, NY: Routledge.
Turner, J. (1998). Assessing speaking. Annual Review of Applied Linguistics, 18, 192–207. http://
dx.doi.org/10.1017/s0267190500003548
University of Cambridge. (2014). IGCSE syllabus for Dutch, French, German and Spanish.
Cambridge, England: University of Cambridge International Examinations.
Van den Branden, K. (2006). Introduction: Task-based language teaching in a nutshell. In K. Van
den Branden (Ed.), Task-based language education: From theory to practice (pp. 1–16).
Cambridge, England: Cambridge University Press. http://dx.doi.org/10.1017/
cbo9780511667282.002
van Lier, L. (1989). Reeling, writhing, drawling, stretching, and fainting in coils: Oral proficiency
interviews as conversation. TESOL Quarterly, 23, 489–508. http://dx.doi.org/10.2307/3586922
Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes.
Cambridge, MA: Harvard University Press.
Wajda, E. (2011). New perspectives in language assessment: The interpretivist revolution. In
M. Pawlak (Ed.), Extending the boundaries of research on second language learning and
teaching (pp. 275–285). Berlin: Springer. http://dx.doi.org/10.1007/978-3-642-20141-7_21
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Basingstoke,
England: Palgrave Macmillan.
Press.
Yoffe, L. (1997). An overview of the ACTFL proficiency interview: A test of speaking ability.
Shiken: JALT Testing & Evaluation SIG Newsletter, 1(2), 2–13.
Chapter 3
Introducing a New Assessment of Spoken
Proficiency: Interact
3.1 Introduction
In Chaps. 1 and 2 I laid the foundation for the study that is the focus of this book. In
these foundational chapters I argued that assessment reform is a risky business. This
is largely because the understandings and intentions of assessment developers may
differ from, or even be at odds with, the beliefs and perspectives of the primary users
of the assessment – teachers and students. This may be so even when there may be
agreement about the goals and intentions of educational programmes. As a conse-
quence of perceptual or actual mismatches, strong feelings about a particular assess-
ment can be evoked. Seen more broadly, arguments also rage about which assessment
paradigm, static or dynamic, is more useful. There is likewise debate about whether
assessment performances should be measured in terms of task outcome or construct
alignment, or both. There has been much discussion about the relative merits of
assessing FL students’ spoken communicative proficiency through single interview
or paired/group assessments. Ultimately, there is no one ‘right’ way to assess a par-
ticular skill. In any assessment situation there will be a range of alternatives, with
advantages and disadvantages to each (Bachman & Palmer, 2010). In this light, it
may be argued that disagreements over particular forms of assessment are inevita-
ble, or at least should not take us by surprise.
The purpose of this chapter is to begin the story of New Zealand’s most recent
assessment reform with regard to assessing FL students’ spoken communicative
proficiency (the move from static single interview to on-going paired assessments).1
The chapter opens with a brief account of the events that precipitated the introduc-
tion of New Zealand’s high-stakes assessment system, the National Certificate of
Educational Achievement (NCEA), and what the new assessment system was
designed to accomplish. It goes on to present a detailed account of changes to
1
This presentation is derived, in part, from articles in Language Assessment Quarterly (East &
Scott, 2011a), published 25th May 2011, available online: http://dx.doi.org/10.1080/15434303.
2010.538779, and Assessment Matters (East & Scott, 2011b).

Linguistics 26, DOI 10.1007/978-981-10-0303-5_3
52 3 Introducing a New Assessment of Spoken Proficiency: Interact
assessment practices and how these changes have influenced FL assessments. The
chapter then describes in some detail the processes involved in the most recent
reforms, and the implications of those reforms for the assessment of FL students’
spoken communicative proficiency. The chapter provides a thorough contextual
background for the study into teachers’ and students’ perspectives on assessment
reform reported in this book.
3.2 The New Zealand Landscape for Assessment –

A Shifting Environment
The establishment, in the UK, of the body that came to be known as the Assessment
Reform Group (ARG) has had significant implications for how assessment in
schools may be conceptualised. The ARG was a voluntary group of researchers who
worked on researching educational achievement from 1989, originally under the
auspices of the Policy Task Group on Assessment of the British Educational
Research Association (BERA), until the dissolution of the ARG in 2010. Back in
2002, the ARG highlighted its thinking about the appropriate assessment of stu-
dents in schools with these words: “assessment is one of the most powerful educa-
tional tools for promoting effective learning. But it must be used in the right way”
(ARG, 1999, p. 2). During its just over twenty year history, the ARG set out to
enhance the ‘power and right use’ of assessment by arguing for the educational
benefits of embedding assessment within on-going teaching and learning pro-
grammes, facilitating opportunities for feedback that would enhance learning out-
comes. This ‘assessment for learning’ principle challenged the somewhat entrenched
assumption that a summative examination-dominated system (the assessment of
learning) was the more effective model for measuring educational achievement.
In the New Zealand context, the arguments of the ARG were to have significant
impact. Hattie (2009) describes this impact as a “revolution of assessment” that has
“swept through many of our primary and secondary schools in New Zealand”
(p. 259). He goes on to explain:
This revolution relates to Assessment for Learning and it can be witnessed in innovations
such as the National Certificate of Educational Achievement (NCEA) and its standards-
based approach, the emphasis on reporting more than on scoring, constructive alignment of
learning and outcomes, peer collaborative assessment, learning intentions and success cri-
teria, and the realisation of the power of feedback. (p. 259)
Hattie’s (2009) stance challenges the once accepted norm that high-stakes assess-
ments should be operationalised primarily through a summative examination model.
Instead, he lauds the educational benefits of a high-stakes system that encompasses
‘alternative assessment’ models, including the value of feedback and feedforward
(Hattie & Timperley, 2007).
This assessment for learning stance to high-stakes measurements of learning is
reflected in a New Zealand Ministry of Education Position Paper on assessment
3.2 The New Zealand Landscape for Assessment – A Shifting Environment 53
(Ministry of Education, 2011a) that signals a move “beyond a narrow summative

(“end point” testing) focus to a broader focus on assessment as a means of improv-
ing teaching and learning” (p. 4). The document goes on to assert that “[t]his
approach to assessment has strongly influenced the way in which we have imple-
mented standards-based assessment” (p. 9).
Part of the implementation of assessments aligned to specific criteria or stan-
dards has been through the strengthened and increased role of internal (teacher cre-
ated and context specific) assessments. This move has given teachers considerable
ownership of how best to operationalise assessments with due regard to their own
contexts, and considerable authority to grade students’ performances against the
published achievement criteria. Through internal assessments, teachers are expected
to make professional judgments about their students’ progress and achievements in
relation to the published expectations of the relevant assessment opportunity. The
Ministry document (Ministry of Education, 2011a) asserts that this approach, with
its “deliberate focus on the use of professional teacher judgment underpinned by
assessment for learning principles rather than a narrow testing regime,” is “very dif-
ferent from that in other countries” (p. 9).
There is, therefore, a sense of uniqueness in the way in which New Zealand has
operationalised its high-stakes assessment system for schools. Part of that unique-
ness is an ostensibly high trust model that ‘puts faith in teachers’ (ARG, 2006) by
placing teachers at the centre and relying strongly on their ability both to set mean-
ingful internal assessments and to make professional judgments on students’ perfor-
mances. In practice, the high-stakes nature of the system means that issues of
accountability, validity and reliability are important. As a consequence, external
examinations still have a role to play (NZQA, 2014b), and teachers’ internal assess-
ments are subject to scrutiny and moderation (NZQA, 2014c, 2014e). (There is a
tension here that I explore in Chap. 9.) However, internal assessments are key com-
ponents of the system, and teachers’ professional judgments are integral compo-
nents of these assessments. This scenario has significant implications for how the
assessment of FL students’ spoken communicative proficiency is conceptualised
and enacted under the NCEA system. In what follows I describe the range of reforms
that have impacted on assessment practices in the New Zealand context, culminat-
ing, for languages, in the introduction of interact.
3.2.1 The 1990s: A Mismatch Between Curricular Aims

and High-Stakes Assessment
The early 1990s represented a significant time in New Zealand’s education system.
1993 marked the publication of the New Zealand Curriculum Framework or NZCF
(Ministry of Education, 1993), acknowledged as the first attempt since the 1940s to
provide a government sanctioned “foundation policy” and “coherent framework”
for teaching, learning and assessment in New Zealand’s schools (p. 1). The NZCF
became the guiding document that would inform what would be taught in all school
years from Year 1 to Year 13. Shearer (n.d.) remarks that the NZCF represented a
“paradigm shift” in curricular thinking (p. 11) in that, for the first time, it would
provide “a clear set of national guidelines within which teachers could develop
programmes for their students” (p. 20).
To facilitate the introduction of what Shearer (n.d.) describes as an “outcomes
based” curriculum where “achievement objectives and assessment dominate”
(p. 10), a range of subject-specific curriculum support documents, or ‘national cur-
riculum statements’, were produced. These were designed to help teachers to plan
coherent programmes of study. Their purpose was clear:
The statements define in more detail the knowledge, understanding, skills, attitudes, and
values which are described in The New Zealand Curriculum Framework. They specify the
learning outcomes for all students. In each statement, several strands of learning are identi-
fied, each with one or more achievement aims. For each of these strands, sets of specific
objectives, referred to as the achievement objectives, are defined. These objectives are set
out in a number of levels, usually eight, to indicate progression and continuity of learning
throughout schooling from year 1 to year 13. (Ministry of Education, 1993, p. 22)
Additionally, the national curriculum statements “also suggest assessment proce-

dures, and provide assessment examples. Furthermore, they contain guidelines on
appropriate teaching and learning approaches” (Ministry of Education, 1993, p. 23).
The assessments referred to were essentially opportunities for learning. However,
alongside what were regarded as significant developments in how teaching, learning
and formative assessment would be organised in schools there existed a senior
school high-stakes assessment system that was in many respects at odds with where
the NZCF was aiming to go.
Up to the turn of the century, high-stakes assessments for New Zealand’s schools
were operationalised within a classic behaviourist and knowledge-based summative
examination culture. The two principal examinations were School Certificate
(known as School C), taken by students of 15+ years of age at the end of Year 11
(the final year of compulsory schooling), and the University Entrance, Bursaries and
Scholarships examination (known as Bursary), usually taken by students of 17+
years of age at the end of Year 13 (the final year of voluntary schooling). There was
also, in Year 12, an internal assessment system (Sixth Form Certificate) that did not
carry with it the high-stakes nature of the two main certification programmes.
For both examinations, School C and Bursary, students could sit up to six sub-
jects, and for each examination performances were measured and reported in terms
of five grades of achievement (Table 3.1).
The Bursary examination in particular carried significant weight. Bursary grades
were used to determine which students would qualify for entrance to university,
who would gain a monetary award (bursary), and who would be awarded a scholar-
ship grade for exceptional achievement (NZQA, 2014h). The grades therefore had
significant gate-keeping functions.
New Zealand’s high-stakes examinations represented a norm-referenced system
whereby test takers’ scores were to be interpreted in relation to all other test takers
and candidates were ranked against each other. Marks were also statistically scaled.
That is, raw scores were adjusted so that the mean mark became 50 % and higher
Table 3.1 Grades and Grade School C Bursary

percentage equivalents
A 80–100 % 66–100 %
(School C and Bursary)
B 65–79 % 56–65 %
C 50–64 % 46–55 %
D 30–49 % 30–45 %
E 1–29 % 1–29 %
and lower performances were determined relative to that mean in a way that ensured
an adequate score distribution (Crooks, 2010). Final grades did not therefore reflect
candidates’ achievements relative to a series of criteria (which appeared to be where
the NZCF wished to go), in which case candidates’ grades would stay consistent
regardless of the year in which the assessment was taken. Rather, theoretically a
student’s grades in one year could be different to what that student may have
achieved in a different year depending on that student’s performance relative to
other candidates.
School C and Bursary thus presented significant challenges in terms of align-
ment with a series of curriculum documents that articulated levels of achievement
against specific criteria. The system was also coming under fire as an appropriate
assessment regime in the light of the assessment “paradigm shift” (Gipps, 1994)
that was having influence in a range of jurisdictions (see Chap. 2). In the early
1990s, at the time of the publication of the NZCF, the bodies responsible for over-
seeing New Zealand’s high-stakes assessment system, the Ministry of Education
and the New Zealand Qualifications Authority (NZQA), began to signal an inten-
tion to introduce an overarching national certificate of educational achievement
using a standards-based assessment model.
Criticism of the School C and Bursary system nevertheless brought to the fore
contrasting views about appropriate assessment, and the proposal to introduce a
standards-based criterion-referenced system led to intense debates that focused on
“differences in opinion over the suitability of national, external examinations … as
the main mode of assessment in the country’s system for certification” (Dobric,
2006, p. 85). In Chap. 2 I made it clear why differences in opinion were inevitable.
In that chapter I presented the apparent conflict and incompatibility between two
assessment paradigms: the ‘assessment of learning’ which focuses (as did School C
and Bursary) on the measurement of candidates’ knowledge and understanding
relative to others, and ‘assessment for learning’ which focuses on embedding
assessment within on-going teaching and learning programmes and opportunities
for feedback and feedforward. Given the conflict, it is not surprising that, in New
Zealand, “[n]early ten years was spent contesting the adoption of [a new assess-
ment] policy and it went through a number of changes due to the range and intensity
of issues and debates that occurred” (p. 88). Dobric goes on to conclude that this
was a considerable period of time, but that this time commitment reflected the real-
ity that, with the policy being so crucial to young New Zealanders’ educational
opportunities, there was a necessary requirement for intensive public and govern-
ment agency debate.
Out of this intense debate, occurring within the broader contexts of global
paradigmatic shifts and the introduction of the NZCF, came a new high-stakes
assessment system for schools – the National Certificate of Educational Achievement
or NCEA.
3.2.2 The NCEA System: The Beginnings of Reform
The educational and assessment philosophies underpinning the proposed new

NCEA were radically different to those operationalised through School C and
Bursary. Most important was the shift from norm-referencing to criterion-
referencing. Criteria were designed to clarify and articulate what students should
know and be able to do in a given area of learning, and at different levels of perfor-
mance. Criteria were made transparent in assessment blueprints (Bachman &
Palmer, 2010) known as ‘achievement standards’.
A further reformation was that there was no longer a requirement for students to
be assessed across all aspects of a given subject, thereby only receiving one grade
per subject. Individual subjects were broken down into separate components, each
with their own standard and criteria, and students, in consultation with their teach-
ers, had the opportunity to select which standards they wished to complete. There
was also a balance of external (examination) and internal (teacher-constructed and
teacher-assessed) standards. For each standard, students could be awarded one of
four ‘grades’:
1. Achieved (A) for a satisfactory performance
2. Merit (M) for very good performance
3. Excellence (E) for outstanding performance
4. Not achieved (N) if students do not meet the criteria of the standard (NZQA,
2014i).
The intention of the new system (NZQA, 2014d) was to provide “a more accu-
rate picture of a student’s achievement [than the former system] because a student
who has gained credits for a particular standard has demonstrated the required skills
and knowledge for that standard” (¶ 3). Furthermore, each student would receive a
“School Results Summary that presents all standards taken throughout their school
years, and the results for each, and can order a Record of Achievement listing all
standards achieved at school and beyond” (¶ 3).
The new system was rolled out in a 3-year programme of implementation
whereby, as each level of the NCEA was phased in, the parallel level of the old
system was replaced. NCEA level 1 replaced School C, and was introduced from
2002. NCEA level 2 (replacing Sixth Form Certificate) was introduced from 2003,
and from 2004 NCEA level 3 replaced Bursary.
The published benefit of the NCEA system is that “[s]ince NCEA was intro-
duced, more students are leaving school with qualifications” (NZQA, 2014d, ¶ 4).
The NCEA therefore represents a positive shift in “framing the purpose of senior
secondary education from the selection of high-achieving students to the progres-
sion of all students” (Dobric, 2006, p. 86). Since its introduction, however, the
NCEA has been subject to on-going review and refinement (a summary of key dates
and milestones in that process can be viewed at the NZQA website, NZQA, 2014d).
Reflecting on what has been achieved since the introduction of NCEA, Hipkins
(2013) notes that “[a] decade after its inception, support for NCEA has further con-
solidated. Ninety-five percent of principals, 74 % of trustees, 69 % of teachers and
54 % of parents support it” (p. iii). Hipkins’ comment reveals widespread, although
not universal, acceptance of the NCEA system.
The most significant overhaul of the NCEA occurred during 2008–2010 as a
result of a complete overhaul of the NZCF which led to a revised national curricu-
lum, published in 2007 and mandated from 2010 (Ministry of Education, 2007).
Although the changes to the assessment system were not as radical as the move
away from School C and Bursary, they were nonetheless substantial. As a result of
the most recent review process that became known as the standards-curriculum
alignment, a revised NCEA system has been rolled out in another 3-year programme
of implementation: the revised NCEA level 1 was introduced from 2011; the revised
level 2 came on stream in 2012; the revised level 3 was operationalised from 2013.
The past two decades have therefore witnessed considerable changes to teaching,
learning and assessment in New Zealand’s schools. Furthermore, the broader issues
concerning modes of assessment and alignment or otherwise with the goals of the
school curriculum – both the original NZCF (Ministry of Education, 1993) and the
revised curriculum (Ministry of Education, 2007) – are clearly illustrated in what
has been happening to FL teaching, learning and assessment in the two decades
from the early 1990s.
3.2.3 The Impact of Assessment Mismatch on FL Programmes
The introduction of the NZCF in 1993 was in several respects a helpful document
for FL programmes in schools. It was intended that, within the essential learning
area Language and Languages, schools would provide opportunities for all students
from Year 7 (11+ years of age) to study an additional language. This intention fitted
within an argument, stated at the start of the NZCF, that “we need a work-force
which … has an international and multicultural perspective” (Ministry of Education,
1993, p. 1) and, at the end, that “[m]ore trade is occurring with the non-English
speaking world,” with the consequence that “[t]he different languages and cultures
of these new markets pose a challenge for education” (Ministry of Education, 1993,
p. 28). The Language and Languages curriculum ‘essence statement’ (a short
description of what was intended in this learning area) supported a view that, in the
words of Sakuragi (2006), language learning would help students to appreciate both
the “practical and tangible benefits of being able to communicate in a language” and
the “broader and intangible benefits of expanding one’s intellectual experience”
(p. 20). That is, although the academic benefits of language learning were noted and
appreciated, the communicative benefits were presented as being of central
importance.
A communicative and utilitarian emphasis became further strengthened through

the subsequent publication of a range of curriculum statements for the main interna-
tional languages taught in New Zealand (Ministry of Education, 1995a, 1995b,
1998, 2002a, 2002b). It became increasingly clear through these documents that the
favoured pedagogical approach for FL teaching and learning in New Zealand was
Communicative Language Teaching (CLT) which was seen as an approach that
“encourages learners to engage in meaningful communication in the target lan-
guage – communication that has a function over and above that of language learning
itself” (Ministry of Education, 2002a, 2002b, p. 16). The documents were not man-
datory. Rather, they were presented as guidelines. However, they soon became
adopted syllabi by many teachers.
The documents essentially presented a communicative notional-functional syl-
labus framework, organising language learning around discrete topics (family,
school subjects, daily life, leisure time activities, etc.). Guided by the different lev-
els of the Common European Framework of Reference for languages (Council of
Europe, 2001; Koefoed, 2012; Scott & East, 2012), the guidelines included a range
of proficiency statements that were in the form of a description of what students
should be able to do with the language at four broad levels commensurate with the
eight required levels of the curriculum: emergent communication (levels 1 and 2),
survival skills (levels 3 and 4), social competence (levels 5 and 5), and personal
independence (levels 7 and 8). Additionally, achievement objectives for both recep-
tive skills (listening and reading) and productive skills (writing and speaking) were
noted, along with lists of structures and vocabulary typically associated with these
objectives at the appropriate level. There were also examples of suggested learning
and assessment activities. The curriculum guidelines were therefore quite extensive
in their prescription and arguably supportive in helping teachers to organise their
programmes in ways that supported CLT.
However, in contrast to the aims and direction of the language-specific curriculum
statements, School C and Bursary relied on a terminal examination which included
tasks that were more aligned to a grammar-translation approach to language teaching
such as translation, reading comprehension and essay writing. Some subjects did
have internally assessed components, organised, carried out and marked by teachers,
and for languages speaking was assessed internally via a short summative interview
test, worth 20 % of the final mark. However, in common with the overall emphases
of the written examination, speaking proficiency was not regarded as central:
Oral assessment marks were scaled to fit with marks on the written paper. That is, the aver-
age oral assessment mark for a student cohort in a single school could not be higher than
that cohort’s average mark for the written paper. In cases where it was higher, the average
was scaled down and individual marks adjusted accordingly. No account was taken of indi-
vidual differences in performance across the oral and written components, and no meaning-
ful evidence of oral proficiency was available. (East & Scott, 2011a, p. 181)
Seen in the context of a CLT approach to FL teaching and learning, the assess-
ment of languages as operationalised through School C and Bursary seemed hardly
to be compatible. Certainly, the School C and Bursary system could not be described
as ‘fit for purpose’ with regard to measuring communicative proficiency constructs.
3.2.4 The NCEA for Languages – 2002–2010
The introduction of NCEA would have significant implications for languages. It

provided a valuable opportunity for languages specialists to introduce senior school
assessment options that reflected a CLT approach, and the communicative language
abilities that were increasingly seen to be of value. Two significant assessment
changes were brought about through the NCEA.
First, the creation of individual standards meant that it became possible to mea-
sure the four skills of listening, reading, writing and speaking as discrete items, and
to place equal weight on each skill. An emphasis was placed on assessing candi-
dates’ ability both to understand and to use meaningful language in authentic com-
municative contexts, and with increasing complexity at each NCEA level, thereby
providing measures of candidates’ developing communicative proficiency across
the four skills. Second, a greater emphasis was placed on internal components. Over
a third of the assessments became teacher-led and classroom-based. The original
NCEA assessment framework (or matrix) for languages is reproduced in Fig. 3.1.
Additionally, and in common with the philosophy underpinning the NCEA, stu-
dents were not required to take all the achievement standards, and were able to make
choices that represented their areas of strength.
The introduction of the NCEA from 2002 therefore provided a significant oppor-
tunity to promote positive washback in line with the expectations of a CLT approach
(Cheng, 1997; Scott & East, 2009). That is, if washback “refers to the extent to
which the introduction and use of a test influences language teachers and learners to
do things they would not otherwise do that promote or inhibit language learning”
(Messick, 1996, p. 241), it was anticipated that the NCEA would have a positive
effect. For those teachers who had begun to embrace the CLT philosophy as articu-
lated in the language-specific curriculum documents, the introduction of NCEA
provided the opportunity to ensure greater parity between what they wanted to
achieve in the classroom and what was expected, in terms of measurement, in the
high-stakes assessments. For those teachers who may have been more reluctant to
embrace a CLT approach, and for whom the former examination dominated system
provided a justification not to, introducing the new assessments became a powerful
vehicle for “forcing the reluctant and conservative to shift to more communicative
teaching methods” (Buck, 1992, p. 140).
The skill of speaking took on a higher profile in the assessment regime than had
been the case under School C and Bursary. Spoken communicative proficiency was
to be measured via two internal assessments. In one of these, prepared talk, candi-
dates would be assessed on their ability to make a short presentation in the target
language. This was effectively a monologic assessment of candidates’ ability to
communicate. In the second assessment, converse, candidates would be assessed on
their ability to carry out a short conversation, at the end of the year, with their
teacher as the interlocutor.
In both assessments, and in common with the language-specific curriculum doc-
uments that were then in place, assessment criteria were in line with the speaking
Fig. 3.1 The original NCEA assessment matrix (Note: 1 credit represents a notional 10 h of
learning)
skills achievement objectives of the documents, consistent with the appropriate

level of the assessment (NCEA level 1, 2 or 3).
For example, when speaking in French at curriculum level 6 (NCEA level 1), it
was expected that students would be able to “interact with French speakers in famil-
iar and social situations and cope with some less familiar ones” and “use basic lan-
guage patterns spontaneously” (Ministry of Education, 2002a, p. 57). They should be
able to (1) initiate and sustain short conversations in both formal and informal con-
texts; (2) give short talks on familiar topics in a range of contexts, past and present;
and (3) use appropriate pronunciation, stress, rhythm, and intonation. Achievement
objectives included (1) giving and following instructions; (2) communicating about
3.3 Towards a Learner-Centred Model for High-Stakes Assessment 61
problems and solutions; (3) communicating about immediate plans, hopes, wishes
and intentions; and (4) communicating in formal situations. A range of suggested
grammar, vocabulary and formulaic expressions was provided for these purposes. It
was anticipated that the prepared talk and converse assessments at NCEA level 1
would draw on this range of objectives, including the requisite grammar, vocabulary
and expressions.
Thus, the converse standard represented an adaptation of the one-time summa-
tive teacher-led conversation that had been common to School C and Bursary. The
assessment arguably fitted with a spoken communicative proficiency construct as
reflected in Canale and Swain’s (1980) model of communicative competence. This
model was operationalised within clearly defined boundaries that established the
range of language that would be appropriate for each level of the assessment.
3.3 Towards a Learner-Centred Model for High-Stakes

Assessment
3.3.1 2007: The Advent of a New Curriculum
At the start of the twenty-first century, the very structure of what was happening in
school teaching and learning programmes once more came under scrutiny. The
result of this scrutiny was the publication of a revised curriculum for New Zealand’s
schools (Ministry of Education, 2007) which reflected a growing momentum within
education to embrace a Vygotskian sociocultural view of teaching and learning,
with its emphasis on student-focused experiential learning in contrast to a teacher-
led and teacher-dominated didactic model (Weimer, 2002).
Underpinning the pedagogical reorientation was the articulation of five key com-
petencies that would inform the direction of teaching and learning programmes:
thinking; using language, symbols and texts; managing self; relating to others; and
participating and contributing (Ministry of Education, 2007, pp. 12–13). Each of
these competencies would contribute to a more learner-focused and autonomous
approach to pedagogy. Thinking requires students to “actively seek, use, and create
knowledge” and to “reflect on their own learning.” Using language, symbols and
texts includes “working with and making meaning of the codes in which knowledge
is expressed” and “communicating information, experiences, and ideas.” Managing
self requires students to “establish personal goals, make plans, manage projects, and
set high standards.” Relating to others necessitates “interacting effectively with a
diverse range of people in a variety of contexts,” and comprises “the ability to listen
actively, recognise different points of view, negotiate, and share ideas.” Participating
and contributing includes “a capacity to contribute appropriately as a group mem-
ber, to make connections with others, and to create opportunities for others in the
group.” The key competencies would encourage opportunities for collaborative
learning, and each could be developed through social interaction.
For languages, the introduction of a revised curriculum provided the opportunity

to address two problems that had become apparent with the ways in which language
learning was being presented in the NZCF and the language-specific support docu-
ments. There was first the essential learning area Language and Languages. This
broad learning area was designed to cater for all language learning that might take
place in schools. Subsuming the teaching and learning of additional languages
within a broader curriculum area that included English (and/or te reo Māori) as first
languages had the effect of marginalising FL programmes (East, 2012). As a conse-
quence of international critique of the NZCF which highlighted, from two external
sources, the lack of priority given to additional languages (Australian Council for
Educational Research, 2002; National Foundation for Educational Research, 2002),
a new learning area was created – Learning Languages. This new learning area
effectively separated out the teaching of an additional language from the teaching of
a first language and gave additional languages their own dedicated curriculum
space.
The second essential problem with languages programmes was pedagogical. The
move towards a learner-centred experiential pedagogical approach brought with it
an incentive to move away from conventional hierarchical curricular models which
“divide the language into lexis, structures, notions or functions, which are then
selected and sequenced for students to learn in a uniform and incremental way”
(Klapper, 2003, p. 35). In their place was the encouragement to consider more open-
ended approaches such as task-based language teaching which were built on an
educational philosophy that sees “important roles for holism, experiential learning,
and learner-centered pedagogy” alongside a sociocultural theory of learning that
supports “the interactive roles of the social and linguistic environment in providing
learning opportunities, and scaffolding learners into them” (Norris, Bygate, & Van
den Branden, 2009, p. 15).
The new essence statement for Learning Languages did not specify a task-based
approach, thereby leaving teachers free to make up their own minds about effective
pedagogy. However, TBLT was implicit in the statement, articulated in a core com-
munication strand, that this curriculum area “puts students’ ability to communicate
at the centre,” with the requirement that “students learn to use the language to make
meaning” (Ministry of Education, 2007, p. 24, my emphases).
To facilitate the shift from top-down and teacher-led to bottom-up and learner-
centred, the language-specific curriculum documents were withdrawn, with the
instruction that they should no longer form the basis for arranging languages pro-
grammes. In their place came a range of web-based support resources that would
support a task-based approach and that gave examples of appropriate communica-
tive activities in a range of languages at a variety of levels (Ministry of Education,
2010, 2011b). The web-based resources were not prescriptive in the way the former
guidelines (by default) had become. Teachers were free to make their own choices
about appropriate themes, vocabulary and grammatical structures, and when and
how to introduce them to students, within an overall eight-level curriculum frame-
work that continued to be aligned to the CEFR and that indicated the increasing
levels of proficiency that might be anticipated.
3.3 Towards a Learner-Centred Model for High-Stakes Assessment 63
For languages, the shifts in emphasis from teacher-led to student-centred and

from a more prescriptive to a more open-ended task-based approach have required
significant adjustments for teachers which have not always been easy (see East,
2012, Chap. 8, for a thorough examination of the consequences of these shifts in the
early years of implementation, which include teacher uncertainty about what, how
and when to teach particular topics and structures, and how to reconcile a learner-
centred and experiential philosophy with a perceived requirement for the teacher to
remain ‘in control’). This new orientation to pedagogy also required a reconsidera-
tion of the NCEA. In what follows, I provide an extensive presentation of the differ-
ent stages through which the reconsideration of the NCEA for languages went as a
result of the revised curriculum
3.3.2 NCEA Mark II
Across all subject areas, the introduction of a revised curriculum created the neces-
sity for the Ministry of Education and NZQA to conduct a review of the NCEA
system to ensure its alignment with curricular expectations. Both government agen-
cies took a radical and unprecedented direction for the review: teachers were
recruited for the review task via subject-specific national teachers’ associations.
This was the first time that subject associations had been invited to take part in this
kind of work. However, it was acknowledged that subject teachers were crucial
stakeholders not only in implementing the revised curriculum but also in mediating
effectively any assessments that might be linked to it, and it made sense that their
voice and perspectives should be central to the assessment review process.
The writing panels that were created became the blueprint writers for the new
assessments (that is, their brief was to write the standards; the role of creating new
assessments would subsequently be delegated to other panels).
It was recognised that teachers were not assessment specialists. The teachers
therefore became panels of ‘subject experts’ who would be guided by principles
developed by an ‘expert assessment group’. Bearing in mind the high-stakes nature
of the subsequent assessments, several guiding principles were crucial to the assess-
ment review. First, panels were left free to decide on the relative balance they would
like to see between on-going in-class assessments and external summative examina-
tions, although there was an expectation that a balanced approach would be main-
tained. Second, panels were required to ensure that the standards they designed
would subsequently be helpful in guiding the assessment writers in the development
of valid and reliable assessment tasks. Furthermore the blueprints (and therefore the
subsequent assessments) were to be aligned with the achievement objectives of cur-
riculum learning areas as published in the revised curriculum.
All writing groups were therefore required to ensure that proposed NCEA stan-
dards would:
• be derived from a curriculum or established body of knowledge;
• have a clear purpose;
• lead to assessment tasks that were both ‘valid and reliable’ and ‘possible and
manageable’;
• not duplicate one another;
• lead to qualitative differences in achievement (East & Scott, 2011b, p. 100).
Additionally, the blueprint writers were required to incorporate the five key com-
petencies which were to be used to demonstrate the qualitative differences in
achievement between the three grade levels: achieved, achieved with merit, and
achieved with excellence.
3.4 Revising the Assessments for Languages
For languages, the New Zealand Association of Language Teachers (NZALT) was
the subject association invited to undertake the assessment review, and its work
became known as the Standards-Curriculum Alignment Languages Experts
(SCALEs) project. The work was headed up by two project directors, one of whom
was, at the time of the review, the President of NZALT. Theirs was in fact a chal-
lenging brief in that it originally included not only the five international languages
traditionally taught in New Zealand’s schools (Chinese, French, German, Japanese
and Spanish) but also the Pasifika languages of Samoan and Cook Islands Māori,
and the classical language Latin (although the unique requirements of Latin led to
its subsequently taking a different path to assessment). The project directors
appointed a team of teacher representatives for this range of languages.
An initial scoping meeting between NZQA, the Ministry of Education and rep-
resentatives for all subjects was held in Wellington, New Zealand’s capital, on 19th
and 20th May 2008. Its purpose was to provide the opportunity for NZQA and the
Ministry to outline their intentions for the alignment exercise. The co-leaders of the
SCALEs project subsequently convened the full languages group for two weekend
meetings, one in 2008 and one in 2009, in order to respond to the brief. The leaders
also had frequent meetings together, and with the Ministry of Education, throughout
the 2-year process.
3.4.1 2008: The First SCALEs Meeting
The first weekend meeting of the SCALEs team was called for 7th and 8th June
2008. An initial task was to scrutinise all standards for all languages across all
NCEA levels to determine their current alignment with the core communication
strand of the revised curriculum. The group also had to consider two new skills
which had made their way into the Learning Languages essence statement (viewing
and presenting, and performing) as well as the role of the key competencies in dis-
tinguishing between different levels of performance.
3.4 Revising the Assessments for Languages 65
The panel developed a revised assessment matrix. The matrix continued to incor-
porate the skills-based approach that had informed the first iteration of NCEA stan-
dards. However, the revisions to the curriculum, and in particular the move towards
a more holistic and task-based approach to language use, led the writers to think less
in terms of ‘skills’ as separate and discrete, and more in terms of tasks in which
language is used for specific purposes, with language skills viewed as context-
specific realisations of the ability to use language in particular tasks (Bachman &
Palmer, 1996, 2010). This led to several proposed changes.
First, it was proposed that two external (summative examination based) stan-
dards should be maintained from the previous system but that the multimodal and
integrated nature of language use should be more apparent. The titles to the stan-
dards were modified to reflect this (listening was reframed as listen and respond,
and reading was reframed as view and respond). Second, it was proposed that the
external writing standard should be removed, and that the internal prepared talk
standard should be maintained. Most significantly, and in line with a task-oriented
understanding of the curriculum expectation that students’ ability to communicate
was central, converse and write (internally assessed) were reconceptualised along
lines that would promote the on-going collection of evidence throughout the year.
In light of the communicative expectation of the revised curriculum, the converse
standard was a particular focus of attention. As I have previously stated, converse
was essentially a one-time summative examination whereby teachers would pose
questions that were guided and framed by the expectations of the language-specific
curriculum documents. Problematic here was that the documents, although pre-
sented as guidelines, had in practice become highly prescriptive, and this level of
prescription became mirrored in the assessments. Even though students were
expected to be able to converse in increasingly less familiar contexts as they pro-
gressed through the levels, teachers inevitably tailored questions to the suggested
topic areas, and inevitably expected functional language, vocabulary and grammati-
cal structures to be used commensurate with the level of learning (NCEA 1, 2 or 3).
The wording of the standards themselves, including the explanatory notes that inter-
preted the criteria, expected this. In turn, students became subject to the expectation
that, no matter how contrived or artificial, they had to account, in their performances,
for specific grammatical structures. Rote learning of at least some questions and
responses was predictable, and ‘test wiseness’ exerted a negative influence on the
measurement of the underlying construct. In turn, authenticity was diminished and
concepts such as negotiation of meaning were effectively redundant.
A new standard for speaking was proposed – interact – which would focus on
evidence of genuine peer-to-peer spoken interactions. Fundamental here was the
notion that students could be encouraged to record, visually or aurally, spontaneous
and unrehearsed interactions with other people in the target language both as they
occurred within the context of the teaching and learning programme (e.g., during
task completion) or in genuine contexts such as trips overseas. Students could col-
lect samples of interaction throughout the year, selecting, in consultation with their
teachers, their best three for summative grading purposes. In order to underscore
how central the SCALEs team viewed the interact assessment, it was proposed that
the assessment would attract a high credit weighting in comparison with the other
new standards (six credits).2
By the end of the first meeting, a first draft of the proposed assessment matrix
was available, including titles of individual standards, proposed credit values and
modes of assessment (external or internal). NZALT subsequently initiated an on-
line consultation on the matrix with teachers and other key stakeholders via its web-
site. Unfortunately, the consultation took place at the end of the school year in 2008
and there was a tight time-frame for giving feedback. As a consequence, a good deal
of feedback was negative and reactionary, and revealed that many teachers had not
yet begun to engage with the intentions of the revised curriculum (which at that time
had only been available for a year and had not yet been mandated). Feedback
included alarm at the high credit value being attached to interact. More work was
needed. The matrix and standards were taken back to the drawing board.
3.4.2 2009: The Second SCALEs Meeting
A second meeting of the SCALEs project team was held in 2009. A good deal of
time was devoted to discussing the feedback that had been submitted. With regard
to speaking, analysis of teacher feedback revealed that many teachers were appre-
hensive about any changes that might lead to extra work or alterations to practice.
Some expressed anxiety about exactly what was going to be required for the new
interact standard and for on-going collections of evidence. However, the consulta-
tion had provided a genuine opportunity for stakeholders to engage with the reforms
being proposed, and stakeholder feedback was influential in effecting some changes.
The matrix was redrafted. For interact, one change was a reduction in credit value
from six to five credits.
The two published outcomes of the second SCALEs project team meeting were
a new draft assessment matrix for NCEA levels 1 to 3 and draft standards for level
1. (Draft standards for levels 2 and 3 were also written, but these were not required
at this stage and were therefore not published.)
A final consultation, which included the proposed standards for NCEA level 1,
was initiated on 9th June 2009. This was independent of NZALT and run on behalf
of the Ministry of Education through Research New Zealand. NZALT continued to
play a crucial role by encouraging teachers to express their views. Feedback received
through the second consultation was more balanced, with some teachers showing
support for changes which would lead to the recognition of more authentic and
genuine FL use. For interact, there was therefore a level of support for changes
which would lead to assessment evidence that was more representative of authentic
2
The assessment in practice does not preclude evidence derived from a teacher-student interaction,
but emphasises the peer-to-peer nature of the assessment and the expectation that most evidence
will be derived from student-student interactions.
attempts to interact and make meaning, in contrast to the one-off, contrived and
inauthentic conversations which were part of the current NCEA.
The achievements of the SCALEs project may be summed up in these words:
The SCALEs project has proposed a system of assessment for Learning Languages, which
makes communication central and provides opportunities for the completion of a range of
tasks. … This compares favourably to the current matrix … where equal weight is placed
on the four “traditional” skills. It contrasts significantly with School C and Bursary, with
their emphases on written examinations and marginalisation of oral production. There is a
strong emphasis on “assessment for learning,” exemplified in the collection of evidence
over time and the opportunity for students to select their best work for final assessment.
(East & Scott, 2011a, pp. 186–187)
Trialling of the new assessments at NCEA level 1 was initiated soon after the
Ministry consultation, continuing into 2010. However, the work that came out of the
SCALEs project represented a proposal that ultimately did not become actioned in
its proposed form. The final shaping of the proposal into confirmed standards was
not to occur until the Ministry of Education and NZQA initiated a different approach,
early in 2010, effectively setting up a new drawing board.
3.4.3 2010: A Further Opportunity to Confirm the New

Assessments
On 9th March 2010 this author received an invitation to join a new standards writing
group for languages, initially with the brief of finalising the NCEA level 3 achieve-
ment standards. This writing group would be one of a number of groups across the
full range of national curriculum subjects, and we came together at a centralised
meeting in Wellington from 12th to 16th April 2010.
A move towards convening new writing groups was not instigated because the
approach involving subject associations had failed. On the contrary, at the meeting
the valuable work of the associations was acknowledged. However, the exercise,
which had involved individual meetings of subject associations, and individual con-
tracts, had proved to be expensive, both financially and in terms of the time involved.
The approach had also not provided the opportunity for cross-subject discussion.
Bringing all subject areas together in one central 5-day meeting was designed to
make this final phase of standards writing more manageable and economical, and
also to ensure comparability across the particularly high-stakes NCEA level 3 stan-
dards. Although individual subject groups worked separately for most of the 5 days,
there was the opportunity for all groups to receive generic information, and the
opportunity for a member of an overview group to ensure comparability across all
standards in all subject areas, signing off on a standard proposal when this had been
done.
On this occasion the languages writing group was broader than mainly practising
teachers. It comprised two teachers, one of whom was also involved with trialling
NCEA level 1 assessment tasks, one NCEA languages assessment moderator, and
one member of New Zealand’s school support services for languages. I was invited
to be a member of the writing group in order to provide an academic perspective.
The group was therefore arguably representative of several key stakeholders. Its
work was facilitated by a member of the Ministry of Education with a special inter-
est in FL teaching and learning.
Once the work of the languages writing group got underway, it became clear that
our brief was going to extend beyond working on the level 3 standards. All groups
were requested initially to review the matrices and level 2 standards that subject
association groups had produced. This was because, theoretically, level 1 standards
were now complete (having been consulted on) and level 1 assessments were now
being trialled in schools. In the case of the languages group, we had the advantage
of draft standards up to NCEA level 3 because these had already been produced by
the SCALEs group (although, as previously stated, level 2 and 3 standards had not
been consulted on). We therefore had the advantage of access to blueprints across
all three levels, and our task became one of reviewing, discussing and refining all
standards across all levels. Having a second writing group, with different represen-
tation to the SCALEs group, provided a fresh opportunity to look at the challenges
and limitations of the current NCEA and to revisit some of the proposals from the
first writing group’s work alongside stakeholder feedback.
Although we began, as instructed, with NCEA level 2, it soon became apparent
that any changes we might propose to the level 2 standards would have implications
for level 1. This was initially of concern because we did not wish to make any
changes that might nullify or invalidate the trialling work that was already taking
place in the pilot schools. However, we were able to seek the input of the teaching
member of the writing group who had also been involved in the trialling process,
and were satisfied that the modifications we proposed to level 1 standards did not in
fact alter the types of assessment that were being piloted.
The new group paid particular attention to the proposals around the assessment
of speaking which had generated considerable anxious feedback from teachers. We
recognised the limitations of the current converse standard, particularly alongside
the emphases of the Learning Languages learning area on the centrality of interactive
communication and a more open-ended task-based pedagogical approach. At the
same time, we acknowledged the feedback from several languages teachers in New
Zealand who appeared to be struggling to come to terms with the shifts in emphasis
within the revised curriculum and new learning area, and who had raised specific
concerns around interact. It was apparent that teachers were uncertain what interact
was intended to mean in practice and were alarmed about the workload implications
of collecting on-going evidence. In turn, teachers feared that they might lose stu-
dents who did not feel confident with spoken interaction and who may be put off by
such a strong focus on the oral component.
Our discussion about interact was both extensive and intensive. It occupied much
of the final 2 days of the 1-week meeting as we attempted to balance the value of the
comparatively more authentic model of assessment presented in interact with the
genuine concerns of teachers who were very used to the one-time summative model
of converse.
It was finally resolved that the credit value for interact should be moved back to
six credits. This was intended to signal the central place that several of us in the
group believed the assessment should have and the crucial washback implications in
terms of allowing the assessments to reflect, and therefore to support, curriculum
change (Cheng, 1997). However, it was decided that the higher credit weighting
would occur only at NCEA level 3. At levels 1 and 2 the credit value would remain
at five. This compromise position took into account teachers’ hesitation around
interact and also acknowledged that the revised NCEA would be progressively
introduced. It was hoped that, by the time level 3 was introduced in 2013, teachers
would be more certain about, and experienced with, how interact was designed to
work, such that raising the credit value only at level 3 would not be so problematic.
The compromise also acknowledged that students at level 3 have opted to persevere
with a language to the highest level available in schools and should therefore be in
a position to deal with the demands of authentic interaction without the high credit
weighting having negative impact on student retention. It was noted that teachers
would require specific guidance about the types of evidence that might be drawn on
to fulfil the interact standard.
Subsequent to our meeting, draft level 2 standards were released for consultation
in 2010, followed by level 3 in 2011. Assessment resources were drafted, and trial-
ling of the new assessments completed.
3.4.4 2011 Onwards: Support for the Implementation

of Interact
The final confirmed assessment matrix for the revised NCEA for languages is repro-
duced in Fig. 3.2.
Once confirmed and approved for introduction, the assessment matrix and each
of the assessment blueprints were made publicly accessible via the NZQA website
(NZQA, 2014g). The individual standards documents provide greater articulation of
each standard within the matrix, including how performances are to be differenti-
ated between the three levels of achieved, achieved with merit, and achieved with
excellence across each of NCEA levels 1 to 3.
In common with all other revised assessments, the revised assessments for lan-
guages were progressively introduced, beginning with level 1 in 2011 and culminat-
ing in level 3 in 2013. At each point there was a cross-over year in which teachers
could opt to use the former standards if they wished to do so.
In addition to the matrix and standards, a range of support resources has been put
in place to help and guide teachers with the introduction of the new assessments.
One key resource is the series of senior secondary school guides dedicated to each
learning area of the revised curriculum (Ministry of Education, 2011b). The guides
are designed to help teachers to plan programmes and assessments aligned to the
expectations of the revised curriculum. For languages, these guides effectively
replace the considerably more prescriptive language-specific curriculum documents
(Ministry of Education, 1995a, 1995b, 1998, 2002a, 2002b).
Fig. 3.2 The revised NCEA assessment matrix (Copyright © NZQA)
A useful inclusion in the senior secondary guides for languages is a page which
articulates the changes that have occurred since the introduction of the revised cur-
riculum and the new assessments (Ministry of Education, 2012b). The key changes
as they relate to interact are presented in Fig. 3.3.
Further support for teachers has been made available through a link within the
guides to a range of language specific exemplar resources for the internal achieve-
ment standards (Ministry of Education, 2014b). More informal support has included
the sharing of resources and exemplars among teachers through various channels
such as NZALT. Additionally, national moderators, who have overall responsibility
Fig. 3.3 Key changes between converse and interact (Ministry of Education, 2012b) (Note: See
East, 2012, p. 36 for an overview of the four proficiency descriptors and their relationship to the
CEFR, and the Ministry of Education (2014a) for proficiency descriptors across all eight curricu-
lum levels)
for the consistent application of the required internal standards, produce periodic
updates (clarification documents) that engage with issues as they arise and help with
clarifying the published expectations of the standards (NZQA, 2014f).
In addition to web-based and printable resources, face-to-face support for
the implementation of the internal standards including interact has been provided
through a series of Best Practice Workshops (NZQA, 2014a). Two kinds of
workshop have been made available. The ‘making judgements’ workshops have run
for several years since the introduction of the revised NCEA. Their published aim is
to “increase assessor confidence when making assessment judgements for internally
assessed standards” (¶ 2). Teachers work with “real samples of student work” and
“engage in professional discussion with colleagues and the facilitator about inter-
preting the standards” (¶ 2). The published aims of the more recent ‘connecting with
contexts’ workshops are to “modify existing assessment resources to better meet the
needs of students” and to “increase assessor confidence in modifying assessment
resources without affecting the ‘NZQA Approved’ status” (¶ 11). Teachers “engage
in professional discussion with colleagues and the facilitator about assessment
resource design” (¶ 11). Additional dedicated support has been available through
the Learning Languages facilitators employed as part of a Ministry of Education
secondary student achievement contract which was introduced in 2012 (Ministry of
Education, 2012a).
Each of these resources represents a commitment to help teachers with the intro-
duction of interact in the broader context of a wholesale standards-curriculum
alignment exercise. Thus, considerable investment has been made to provide vari-
ous channels of support to teachers as they come to terms with implementing assess-
ment reform. As Hipkins (2013) notes, overall, these most recent changes to NCEA
and the range of support available for teachers have been generally positively
received.
3.5 Conclusion
Taking into account the New Zealand context for assessment that I outlined at the
start of this chapter, a number of observations concerning interact can be made. The
assessments linked to interact appear to provide a realisation of Hattie’s (2009)
observation that the NCEA is encouraging a ‘revolution of assessment’ with its
standards-based approach, and opportunities for peer collaboration, feedback and
feedforward. Furthermore, the assessments linked to interact reflect a model that
recognises the centrality of teachers to assessment and that places strong reliance on
their ability to create meaningful internal assessment opportunities and to make
professional judgments about students’ performances (ARG, 2006; Ministry of
Education, 2011a). In this light it is not surprising that the teacher voice was so
crucial to the standards-curriculum alignment exercise, in terms of involvement in
creating the new assessment blueprints, providing feedback on the development of
aligned assessments, and subsequent trialling.
Interact arguably has strong theoretical justification, not only in terms of its fit
with the direction in which assessment practices are going in New Zealand, but also
in terms of arguments around effective language pedagogy (see Chap. 1) and assess-
ment procedures (see Chap. 2). There are also clear implications for positive wash-
back in line with curriculum expectations. Teachers, however, are central to the
successful implementation of interact. In light of the radical departure from estab-
References 73
lished assessment practices, and concerns about interact that were raised during the
various consultation stages, it is crucial, now that the new assessment is being put
into operation, to find out from teachers how the roll-out is going. In Chap. 4 I con-
sider in more detail exactly what interact is requiring of teachers and students, and
outline the 2-year study that has sought stakeholder views on interact in practice.
References
ARG. (2006). The role of teachers in the assessment of learning. London, England: University of
London Institute of Education.
Australian Council for Educational Research. (2002). Report on the New Zealand national cur-
riculum. Melbourne, Australia: ACER.
Press.
Buck, G. (1992). Translation as a language testing procedure: Does it work? Language Testing,
9(2), 123–148. http://dx.doi.org/10.1177/026553229200900202
applin/i.1.1
Cheng, L. (1997). How does washback influence teaching? Implications for Hong Kong. Language
and Education, 11(1), 38–54. http://dx.doi.org/10.1080/09500789708666717
Crooks, T. (2010). New Zealand: Empowering teachers and children. In I. C. Rotberg (Ed.),
Balancing change and tradition in global education reform (2nd ed., pp. 281–310). Lanham,
MD: Rowman and Littlefield Education.
Dobric, K. (2006). Drawing on discourses: Policy actors in the debates over the National Certificate
of Educational Achievement 1996–2000. New Zealand Annual Review of Education, 15,
85–109.
Zealand. Amsterdam / Philadelphia, PA: John Benjamins. http://dx.doi.org/10.1075/tblt.3
East, M., & Scott, A. (2011a). Assessing the foreign language proficiency of high school students
in New Zealand: From the traditional to the innovative. Language Assessment Quarterly, 8(2),
179–189. http://dx.doi.org/10.1080/15434303.2010.538779
East, M., & Scott, A. (2011b). Working for positive washback: The standards-curriculum align-
ment project for Learning Languages. Assessment Matters, 3, 93–115.
Hattie, J. (2009). The black box of tertiary assessment: An impending revolution. In L. H. Meyer,
S. Davidson, H. Anderson, R. Fletcher, P. M. Johnston, & M. Rees (Eds.), Tertiary assessment
and higher education student outcomes: Policy, practice and research (pp. 259–275).
Wellington, NZ: Ako Aotearoa.
Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1),
81–112. http://dx.doi.org/10.3102/003465430298487
Hipkins, R. (2013). NCEA one decade on: Views and experiences from the 2012 NZCER National
Survey of Secondary Schools. Wellington, NZ: New Zealand Council for Educational Research.
Klapper, J. (2003). Taking communication to task? A critical review of recent trends in language
teaching. Language Learning Journal, 27, 33–42. http://dx.doi.org/10.1080/09571730385200061
Koefoed, G. (2012). Policy perspectives from New Zealand. In M. Byram & L. Parmenter (Eds.),
The Common European framework of reference: The globalisation of language education pol-
icy (pp. 233–247). Clevedon, England: Multilingual Matters.
http://dx.doi.org/10.1177/026553229601300302
Ministry of Education. (1993). The New Zealand curriculum framework. Wellington, NZ: Learning
Media.
Ministry of Education. (1995a). Chinese in the New Zealand curriculum. Wellington, NZ: Learning
Media.
Ministry of Education. (1995b). Spanish in the New Zealand curriculum. Wellington, NZ: Learning
Media.
Ministry of Education. (1998). Japanese in the New Zealand curriculum. Wellington, NZ: Learning
Media.
Ministry of Education. (2002a). French in the New Zealand curriculum. Wellington, NZ: Learning
Media.
Ministry of Education. (2002b). German in the New Zealand curriculum. Wellington, NZ:
Learning Media.
Ministry of Education. (2007). The New Zealand Curriculum. Wellington, NZ: Learning Media.
Ministry of Education. (2010). Learning Languages – Curriculum guides. Retrieved from http://
learning-languages-guides.tki.org.nz/
Ministry of Education. (2011a). Ministry of Education Position Paper: Assessment (schooling sec-
tor). Wellington, NZ: Ministry of Education.
Ministry of Education. (2011b). New Zealand curriculum guides senior secondary: Learning lan-
guages. Retrieved from http://seniorsecondary.tki.org.nz/Learning-languages
Ministry of Education. (2012b). What’s new or different? Retrieved from http://seniorsecondary.
tki.org.nz/Learning-languages/What-s-new-or-different
Ministry of Education. (2014a). Learning languages – Achievement objectives. Retrieved from
http://nzcurriculum.tki.org.nz/The-New-Zealand-Curriculum/Learning-areas/Learning-
languages/Achievement-objectives
Ministry of Education. (2014b). Resources for internally assessed achievement standards. Retrieved
from http://ncea.tki.org.nz/Resources-for-Internally-Assessed-Achievement-Standards
National Foundation for Educational Research. (2002). New Zealand stocktake: An international
critique. Retrieved from http://www.educationcounts.govt.nz/publications/curriculum/9137
Norris, J., Bygate, M., & Van den Branden, K. (2009). Introducing task-based language teaching.
In K. Van den Branden, M. Bygate, & J. Norris (Eds.), Task-based language teaching: A reader
(pp. 15–19). Amsterdam / Philadelphia, PA: John Benjamins.
NZQA. (2014a). Assessment and moderation best practice workshops. Retrieved from http://www.
nzqa.govt.nz/about-us/events/best-practice-workshops/
NZQA. (2014b). External examinations. Retrieved from http://www.nzqa.govt.nz/qualifications-
standards/qualifications/ncea/ncea-exams-and-portfolios/external/
NZQA. (2014c). External moderation. Retrieved from http://www.nzqa.govt.nz/providers-
partners/assessment-and-moderation/managing-national-assessment-in-schools/secondary-
moderation/external-moderation/
NZQA. (2014d). History of NCEA. Retrieved from http://www.nzqa.govt.nz/qualifications-
standards/qualifications/ncea/understanding-ncea/history-of-ncea/
NZQA. (2014e). Internal moderation. Retrieved from http://www.nzqa.govt.nz/providers-partners/
assessment-and-moderation/managing-national-assessment-in-schools/secondary-moderation/
external-moderation/internal-moderation/
References 75
NZQA. (2014f). Languages – Moderator’s newsletter. Retrieved from http://www.nzqa.govt.nz/

qualifications-standards/qualifications/ncea/subjects/languages/moderator-newsletters/
October-2014/
NZQA. (2014g). Search framework. Retrieved from http://www.nzqa.govt.nz/framework/search/
index.do
NZQA. (2014h). Secondary school qualifications prior to 2002. Retrieved from http://www.nzqa.
govt.nz/qualifications-standards/results-2/secondary-school-qualifications-prior-to-2002
NZQA. (2014i). Standards. Retrieved from http://www.nzqa.govt.nz/qualifications-standards/
qualifications/ncea/understanding-ncea/how-ncea-works/standards/
Sakuragi, T. (2006). The relationship between attitudes toward language study and cross-cultural
attitudes. International Journal of Intercultural Relations, 30, 19–31. http://dx.doi.
org/10.1016/j.ijintrel.2005.05.017
Scott, A., & East, M. (2009). The Standards review for learning languages: How come and where
to? The New Zealand Language Teacher, 39, 28–33.
Scott, A., & East, M. (2012). Academic perspectives from New Zealand. In M. Byram &
L. Parmenter (Eds.), The common European framework of reference: The globalisation of lan-
guage education policy (pp. 248–257). Clevedon, England: Multilingual Matters.
Shearer, R. (n.d.). The New Zealand curriculum framework: A new paradigm in curriculum policy
development. ACE Papers, Issue 7 (Politics of curriculum, pp. 10–25). Retrieved from https://
researchspace.auckland.ac.nz/handle/2292/25073
Weimer, M. (2002). Learner-centered teaching: Five key changes to practice. San Francisco, CA:
Jossey-Bass/Cambridge University Press.
Chapter 4
Investigating Stakeholder Perspectives
on Interact
4.1 Introduction
The recent introduction of interact as the primary assessment of high school stu-
dents’ FL spoken communicative proficiency within New Zealand’s high-stakes
assessment system, NCEA, marks a significant shift in the way in which such pro-
ficiency is to be assessed. Gone is the one-time summative teacher-student inter-
view test that was operationalised within a prescriptive framework that recommended
topic areas, vocabulary and grammatical structures for each level, and that required
judgments to be made against these. In its place has come a considerably more
open-ended assessment that places emphasis on spontaneous and unrehearsed peer-
to-peer interactions as they take place in the context of the teaching and learning
programme. These interactions can be about any topic, and do not require the use of
particular vocabulary and grammar to fulfil them. As I argued in Chap. 3, the assess-
ment opportunities presented through interact stand in stark contrast to those found
in the former summative School C and Bursary system. They also represent a sig-
nificant attempt to address several of the shortcomings of the first iteration of the
NCEA in line with a revised school curriculum.
Despite concessions made to teacher concerns at the time of its conceptualisa-
tion, interact arguably represents the most radical change in practice for FL teachers
and their students arising from the entire review of the NCEA for languages. The
discussions in the previous chapters have revealed, however, that the arguments sur-
rounding how best to measure FL students’ spoken communicative proficiency are
multifaceted. Furthermore, teachers’ reactions to interact at the early stages of plan-
ning for the reform revealed divided opinion on the efficacy of interact. The realisa-
tion of interact is taking place within a complex environment.
Taking into account the arguments I presented in Chap. 2, effective measurement
of FL students’ spoken communicative proficiency cannot be perceived as being
straightforwardly a matter of relating an assessment to a particular theoretical
articulation of what it means to communicate proficiently in speaking, and then

Linguistics 26, DOI 10.1007/978-981-10-0303-5_4
78 4 Investigating Stakeholder Perspectives on Interact
organising a test that will measure that theoretical definition. As I outlined in Chap.
2, arguments surrounding what makes a fair, valid and reliable assessment of spo-
ken communicative proficiency must take into account several dimensions: should
the assessment fit within a static or dynamic model of assessment (the assessment
of learning, or assessment for learning)? Should the assessment be defined in terms
of fulfilling the outcomes of a given task, or in terms of a construct of spoken com-
municative proficiency? Should the assessment be of an individual, or conducted in
a pair or group context?
From a theoretical perspective, the assessments associated with the interact
achievement standard appear to fit more comfortably within an assessment for
learning context, embedded within the process of teaching and learning, even
though they are essentially used for summative purposes. These assessments attempt
to marry both task-based and construct based considerations in their design and
operationalisation. They focus on the measurement of individuals, but that measure-
ment takes into account the individual in interaction (McNamara, 1997).
Nevertheless, when wishing to draw conclusions about the usefulness or fitness
for purpose of interact as an assessment, a range of counter-arguments must be
taken into account: the validity and reliability of one-time tests; the facets of speak-
ing that we wish to measure; and whether it is better for the assessment to be carried
out by the teacher in a single-interview format.
What makes a speaking assessment valid or useful or fit for purpose depends on
a range of considerations which must impact on discussions. In this chapter I outline
a theoretical framework that will support the evaluation of the relative usefulness or
fitness for purpose of different assessment types. I go on to articulate the fundamen-
tal principles informing interact in practice in terms of the information teachers
have received, evaluating these against the theoretical framework for usefulness.
Finally, I present the methodology for the 2-year study that has sought stakeholder
views (both teachers and students) during the initial phases of the implementation
of interact (2012–2013).
4.2 Bachman and Palmer’s Test Usefulness Framework
Bachman and Palmer’s (1996) six qualities of test usefulness provide a traditional
and well-established framework against which to evaluate a given assessment.
Although Bachman and Palmer have subsequently substantially developed their test
usefulness argument (2010), their later work should not be taken to suggest that the
six qualities framework should be abandoned as a viable means of appraising
assessments. (For example, Pardo-Ballester, 2010, presents a relatively recent study
into assessment validation, and Hu, 2013, presents a recent discussion of estab-
lished ways of assessing English as an International Language. Both accounts draw
substantially on the test usefulness argument.)
Bachman and Palmer (1996) assert that an essential consideration in designing
and developing a language assessment is “the use for which it is intended,” such that
4.2 Bachman and Palmer’s Test Usefulness Framework 79
“the most important quality of a test is its usefulness” (p. 17). They go on to contend
that taking usefulness into consideration is an overriding component of quality con-
trol throughout the entire process of designing, developing and putting into action a
particular language assessment.
The dimensions of assessment I presented in Chap. 2 ultimately come down to
whether the assessment provides us with meaningful information about students’
spoken communicative proficiency. Those dimensions will inform how assessments
will be conceptualised and enacted. Bachman and Palmer’s (1996) theoretical con-
struct of test usefulness, together with its six sub-constructs, provides a lens through
which the arguments from Chap. 2 can be viewed and considered. The six qualities
represent a kind of ‘checklist’ (East, 2008) which supports the appraisal of a given
assessment. Bachman and Palmer argue that evaluating an assessment against the
six qualities will help to establish the relative usefulness of the assessment. The
items on the checklist are:
• Construct validity
• Reliability
• Interactiveness
• Impact
• Practicality
• Authenticity.
In what follows I provide an overview of each of the six components of the
Bachman and Palmer framework, with a particular focus on speaking. This over-
view revisits and expands on arguments I have presented in the previous chapters,
demonstrating the integration of facets of test usefulness with a range of arguments
pertinent to language assessments.
4.2.1 Construct Validity and Reliability
As discussed in Chap. 1, construct validity and reliability are “the two fundamental
measurement characteristics of tests. They are concerned with the meaningfulness
and accuracy of the scores awarded, in relation to the measurement of an underly-
ing construct, where the scores are indicators of the ability or construct” (East,
2008, p. 25).
When it comes to assessing speaking meaningfully, an issue of importance is to
define a spoken communicative proficiency construct which would inform the fac-
ets of speaking that require measurement. As I argued in Chap. 2, Bachman (2002)
asserts that, even when viewing assessments from a task-based perspective (that is,
when the outcomes of the task are considered the important criteria for making
judgments about proficiency), assessments should incorporate both the specification
of the task that is to be fulfilled and the abilities to be assessed. Luoma’s (2004)
conclusion is that ultimately assessment developers should take into account both
construct and task when designing and developing speaking assessments. Construct
validity therefore arguably becomes an important criterion for usefulness, whether
assessments are viewed from a task-based or construct based perspective.
In other words, task considerations (what students are required to do in the
assessment, including the task-related outcomes they are to achieve) are important.
Equally important are construct considerations, that is, the underlying proficiency
that the task aspires to rate. As Bachman and Palmer (2010) argue, “[i]f we are to
make interpretations about language ability on the basis of performance on lan-
guage assessments, we need to define this ability in sufficiently precise terms to
distinguish it from other individual attributes that can affect assessment perfor-
mance” (p. 43, my emphases). The construct is therefore “the specific definition of
an ability that provides the basis for a given assessment or assessment task and for
interpreting scores derived from this task” (p. 43). Construct validity may be deter-
mined by articulating the construct that the assessment intends to measure – for
example, relating speaking proficiency to a general theoretical model (e.g., Canale,
1983; Canale & Swain, 1980), and/or by defining the facets of the specific construct
that the task aims to operationalise (e.g., apologising, negotiating, complaining) –
and subsequently by ensuring that assessment tasks will adequately measure the
construct. That is, we need to be satisfied that the scores will give us meaningful
construct-related information about different levels of candidate proficiency.
Following on from Bachman and Palmer’s (1996, 2010) argument concerning
the relationship between scores and a demonstration of construct validity, the pro-
cesses that lead to scores that give us accurate (i.e., reliable) differential information
on candidate performances are also important. Reliability is, therefore, “an essential
requisite of validation” (Bachman, 1990, p. 239).
Reliability relates to the procedure of awarding the scores (and therefore the
extent to which this process is adequate to justify the scores as measures of different
levels of candidate proficiency). This kind of reliability may be determined by hav-
ing clearly articulated criteria for assessment and using more than one rater of can-
didates’ performances. Reliability is also concerned with whether the assessment
(or a different version of the assessment) leads to comparable scores across different
administrations – so-called test-retest and parallel-forms reliabilities. If a speaking
assessment is working reliably, we would anticipate similar performances across
different administrations of the same kind of assessment (that is, if performances
across two administrations of the same or a parallel test yield widely different
scores, we would be concerned about why this is, and which performance gives us
the more accurate information about the candidate’s ability). Whether the focus is
on measuring one performance or on comparing several performances, the extent of
reliability in each case will usually be determined by subjecting the scores to some
form of correlational reliability analysis.
Thus, in terms of measurement, “validity focuses on the tasks themselves and
reliability is concerned with how consistently performances on those tasks are mea-
sured” (East, 2008, p. 25). A useful speaking assessment task will be one that is both
construct valid and reliable.
4.2.2 Interactiveness, Impact, Practicality and Authenticity
The four qualities of interactiveness, impact, practicality and authenticity are dis-
tinct from the two measurement qualities.
Interactiveness as a quality of usefulness relates to “the extent and type of
involvement of the test taker’s individual characteristics in accomplishing a test
task” (Bachman & Palmer, 1996, p. 25). A useful assessment task may be defined
as one that promotes positive interaction for most candidates. Bachman and Palmer
go on to articulate three individual characteristics that will influence the interface
between the candidates and what they are asked to do in the assessment (the task),
and how effectively the candidates are able to interact with the task. Positive interac-
tion is promoted when candidates are able to engage with the task in ways that allow
them to demonstrate the full potential of their proficiency (language ability). Related
issues here are whether the candidate has sufficient topical knowledge to engage
meaningfully with the task (e.g., knows enough to answer the questions being
asked), and “the affective or emotional correlates of topical knowledge” (p. 65), that
is, whether the task promotes a positive affective response (or at least does not
assess the candidate in an area that may provoke negative feelings, thereby hinder-
ing interaction with the task). As Leaper and Riazi (2014) make clear, task prompts
can be crucial because they influence the discourse in oral tests, promoting or hin-
dering the type and quality of language elicited. In the context of spoken communi-
cative proficiency, interactiveness as a construct and interactional competence as
defined in Chaps. 1 and 2 essentially relate to dimensions of the same issue: does the
task, and do the ways in which the task is operationalised, give candidates the great-
est opportunity to demonstrate context-relevant language ability?
Impact relates to the influence of a given assessment task on the candidates.
Impact at the macrolevel is concerned with the wider implications of the assessment
for a range of stakeholders, including the candidates themselves, their teachers, par-
ents, educational programmes, future employers and gatekeepers. There are impli-
cations here for performance outcomes (scores). Macrolevel impact must concern
itself with accountability and therefore must take into account the range of argu-
ments I have thus far presented about assessment conceptualisation and operation-
alisation. For example, in contexts where teachers are accountable to school
authorities and administrators, they will inevitably feel an obligation (even pres-
sure) to ensure that their students perform as well as possible on assessments. Where
scores have significant implications for candidates’ futures, students will inevitably
feel pressure to perform at their best. This has implications for microlevel impact.
Impact at the microlevel is concerned with the effects that taking a particular
assessment has on individual candidates. These effects may be related to the assess-
ment environment or assessment conditions (e.g., how hot or cold the room is; how
the assessment is set up). These effects also include score impact (the consequences
for the candidates of receiving particular scores). Related to this is the extent to
which a given assessment heightens or lowers candidates’ anxiety and stress. That
is, impact is concerned with “the positive or negative consequences for the test
takers as they actually carry out a given test” (East, 2008, p. 176). Bearing in mind
the arguments I have presented earlier around the potential anxiety associated in
particular with high-stakes assessments (Brown & Abeywickrama, 2010; Shohamy,
2001, 2007), a useful speaking assessment task may be defined as one that promotes
positive impact for most candidates. Candidates are able to engage with the task in
suitable conditions and without the assessment invoking undue anxiety or stress,
either or both of which might affect candidates’ opportunities to demonstrate the
full potential of their proficiency.
Practicality is a quality of how the assessment is administered. Practicality is
concerned with questions such as ‘are there sufficient resources – people, time,
materials – to enable the assessment to be carried out successfully and efficiently?’
and ‘will this assessment be more or less efficient than another kind of assessment?’
(East, 2008, p. 176). In this regard, a useful speaking assessment task may be
defined as one that can be administered sufficiently efficiently in all respects (e.g.,
task-setting; task execution; task scoring and reporting) that it can be considered
acceptably practical, and not unduly burdensome or costly, both in time and in
finances.
Authenticity, as I noted in Chap. 1, is a quality of the relationship of the assess-
ment to what Bachman and Palmer (1996) refer to as a target language use or TLU
domain. TLU domains are the actual real-world situations that the assessments aim
to mirror. For Bachman and Palmer authenticity is a critical quality of language
assessments (at least language assessments that replicate a CLT-focused learning
environment). This is because, if we want to establish how well assessment candi-
dates are likely to perform in subsequent real-world TLU domains beyond the class-
room, we need to generate assessment tasks that will allow them to use the range of
language they are likely to encounter outside the classroom (East, 2008).
As indicated in Chap. 2, an important distinction to make is between situational
and interactional authenticity (Bachman, 1990). Essentially, situational authenticity
refers to the attempted replication of a TLU domain in an assessment context (the
‘cup-of-coffee-in-a-restaurant’ scenario). These replications are effectively only
limited simulations of the real world. They cannot evoke genuine communicative
interaction and can only be made to look like real-life interactions (Lewkowicz,
2000; Spolsky, 1985). An issue of concern therefore becomes what exactly is being
measured, and how reliable the assessment data are in terms of reflecting and mea-
suring future TLU domain performance.
Interactional authenticity fundamentally concerns “the interaction between the
language user, the context, and the discourse” (Bachman, 1990, p. 302). Interactional
authenticity is promoted in assessments that “involve the test taker in the appropri-
ate expression and interpretation of illocutionary acts” (p. 302) and provide “poten-
tial for generating an authentic interaction with the abilities of a given group of test
takers” (p. 317). Authenticity thus “becomes essentially synonymous with what we
consider communicative language use, or the negotiation of meaning” (p. 317), and
authentic tasks are those that “engage test takers in authentic illocutionary acts”
(p. 321). An interactionally authentic assessment will require learners to draw on,
and display, the kinds of language knowledge and interactional skills that they might
use in any real-life interactional situation beyond the assessment (East, 2012).
Defining authenticity in interactional terms creates the opportunity to use speak-
ing assessment tasks that might not necessarily reflect a future TLU domain
(although they may aim to do so), but that will invoke genuine interactional abilities
from candidates. However, as Lewkowicz (2000) argues, neither situational nor
interactional authenticity is absolute – an assessment task might be “situationally
highly authentic, but interactionally low on authenticity, or vice versa” (p. 48).
Role-playing the purchase of a cup of coffee in a restaurant is one possible example
of this. Depending on what is regarded as important as part of the assessment pro-
cedure, tasks might be developed that aim to enhance either situational or interac-
tional authenticity, or both.
Bachman and Palmer (1996) also suggest a further link between authenticity and
interactiveness in that a positive affective response to an assessment task may be
enhanced when candidates perceive the task to be authentic. That is, authenticity
may contribute to candidates’ perceptions of the relevance of the assessment, and
this perceived relevance may promote more positive interaction with the assessment
task, thereby helping candidates to demonstrate their proficiency. In this light, it
may be suggested that a useful speaking assessment task may be defined as one that
is at least interactionally authentic (i.e., promoting opportunities for candidates to
interact with and in the task using the target language in ways that they might also
do outside of the assessment situation) even if it is not necessarily situationally
authentic (i.e., not trying to replicate a future real-world TLU situation).
Bachman and Palmer (1996, p. 18) do not suggest that all six qualities must be
equally distinguishable if a conclusion is to be reached that a test or assessment is
useful. In their view, three principles inform the application of the six qualities to a
given assessment: (1) it is important to maximise overall usefulness; (2) as a conse-
quence, the individual usefulness qualities need to be evaluated in terms of their
combined effect on overall usefulness; (3) however, the relative importance of each
of the six components needs to be determined in the context of the specific assess-
ment in question.
Thus, in some situations (for example, a formative in-class communicative task,
the outcomes of which are not required for accountability purposes), interactional
authenticity may be considered of paramount importance, whereas reliability is not
a consideration. In others (such as a high-stakes discrete point grammar test), reli-
ability may be at the top of the list, but authenticity and interactiveness are not. Both
assessments may be considered fit for purpose (albeit different purposes) and, there-
fore, useful.
In the context of the high-stakes assessment of FL students’ spoken communica-
tive proficiency, however, it is arguably important for all six qualities to be traceable
and discernible, although not necessarily in fully equal measure. That is, if we are
to tap into candidates’ spoken communicative proficiency as meaningfully and use-
fully as possible, we arguably require assessment tasks that:
• are construct valid (that is, leading to outcomes that adequately represent
different levels of performance against a defensible theoretical model of spoken
communicative proficiency and/or against the facets of the construct we wish to
measure);
• are reliable (that is, leading to outcomes that are defensibly consistent, fair and
equitable);
• promote positive impact and positive interaction, at least for the majority of can-
didates, so that candidates have the opportunity to display their full proficiency
(that is, no matter how much assessment task setters strive to promote positive
impact and interaction, there will always be some candidates who will fail to
interact positively with the task and who may be subject to negative impact);
• are as practical as possible within the requirement to collect and measure mean-
ingful data;
• are as authentic as possible, possibly situationally, but at least interactionally
(thereby contributing to positive interaction and perceived relevance).
4.3 2011 Onwards: Interact in Practice
Taking the above usefulness arguments into account, a key issue for the evaluation
of interact is the extent to which, when considered against Bachman and Palmer’s
(1996) six qualities, the assessment measures up. Teachers were not left to their own
devices to move from the one-time summative interview test to a series of peer-to-
peer spoken interactions. As I outlined in Chap. 3, a whole range of support
resources has been put in place to scaffold teachers and to provide avenues for on-
going support. In what follows, I summarise the key information available to teach-
ers about interact and its operationalisation as presented in a range of internet-based
guidelines and resources, including periodic clarifications from the chief moderator
(moderator’s newsletters, NZQA, 2014d) that contain information about how inter-
act is to be enacted. I draw from this material some initial conclusions about the
usefulness of the assessment as determined against Bachman and Palmer’s
framework.
With regard to interact in comparison with the converse assessment that it has
replaced, two significant changes for teachers and students stand out. There is, first,
the move away from conversations embedded within a clearly defined structure and
the move towards open-ended communicative interactions that do not require the
use of specific topics, vocabulary and grammar for their successful completion.
There is, second, the move away from the summative one-time teacher-led inter-
view test and the move towards a more dynamic assessment model that relies on the
collection of on-going evidence of peer-to-peer interaction and that builds in oppor-
tunities for feedback.
Regarding the first significant shift (greater linguistic open-endedness; no longer
measuring proficiency in terms of prescribed topic, language and grammatical
structures), communicative outcomes become the basis of judgment. Successful
4.3 2011 Onwards: Interact in Practice 85
Fig. 4.1 Outcome requirements of interactions (NZQA, 2014c)
outcomes are judged in terms of students’ ability to do certain things with language
at different levels of proficiency (in this regard, outcomes have been heavily influ-
enced by the Common European Framework of Reference for languages or CEFR
[Council of Europe, 2001; Koefoed, 2012; Scott & East, 2012] – see Chap. 3). As a
reflection of the requirements of the revised curriculum, the individual interactions
(three at levels 2 and 3, and up to five at level 1) must show evidence, when consid-
ered holistically, of students’ proficiency as presented in Fig. 4.1.
Students are free to use all and any language at their disposal. The assessment
task cannot expect students to use language beyond the target level in order to
achieve the standard, but there is also no expectation that particular language/struc-
tures belong to and/or must be used at a particular level. The purpose of the task will
dictate the appropriate language. Essentially, “[i]n all situations the students should
be showing their ability to use the language they have acquired in as natural a way
as possible i.e. not artificially using long sentences and complex structures where
native speakers would not naturally do so” (NZQA, 2014c, p. 1).1 Furthermore, a
range of interactional contexts/genres is anticipated in the assessment. Different
contexts and genres will elicit different kinds of language, and this has implications
for defining the construct of relevance, and thus for construct validity. However, the
requirement to assess the collection of interactions holistically means that, taken
together, there is opportunity to collect evidence of performance across the different
facets of a defined spoken communicative proficiency construct that are considered
important.
In terms of fluency, some level of spontaneous interactional ability, commensu-
rate with the level and the student’s proficiency (achieved, merit, excellence), must
be in evidence across all levels of the assessment (NCEA 1, 2 and 3). At NCEA
levels 1 and 2, ‘spontaneous’ and ‘unrehearsed’ are implicit criteria in the perfor-
mances that may be expected, but are influenced (particularly at level 2) by the
1
Open-endedness of language is obscured by one clarification document which suggests that the
former language-specific curriculum documents and the former vocabulary and structures lists
(which are supposed to have been withdrawn, see Chap. 3) may be used for guidance when deter-
mining whether the appropriate level of language has been achieved (NZQA, 2014d, moderator’s
newsletter, December 2012). However, the overall tenor of the guidelines signals openness of
language and structures.
expectation of the corresponding level of the CEFR (level B1): “entering unpre-
pared into conversation.” At level 3, spontaneity becomes an explicit criterion which
“refers to the ability to maintain and sustain an interaction without previous
rehearsal” (NZQA, 2014d, moderator’s newsletter, March 2013). This explicitness
reflects the expectation of CEFR level B1 and, at the higher levels of performance,
level B2: “interacting with a degree of fluency and spontaneity.” Students also need
to demonstrate a “repertoire of language features and strategies” (NZQA, 2014c,
p. 1). That is, the display of interactional proficiency that embraces the dynamic
process of communication (Kramsch, 1986) is primary.
Although spontaneity and authenticity are essential hallmarks of appropriate
interactions, ‘spontaneous and unrehearsed’ does not suggest that task preparation
and task repetition are invalid (as I noted in Chap. 2, in the teaching and learning
context both task preparation and task repetition are valid components that contrib-
ute to successful task completion). In this regard, it is acknowledged that students
will “complete the assessment once they have used the language in class and have
sufficient mastery of the language to complete the task” (NZQA, 2014d, March
2013).
In terms of accuracy, the guidelines stress that grammatical errors are not criteria
of this standard. That is, “[i]ncorrect language/inconsistencies will only affect a
grade if they hinder communication” because “[i]n a realistic conversation by learn-
ers of a second language errors are natural and should not be overly penalised”
(NZQA, 2014c, p. 1). Grammatical proficiency is important, but only in terms of its
contribution to effective communication.
With regard to the second significant change, from a one-time summative test
model to an on-going formative assessment model, students are to be given oppor-
tunities to interact in a range of contexts (NZQA, 2014c). This will enable the
assessment to tap into discourse and sociolinguistic competence (Canale & Swain,
1980), or sociocultural communicative competence which requires language users
to adapt their language choices to fit different contexts and genres appropriately
(Hinkel, 2010). Furthermore, feedback and feedforward are encouraged, although
teachers must ensure that ultimately “the final product is still a reflection of the
student’s own ability” (NZQA, 2014d, October 2014).
In addition to the different avenues of support that I noted towards the end of
Chap. 3, a range of written annotated exemplars for several languages and across
several levels was also introduced (Ministry of Education, 2014b). Written com-
mentary is provided on the sample performances and how they are to be judged
against the achievement criteria for each standard (level 1, 2 or 3). Other annotated
exemplars are in the form of downloadable mp3 files of actual students undertaking
interactions in a variety of contexts and at a range of levels (NZQA, 2014e). These
were recorded in the process of trialling the assessment. Alongside the audio files
are notes to explain the judgments about performance levels.
In terms of the place of interact within the broader assessment system, the over-
all tenor of the NZQA guidelines is that, at the end of the day, interact is part of a
high-stakes assessment system. For example, the requirement to give students ade-
quate notification of assessment tasks and what those tasks will involve (NZQA,
4.4 The Theoretical Usefulness of Interact 87
2014d, June 2012) potentially focuses attention on the task as an assessment, and
negates the validity of the recording of impromptu evidence (the language that stu-
dents use in the process of completing an in-class task when their attention is
focused on the task, and not on assessment). This appears to have moved the assess-
ment somewhat away from its original intent (as noted by East & Scott, 2011a,
2011b), and brings into question the realisation, through interact, of teaching and
assessment as a dialectically integrated activity (Poehner, 2008). As I said at the
start of Chap. 3, however, the high-stakes nature of NCEA means that issues of
accountability, validity and reliability are important. There is a requirement to work
these issues out meaningfully through internal assessments and teachers’ profes-
sional judgments as integral components of the system.
4.4 The Theoretical Usefulness of Interact
Evaluated against the six qualities of test usefulness, it would appear that interact
measures up considerably well as a valid and useful measure of spoken communica-
tive proficiency.
In terms of construct validity, the guidelines appear to encourage assessments
that reflect a more comprehensive range of facets of a spoken communicative profi-
ciency construct than those measured by a single-candidate interview test (see
Chap. 2). With regard to reliability, a range of evidence should enable assessors to
determine whether a measure of proficiency has been provided that reliably (over
time) assesses individual candidates’ levels. There are clear instructions about pre-
preparation – for example, no pre-scripting and rote-learning (NZQA, 2014d) with
a view to ensuring that what is presented is the candidates’ own work, providing
evidence of interactional competence. Both internal and external moderation pro-
cesses are built into procedures (NZQA, 2014a, 2014b). There is a requirement for
clear identification of candidates for scoring and external moderation purposes
(NZQA, 2014d, June 2013), thereby facilitating both individual grading and mod-
eration of that grading.
As far as the other four qualities are concerned, the guidelines suggest the devel-
opment of assessment tasks that will promote positive interaction, reflecting both
interactionally and situationally authentic scenarios in which candidates should be
able to interact meaningfully. Furthermore, the requirements for fluency and spon-
taneity contribute to likely enhanced authenticity in comparison with converse.
Opportunities for feedback and the move away from the requirement to account for
different grammatical structures at different levels (and thereby to force these into
use, no matter how artificial) will potentially enhance both interaction with the task
and positive impact on the candidates. With regard to practicality, it is clear that
evidence of interaction for assessment purposes is reasonably short (NZQA, 2014c),
and the evidence is to be assessed holistically on one occasion (NZQA, 2014d,
December 2012).
Counter-arguments to a claim to usefulness include the collection of evidence

over time. If the evidence is spaced throughout the year, it is possible (indeed likely)
that performances towards the start of the year will not be at the same level of pro-
ficiency as performances towards the end of the year. This raises the question of
what evidence may be included to demonstrate proficiency, and when that should be
collected. Feedback and feedforward also potentially present challenges: when is
the work clearly the candidate’s own, and when is that work unfairly influenced by
feedback? The requirement to inform candidates when an assessment is to take
place potentially focuses attention on the interaction as an assessment, with impli-
cations for impact and interactiveness. This, and the apparent blurring of the stance
towards the use of spontaneous interactions recorded beyond the classroom, raises
issues around authenticity (i.e., what evidence does constitute a genuinely authentic
interaction?).
Theoretically, therefore, interact does appear, on the one hand, to measure up
well against the theoretical construct of test usefulness and each of the sub-
constructs. On the other hand, interact can be challenged at several points.
Additionally, a theoretical evaluation does not take into account teachers’ early
reactionary feedback to the proposal to introduce interact which also raised initial
concerns about practicality (East & Scott, 2011a, 2011b; Scott & East, 2009). Nor
does it take into account on-going arguments and debates about the operationalisa-
tion of interact in practice (as evidenced, for example, through the occasionally
passionate, fiery and heated listserv conversations throughout late 2013 and early
2014 that I referred to at the start of Chap. 1). In other words, several warrants in
support of interact can be advanced; several rebuttals to those warrants can be made
(Bachman & Palmer, 2010).
In turn, introducing an assessment such as interact on the basis of arguably sound
and defensible theoretical principles of both effective second language pedagogy
and effective assessment, together with counter-arguments as to its suitability, raises
the need to investigate, after its introduction, what is happening with interact in
practice. As has already been made clear, for all its theoretical justification, interact
represents a radical departure from previous and established procedures such that
investigating its impact and usefulness in the real worlds of classrooms is crucial. In
what follows, I outline the research project reported in this book that sought to do
this.
4.5 A Study into Teachers’ and Students’ Views
The study reported in this book was implemented to investigate stakeholders’ per-
spectives on the assessment reform during the period of its initial roll-out (2012–
2013). The study sought to answer the following questions: What are teachers and
students making of the innovation? What is working, what is not working, what
could work better? What are the implications, both for on-going classroom practice
and for on-going evaluation of the assessment?
4.5 A Study into Teachers’ and Students’ Views 89
The choice to investigate perspectives at an early stage in the implementation

process was deliberate. Bearing in mind the considerable changes to practice antici-
pated in the introduction of interact, and the reactionary feedback that had been
received from teachers at the early stages of planning the reform, a key issue of
interest for me was stakeholder perspectives on interact in comparison with con-
verse. Gathering data at the earlier stages of the reform would enable teachers to
reflect on the relative merits and demerits of interact whilst converse was still fresh
in their minds.
Indeed, as I explained in Chap. 3, the new assessments were introduced in a stag-
gered process, beginning with NCEA level 1 in 2011 and culminating in level 3 in
2013. In each year of introduction there was also a cross-over phase which allowed
teachers to select either the new standard (interact) or the original standard (con-
verse). (From 2014 all former standards were withdrawn, and only the new [aligned]
standards became available.) The two stages of the study (Stage I and Stage II)
coincided with different junctures in the reform implementation process (Table 4.1).
The data corpus for the study consisted of several data sets. The documentary
evidence available through the range of New Zealand based resources, accessible
on-line and produced to support teachers with their implementation of interact (see
Chap. 3 and earlier this chapter) provided foundational data upon which the two-
stage empirical investigation was built.
Data collected during the two stages of the empirical study comprised:
1. an anonymous teacher survey in 2012 (n = 152);
2. interviews with teachers in 2012 (n = 14) and 2013 (n = 13). These were drawn on
to broaden and deepen the understandings gleaned from the survey;
3. anonymous student surveys in 2012 (n = 30) and 2013 (n = 119). These provided
an additional vantage point from which to view the terrain.
Table 4.1 Stages of the study

New curriculum- Former non-curriculum-aligned
Year Stage in study aligned assessment assessment
2011 Level 1 interact Level 1 converse
introduced (Year 11, still available
15+ years of age) (transition year)
2012 Stage I – end of 2012 Level 2 interact Level 2 converse Level 1
(implementation introduced (Year 12, still available converse
mid-point) 16+ years of age) (transition year) withdrawn
2013 Stage II – end of 2013 Level 3 interact Level 3 converse Level 2
(implementation introduced (Year 13, still available converse
completed) 17+ years of age) (transition year) withdrawn
2014 Only new aligned Level 3
assessments available converse
withdrawn
4.6 Study Stage I
4.6.1 Nationwide Teacher Survey
The primary data collection instrument utilised in Stage I was a large-scale nation-
wide anonymous paper-based survey. The survey was targeted at teachers of the five
principal FLs taught in New Zealand (Chinese, French, German, Japanese and
Spanish), and levels 1 and 2 of NCEA. The survey sought teachers’ perceptions of
interact in comparison with converse, whether or not teachers had chosen to use the
new assessment since its introduction.
The overarching construct measured by the survey was ‘perceived usefulness’.
That is, the survey sought to elicit teachers’ perceptions of the usefulness or fitness
for purpose of both interact and converse. Usefulness was interpreted as incorporat-
ing the six qualities described earlier in this chapter as articulated by Bachman and
Palmer (1996).
In Section I of the survey respondents were asked to respond to ten paired state-
ments. One statement referred to converse and the other to interact. The statements
were written to reflect and measure perceptions across different facets of the useful-
ness construct. There were four sub-constructs:
1. Perceived validity and reliability (Statements 1, 2 and 3)
2. Perceived authenticity and interactiveness (Statements 4, 6 and 7)
3. Perceived impact (Statements 5 and 8)
4. Perceived practicality (Statements 9 and 10).
Bearing in mind that each statement was paired to elicit comparative attitudinal
data, it was important to prompt more precise and nuanced responses than those that
might have been collected from the more commonly used, but somewhat blunt, five-
point Likert scale (strongly disagree/disagree/neutral/agree/strongly agree). That
is, using a five-point scale would not have given the opportunity for respondents to
demonstrate comparative differences in strength of perception between interact and
converse with regard to a particular statement. For each paired statement, respon-
dents were presented with a 5cm line, with strongly disagree at the left-hand end of
the line and strongly agree at the right-hand end (Fig. 4.2). Respondents were
required to indicate the strength of their responses to each statement by drawing a
vertical line at the appropriate point. Strengths of responses were determined by
Please mark a clear vertical line to indicate the level of your response, like this:
Strongly Strongly
disagree agree
1 10
Fig. 4.2 Procedure for eliciting strength of perception

4.6 Study Stage I 91
measuring the distance, in mm, from the left-hand of the line to the point of intersec-
tion. This measurement was then converted into a score out of 10.2
In Section II of the survey, there were four open-ended questions. Respondents
were first asked to comment on the perceived advantages and disadvantages of
interact in comparison with converse (Questions 1 and 2). Question 3 asked those
who were using interact at the time of the survey to describe briefly their experi-
ences with introducing the new assessment at levels 1 and/or 2. Question 3 alterna-
tively asked those who were not using interact at the time of the survey to explain
briefly why they had decided not to use it. Question 4 solicited advice on how
interact might be improved.
4.6.2 Piloting the Teacher Survey
The survey was piloted before being distributed nationally. Ten teachers were inde-
pendently invited to complete the survey and then to comment on their understand-
ing of the statements in both sections, and how long it took them to respond to both
sections of the survey. Cronbach’s alpha was subsequently used to measure internal
consistency across the statements and the sub-constructs and thereby to determine
the reliability of the statements as measures of the construct and sub-constructs.
As an overall measure of perceived usefulness, responses to the ten statements in
Section I revealed acceptably high levels of internal consistency, whether relating to
converse (α = .86) or to interact (α = .73). Overall, therefore, the survey could be
regarded as a reliable measure of teachers’ perceptions of the usefulness or fitness
for purpose of converse/interact as assessments.
With regard to the four sub-scales, high levels of internal consistency were found
for perceived validity and reliability (α = .78 and .90) and perceived authenticity and
interactiveness (α = .86 and .89).
Lower internal consistency was found for perceived impact (α = .47 and .40).
However, there were only two items in the scale (which has a tendency to lower α
values), and the average correlation between the responses on the two statements
(the extent to which the students enjoyed the opportunities to speak versus the extent
to which they found the experiences stressful – a reversed polarity statement) was
r = .45. Closer inspection of the pilot surveys suggested that teachers varied in the
extent to which the two statements were perceived to correlate. That is, from one
perspective, the fact that students might enjoy the opportunities to speak did not
necessarily make the experience, as an assessment, any less stressful. From another
perspective, enjoyment of the opportunities correlated more closely with feeling
less stressed. Although Statement 5 had reversed the polarity of response, there was
2
The scale as presented in the surveys (Fig. 4.2) suggests a measure from 1 to 10. This was done
to indicate that strongly disagree was considered a viable response, with the mid-point (neutral) set
at 5. In terms of measuring the response with a ruler, however, measurement began at 0mm and the
extreme left of the scale was regarded as 0.
no evidence to suggest that this was impacting adversely on responses. The lower
Cronbach’s α scores were considered acceptable, and no amendments to statements
were made.
Considerably lower internal consistency was found for perceived practicality
(α = −.42 and .23). The two statements were clearly not measuring the sub-construct
in an internally consistent way. (Statement 10 also reversed the polarity of the
response.) Closer inspection of the pilot surveys, alongside comments recorded
from the teachers, suggested that Statement 10 (the extent to which the administra-
tion of interact detracted too much from available classroom teaching time) was not
being consistently understood. A modification to the wording of this statement was
made for the final survey.
Additional feedback from the piloting indicated that the survey could be com-
pleted relatively quickly, an important consideration for surveys that would eventu-
ally find their way into the hands of busy teachers.
4.6.3 Administering the Main Survey
For the main study, surveys were sent by mail to teachers across the country. School
names and addresses were acquired from a publicly available database published by
New Zealand’s Ministry of Education and schools were cross-checked against pub-
licly accessible Ministry data on the languages taught. It was only possible to deter-
mine from the databases which languages were taught in which school, not how
many teachers of a given language taught in each school. It was therefore acknowl-
edged that not all teachers teaching a language would receive the survey, and that
some teachers who would receive the survey might either not be teaching the lan-
guage in Years 11 and 12 at the time of the survey, or, if they were, may have been
preparing students for an alternative examination.3 Only one survey per language
per school would be distributed, with correspondence addressed to the teacher in
charge of a particular language. To ensure anonymity, respondents were asked not
to provide their names. Respondents were, however, asked to indicate the principal
language which they taught, and whether or not they had used interact at NCEA
levels 1 and/or 2.
At the start of September 2012 an initial invitation letter was sent to all teachers
whose schools had been extracted from the database. The letter outlined the project
and pre-empted the arrival of the survey. Surveys were sent out in the following
week, with a response deadline of mid-October. To facilitate the response rate a
postage paid envelope was included. In mid-November a follow-up reminder letter
was sent to all schools, with an extended response deadline 2 weeks later (i.e., the
3
See East (2012) for a brief discussion of the different kinds of assessment that schools in the New
Zealand context can opt into. Alternatives include Cambridge International Examinations and the
International Baccalaureate.
4.6 Study Stage I 93
end of November). In total, 579 surveys were distributed. As a consequence of the

first mail-out and the reminder, 152 surveys were returned.
Once surveys had been returned, the closed-ended section (Section I) was anal-
ysed descriptively and then inferentially using one-way analyses of variance to
determine patterns in the data and areas of statistically significant difference. Open-
ended comments from Section II were coded using a thematic analysis approach
(Braun & Clarke, 2006) which identified themes in three broad categories: advan-
tages of interact; disadvantages of interact; suggestions for improvements to inter-
act. Several months after my initial coding, I invited an independent coder to code
a subset of the data. The coder was provided with a sample of ten coded responses
in each of the three categories, along with the range of themes that had emerged
from the initial coding, and was asked independently to code a further sub-set of 30
samples (representing 20 % of the total data set). Inter-coder reliability analyses
using Cohen’s kappa were performed. A minimum threshold level of κ = .61 was set,
being the minimum level to indicate substantial, and therefore acceptable, inter-
coder agreement (Landis & Koch, 1977).
Thirteen themes emerged from the original codings. When the two independent
sets of codings of the subset of 20 % of the surveys were compared, for eleven
themes there was substantial (and therefore acceptable) inter-coder agreement
(Table 4.2).
Several emerging themes as recorded in Table 4.2 were subsequently conflated
for the purpose of focusing on the six sub-constructs of test usefulness. Additionally,
themes were considered in relation to perceived washback implications, particularly
with regard to interact.
Table 4.2 Taxonomy of emerging themes from the survey, Section II

Themes Cohen’s κ
Advantages
1. Promotes authentic and spontaneous spoken interactions (a focus on fluency) .638
2. Encourages peer-to-peer interactions .684
3. Makes the assessment less ‘test-like’, consequently less stressful .684
4. A genuine reflection of what learners know and can do .734
Disadvantages
5. Takes too long to administer and increases workload .689
6. Accessing/using the technology required can be challenging .792
7. Multiple peer-to-peer interactions have negative impact on students .63
Suggestions for improvement
8. Reduce the number of interactions required .83
9. Provide provision for scaffolding/rehearsal .634
10. Provide more examples of appropriate tasks .87
11. Provide more flexible assessment options .634
4.6.4 Teacher Interviews
Subsequent to the distribution of the survey, interviews were conducted during

November and early December 2012 with teachers across the country who had
opted to introduce the new assessment, recruited using a snowballing sampling pro-
cedure. Initial recruitment of teachers was carried out through personal contact.
Those initially recruited were asked to pass an invitation to participate on to others
who, to their knowledge, were successfully using interact at levels 1 and/or 2.
Fourteen teachers across New Zealand consented to be interviewed. Among
these, three teachers had been involved in the trialling and/or external moderation of
interact and could therefore bring a broader perspective to the investigation than
those who had introduced interact in their local contexts. Interviews, which lasted
between 30 and 50 min, were semi-structured. They followed a prescribed schedule
which included some questions that paralleled the survey, and were designed to
cover similar ground with each participant and facilitate a flexible open-ended con-
versation (Mangubhai, Marland, Dashwood, & Son, 2004). Issues explored with
interviewees included:
• Interviewees’ understandings of the main purposes of interact, and opinions
about interact
• Comparisons and contrasts between interact and converse
• Advantages and disadvantages of interact in comparison with converse
• Experiences and challenges with the implementation of interact
• Types of assessment tasks used and perceived student reception of these
• Advice for others about how to implement interact successfully.
Interviews were digitally recorded and transcribed. To ensure the credibility of
the interview data, the data were subject to respondent validation (Bryman, 2004a;
Miles & Huberman, 1994). All interview participants were invited to review the
transcripts of their interviews, to comment on the transcripts and to make deletions
and amendments. After member checking had occurred, individual transcript data
were principally organised around discrete aspects of the interviews framed by
questions (such as those questions that focused on advantages or disadvantages of
interact). These units were then explored inductively and comparatively (Merriam,
2009).
Interviews were used to add a supportive and complementary data source to the
teacher survey findings, with a view to elaborating on, illuminating and substantiat-
ing the key themes emerging from the analyses of Section II of the teacher survey.
Three key domains of concern also arose from the survey and interview data, and
these became the principal focus of data analysis in Stage II: the importance of the
task; the concept of ‘spontaneous and unrehearsed’ (a focus on fluency); a de-
emphasis on grammar (the place of accuracy).
4.7 Stage II 95
4.7 Stage II
In August 2013 I was invited to share aspects of the findings from Stage I of the
study at a regional seminar of the New Zealand Association of Language Teachers
(East, 2013). The one-hour presentation attracted approximately 100 teachers and
provided the opportunity to pass on findings of interest to teachers (a number of
whom would have participated in the study, either as interviewees or respondents to
the anonymous survey). The presentation therefore acted as an additional opportu-
nity for member checking and feedback on findings. It also created an occasion to
introduce Stage II of the project, and to invite participation in Stage II. This stage
was aimed at investigating interact at NCEA level 3 (the highest level of examina-
tion). In Stage II interviews with teachers were supported by surveys with Year 13
(NCEA level 3) students. Stage II took place towards the end of 2013.
4.7.1 Teacher Interviews
Interviews were conducted during November and early December 2013. Recruitment
was accomplished primarily through the invitation to participate after the presenta-
tion of findings (East, 2013). Ten teachers who had opted to introduce the new
assessment were recruited through this means. These teachers had not taken part in
the Stage I interviews. Additionally, the three teachers who had been interviewed in
Stage I the previous year, and who had been involved with the trialling and/or mod-
eration of interact and could therefore offer a broader perspective, were re-invited
to participate, and each consented to do so.
As with the Stage I interviews, interviews lasted between 30 and 50 min and
were semi-structured. Issues explored with interviewees paralleled those that had
been explored during the Stage I interviews. However, particular focus was given to
interact at level 3, and issues pertaining to its successful implementation. Once
more, interviews were digitally recorded and transcribed. On this occasion, member
checking was not employed. Instead, the interview data were drawn on for data tri-
angulation purposes, to illuminate the three key issues of concern that had emerged
from the analyses of Stage I surveys and interviews (see above), and, finally, in
terms of implications for the classroom (i.e., washback).4
4.7.2 Student Surveys
Additional data to support Stage II included two student surveys: a survey for Year
13 students (final year of schooling) who had taken level 3 converse in its final year
of operation in 2012 and were therefore among the last to take the converse
4
A subsequent opportunity for informant feedback was possible when data were re-presented in a
one-hour forum in 2014 which attracted approximately 180 attendees (East, 2014).
assessment (n = 30); and a survey for Year 13 students who had taken level 3 interact
in its first year of operation in 2013 (n = 119).
Given that, unlike the teachers, the students were not in a position to provide
comparative data (seeing as they would only be familiar with one assessment type),
the main interest in collecting data from the students was to investigate perceptions
about interact. The 2012 survey therefore acted principally as a small-scale pilot for
the main student survey that would be used in Stage II in 2013, although it was con-
sidered that it would yield some information that could be analysed comparatively.
Both student surveys were designed to parallel the surveys that had been sent to
teachers in Stage I of the project, and contained both closed- and open-ended items.
The wording of the statements in the teacher survey was modified and simplified in
the student survey to make the survey as accessible to students as possible. To take
account of the two independent groups, the statements in the closed-ended section
of the student survey were differentiated and referred only to converse or interact.
As with the teacher survey, the overarching construct measured by the statements
was perceived usefulness as understood in terms of Bachman and Palmer’s (1996)
six qualities. However, the two final statements referring to the sub-construct of
practicality were removed from the student surveys. This was because these two
statements in the teacher survey referred to teachers’ perceptions of comparable
management challenges between the two assessments, and the issue of interest,
from the students’ perspective, was their perception of the assessment with regard
to its measurement of their spoken communicative proficiency. As with the teacher
survey, respondents were asked to indicate the strength of their response to each
statement by drawing a vertical line at the appropriate point, with strongly disagree
at one end and strongly agree at the other (see Fig. 4.2).
In Section II of the student survey, students were asked to describe their experi-
ences of taking converse or interact. Open-ended questions asked students to
describe briefly what they had had to do for the converse or interact assessment
(Question 1), and then what they thought about the converse or interact assessment
(Question 2). Question 3 paralleled Question 4 for teachers and solicited views
about how converse or interact might be improved.
The converse survey was distributed by mail in September 2012 to coincide with
the final few weeks of Year 13 students’ schooling before going on study leave in
preparation for their forthcoming external examinations. A reply paid envelope was
included to facilitate return. With one exception, surveys were sent to schools where
teachers had volunteered to take part in the Stage I interview and where teachers had
indicated that they had Year 13 students who were available to complete the survey.
Thirty surveys were returned, representing the full range of international languages.
As with the pilot of the teachers’ survey used in Stage I, Cronbach’s alpha was
subsequently used to measure internal consistency across the statements. (Statement
5 reversed the polarity of response.) As an overall measure of perceived usefulness,
responses to the eight statements in Section I revealed acceptably high levels of
internal consistency (α = .79). Overall, the survey could be regarded as a reliable
measure of students’ perceptions of the usefulness or fitness for purpose of the
assessment they had taken.
4.8 Conclusion 97
The interact survey was distributed by mail in September 2013, once more
designed to coincide with the final few weeks of Year 13 students’ schooling, when
all interact assessments would have been completed. Surveys were sent to 12
schools across the country whose teachers had consented for questionnaires to be
administered (in eight cases teachers were also interviewed). Surveys were returned
from 11 schools. In nine cases, only one class (language) was represented. One
school returned surveys from two different language classes, and another from
three. Of 119 surveys returned, Section I of one survey was unusable because the
respondent had not responded to any statement. Section I of this survey was there-
fore removed from the dataset. A range of school types and all languages apart from
Chinese were represented in the returns.
In common with the teacher surveys, the closed-ended responses from both sets
of student surveys were analysed descriptively and then inferentially using one-way
analyses of variance. The open-ended comments from Section II were drawn on for
illustrative purposes to exemplify student perceptions of the two assessments.
Subsequent analyses focused on comments related to the three key issues identified
from Stage I and illuminated through the Stage II interviews – the importance of the
task; the concept of ‘spontaneous and unrehearsed’; a de-emphasis on grammar –
alongside perspectives regarding washback. To enhance readability, both survey and
interview comments from both stages of the study were cleaned, for example, spell-
ing mistakes corrected; redundant words omitted.
4.8 Conclusion
The intention of this book is to tell the story of assessment innovation – the move
from one form of assessment to a substantially different form of assessment, and its
reception by teachers and students as two key stakeholders. Two contrasting means
of assessing spoken communicative proficiency are under the spotlight.
Notwithstanding Bachman and Palmer’s (2010) argument that in any assessment
situation there will be a number of alternative approaches, each offering advantages
and disadvantages, the issue at stake is this: which of the two assessment formats
realised in converse and interact better reflects assessments of spoken communica-
tive proficiency that are valid, useful and fit for purpose? Attempting to address this
issue by taking account of stakeholder perspectives is the essence of this book.
Following the arguments proposed by Lazaraton (1995, 2002 – see Chap. 1),
the study reported in this book is largely qualitative, drawing on several indepen-
dent and complementary data sources (surveys and interviews) that solicited
teacher and student perceptions. However, the data also enabled a level of quanti-
fication in terms of frequency counts and tests of significance. Laying these data
alongside published documentary material (i.e., NZQA and Ministry of Education
documentation) enabled comparison and contrast between a range of different data
sets. In turn, this facilitated both data source and methodological triangulation
(Bryman, 2004b; Denzin, 1970). Each aspect contributed to a robust study into
stakeholder perspectives, the findings of which are presented and discussed in the
remaining chapters.
References

University Press.
Press.
Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in
Psychology, 3(2), 77–101. http://dx.doi.org/10.1191/1478088706qp063oa
Bryman, A. (2004a). Member validation and check. In M. Lewis-Beck, A. Bryman, & T. Liao
(Eds.), Encyclopedia of social science research methods (p. 634). Thousand Oaks, CA: Sage.
http://dx.doi.org/10.4135/9781412950589.n548
Bryman, A. (2004b). Triangulation. In M. B. Lewis-Beck, A. Bryman, & T. Liao (Eds.),
Encyclopedia of social science research methods (pp. 1143–1144). Thousand Oaks, CA: Sage.
http://dx.doi.org/10.4135/9781412950589.n1031
applin/i.1.1
Council of Europe, (2001). Common European Framework of Reference for languages. Cambridge,
Denzin, N. K. (1970). The research act in sociology. Chicago, IL: Aldine.
East, M. (2008). Dictionary use in foreign language writing exams: Impact and implications.
org/10.1075/tblt.3
East, M. (2013, August 24). The new NCEA ‘interact’ standard: Teachers’ thinking about assess-
ment reform. Paper presented at the New Zealand Association of Language Teachers (NZALT)
Auckland/Northland Region language seminar, Auckland.
East, M. (2014, July, 6–9). To interact or not to interact? That is the question. Keynote address at
the New Zealand Association of Language Teachers (NZALT) Biennial National Conference,
Languages Give You Wings, Palmerston North, NZ.
179–189. http://dx.doi.org/10.1080/15434303.2010.538779
(Ed.), The Oxford handbook of applied linguistics (2nd ed., pp. 110–123). Oxford, England:
References 99
Hu, G. (2013). Assessing English as an international language. In L. Alsagoff, S. L. McKay, G. Hu,

& W. A. Renandya (Eds.), Principles and practices for teaching English as an international
language (pp. 123–143). New York, NY: Routledge.
The Common European Framework of Reference: The Globalisation of Language Education
Policy (pp. 233–247). Clevedon, England: Multilingual Matters.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data.
Biometrics, 33(1), 159–174. http://dx.doi.org/10.2307/2529310
Leaper, D. A., & Riazi, M. (2014). The influence of prompt on group oral tests. Language Testing,
31(2), 177–204. http://dx.doi.org/10.1177/0265532213498237
Lewkowicz, J. (2000). Authenticity in language testing: Some outstanding questions. Language
dx.doi.org/10.1017/cbo9780511733017
Mangubhai, F., Marland, P., Dashwood, A., & Son, J. B. (2004). Teaching a foreign language: One
teacher’s practical theory. Teaching and Teacher Education, 20, 291–311. http://dx.doi.
org/10.1016/j.tate.2004.02.001
Merriam, S. B. (2009). Qualitative research: A guide to design and implementation. San Fransisco,
CA: Jossey-Bass.
Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded sourcebook
(2nd ed.). Thousand Oaks, CA.: Sage.
Ministry of Education. (2014b). Resources for internally assessed achievement standards. Retrieved
from http://ncea.tki.org.nz/Resources-for-Internally-Assessed-Achievement-Standards
NZQA. (2014a). External moderation. Retrieved from http://www.nzqa.govt.nz/providers-
moderation/external-moderation/
NZQA. (2014b). Internal moderation. Retrieved from http://www.nzqa.govt.nz/providers-
moderation/external-moderation/internal-moderation/
NZQA. (2014c). Languages – Clarifications. Retrieved from http://www.nzqa.govt.nz/
qualifications-standards/qualifications/ncea/subjects/languages/clarifications/
NZQA. (2014d). Languages – Moderator’s newsletter. Retrieved from http://www.nzqa.govt.nz/
October-2014/
NZQA. (2014e). NCEA subject resources. Retrieved from http://www.nzqa.govt.nz/qualifications-
standards/qualifications/ncea/subjects/
Pardo-Ballester, C. (2010). The validity argument of a web-based Spanish listening exam: Test
usefulness evaluation. Language Assessment Quarterly, 7(2), 137–159. http://dx.doi.
org/10.1080/15434301003664188
Scott, A., & East, M. (2009). The standards review for learning languages: How come and where
L. Parmenter (Eds.), The Common European framework of reference: The globalisation of
language education policy (pp. 248–257). Clevedon, England: Multilingual Matters.
Shohamy, E. (2001). The social responsibility of the language testers. In R. L. Cooper (Ed.), New
perspectives and issues in educational language policy (pp. 113–130). Amsterdam, Netherlands/
Philadelphia, PA: John Benjamins Publishing Company. http://dx.doi.org/10.1075/z.104.09sho
Spolsky, B. (1985). The limits of authenticity in language testing. Language Testing, 2(1), 31–40.
http://dx.doi.org/10.1177/026553228500200104
Chapter 5
The Advantages of Interact
5.1 Introduction
In Chap. 4 I argued that, when evaluated in theory against the construct of test use-
fulness (Bachman & Palmer, 1996) and its six sub-constructs – construct validity,
reliability, interactiveness, impact, practicality and authenticity – interact measures
up considerably well as a useful measure of spoken communicative proficiency. I
also indicated potential challenges to claims to usefulness. Also, particularly given
the central role of teachers in enacting internal assessments in the New Zealand
context, and the somewhat reactionary early feedback that had been received from
some quarters regarding interact, I suggested that evaluating the claims to useful-
ness of interact from a purely theoretical basis was insufficient. I argued that it is
important to find out from teachers as principal stakeholders what they think about
the usefulness of interact now that they have had the opportunity to try it out. Their
perspectives contribute to making more robust (empirically-based) decisions about
relative usefulness, and therefore to validity arguments (Winke, 2011).
Stage I of the two-stage study reported in this book took place towards the end of
2012, two years after the introduction of interact in schools. It included an anony-
mous paper-based nationwide teacher survey (n = 152) targeted at the principal
international languages taught in New Zealand (Chinese, French, German, Japanese
and Spanish), and interviews with teachers who had been using interact since its
introduction (n = 14).
This chapter and Chap. 6 report on Stage I. Findings are reported from the
nationwide survey and compared to those gleaned from the interviews. In this chap-
ter I begin by presenting the results and analyses from Section I of the survey,1
which was designed to tap into the different facets of Bachman and Palmer’s (1996)
1
This presentation is an expansion of data reported in an article in Language Testing (East, 2015),
first published 14th August, 2014, and available on-line: doi:10.1177/0265532214544393

Linguistics 26, DOI 10.1007/978-981-10-0303-5_5
102 5 The Advantages of Interact
test usefulness construct. I go on to present findings from Section II of the survey2

that pertain to teachers’ perceived advantages to interact, and compare these to find-
ings elicited from the teacher interviews.
5.2 The Nationwide Teacher Survey – Section I

5.2.1 Overview
As I noted in Chap. 4, in total 579 surveys were distributed, and 152 responses
received. This was considered a very positive response rate for a mail survey
(Resnick, 2012) of just over one in four targeted FL teachers in New Zealand.
Respondents were asked to identify the main language which they taught. Response
rates across the five targeted languages were subsequently compared to the numbers
of senior secondary students (NCEA levels 1 to 3) taking each FL in 2012 (Education
Counts, 2012). The response numbers correlated virtually perfectly (r = .996,
p < .001), suggesting that the larger populations of teachers of these languages at
senior secondary (NCEA) level were adequately represented in the sample (Fig. 5.1).
Respondents were also asked to indicate whether or not they were using interact
at the time of the survey, and, if so, at which level (NCEA level 1 only; NCEA level
2 only; NCEA levels 1 and 2). By far the majority of respondents (123/152 or 81 %)
had decided to use interact at either or both levels. The breakdown of responses is
summarised in Fig. 5.2. Some respondents gave reasons for the usage patterns.
Fig. 5.1 Numbers of survey respondents (left) compared to numbers of NCEA (senior secondary)
students (2012) (right)
2
Findings presented here and elsewhere incorporate some data reported in an article in Assessment
Matters (East, 2014).
5.2 The Nationwide Teacher Survey – Section I 103
Fig. 5.2 Numbers of

survey respondents using/
not using interact (Note:
the ‘not used’ category
includes one respondent
who did not specify)
It seemed that the decision whether or not to use interact was often influenced by a
variety of factors. For example, some teachers were not using interact at a particular
level because they had no students at that level at the time of the survey. Decisions
were not necessarily indicative of their views about interact.
5.2.2 Perceived Relative Usefulness of Converse and Interact
Section I of the survey sought teachers’ perceptions of the usefulness or fitness for
purpose of both interact and converse. Ten paired statements (one referring to con-
verse and the other to interact) measured four sub-constructs:
1. Perceived validity and reliability (Statements 1, 2 and 3)
2. Perceived authenticity and interactiveness (Statements 4, 6 and 7)
3. Perceived impact (Statements 5 and 8)
4. Perceived practicality (Statements 9 and 10).
The paired statements therefore aimed to elicit a comparison between the two
assessments. Strength of response was indicated by drawing a vertical line at the
appropriate point (see Chap. 4, Fig. 4.2). The distance between the responses was of
particular interest because it revealed relative levels of difference in perception
between the two assessments on each measure.
An initial observation of the data revealed several statements where no responses
were recorded. In total, missing responses accounted for 91 occasions out of a total
of 3040 (2 x 10 statements across 152 surveys). This represented 3 % of the total
data set. Missing data are a perennial and ubiquitous problem for social science
research, and a variety of means for dealing with them have been proposed (see,
e.g., Graham, 2012). Graham notes that “we cannot know all the causes of missing-
ness,” but we can at least make “plausible guesses about the effects that these
unknown variables have on our statistical estimation” (p. 9). Missing responses
were scrutinised to determine if there were any patterns in the missingness.
The missing responses appeared to indicate either complete randomness

(i.e., respondents failing, for unidentifiable reasons, to complete a particular
response) or, in some instances, respondents perceiving that they were unable to
respond (e.g., respondents who did not complete statements referring to the students’
perceptions because they may have felt unable to comment on these). In five cases, all
statements pertaining to one of the assessments were responded to, but all statements
about the other were ignored. This scenario was presumably because the respondents
had no experience of using the assessment about which they chose not to respond.
There was no evidence to suggest any missing responses that would have biased
the data (e.g., teachers with strong feelings about a particular issue who, on that
basis, chose not to respond to a statement about that issue). It was estimated, there-
fore, that the missing responses would not impact adversely on conclusions drawn
from statistical analyses, and the data remained intact without any modification
(such as listwise or pairwise deletion, or imputation of missing values).
Table 5.1 provides the overall means, on a scale from 0 (strongly disagree) to 10
(strongly agree), for both assessments and each individual measurement statement
Table 5.1 Overall means and differences in means (teachers): converse and interact
Difference
Measure Converse Interact in mean
Perceived validity and reliability
1. enables students to demonstrate clearly what they M 4.80 6.20 +1.4
know and can do when speaking the target language SD 2.44 2.14
2. provides an accurate measure of students’ spoken M 3.97 6.22 +2.25
communicative proficiency SD 2.38 2.18
3. provides good opportunities to measure students’ M 4.24 6.30 +2.06
fluency in the target language SD 2.50 2.26
Perceived authenticity and interactiveness
4. provides a meaningful measure of target language M 2.99 5.90 +2.91
use in the real world beyond the classroom SD 2.34 2.53
5. promotes the opportunity for students to engage M 3.65 6.33 +2.68
in genuine social interactions in the target language SD 2.43 2.27
6. promotes the opportunity for students to use M 2.93 5.67 +2.74
authentic and unrehearsed language SD 2.32 2.42
Perceived impact on the students
7. completing this assessment makes students feel M 6.76 6.11 −0.65
anxious and stressed SD 2.42 2.40
8. students generally feel that the assessment M 4.56 5.78 +1.22
is a good test of their spoken ability SD 2.27 2.12
Perceived practicality
9. easy to manage and administer M 7.33 2.39 −4.94
SD 1.98 2.16
10. takes up a lot of class time at the expense M 3.20 7.26 +4.06
of the available teaching time SD 2.22 2.43
Table 5.2 Differences in standardised means between converse and interact

Measure Differencea nb tc dd re pf
1 14.15 144 4.79 0.801 0.372 0.000
2 22.24 146 7.53 1.251 0.530 0.000
3 20.88 147 6.88 1.139 0.495 0.000
4 29.08 146 8.99 1.493 0.598 0.000
5 26.99 146 8.9 1.478 0.594 0.000
6 27.81 146 9.26 1.538 0.610 0.000
7 5.96 139g 1.92 0.327 0.161 0.056
8 12.17 136g 4.17 0.718 0.338 0.000
9 −49.32 145 −17.19 2.865 0.820 0.000
10 −40.79 145 −12.81 2.135 0.730 0.000
Adapted from East (2015, p. 109)
Notes
a
Difference between converse and interact measures on standardised (out of 100) data and to two
decimal places (polarity reversed for Measures 7 [Statement 5] and 10 [Statement 10])
b
n = no. of responses
c
t = test for no difference between converse and interact
d
d = Cohen’s d (no. of SDs in the difference)
e
r = effect size (all very strong except Measure 7)
f
p = probability of getting these numbers if there were no difference
g
The lower response rate may have been attributable to teachers not feeling in a position to com-
ment on impact on students. The result is clear, however
in the survey (measures are presented here and elsewhere in sub-construct order, not
in the order as presented in the survey).
For subsequent analyses, means were standardised to 100. As stated above, the
primary phenomenon of interest was the differences between the means. Table 5.2
presents the differences in standardised means between converse and interact and
indicates where these differences are significant (α = .05).
The descriptive statistics revealed different strengths of response across the mea-
sures. Taking the midpoint of the scale (a neutral or neither agree/disagree response)
as 50, the mean response for converse was below this for Measures 1 to 8 and above
this for Measures 9 and 10 (with polarity for Measures 7 and 10 reversed). On aver-
age, it appeared that respondents perceived converse to underperform on Measures
1 to 8, but to perform well on Measures 9 and 10. By contrast, interact was per-
ceived, on average, to perform well on all measures except 7, 9 and 10, on which it
under-performed.
Considered as a whole (that is, without taking into account language taught and
whether or not the teacher was using interact at the time of the survey), the descrip-
tive statistics indicated several differences in perception between the two assess-
ments. These differences were statistically significant. Respondents:
• considered interact to be a more valid and reliable assessment than converse;
• somewhat more strongly considered interact to be more authentic and interactive
than converse;
• considered interact to be considerably less practical to administer than converse.
It was also apparent that respondents did not see a great deal of difference
between the two assessments in terms of impact on the students. That is, the sub-
construct of impact was not as clearly defined because respondents perceived no
difference between the assessments on Measure 7 (student stress). It appeared that,
even though teachers perceived that their students would regard interact as a (sig-
nificantly) better assessment of their spoken communicative proficiency than con-
verse, both assessment types were regarded as equally stressful for students. In the
teachers’ perception, it seemed that whether students felt anxious and stressed did
not depend on the design of the assessment but on other (unmeasured) variables.
A principle components analysis indicated that Measures 7 and 8 were in fact not
well represented by the one sub-construct. Indeed, Measure 7 did not correlate well
with any of the other measures, even though the remaining Measures 1 to 8 (exclud-
ing 7) showed high positive correlations, and Measures 9 and 10 correlated well
with each other (although they had more moderate positive correlations with the
other measures). In essence, where the respondents perceived an assessment to be
valid and reliable, they also saw it as authentic and interactive, and vice versa.
Practicality was an entirely different issue for teachers, with clear polarisations
between the two assessment types. The measurement of impact as one sub-construct
was somewhat clouded by two statements that measured different (and arguably
unrelated) aspects of impact.
Mean scores provide a convenient way of summarising the average picture, and
analysis of the differences between the means provides useful information on the
measures where differences in perception between the two assessments exist, par-
ticularly when (as is the case with all but Measure 7) those differences are signifi-
cant. Three other analyses provide additional dimensions of understanding what the
quantitative data tell us about teachers’ perceptions. These are:
1. the strengths of differences of response across each measure;
2. whether principal language taught influences teachers’ perceptions of difference;
3. whether using or not using interact influences teachers’ perceptions of difference.
5.2.3 Variations in Teacher Responses
The mean score differences tell us nothing about individual response patterns. For
example, on measures of authenticity and interactiveness, one teacher may have
perceived only a small difference between the two assessments, whereas another
may have seen the two assessments as vastly different in these respects. Looking at
the strengths of differences of response between each statement on a more individu-
alised basis than the mean differences provides a window into how strongly differ-
ent teachers perceived the differences between the two assessment types, and
therefore how many teachers held views about the two assessment types that were
relatively comparable, or, by contrast, vastly polarised.
Figure 5.3 presents a percentage histogram for the differences in scores for
each of the ten pairs of statements which reveal the percentages of respondents who
-90 -60 -30 0 30 60 90 -90 -60 -30 0 30 60 90 -90 -60 -30 0 30 60 90 -90 -60 -30 0 30 60 90 -90 -60 -30 0 30 60 90
Measure 1 Measure 2 Measure 3 Measure 4 Measure 5
30% 30%
20% 20%
10% 10%
0 0
Measure 6 Measure 7 Measure 8 Measure 9 Measure 10
30% 30%
20% 20%
10% 10%
0 0
-90 -60 -30 0 30 60 90 -90 -60 -30 0 30 60 90 -90 -60 -30 0 30 60 90 -90 -60 -30 0 30 60 90 -90 -60 -30 0 30 60 90
Fig. 5.3 Percentage histogram of difference scores (converse – interact) by measure
differed in their responses by various amounts. These scores were calculated by

taking the response of each individual to interact and subtracting their response to
converse. Where the difference score falls around the centre line of zero this repre-
sents a response of perceiving no or negligible difference between the two assess-
ment types. Those responses that fall to the left of centre indicate a perception that
converse outperforms interact. Those to the right of centre indicate that interact is
perceived to be better than converse on that measure. The further out from the centre
the difference score lies, the greater the perceived difference between the two.
Accordingly, a difference score of ±90 represents a very extreme polarisation of
views.
The peak around zero for Measures 1 to 8 indicates that around one quarter to
one third of respondents recorded similar levels of response for both assessments. A
sizeable minority effectively did not perceive the assessments as different in terms
of usefulness.
However, the majority of the rest of the responses for Measures 1 to 8 fall to the
right of the midline. This indicates that these respondents considered that interact
outperformed converse on those measures. (Measure 7 was more evenly distributed
around the midline, but still shows marginal support for interact.) There were, on
average, twice as many respondents to the right as there were to the left. Generally,
there were far fewer respondents who rated converse as outperforming interact. By
far the majority of teachers considered interact to be an improvement over converse
on all measures of usefulness apart from practicality.
0 -90 -60 -30 0 30 60 90

Validity Authenticity
25%
20%
15%
10%
5%
0
Impact Practicality
25%
20%
15%
10%
5%
0
-90 -60 -30 0 30 60 90 0
Fig. 5.4 Difference scores averaged across constructs
In terms of practicality, Measures 9 and 10 reveal a substantially different pic-

ture. Respondents generally returned a larger difference between the assessments in
favour of converse. Most respondents perceived that converse outperformed interact
on this measure. The magnitude of the perceived difference was much larger than
those recorded for the other measures. That is, there were fewer responses in the
middle and more to the outer edge with very few respondents (5–6 %) considering
that interact surpassed converse on the two measures of practicality. When it came
to practicality interact was failing miserably in the perception of the vast majority.
The substantially different opinion on practicality in comparison with the other
sub-constructs becomes more apparent when the difference scores across each of
the sub-constructs are averaged for each respondent. Figure 5.4 depicts this and
confirms the above conclusions. (The figure also reveals that the variables were not
normally distributed and this should be kept in mind when considering the inferen-
tial analyses below. However, most of the results reported are sufficiently statisti-
cally significant for this not to be of concern.)
5.2.4 Differences in Perception According to Principal

Language Taught
Figure 5.5 shows the mean difference score for each sub-construct displayed as a
horizontal line together with the mean difference score according to principal lan-
guage taught.
Fig. 5.5 Sub-construct differences in mean (converse v. interact) by language taught (Reproduced
from East, 2015, p. 110). Note: panels do not have the same y-scale
Figure 5.5 suggests that, at first glance, there may be some pattern of responses
according to the principal language taught. That is, on average, teachers of Chinese
apparently perceived less difference between the two assessments than teachers of
other languages, and as a group they appeared to perceive only a small improvement
in interact in terms of validity and reliability, and authenticity and interactiveness.
By contrast, teachers of German on average perceived the improvement of interact
in these two measures, and the deterioration in terms of practicality, to be greater
than teachers of other languages. However, these results need to be interpreted with
some caution because of the small numbers of teachers of Chinese (6) and German
(11) within the sample. The variation between the teachers of Chinese in particular
was very large. For example, a negligible average difference for authenticity and
interactiveness belies a range of −79 to 49. Examination of the raw data revealed
that in each case the mean for Chinese teachers was pulled down by one respondent
who clearly perceived converse considerably more positively in all respects and
scored converse highly (above 80) and interact very low (below 10) on all measures.
Analyses of variance revealed that, when the variable of interest was principal lan-
guage taught, the differences depicted in Fig. 5.5 were not statistically significant.
Language taught made no difference to teachers’ perceptions of the relative useful-
ness of interact compared to converse.
Validity Authenticity
30 40
20 30
10 20
0 10
-10 0
None Level-1 Level-2 Both None Leve-1 Level-2 Both
Impact Practicality
-40
10
-44
-48
0
-52
-10 -56
None Level-1 Level-2 Both None Level-1 Level-2 Both
Fig. 5.6 Sub-construct differences in mean (converse v. interact) by whether or not using interact
(Reproduced from East, 2015, p. 110). Note: panels do not have the same y-scale
5.2.5 Differences in Perception According to Whether

or Not Using Interact
Figure 5.6 shows the mean difference score for each construct displayed as a hori-
zontal line together with the mean difference score according to whether or not the
respondent was using interact at the time of the survey, whether at level 1 or 2 only,
or at both levels.
When compared with those who stated that they were not using the new assess-
ment, respondents who were using interact considered it more useful than converse
in terms of validity, reliability, authenticity, interactiveness and impact. They also
rated it more highly (or, more precisely, not as severely) with regard to practicality.
In other words, in comparative terms, respondents who reported that they were
using interact perceived its benefits over converse more favourably and judged its
cost in terms of practicality less harshly. (This may be either because they were
using it or why they were using it.)
Analyses of variance were conducted to determine whether differences in per-
ception between users and non-users of interact were statistically significant
(Table 5.3). It was found that whether or not respondents were using interact at the
5.3 Advantages of Interact – Survey Data 111
Table 5.3 Analyses of variance of difference scores for each sub-construct by use of interact
Sub-construct Source DF SS MS F p
Validity Using 1 24505 24505 30.28 0.000
Error 144 116551 809
Total 145 141057
Authenticity Using 1 24727 24727 28.28 0.000
Error 144 125901 874
Total 145 150628
Impact Using 1 10203 10203 15.51 0.000
Error 137 90155 658
Total 138 100358
Practicalitya Using 1 5493 5493 5.58 0.020
Error 143 140789 985
Total 144 146283
Reproduced from East (2015, p. 112)
Note
a
The analysis for the sub-construct practicality was repeated using data transformed for normality
(p = 0.78) and a comparable result was obtained
time of the survey made a significant difference to teachers’ perceptions of the rela-
tive usefulness of interact compared to converse.3
The evidence from Section I of the survey might lead us to the arguably defen-
sible conclusion that, in the perception of the surveyed teachers, interact was, in
most respects, more useful and fit for purpose as a measure of spoken communica-
tive proficiency than the traditional summative teacher-led interview test that it had
replaced. There were important advantages to interact in the eyes of the majority of
respondents. However, there was also a noteworthy disadvantage (impracticality),
and a level of ambivalence with regard to student impact. These findings suggest
that there are areas where interact might be made more useful. The qualitative data
from Section II of the survey and the teacher interviews provided complementary
opportunities to probe more deeply into teacher perceptions. In the remainder of this
chapter I present the perceived advantages of interact, as reported in Section II of
the survey and explored in the Stage I interviews. (In Chap. 6 I report perceived
disadvantages and suggestions for improvement to interact.)
5.3 Advantages of Interact – Survey Data
I explained in Chap. 4 how specific dimensions of advantage, disadvantage and

improvements were reliably identified in the themes emerging from the data coding
(see Table 4.2). Perceived advantages of interact were subsequently grouped
3
When a Bonferroni correction was applied because of the use of four ANOVAs (resulting in an
alpha level of .0125), the differences between the two groups were highly significant for all but the
sub-construct of practicality.
Table 5.4 Frequencies of mentioning advantages of interact

Frequency of Total no. of Percentage of
comment respondentsa respondents
Using Not using Using Not using Using Not using
Advantage interact interact interact interact interact interact
1. authenticity/interactiveness 82 8 120 25 68 % 32 %
2. positive impact 39 1 120 25 33 % 4%
3. validity/reliability 14 1 120 25 12 % 4%
Note
a
From 152 returns, 145 respondents made comments relating to advantages of interact
according to the relevant qualities of test usefulness. The frequencies with which each
quality was discernible, starting with the most frequent, are recorded in Table 5.4.
It was evident that, whether or not respondents were using interact at the time of
the survey, the most commonly commented on advantage of interact in comparison
with converse was its perceived authenticity and interactiveness. Furthermore, those
using interact differed significantly from those who were not using interact with
regard to mentioning this advantage, χ2 (1) = 11.601, p < .001. For the other compari-
sons χ2 tests were not performed because in each case one observed cell count was
equal to or less than five. It was apparent from the frequency counts, however, that,
proportionally, those using interact made reference to each of the positive attributes
of the assessment considerably more frequently than those who were not using
interact. It appeared that actually using the assessment was a factor in the frequency
with which respondents commented on positive characteristics of the new assess-
ment. This corroborates the claim from the closed-ended section of the survey that
actually using the assessment made a positive difference to perceptions.
In what follows, I present excerpts from the open-ended survey comments that
illustrate the different dimensions of advantage. I also consider washback implica-
tions. In each case I record the principal language taught and the survey number as
received (e.g., French 007 refers to the seventh survey received, completed by a
teacher whose principal language taught was French).
5.3.1 Authenticity and Interactiveness
Authenticity and interactiveness were the dimensions of usefulness commented on

overall by two out of every three respondents. That is, interact represented a “push
towards more authentic, real users of the language” (Spanish 075). This made the
assessment “refreshingly authentic” and certainly prepared students well for “real-
world use of the language” (French 008). In other words, the new assessment
encouraged “authentic use of the target language between the students rather than
answering the teacher’s prepared questions” and was thereby “giving the message
to the students that speaking the target language is important in learning a second
language” (Japanese 101).
Interact thus promoted a future-focused authenticity, interpreted in both situa-

tional and (more importantly) interactional terms. In the words of one French
teacher, the assessment “isn’t just role-play.” Rather, it “prepares [students] for
going to France” and “makes them think what it would really be like” (French 042).
These notions were reiterated by a second French teacher (French 002) who argued
that, contrary to a prepared role-play, interact “helps the students to really interact
as they would if they find themselves in France.” For this teacher, the reality was
that, when in the target country, students “would never have a conversation that they
got to learn off by heart as it was the case with the old version.” Ultimately “we want
students to be able to interact with French people, that’s the main aim of learning a
language.” Interact, it seemed, was well suited to this aim, and through its use stu-
dents could “start to see that the target language is not a school subject but a living
language spoken by millions of people around the world” (French 145).
An important distinction for several teachers was therefore that the tasks that
students could engage in for interact could be “more ‘real’ life” (German 057). The
tasks required a level of spontaneity and naturalness that was clearly lacking, not
only in the kinds of speaking opportunities that appeared to have been elicited from
converse, but also in the more ‘traditional’ transactional role-play scenarios that
may once have dominated classrooms.
Another dimension of enhanced authenticity and interactiveness, seen in com-
parative terms, was the move away from having to account for specific grammatical
structures at different levels of the examination. With performances “judged on flu-
ency, ability to sustain conversation, and not on structures and accuracy” (German
098), students did not need to “cram in unnatural grammar in a conversation”
(Japanese/French 100) and were “not tied to particular language structures which
can hinder spontaneity” (French 081). They could therefore “converse with each
other more naturally and with less anxiety” (French 081). Having fewer structures
“rehearsed and chucked in” made the assessment “more real” (Japanese 035). The
benefit of students “having real conversations, not rehearsed ones,” was that they
were able to “speak naturally” and did not need to “‘develop’ their work the way
they used to need to to gain ‘excellence’ which was not natural in a conversation”
(Japanese 012). Consequently there was “way less ‘pressure to perform’ in the new
standard” (Japanese 019). With “no pressure put upon the student to operate a cer-
tain/required level for each interaction,” students could “interact more freely”
(Japanese 062).
One of several consequential advantages relating to the move away from gram-
mar as a central or decisive factor in the assessment and the shift towards commu-
nicative fluency was therefore “more natural language” (German 114) or “a more
‘natural’ conversation compared to the old standard” (Spanish 073). Also, tolerance
of error meant that making errors could be seen as “a normal process of learning a
language” (French 145).
Additionally, the development of students’ strategic competence was recognised
as a key advantage of the new assessment. With “less emphasis on ‘correctness’ and
more on communicating the message,” students were “forced to use and develop the
ability to respond, react, interact, and engage with the person/people they’re talking
to” (French 034). Freeing students from ‘correctness’, as another teacher put it,
facilitated “communication in the sense of conveying and receiving messages” and
enabled “a more genuine exchange of information” (French 025). Attention was
therefore placed “on interaction - that is, fillers, questions, comments, agreements
etc. rather than correct structures” which “promotes communication and fluency”
(French 091, my emphasis).
Indeed, the development of strategic language use was seen by several respon-
dents as an important contribution to the ‘naturalness’ of the new assessment.
Interact enabled students to “focus on the fillers, exclamations and sociable responses
that ‘oil’ good conversations” (Spanish 075); it promoted “more genuine interaction,
conversation fillers, a range of different situations (not just one)” (Japanese 076);
students learnt to “use ‘natural language’ - pausing, recovery, learning how to deal
with unexpected and not understanding.” In all these ways, the interactions became
“better preparation for ‘real-life’ interaction in the language” (Spanish 123).
Three respondents made direct reference to strategic competence as a theoretical
construct. That is, interact enabled a “shift from language focus in moving to com-
municative strategy focus (i.e. question, response, seeking clarification etc.)”
(Spanish 078). Interact facilitated “moving away from grammatical competence as
the determining factor to strategic competence” in a context where “errors aren’t
considered bad” (Unstated, 096). Thus, students developed “communicative and
strategic competency” and “end [the] year with ability to interact in a number of
different situations” (German 117).
5.3.2 Positive Impact
Another perceived benefit of what one Japanese teacher labelled “freedom from
accuracy” was that it “encourages risk-taking” (Japanese 029). In other words, in
contexts where students were “encouraged to work more independently,” they were
“usually more willing to take risks” and could thereby “have much more fun and be
more creative” (French 013). It seemed that students “enjoy being creative and com-
ing up with ideas that they are going to talk about” (French 041), and “visibly relax
and enjoy communicating” when there is greater focus on “communication and less
on inserting [prescribed] language” (French 147). An associated positive dimension
of interact was that “students are in control of what they want to talk about” (French
147, my emphasis) and “student to student interactions empower students,” making
the assessment “less like a test and more like real-life performance assessment”
(Unstated, 077, my emphasis). Positive interaction was therefore promoted, and
positive impact ensued.
Several other dimensions of positive impact for students emerged from the sur-
vey comments. These were: being able to interact with peers over a range of occa-
sions, making the experience less stressful; receiving feedback and feedforward;
greater ownership given to students to manage the evidence of their spoken
interactions.
Moving away from the teacher-led and teacher-dominated ‘interview’ towards

peer-to-peer interactions was seen to contribute to enhanced positive reception by
students. The students were now “speaking for real purposes, with their peers”
(French 147). Students were perceived to be “more natural working with peers
rather than teachers” (Unstated, 096). Peer-to-peer interaction appeared to be some-
thing that students “really enjoy” (French 008) and “find … motivating” (French
093). It also created the opportunity for “more flexible grouping” (Japanese/French
028) whereby students could “interact with a range of people” (Spanish 024) and
“mix with a variety of other students” (French 137).
It seemed that the peer-to-peer nature of the assessment contributed, in the think-
ing of several respondents, to a clear reduction in candidate stress. That is, the
opportunity to “converse with each other rather than the teacher” meant that stu-
dents were “not as nervous,” “more at ease,” and “more willing to try things out and
ask questions of each other” (Spanish 151). Because students were “less anxious”
(Spanish 036; German 057) and the experience was “less nerve wracking” (Spanish
073), students could “enjoy a more relaxed assessment environment” (Japanese
040) and there was “definitely more enthusiasm noted” (French 082).
Opportunities for students to “have their speaking abilities assessed over several
interactions” also contributed to enhanced positive impact because it was “more
realistic than being assessed on one very staged and rehearsed ‘conversation’ which
didn’t manage to live up to the title at all” (French 007). This made the assessment
“much fairer” (French 013), because decisions on performance were “not all hinged
on one piece of work” (Spanish 078). Students were able to provide “a range of
evidence over several contexts,” and there could be “a huge variety of choice in task
design” (Spanish 011).
The relationship between multiple interactions and a reduction in stress for can-
didates was noted by several respondents. Interact facilitated for students “a huge
choice of situations to practise their speaking skills in” (French 137). Removing
“the ‘one chance assessment’” (Spanish 024) meant that there was “less temptation
to produce a one-off, contrived performance” (Spanish 021). With students being
“assessed over multiple occasions” (German 116) each interaction could be per-
ceived as “low risk” (Japanese/French 028) and there were “more chances for stu-
dents to succeed.” Consequently, there was “less anxiety for students” (Japanese/
French 028) and the assessment was “less stressful” (Japanese/French 100).
Positive impact was also enhanced by the opportunity for feedback leading to
improved future performances. That is, because spoken interactions were “carried
out throughout the year not just on one-off activity” (Chinese 080), and because the
assessment was “ongoing, not one-off snapshot of ability,” opportunities for stu-
dents to “apply learning and correct errors in subsequent interactions” were created
(Unstated, 077). Students could therefore “show their progress over time” (Japanese
088) and, “if Interaction 1 did not work so well, students could do better in the sec-
ond one” (Spanish 112).
Ownership of portfolio management was also perceived by some respondents as

empowering. Students were able to “take their own responsibility to collect and
record their interactions” (Chinese 080). The facility to complete several interac-
tions and subsequently to select the pieces to be presented for summative grading
enabled “greater autonomy,” thereby improving the likelihood that “outcomes will
better reflect students’ best work” (Japanese 040), and helping to “eliminate ‘having
a bad day’ reason for non-achievement” (Japanese 062). This also allowed the stu-
dents “freedom to speak when ready. Students don’t have to talk on the topic –
[they] can choose their own topic. … Not end of the world if they get it wrong”
(French 120). Consequently, students were given “the opportunity to manage inter-
action assessments in less contrived and less stressful situations (more authentic or
less artificial).”
5.3.3 Validity, Reliability and Potential for Washback
An important consequence of the perceived advantages I have presented thus far

was that, in the perception of a number of respondents, the validity and reliability of
the assessment was enhanced. That is, “more interactions/opportunities to interact
give a better picture” of proficiency (Spanish 112) and “allow students to demon-
strate what they have learned” (French 120). Because “the sample of demonstra-
tions of speaking ability is greater,” this sample was “therefore theoretically more
representative of the student output” (Spanish 021). Also, since students “cannot
rote learn contribution,” the assessment enabled “a genuine reflection of what they
can do” (Japanese 048). There was therefore “better assessment data by collecting
more evidence than conversation” (Japanese 101). This provided, in comparative
terms, “a more accurate measure of the student’s ability to respond to an interaction
in a real-life situation” (Spanish 059).
Seen in the broader context of teaching and learning, positive washback benefits
were also noted. Interact stimulated the creation of “programs with more emphasis
on speaking” (Japanese 101) and attempts to “make communication central”
(German 116). Consequently, interact was compelling teachers to “teach in a man-
ner which encourages communication in authentic situations” (German/French
049), leading to “more … unofficial conversations” (my emphasis); “less fear of
using the spoken word” (French 060); “use of the target language in the classroom”
(French 093) “on a more regular basis and in a natural way” (French 082). As one
teacher noted, as a consequence of interact “I am doing way more speaking in the
class. Interactions happen all the time whether recorded or not” (Japanese 019).
Several respondents neatly expressed what they perceived as the end results and
consequential advantages of interact for their students. Two spoke of a ‘real sense
of achievement’ that came “after completing these totally unscripted interactions,”
because students “realise this is something they could now do in real life” (Spanish
5.4 Advantages of Interact – Interviews 117
039), and they were now “able to interact more freely, confidently and accurately”
(German 129). Another commented that “students’ fluency has definitely improved
and they feel at ease speaking” (French 032). One comment captured the essence of
several perceived advantages to interact: “I like the real-life situations, the student
to student nature of the tasks and the fact that it is ‘error tolerant’ and focuses on
communication” (French 127).
5.4 Advantages of Interact – Interviews
The interviews enabled the opportunity to elicit parallel qualitative data concerning
teachers’ perceptions of interact in comparison with converse. The teachers (n = 14)
represented the full range of languages across a range of different school types, and
also three colleagues who were or had been involved with the trialling and/or mod-
eration of interact in schools and who could bring a broader perspective to the
issues (Table 5.5). (These three colleagues are subsequently referred to as ‘lead
teachers’.)
In what follows, I draw on comments from interview participants in ways that
throw light on the issues raised by teachers in the open-ended section of the national
survey.
Table 5.5 Interview participants (Stage I)

Pseudonym Principal language taught Type of schoola
Dongmei Chinese Boys’ state school
Jane Frenchb Co-educational state school
George French Girls’ integrated school
Françoise French Boys’ state school
Monika Germanb Boys’ state school
Carol German Boys’ state school
Peter German Girls’ state school
Mary German Boys’ integrated school
Celia Japaneseb Co-educational state school
Sandra Japanese Co-educational state school
Yuko Japanese Girls’ integrated school
Sally Japanese Co-educational state school
Janine Japanese Girls’ state school
Georgina Spanish Girls’ state school
Notes
a
A state school is a government-funded school; an integrated school was once a private (often
church-founded) school, but is now part of the state system whilst retaining its ‘special character’
b
At the time of the interviews these teachers were currently or had previously been involved in the
trialling and/or moderation of interact
5.4.1 Authenticity and Interactiveness
Interview participants were initially asked what they understood to be the main
purposes of interact. Answers focused on a clear recognition of its intended authen-
ticity and interactiveness, particularly when seen in comparison with converse.
Lead teacher Celia noted that, when considering language learning from the stu-
dents’ point of view, “part of the reasons why students learn a language is to be able
to use it.” This effectively meant that “the main skill they want is to be able to go
and have a conversation with a French person, a Japanese person, a Chinese person,
whoever.” Several teachers concurred with this communicative and interactive view.
Georgina noted that the goal of interact was “to showcase students’ ability to com-
municate relatively fluently in Spanish or whatever language it happens to be.” In
other words, interact was there “to provide … ongoing and authentic situations for
kids to use a language in different situations” (Dongmei) or “to enable students to
be able to interact in Japanese about relevant topics and relevant situations” (Sally).
The main purpose of interact was therefore, in Monika’s words, “the idea of moving
real life interactions, spoken interaction, into an assessment situation.” After all, as
Mary put it, “your ultimate goal as a language teacher is to allow [students] to com-
municate with anybody in the target language, not just their teachers.”
Several teachers elaborated on the notion that the assessment was designed to
reflect the central goal of communicative language teaching programmes – the
ability to communicate, particularly in future real-world contexts. Janine argued
that interact would:
allow the students the opportunity to have [and] practise the skill of having a conversation
in the target language. You know, ultimately they want to be able to speak when they go to
the country, so I think the standard was based on the fact that communication is the most
important thing for a language learner and ‘how can we help the students to gain that skill?’
George put it this way:

The main purpose is really to engage in an authentic and purposeful piece of communica-
tion. That’s what it’s all about, it’s about not learning the language per se but learning the
language to put into practice, into use and to be able to converse with someone else effec-
tively - that’s the whole point.
Seen in comparative terms, therefore, the former converse standard, which, it

seemed, did not facilitate the goal of authenticity, was not useful or fit for purpose.
In Mary’s view, the assessments aligned with converse were “a bit artificial and …
often became little scripted speeches in response to questions rather than free flow-
ing language.” Dongmei concurred that, in the days of the one-off conversation with
the teacher, “everything [was] artificial, very manageable because you tell the kids
‘okay this is the topic we’re going to talk about.’ They’ll sort of brainstorm possible
questions, go away, prepare answers, pre-learn, so it’s very much like very
rehearsed.” Jane reiterated the same point:
I think that the conversation wasn’t ever a conversation. Even if people didn’t have their
pre-prepared list of questions that they were going to ask the student … it was always very
clear what was going to be asked, there wasn’t anything natural about it.
Converse was therefore effectively a “once a year fake conversation … not a real
conversation” (Sally). It was “more scripted and controlled” (Yuko), or “so artifi-
cial” that “even as a language teacher, you probably wouldn’t answer questions like
that in the target language” (Mary). Interact, by contrast, was “making it more natu-
ral” (Mary). With interact, students “have to have three different scenarios with
different partners … [and] they have to actually negotiate meanings” (Dongmei).
Allied to perspectives about the enhanced authenticity of interact was the recog-
nition of the benefit of moving away from the requirement to force particular gram-
matical structures into use. Sally suggested, “I guess [with] the old conversation
standard the focus was on the kids producing as many [appropriate] level structures,
grammatical structures as they could to get excellence or merit in a really artificial
way.” By contrast, interact was “way more natural.” Georgina thought that the de-
emphasis on grammatical structures was “great.” She went on to explain, “as lan-
guage teachers we know that when you speak you don’t automatically use all those
structures … you can be perfectly fluent, speak and understand at a really high level
without using specific structures.”
As lead teacher Monika put it, in terms of assessment “you thankfully don’t have
to fail a student any longer if there isn’t this one magical phrase or this one magical
tense.” Nevertheless, a de-emphasis on grammar did not necessarily negate the
place of grammatical accuracy. Rather, it seemed that grammatical accuracy was
relegated to an important support role in terms of the extent to which it facilitated
effective communication. For Monika, “clearly the onus is on communication and
you really only fail if you can’t communicate at a particular level” (my emphasis).
However, the “quality step up,” that is, achieving the higher grades, “is still influ-
enced, not determined but influenced by the student who is more capable of manipu-
lating the language accurately or appropriately.” Monika went on to explain, “I
think to me that is where the accurate use of a grammatical structure or the accurate
use of a word comes in as a quality marker.” In Monika’s view, “the step up, I think,
is still determined by things like the classical ‘how many words do you know and
can you construct a proper past tense?’ and stuff like that, to a lesser degree than it
was before, but I don’t think that that has completely gone” (my emphasis).
The grammatical issue was therefore, from Mary’s perspective, to move away
from “thinking ‘oh I’ve got to pack in past tense, future tense, present tense and I’ve
got to pack in all of these conjunctions and all these other things into one conversa-
tion’, which wouldn’t naturally happen.” Rather, interact meant that accuracy was
required, but its purpose was to contribute meaningfully to the communication –
students could now “show you breadth in their assessment and their answers and
show you a lot of different language, but naturally occurring language, I guess” (my
emphases).
A consequence of moving away from artificiality therefore became the develop-
ment of students’ skills in using language more naturally and more flexibly.
As Françoise explained, students were “not stuck on talking only about one thing,
about having one interaction only … they are able to touch different aspects, differ-
ent topics.” Additionally, Sandra argued that, under the old one-off assessment, “if
somebody mucked up on something that they’d sort of rehearsed, it threw the whole
thing.” With interact, students were now “not so focused on everything being per-
fect and they know how to get something restarted if it doesn’t follow what they
expect it to do.”
In addition to appropriate (rather than forced) grammar and lexis, therefore, was
the development of strategic competence. Sandra went on to provide the following
illustration of what she saw as a key skill required to “be spontaneous”:
being able to give and take information … and they work a lot more on the natural side of
things, like being able to say ‘um’ in Japanese, which is really simple, or being able to say,
‘pardon, I didn’t quite understand’, or ‘can you say that again?’ or some of those formulaic
things that I think they use a little bit more naturally now than what they used to.
In George’s view, interact thus prepared students for “the unexpected,” and stra-
tegic competence became an imperative. That is, students had to come to the realisa-
tion that “learning a language is an organic process and it’s an evolving and lively
thing.” As a result, “the language never goes where you want it to go.” With interact,
and by getting students to work with others, “you just force them to come across
unexpected situations and then they’ve got to get by linguistically.” This made inter-
act “so purposeful, preparing for the unexpected … the whole genuineness, the
whole genuine aspect of a conversation.”
There was therefore a sense that, in Carol’s words, in terms of language “the
minimum level, the bar, I think, has been raised … the quality of interaction has
been raised and the breadth of what needs to be covered has been raised.” Quality
and breadth were not, however, to be determined by grammatical structures, unless
relevant to the interaction. That is, with interact, “there is more of a need to show
natural interaction rather than … just ask the one question or whatever” (my empha-
sis). Dongmei put it like this: “in three different interactions they have different
scenarios and different partners so they need to learn to use different language fea-
tures, cultural knowledge as well … three situations that use different language,
formal, informal and so on.”
5.4.2 Positive Impact
Positive impact on students was seen in the facility to record several interactions
over time. As Georgina explained, “they’ve got the whole year to kind of hone their
speaking” alongside “a whole range of topics that they can touch on.” Students
would inevitably “have strengths and weaknesses and likes and dislikes on certain
topics.” Balancing out across several interactions therefore mitigated the effect of
not performing so well on one particular interaction. This made the experience, in
Janine’s words, “less stressful because they can do it as many times as they want and
it does allow more creativity and more freedom for the students as well, which is
nice.” Sandra and Monika provided parallel viewpoints. For Sandra, interact was an
improvement on converse because “it’s not just one assessment. If the students
really botch something up terribly, they know it’s not the be all and end all. They
have other opportunities.” Monika expressed the same perspective in these words:
“you have more than one chance … if one task is a bit of a dud … you just make up
another one.”
Multiple opportunities for interaction also contributed positively in terms of tap-
ping into cumulative development in proficiency. Françoise argued that, in addition
to making the assessment “more manageable” because it was made up of “little
pieces,” interact enabled students to “see some progress.” As a result, “I feel that we
are doing better, they are less stressed by the whole thing, by all their work, relying
on only one final exam.” Dongmei similarly suggested that, by spacing several inter-
actions throughout the year, additional to the benefit that this gave students “more
different varieties … to use different types of language,” the students also had “more
practice.” This meant that they were “more familiar with the expectations and they
get better and better … more competent … so I think the mere fact that they get
more opportunity to practise, that’s good.”
Multiple interactions also provided the opportunity, in Dongmei’s view, for
interact to be “quite empowering for teachers because we can be quite creative with
our task design” (my emphasis). Creativity of task also facilitated opportunities for
students to talk about things they would like to talk about. Assessment tasks could
therefore tap into students’ interests, enhancing both interactiveness and impact. As
Yuko argued, interact not only gave students “more opportunity to show what they
can do in different topics or different situations.” It also meant that students “can be
more natural, close to who they are.” Yuko went on to explain what she meant:
“they’re teenagers, they want to talk about the music … or sports or shopping … the
more natural topics that they want to talk about as [part of] who they are.” Monika
thereby asserted that “there’s a huge engagement by the students and it’s very flex-
ible, I find, and the students succeed. … They are great at it, they love it, so that’s
the advantage.”
An additional dimension of positive impact related to the peer-to-peer nature of
the interactions. Georgina argued that, under the old system, “even though I’m
fairly friendly and I feel confident in my students’ ability, they all felt anxious about
having a conversation with me.” She reflected on a recent experience where she
thought that her students would have got over nervousness with her, having spent
four weeks with her on a school trip to Chile, but this was not the case. Georgina
noticed that her students “still came back and felt anxious about it.” Referring to the
“comfort” factor that, in her view, should make the interactions “a bit easier,”
Georgina noted that “I think the fact that they are doing it with a friend is a strength.”
Peter likewise observed that “the kids are comfortable doing the interactions now”
because “they do them with one another.” Carol similarly argued, “although it’s still
high-stakes, having a friend, having people that they get along with, has taken the
edge off it, I think.” Indeed, ‘taking the edge off’ was a means of diminishing the
perception of ‘high-stakes’ in students’ eyes. Mary asserted that peer interaction
“actually allows some students to achieve better results because they don’t have that
anxiety towards assessment around it.” Mary went on to explain that there were “a
lot of kids that suffer from exam anxiety or assessment anxiety and that doesn’t
necessarily come through with the interact.”
Ownership of portfolio management was also touched on in the interviews as a
contributing factor to positive impact. Françoise argued that an advantage of inter-
act was that it would make students “responsible for their learning, looking after
their portfolios.” Françoise acknowledged that this was challenging. That is, “hav-
ing them taking responsibility” and “being more flexible and spontaneous” was
“quite new” and “a completely different mindset.” This meant that it was “hard for
them to take responsibility just by asking them,” and “in the very beginning … quite
upsetting for them” because “they don’t know what to expect.” Nevertheless, as her
own students began to get used to greater ownership, they “told me ‘well, now we
know what to expect and we think we have done okay’, so they are more comfort-
able with it.” As Georgina asserted, “I think it’s a strength that they have to manage
their own internal.” In other words, “it’s ‘you have to do it and give it in and you
have to listen and you have to act upon this feedforward, and if you don’t, tough.’”
It was therefore empowering that, as Monika put it, students “have so much more
control over what they feel is their best work.”
5.4.3 Validity, Reliability and Potential for Washback
The perspectives I have so far presented from the interviews contribute to the per-
ception from the survey comments that interact would promote assessment oppor-
tunities that were valid and reliable, in other words, assessments that aimed to
replicate the ability to communicate naturally and proficiently with a range of
speakers in a range of contexts. In Sandra’s view, interact enabled teachers to “see
that the students can carry on a conversation with spontaneity and unrehearsed, to
have some give and take rather than a conversation that is absolutely perfect but
doesn’t really reflect what they’re able to do in real life.” As a consequence of focus-
ing the assessment on communication, Carol argued that what she thought the stu-
dents achieved were “much, much, much better, much more competent
communicative skills. … I think they end up far more competent as language learn-
ers and I think they are much better able to go and live in the country.” In Sally’s
thinking, “the students, when they leave you, they’ve got confidence in speaking
and in real life situations; they are learning a language and they are actually coming
away with a skill.”
Additionally, several interviewees noted that the move away from the one-time
snapshot conversation to a series of interactions throughout the year was promoting
positive washback. Sally, for example, reflected that “it’s made me look at what I’m
teaching and how relevant.” Carol explained that, with interact, the scenario was no
longer “just get ready quickly for a conversation at the end of the year.” Interact
would “encourage students and teachers to use the language more” because it would
5.5 Conclusion 123
“force students and teachers to really integrate speaking into everything they do.”
The consequence would be “to intensify the teaching, especially to intensify spoken
language within the classroom, to encourage teachers to move to an immersion
model if possible and to make it also more relevant to students.” Certainly, in Celia’s
view, a move away from “just the teacher asking questions” would have the conse-
quence of “making teachers actually teach the kids how to interact more.” Yuko
summed up the washback implications neatly: from the teachers’ perspective the
assessment would “make the teacher think why we are teaching the languages.”
From the students’ point of view students would no doubt find it “a lot more useful
when they’ve finished the course” because “they can communicate a lot more than
before.”
Comments such as ‘made me look’ (Sally), ‘force students and teachers’ (Carol),
‘make teachers teach kids how to interact’ (Celia) and ‘make the teacher think’
(Yuko) suggest an element of compulsion in terms of washback. Indeed, several
interviewees honed in on what they perceived as a deliberate attempt on the part of
those ‘in authority’ to drive a particular communicative agenda. This was not seen
in negative terms, however. Dongmei argued, “well, we’re supposed to teach to the
curriculum, but obviously we don’t, we teach to assessment, so if you want to
change people’s pedagogy the only way to change it is through assessment.”
Interact, in Dongmei’s view, would achieve this “because we have to assess three to
five times throughout the year.” She concluded, “I think that’s really good.” George
put it similarly: “I think the people in the Ministry were really quite clever … if you
want to change the teaching force and the way they teach you’ve actually got to
manipulate the assessment format.” Through interact and its ongoing assessment
“they actually force the teachers to change the way they teach.” Sally concluded:
It’s not that I’m teaching to assessment but it’s definitely impacted on what I am teaching,
so I’m thinking to myself ‘why would I teach this if it’s not going to lead to a natural sce-
nario … a useful scenario?’ And so I have changed, and am still in the process of changing,
my teaching programme to be relevant and realistic.
5.5 Conclusion
The findings of the national teacher survey indicated that teachers perceived several
advantages to interact in comparison with converse, together with several chal-
lenges to its implementation. The open-ended data from both surveys and inter-
views illustrated the perception that, in comparative terms, interact was considered
to promote more natural, spontaneous, authentic interactions. Indeed, authenticity
and interactiveness were identified in comments by just over two-thirds of survey
respondents, with the de-emphasis on having to force particular grammatical struc-
tures into use seen as a key component of this. Dimensions of positive impact were
noted by one in three respondents. These included the opportunities for multiple
interactions among peers across a range of topics, and a final selection process that
enabled students to showcase their best efforts. Additionally, enhanced validity and
positive washback, in terms of a greater emphasis on genuine communication in the

target language, was thought to ensue.
Lead teacher Jane summed up several of the perceived benefits of a focus on
genuine communication in these words:
I think it’s really great to get students talking to each other. Because you know that across
the world people are sitting in foreign language classrooms conjugating verbs, and that’s
not healthy, and that’s not really what you are hoping for. [Interacting] is the thing that you
will have to do the most in a foreign country, so I think that’s fantastic.
This chapter has highlighted several perceived positive benefits to interact in

practice. However, despite these benefits, several challenges to the successful imple-
mentation of interact were raised in the data. In Chap. 6 I consider perceived disad-
vantages to interact and suggestions for its improvement.
References
East, M. (2014). Working for positive outcomes? The standards-curriculum alignment for Learning
Languages and its reception by teachers. Assessment Matters, 6, 65–85.
East, M. (2015). Coming to terms with innovative high-stakes assessment practice: Teachers’
org/10.1177/0265532214544393
Education Counts. (2012). Subject enrolment. Retrieved from http://www.educationcounts.govt.
nz/statistics/schooling/july_school_roll_returns/6052
Graham, J. W. (2012). Missing data: Analysis and design. New York, NY: Springer.
Resnick, R. (2012). Comparison of postal and online surveys: Cost, speed, response rates and
reliability. Sweet Springs, MO: Education Market Research/MCH Strategic Data.
Chapter 6
The Disadvantages of Interact and Suggested
Improvements
6.1 Introduction
Chapter 5 drew on data generated from the national survey and the teacher inter-
views from Stage I of this two-stage study. Analysis of the closed-ended section of
the survey (Section I) revealed several significant advantages to interact in compari-
son with converse when interpreted from the perspective of different dimensions of
test usefulness. Essentially, interact was perceived to be a more valid and reliable
assessment than converse, and more authentic and interactive than converse. Several
advantages to interact also emerged from the coding of the open-ended comments
(Section II). These comments, substantiated by the interviews, supported the posi-
tive perspectives from the closed-ended data and also threw light on those dimen-
sions of student impact that a number of respondents considered to be positive.
There were also positive implications for washback.
Several limitations to interact were also identified in the closed-ended data.
There was a level of ambivalence around impact. Although it was perceived by the
teachers that, in the students’ eyes, interact was a better assessment of their profi-
ciency than converse, one measure where interact was perceived to be no different
was in terms of student stress – that is, students would feel equally stressed, what-
ever assessment they took. The closed-ended data also revealed one significant
comparative disadvantage to interact – impracticality.
In this chapter I consider, from the open-ended data (surveys and interviews)
perceived disadvantages of interact and, as a consequence, suggested improvements
to interact. Once more, these issues are explored with reference to different dimen-
sions of the test usefulness construct.

Linguistics 26, DOI 10.1007/978-981-10-0303-5_6
126 6 The Disadvantages of Interact and Suggested Improvements
6.2 Disadvantages of Interact – Survey Data
From the open-ended data, perceived disadvantages of interact (see Chap. 4, Table
4.2), subsequently grouped according to the relevant qualities of test usefulness,
were impracticality and negative impact. The frequencies with which these two
themes were identified in the data are recorded in Table 6.1.
It was clear from the frequency counts that impracticality (i.e., the fact that inter-
act, compared with converse, was seen as considerably more impractical to admin-
ister) stood out as a clear disadvantage, with at least four out of every five respondents
mentioning this. With regard to this disadvantage, and taking use or non-use into
consideration, there was no significant difference in frequency between the groups,
χ2 (1) = 0.7156, p = 0.4. In other words, whether teachers were using interact or not,
issues of impracticality clearly loomed large in teachers’ thinking. (As with the
frequency data on perceived advantages recorded in Chap. 5, χ2 tests were not per-
formed for the second comparison because in this case one observed cell count was
less than five.)
In what follows, I present excerpts from the open-ended survey comments that
illustrate the different dimensions of disadvantage.
6.2.1 Impracticality
In the open-ended comments it was very clear that words such as ‘time-consuming’
and ‘unrealistic’ dominated the discourse around disadvantages. Certainly among
those who reported that they had chosen, at the time of the survey, not to use inter-
act, there was some evidence to suggest that time may have played a factor in this
decision. For example, interact was found to be “time-consuming to administer and
gather evidence” (Japanese 118), particularly with large classes. This teacher’s
school had “tried it at level 1 last year.” The teacher reported that, unfortunately, “it
took us one week to gather evidence of one task,” meaning that there was “precious
little time for anything else!” The fact that interact took “a lot of administration
time” thus effectively made it “a torture to teachers” (Chinese 119). Technology
Table 6.1 Frequencies of mentioning disadvantages of interact

Disadvantage interact interact interact interact interact interact
1. Impracticality 101 26 122 28 83 % 93 %
2. Negative 16 2 122 28 13 % 7%
impact
Note
a
From 152 returns, 150 respondents made comments relating to disadvantages of interact
6.2 Disadvantages of Interact – Survey Data 127
also played a part in negative perceptions about impracticality. Not only was inter-
act “far too time consuming” such that there was “no way I’d have time to teach
anything else if I decided to do the interact standard,” there were also “limited
facilities (for recording etc.)” (French 001). There was therefore “too much time
spent organising resources in the classroom, and finding adequate opportunities to
assess, as well as finding data storage space” (Japanese 004).
For those reporting not using interact at the time of the survey, interact was
therefore seen as “three times the work, including (1) preparation of all the pupils
(2) recording times three at least (3) organisation of assessments” (French 010),
making it “far too heavy a workload for both students and teachers” (French 023).
Consequently, it was “unrealistic to expect the busy teachers to do this!!” (Japanese
037).
A large number of those who reported using interact at the time of the survey
also expressed strong concerns about its impracticality, mirroring several of the
arguments put forward by those who were not using the assessment. In comparison
with converse, interact had effectively “tripled our workload” (French 017). The
increase in workload was perceived to impact not only on teachers but also on stu-
dents. In other words, there was “workload for the teacher – administering it, pre-
paring students for it, assessing it and marking each interaction, as well as managing
the final portfolio of tasks.” There was also “workload for students to manage all
these tasks with their other subjects” (Spanish 011). There was therefore a sense in
which, in a context where “learning the target language is not the only area of stu-
dents’ learning” (Japanese 085), the workload implications were a distinct disincen-
tive, for teacher and student alike. In essence, the assessment was “totally stressful
logistically” (Spanish 075) and “way too much work for everyone” (French 065).
Interact thus represented, in comparative terms, a “massive increase in work-
load,” not only to implement but also to assess, and “much more class time lost in
shaping students’ performance” (Japanese 040, my emphases). Indeed, the notion
that workload factors had “cut into my teaching” (French 032), and that other per-
ceived important dimensions of languages programmes were being compromised,
was expressed by several respondents. Interact “takes up a lot of class and prepara-
tion time, to the detriment of the overall teaching programme” (German 098); “takes
up far, far too much time. Cuts into class time. Lots of work for everyone” (Japanese
125); there was “just so little time in the year to cover the curriculum as well as
encourage, support, and cajole students through three interactions” (French 070).
One respondent (German 057) noted being already “about 6 to 8 weeks behind with
my normal teaching plan” at the time of the survey. There was therefore a sense that
the “ongoing nature of portfolio is all consuming. We have no time to teach new
topics and are so busy collating evidence” (French 086). There was also concern
that the “huge focus on spoken ability has resulted in a decrease in written ability
and higher-level language” (French 144).
In summary, dimensions of impracticality focused on both workload and the
logistics of administration. That is:
It is extremely time-consuming to devise suitable tasks which are going to elicit spontane-
ous language from both partners, to administer, to assess. It is extremely complicated to
administer, unless you have access to one flip camera for each pair of students and a sepa-
rate room (or preferably several separate rooms) to put them in while they’re recording.
(French 093)
In other words, “the inordinate amount of time it takes to prepare students for
each task” was problematic enough, without the added burden of “the technological
side (a nightmare for me, as I cannot manage it without help)” (French 064).
In turn, and in contrast to the perceived advantage of multiple assessment oppor-
tunities (see Chap. 5), there was a perception that the languages classroom had now
become assessment-dominated. This was especially challenging when considering
that interact was not the only new assessment, and that writing was also to be
assessed through an evidence portfolio approach. In a context where “I thought the
new curriculum was not driven by assessment!” (Japanese 134), for several respon-
dents interact was clearly seen as “assessment driven learning” whereby teachers
and students “always seem to be working on/preparing for an assessment” (Spanish
108). Students could therefore feel as if they were “being constantly assessed”
(French 007). In turn, and with “the time it takes to set up the whole series of sce-
narios,” it became problematic “to motivate the students each time in a system
where internal assessments must combat the attitude of ‘oh no, not another assess-
ment’” (Spanish 136).
An additional practical burden, once more in contrast to the perceived advantage
of learner autonomy (see Chap. 5), was the expectation that students should be self-
managers, taking ownership of their own portfolios of evidence. Where students
were often “too used to being given instructions and led by the hand” (French 147),
a key challenge was therefore “folio management” because “students don’t cope
well with this as self-managers” (French 055). Consequently the management of
students’ work became “a big burden on [the] teacher’s time, if teacher takes respon-
sibility” (French 055). As one teacher (French 034), reflecting on work with a Year
12 class, expressed it, impractical dimensions were not only “getting students ready
and collecting evidence” but also “managing the collection of evidence.” This was
taking up “heaps of class time,” essentially because “getting students to take respon-
sibility for their own work” was “a battle.” As a consequence, the class was “a
whole unit behind usual.” Another teacher (Japanese 019) put it like this:
When students aren’t able to ‘manage self’, there is a lot of pressure on the teacher to spend
extra time in class, lunch, after school to make sure the students have enough interactions
and evidence to have a chance of passing, reaching the standard.
That is, “students without self-management skills don’t do well” (Spanish 104).
This contributed to the sense that the assessment was “completely unmanageable in
relation to workload – particularly if you have large classes” (French 056). As a
consequence, this teacher was seriously considering “either not offering the stan-
dard in 2013” or moving over to a different assessment system, such as the IGCSE
(see, e.g., University of Cambridge, 2014), or even “limiting the number of students
who continue with languages.”
6.2 Disadvantages of Interact – Survey Data 129
6.2.2 Negative Impact – Unrealistic Expectations
There was clearly a sense in which the logistical challenges of managing interact
were creating an impression of negative impact, both for teachers and students.
Additionally, a number of respondents raised other concerns regarding impact.
Comments focused on two dimensions: the perceived unrealistic demands of the
assessment when taking into account the students’ proficiency levels, and the poten-
tial unfairness of interlocutor variables. Allied to these dimensions was a perception
that interact was, after all, a high-stakes assessment, with all that this implied for
students’ performances and students’ anxiety.
Unrealistic demands for students focused on the issue of ‘spontaneous and unre-
hearsed’, particularly given that the students for whom this was a requirement when
interacting were perceived, after only a few years of instruction, to be operating, at
very best, at a proficiency level equivalent to CEFR B1, and, more likely, at level A2
(Council of Europe, 2001). That is, “the ‘unrehearsed’ requirement is ridiculous”
(Unstated 067) and “the emphasis on being ‘spontaneous’ is too big an ask of our
students. They find it almost impossible to do this in unrehearsed situations” (French
008) – the expectation was therefore “depressing, as it demands command of the
language and confidence most students don’t have.” To “talk off the cuff on a topic
is extremely difficult” (Japanese 110), hence it was idealistic to expect performances
that were “authentic, spontaneous, no rehearsal et cetera” (Chinese 027), especially
“if you want an interaction that demonstrates their ability to communicate in more
than monosyllabic language” (French 132).
In other words, the students were “nowhere near fluent” (Unstated 077). They
needed a “considerable amount of language immersion to be able to cope with hav-
ing a conversation off-the-cuff. The school year does not provide the time necessary
for this immersion to happen and therefore students feel intimidated” (Japanese
124). With interact it appeared that “we expect conversations from students who
have only done the language for a short amount of time” (German 114). As a con-
sequence, students “find it very stressful to be put on the spot and go into a conver-
sation unprepared, that is, without anything they can hold onto.” In other words,
students “hated [it] as they are not sure at all about what to say and feel unprepared”
(French 107). The whole exercise had therefore become “hugely stressful to stu-
dents” (Japanese 110).
The negative impact of student stress was also acknowledged in the recognised
high-stakes nature of interact. That is, perceiving the assessment as high-stakes
meant that students “want to prepare” (German 057); “like to learn things to say”
(German 116); “can’t do without some preparation” (Japanese 152); “want to prac-
tise beforehand,” immediately making the performance “no longer spontaneous”
(French 093). Since this was “still an examination after all and they want to do
well,” it was “hard not to have students scripting speaking tasks” (French 013). The
interactions were therefore “still contrived. Students cannot interact effectively
without preparation. They don’t like being made to use conversation strategies etc.
which seem false” (French 041).
The end-result of a perceived unrealistic expectation to be ‘spontaneous and

unrehearsed’ was that, in some contexts, “students’ peer-to-peer interactions have
ended up planned when they shouldn’t have been” (French 144). As a result, “both
old [converse] and new [interact] do not reflect the ability of students to communi-
cate. To have students at the level the standard indicates is not possible in a school
setting!” (French 069).
6.2.3 Negative Impact – Interlocutor Variables
The circumstances I have so far described could lead to “often unsatisfactory record-
ings.” This was not only because performances were “over-rehearsed and not spon-
taneous.” Performances were also adversely affected by “students not working well
together, being too simplistic in language used” (Japanese 106). In turn, this raises
another dimension of negative impact for students – the influence of interlocutor
variables. That is, not only was it sometimes “difficult to stop students rehearsing if
they know they will be assessed together,” occasionally it was “difficult to pair off
the students, then when paired up, absences foul up the recording plans” (French
135). It could therefore be “hard to arrange the recording … when there is a high
level of absenteeism” (French 127). Beyond absences, pairing students could also
be problematic because “sometimes the partner is not as cooperative or diligent”
(Spanish 136). Additionally, students’ interactions were “often dependent on their
partner’s ability which can make it harder for them” (French 131). In situations
where “students of different abilities work together sometimes” inevitably this
could “affect performance” (Japanese 074).
One teacher summed up the dimensions of negative impact with these words:
If administered as suggested by the Ministry, then it is far more stressful: an off-the-cuff
interaction, no practice or preparation, only good for the very best students, makes it unfair.
Consequently, most students, with the blessing of their teachers, do prepare, practise and
rehearse with their interaction partner. Therefore it is invalid as a ‘spontaneous’ dialogue. It
doesn’t tell us what the student could really do in a real-life situation. A weaker student is
going to make it very difficult for a good one to show what they can really do (in fact this is
why guidelines for the old conversation suggested that these should not take place between
two students!! We have come a long way – but in which direction??). (French/Spanish 141)
6.3 Suggestions for Improvement – Survey Data
The above rehearsal of perceived noteworthy disadvantages to interact raises sev-

eral important issues which, in the perception of a number of respondents, bring into
question the validity, reliability and fairness of interact. In turn, this raises the ques-
tion of whether interact is as useful or fit for purpose as it might be. The perceptions
outlined above must, of course, be laid alongside the perceived advantages to
6.3 Suggestions for Improvement – Survey Data 131
interact which I presented in Chap. 5. Nevertheless, perceived challenges in prac-

tice led several respondents to consider ways in which interact might be improved.
In what follows, and building on survey respondents’ perceptions of the weaknesses
of interact, I present what respondents suggested were possible ways to improve the
assessment.
As I noted in Chap. 4, four areas for improvement to interact were identified in
Section II of the survey:
1. Reduce the number of interactions required
2. Allow provision for scaffolding/rehearsal
3. Provide more examples of appropriate tasks
4. Provide more flexible assessment options.
The frequencies with which each of these was identified are noted in Table 6.2.
Each of these noted improvements is presented below with reference to the open-
ended survey comments.
6.3.1 Reduce the Number of Interactions Required
Bearing in mind that impracticality clearly loomed very large in teachers’ thinking
about interact, it was not surprising that several respondents focused on reducing
the number of interactions as a possible solution to the practicality dilemma. One
teacher who reported not using interact at the time of the survey noted, “I like the
spirit of making language more natural and that holistic communication is seen as
more important than stuffing in ‘structures’” (French 050). Nevertheless, for this
teacher, “the only way I would consider doing the standard would be to have one
interaction to mark.” This sentiment was expressed by several others who were not
using interact – that is, “keep the option open to do only one interaction (like the old
standard)” (Spanish 109); “only one piece of evidence … should be sufficient”
(Japanese 118); “change it into one assessment … make it easy for teachers to
Table 6.2 Frequencies of mentioning improvements to interact

Improvement interact interact interact interact interact interact
1 35 5 95 17 37 % 29 %
2 16 0 95 17 17 % 0%
3 12 1 95 17 13 % 6%
4 9 0 95 17 9% 0%
Note
a
From 152 returns, 112 respondents made comments relating to improvements to interact
administer” (Chinese 119). An alternative was to “record only two interactions” and
then “choose the best one to submit” (Japanese 113).
Among those who reported that they were using interact at the time of the sur-
vey, opinion varied as to whether one or two final pieces of evidence might be
required to assess students’ spoken proficiency, even if students completed further
interactions in class for non-assessed purposes. For example, “allow just one inter-
action to be submitted if it demonstrates students’ best work” (Spanish 021); “maybe
do several interactions but then choose just one for the final submission/assessment”
(French 070); “students do three interactions but we submit their best one as evi-
dence” (Spanish 078). In the view of this Spanish teacher, this would facilitate, by
virtue of three pieces of evidence, “more concrete feedback to parents and students
throughout the year,” but would also mean, by virtue of only one assessed piece,
“less marking on completion of the portfolios.”
One teacher argued that “I really don’t see why three, four, five pieces of work
shows you anything more that can’t be seen in the one-hit approach” (French/
German 086). This teacher went on to assert, “I really like the IGCSE oral exam,
students present a short speech/teacher asks them some questions about their pre-
sentation, then evolves into a more general conversation about a fixed/prescribed
range of topics (the topics that have been studied all year).” In this teacher’s view,
“three, four, five pieces of evidence especially from the start of the year do not pro-
duce same quality of performance as the ‘one and only’ chance in an oral exam
setting or a fixed date.”
Other respondents were happy to consider requiring “only two pieces of evi-
dence” (Chinese 088). This was because surely “two pieces of interactions would be
enough to measure a student’s ability to interact,” apart from also being “more man-
ageable for the teacher” (Spanish 112). In several cases, it was suggested that one of
the two submissions might be derived from a teacher-student interaction, on the
basis, as one teacher put it, that “successful ‘interactions’ rely on plentiful model-
ling by a competent speaker of the language” (Spanish 036). In the view of another
teacher, and bearing in mind the argument that “it’s very difficult for the students to
carry out conversations in another language at these levels,” it may be that the
requirement should be “two rehearsed conversations + one spontaneous,” which, in
this teacher’s view, would “work for the majority of the students” (Chinese 149).
6.3.2 Allow Provision for Scaffolding/Rehearsal
Another limitation to interact apparent from the open-ended survey responses was
the perception that expecting ‘spontaneous and unrehearsed’ interactions, particu-
larly at NCEA levels 1 and 2, was idealistic and unreasonable. In this connection, a
second consideration for improvement was to recognise the apparently unworkable
and ridiculous nature of the requirement, and thereby to soften it. Comments relat-
ing to scaffolding or rehearsal were made only by those who reported that they were
using interact at the time of the survey. For teachers who commented in this regard,
6.3 Suggestions for Improvement – Survey Data 133
there was a perceived need to “make the requirements realistic. ‘Spontaneous’ will
drive all students [to] give it up” (Chinese 027) or “remove unrealistic expectations
that students at A2/B1 levels … are able to have spontaneous conversations which
include a wide variety of more complex language, much of which they have only
just encountered and not fully mastered” (Unstated 077). As a consequence, “stu-
dents in the ‘real’ world still find themselves linguistically limited when speaking
with native speakers (apart from the basics)” (Spanish 136).
There therefore needed to be provision to “allow a certain rehearse/practice
before recording” (French 069) or “a judicious mix of authentic, learned and unre-
hearsed” (French 132), because surely even “a near fluent speaker” might still
“rehearse phrases for certain situations” (Unstated 077).
One teacher suggested:
I do think they should be allowed to have a minor level of ‘rehearsal’ – practising together
(without a script), trying out different questions and responses, experimenting with the
conversation going in various directions on successive run-throughs, before they actually
do the assessment. (French 008)
This teacher went on to explain, “a lot of the success of this [assessment] stan-
dard will depend on how well we can impart conversation techniques and scaffold
the skills required to do it well.” It was necessary, in the words of another, to “be
more realistic about the fact that only our best students are going to be able to cope
‘blind’. The weaker students need time to work out what they are going to say and
don’t cope well with surprises” (French 091). Another teacher (French 042) sug-
gested “being allowed some leeway for using what could be available in real life.
Even the debate/frank discussion usually requires some prior knowledge.” This
teacher went on to argue, “it’s getting balance between genuine interaction and
whether prior knowledge could enhance the interaction.”
The perception of a need to balance prior preparation with spontaneity, alongside
comments regarding the suggested improvement of scaffolding/rehearsal, in fact
revealed that the entire issue was fraught with misunderstanding. As one teacher put
it, the perception of the assessment was that, on the one hand, “it is unrealistic to
expect students to be absolutely spontaneous,” and, on the other, “it is invalid to
judge them on something they have rehearsed” (French/German 141). There was
therefore a perceived need for “clear direction as to what ‘rehearsed’ and ‘not
rehearsed’ means” (French/German 141), because there was “currently a great deal
of confusion between the two” (Spanish 011). To facilitate this it was perhaps nec-
essary to “set guidelines … on how much preparation can occur (not rote learning
it/memorising but setting the students up to a level where they can do it)” (Japanese
088). This teacher went on to argue that there also needed to be “more information
about, for example, can students restart an interaction if they mucked it up early
on?” The teacher explained, “I write this because I have been given different advice
from teachers which has conflicted at times.”
Conflict and confusion were also apparent in a comment that had come through
from the disadvantages section of the survey but is apposite here. One Japanese
teacher (Japanese 113) asserted:
When it was first introduced, the idea was to capture students’ conversation in class so the
interaction was authentic – this idea changed. In the Best Practice Workshops we were told
students needed to practise first and that the level of language (that is, structures) mattered.
This has made it just like the old conversation, only having to do more.
With regard to spontaneous interaction, interact in practice was highlighting sig-

nificant problems for teachers. With regard to task type and task suitability, interact
in practice was also bringing important issues to the fore. In other words, the essen-
tial question that it seemed teachers were grappling with was “what is suitable as a
task?” (French 120).
6.3.3 Provide More Examples and More Flexible Options
Final considerations for improvements as recorded in the national survey focused

on having access to more examples of tasks, together with greater tolerance of
acceptable ‘task types’. There was a need for “more resources – sound files as
exemplars – tasks, properly set and moderated” (French 111). In this connection,
there was also a perceived need to encourage “sharing of tasks, strategies, systems
for collecting and managing evidence.” After all, “there must be some really good
ideas out there that I would love to hear about” (French 034). Bearing in mind that
“the tasks themselves are critical,” it was considered an advantage to encourage
“more sharing with other teachers about which tasks work” (German 117), or to
create “a set pool of tasks we can choose from” which would “make for better con-
sistency between results from different schools” (Japanese 102).
A further consideration was for tolerance of task type. Teachers were very mind-
ful of the guidelines that stipulated that there needed to be a range of task types that
would elicit different kinds of language, together with an embargo on ‘rehearsed
role-plays’. As one teacher explained, “some of the requirements (e.g. variety of
text types) need to be dropped. We are still trying to make students take on topics
that are too hard. If it was just three average classroom conversations that would be
easier and less contrived” (French 041).
Opinion was divided over the use of role-plays, suggesting that respondents held
different perspectives about just how open-ended and spontaneous a role-play could
be. It was suggested, for example, that the assessment should “allow one role-play
situation maybe for ‘achieved’” (Unstated 016). This was thought to be particularly
relevant for students at NCEA level 1. That is, “level 1 should encourage transac-
tional conversations (shopping, directions, restaurant)” (Japanese 084). The per-
ceived limitation for this teacher was, however, that “the standard requires exchange
of opinions etc.,” something that role-plays apparently did not lend themselves to.
Another teacher, by contrast, argued that role-plays were “a major part of everyday
life and do offer plenty of opportunities for personal opinion exchange.” On this
basis, it was surely appropriate to “accept transactional role-plays” at this level,
even though “at present they seem to be disapproved of” (French 148). Thus, those
who favoured transactional role-plays saw potential in them, particularly in terms of
eliciting a personal point of view.
6.4 Disadvantages of Interact – Interviews 135
A contrasting view (Spanish 089) brought out the limitations of the more tradi-
tional transactional role-play, in particular its limited focus on pre-learnt vocabulary
and challenges with regard to authentic replication of the target domain. This teacher
argued that “I think the most important thing is for students to be able to converse
in a natural way about themselves, their experiences, their opinions – about different
topics …” The goal was therefore that “I just want them to be able to converse with
Spanish speakers and assess their ability to do that.” In this context the teacher
asserted, “I don’t like them having to pretend to be a shopkeeper (for example)
because it just becomes a learnt situation. Even for the customer (in a shopping role-
play) it’s very unnatural because it’s hard to realistically simulate.” Thus, although,
in the view of this teacher, “I don’t think there should be a requirement for different
‘types’ of interaction,” role-play was ruled out on the basis of an argument that role-
plays promoted rehearsed and inauthentic interactions.
6.4 Disadvantages of Interact – Interviews
By way of follow-up and comparison to the national survey, interviewees were

asked what they considered to be the disadvantages of interact, and areas where
interact might be improved.
6.4.1 Impracticality
Not surprisingly, the time-consuming nature of interact was identified by several

interviewees as a distinct disadvantage. Carol, for example, explained that “the
complaint that I am hearing from teachers is that the teaching / learning time has
decreased dramatically this year due to the portfolios, the interaction and the writing
portfolio, the time that is taken up with that.” In Carol’s view, the logistics of man-
aging interact appeared to impact negatively in several ways:
Tracking the portfolio, making sure all the criteria have been met, and then at the end of the
year there is the whole process of listening to them all again, doing a holistic judgement and
then everyone has to listen to it all again and have it moderated. So there’s a massive time
investment.
As Janine put it:

The problem is time constraints, both with the student and for the teacher – it just takes such
a long time, and you don’t have that time, and so it gets to the end of the year and … they
are just trying to produce something to get a mark, which is really sad.
Jane concurred: “the internals should all be done by now, but they aren’t. Having
said that, you can’t do all the three pieces of interaction at the beginning of the year
because they probably won’t have the language level.” This meant that “organising
your year is really hard, time taken is really hard.” Jane shared an experience that
had been reported to her of another teacher during the first year of introducing inter-
act: “last year with his Year 11s there were a substantial number that simply didn’t
complete. They got maybe one, maybe two at best [completed], and [these were]
your able kids and your keen kids of course.” Thus, for the interact portfolio to be
successful, it “really brings on [the] key competency of ‘managing self’,” an issue
that several survey comments had suggested was a distinct challenge.
Technology also presented logistical challenges. Janine suggested, “they [the
students] need to be in control of it, not me.” She explained that this “really means
that each student has to have their own laptop, really, in a perfect world.” Then the
students, having taken ownership of the process, “could record it, they could upload
it, they could share it with me on Google Docs.” According to Monika, the ‘ideal’,
or the “the easiest thing to do,” would be to allow students to use their own portable
electronic devices. Nevertheless, Monika acknowledged that some schools out-
lawed these. Also, as noted by Georgina, the assertion that all students have devices
for recording is illusory:
The reality is, no they don’t. They have phones and sometimes they work or they forget to
record. … when I said ‘right take out your phones’, there were choruses of ‘I don’t have a
lead,’ ‘I don’t know where it is,’ ‘my phone can’t take a recording,’ ‘it’s really bad quality’
… all kinds of things …
At least initially, therefore, there would probably need to be a financial outlay for
recording devices. Sally explained her process:
We have the flip cams which are like little cell phones and they just have a USB that flicks
out of them. So they [the students] record themselves and then they just put the USB video
camera, which is a USB, straight into their classroom computer and save it onto their own
USB drive. Then at the end of each lesson, if we are doing recordings, I’ll upload them onto
a secure media drive which students can’t access, so back it up on the school network, and
then I just do a file dump across into the student drive, so they can access it and then there’s
a secure backup, so no one else accidentally deletes somebody else’s work.
For Sally, therefore, the technological process, with a strong emphasis on student
ownership, seemed to work, albeit involving different steps, each of which had prac-
ticality implications. Monika’s view, by contrast, served as a reminder that, even
with school-owned equipment, there was the perceived need for the teacher to
remain in control of the processes, once more eliciting practicality considerations:
We have these Sony digital recorders and they have folders, so if we have one per language
and lots of teachers [whom] you need to train up – Year 12 uses Folder A and Folder B, and
then if absolutely all the time somebody puts something in Folder C and then says ‘I did it
and I don’t know where it is’ … you have to listen to about 150 recordings till we find it.
Additionally, as Françoise argued, “making [the students] responsible for it,”

although perceived by her as a potential benefit (see Chap. 5), was “really, really
hard.” She went on to explain:
I don’t know how to do it, I’ve tried, but they really refuse to take that in their hands. …
when it comes to the paperwork and keeping their drafts and keeping the videos as well,
they are really scared that they are going to do something wrong and maybe lose their
credits.
Françoise conceded that “it was a small number this year so I agreed to keep their
work.” However, “when we have bigger numbers I won’t have a choice.” As a con-
sequence, “I really want them to take responsibility for it.”
For Mary, an added technical complication was a systemic issue regarding final
submission of evidence that, in her view, actually made it harder for teachers:
I think the largest bugbear of mine, I guess … is managing the portfolios and storing them
and not being able to submit them electronically for moderation. I mean, not being able to
just say ‘this is My Portfolio page, here you go,’ but having to print them all off or burn
them all onto DVD or something.
Realistically, therefore, and at least initially, Monika argued that the portfolio
needed to be “managed by the teacher.” This was essentially because, for teachers
and students alike, this was “a new system, and you need to develop your own way
of managing that.”
Jane brought out an additional dimension of impracticality. In terms of the pub-
lished criteria, the entire evidence of interaction “only needs to be five minutes
across the three interactions.” However, in her experience, “each [individual] inter-
action ends up being five minutes just by virtue of what it is. The pauses, the laugh-
ter, the waiting eternally for someone to say something. It takes a long time.” Jane
concluded, “it is a lot of recording. A lot of evidence is being gathered.” Although
it would be appropriate for teachers to extract shorter excerpts for assessment pur-
poses, Jane went on to say, “I feel that three [interactions] is too much. I think the
time taken is extraordinary and it does impact on the other areas that you are trying
to develop.”
6.4.2 Negative Impact – Too Much Work for What It Is Worth
Among several interview participants was a recognition that practicality issues such
as workload and management could potentially have a negative impact on students.
Lead teachers Celia and Jane both noted that, in Celia’s words, “for the number of
credits that it is, it’s a lot of work.” Celia argued:
How much work do students have to do for five credits in their other subjects? And if you
are a smart kid, you will be looking at the workload that is happening in languages, and you
look at the workload in your other subjects, what would you choose?
Jane commented that, under the old converse system, students “got three credits
[for] one conversation, and now they have to do triple the amount of work with less,
less certainty.” She noted that “some students who I was with last week said, ‘well,
you know, this is a lot of extra work for two extra credits.’” Implicit in these per-
spectives was the danger of losing students who, in different circumstances, may
have persevered with a language.
6.4.3 Negative Impact – Interlocutor Variables
Several interviewees raised the issue of potential negative impact on students due to
interlocutor variables. A potential weakness was when a student with a higher level
of proficiency was paired with a weaker student. However, when Mary and Janine
commented on this, both saw disadvantages and advantages.
Mary acknowledged that a successful interaction “would depend on who the stu-
dents speak with.” She went on to explain, “sometimes you watch some of the inter-
acts and see a great student with a perhaps quite poor student and so then the great
student isn’t necessarily able to showcase all of the language that they’ve got.” In this
case, the stronger one is potentially limited and potentially penalised. On the other
hand, this can work to the advantage of the less proficient student: the stronger one
acts as a scaffold and “you can see them helping the weaker student, and that’s great.”
Janine, who noted that “I worry about … the student pairings,” went on to explain,
“in the conversation standard … the teacher actually was quite skilful in a way to
help the student to develop their answer.” By contrast, in interact, occasionally “the
dominant person can take all the things to say and the weaker person doesn’t.” In
this scenario, the weaker one would be potentially disadvantaged. There could also
be “some trouble with people whispering what to say and that kind of thing.” In
other words, the more proficient student might try to ‘bail out’ the weaker one inap-
propriately. Janine had “really tried hard” to help students to develop appropriate
strategies, that is, “to say ‘it doesn’t matter if you tell them what to say if you do it
in a conversational way’, like, we wouldn’t whisper to the other person what they
need to say back again.” However, “I haven’t got through to them about that yet.”
Yuko and Janine, both teachers of an Asian language, brought out a dilemma that
had not been raised by the survey. This was related to differences between European
and Asian languages, bearing in mind the generic nature of the achievement stan-
dard (the assessment blueprint) and the requirement to elicit comparable levels of
performance across languages.
Yuko argued that the Asian languages are “far different from European lan-
guages – it takes longer to be able to learn to that stage. So assessing them with the
same standards as European languages – I do feel a gap.” Comparing her language
(Japanese) with that of a colleague (French), she mused, “what they can do at
[NCEA] level 1 and what my students can do at level 1 is quite different in terms of
speaking.” When adding to this the further complication of writing in a different
script, “we do have to spend more time on that one. We can’t spend that solely for
interaction.” There was therefore an issue of equity.
For Janine the issue with regard to Asian versus European languages related to
comparable discourse expectations. With marking criteria that focused on justifica-
tion and expressing opinion, this “doesn’t work in an Asian language very well …
In Japanese it’s not really culturally correct to give your opinion particularly, and
certainly if the other person has the opposite [view], well, you just don’t have that
language.” She concluded, “I don’t think it’s harder to interact, but I think it’s more
difficult to meet the criteria at the upper levels.”
6.4.4 The Challenges of ‘Spontaneous and Unrehearsed’
As with the open-ended comments from the survey, a significant drawback to inter-
act in practice related to the matter of spontaneity. The issue, as Janine explained,
was this:
I know with our girls it’s very hard to get them to be spontaneous because they are nervous
and they want to script it, they want to write it down and … just keep doing it until it’s
perfect, and it’s a pity – but I can understand because at the end of the day you do have
criteria to meet for excellence.
For several interviewees the challenge of spontaneity did not lay with the assess-
ment itself, but with how teachers were interpreting (or misinterpreting) its require-
ments. There was also a sense that, whether intentionally or accidentally, the
assessment in practice appeared to be turning into something other than what was
originally intended. This perception mirrored one survey comment (Japanese 113),
recorded earlier, that suggested a shift in intention away from ‘authentic communi-
cations in class’ and towards ‘students needing to practise’.
According to Celia, who recognised the open-ended nature of the interactions,
“the whole idea of the standard is that you should in theory be able to chuck the kids
a recorder and they go record something.” Tension was generated by the NZQA
requirement “that you have to give them notification of the assessment and the kids
have to be regularly [informed] about what their expectations are.” Therefore, “you
give them notice of assessment and ‘yes we are talking about this’, brainstorm what
kinds of ideas.” There was thus a sense in which the requirement to inform students
about the assessment had the effect of diminishing genuine spontaneity.
Peter reiterated Celia’s view. In his perspective, the essence of the assessment,
what the assessment was originally designed to achieve, was being compromised in
practice. That is, the original idea was “you’re going to record students talking to
one another rather than to their teacher, off the cuff and unrehearsed, and like you
would actually do if you were in the country.” Peter went on to assert that, in prac-
tice, the assessment was “not turning into that.”
The central tension, for Peter, was that “students want to know, ‘am I being
assessed on this? Is this an important one? … I mean the first thing they ask is, ‘is
this an assessment, is it going to count for something?” When they know that the
interaction is for assessment purposes “they prepare everything and they basically
learn scripted things, and then they add in a few minor phrases or hesitation phrases
or whatever to make it sound more authentic.”
Peter went on to explain the tension:
The whole idea is that they have to have fair conditions of assessment, and if you don’t tell
them this is an assessment then it’s not fair because you’re not giving them the warning and
you’re not indicating to them that they need to do their very, very best in this.
Peter concluded, “as teachers we want them to be perfect, and as students they
also want to be perfect.” The consequence, however, was the risk that “it does just
turn into three pieces exactly the same way we did it before, but you’ve got three
conversations instead of one … it turns into an assessment circus.”
A related downside, for Peter, was that “you can tell if something is read or if
something is super, super over-prepared, it doesn’t sound natural anymore and then
that impacts on their mark because it’s not spontaneous, it’s clearly not.” The chal-
lenge of not meeting the spontaneity requirement was therefore that students might
risk under-achieving, or not actually achieving anything, in terms of the require-
ments of the assessment. That is, an interaction that was clearly staged was not
going to meet the criteria.
Two lead teachers, Celia and Jane, were able to draw on their experiences with
supporting the national introduction of interact to provide reflection on the tension
for teachers. From Celia’s perspective, many teachers were, with regard to sponta-
neity, “not ready for it.” However, if students “come in and they have pre-learnt
stuff, then that’s a role-play and they get penalised.” Jane explained the dilemma,
based on the samples of interact she had encountered:
You have these students that have done beautiful work, it’s original, they have created it
themselves, but it’s pre-scripted work. They’ve then recorded their pre-scripted work and
they can’t get the credits for it, and that doesn’t seem fair. And that hinges down to the
spontaneity issue because these students have spent God knows how many hours
[preparing].
Jane went on to describe two instances where, it appeared, the students were still
operating on the basis of the old converse assessment. One group of students had
produced “really great, fantastic French, not perfect, but original.” Nevertheless, the
scenario was pre-rehearsed, and “there was no spontaneity, not a single instance.”
As a consequence, “they all had to get ‘not achieved’ for this fantastic work, and
that sat badly with me, but you can’t change what is in the standard.”
In the second example relayed by Jane, the students had “produced these really
scripted conversations.” Jane explained:
It ticked all the boxes of the old criteria, you could see exactly what they were doing in their
head. They just divided the role of teacher and student amongst themselves. They had
inserted their subjunctive where appropriate. … But there was no way it could be consid-
ered an interaction under the new rules.
As with the survey data, the interviews revealed concerns around ‘spontaneous
and unrehearsed’ that would require attention. I take up some suggestions for
improvement, from the perspective of several interviewed teachers, below.
6.5 Suggestions for Improvement – Interviews
The interviews illustrated aspects of suggested improvements that, although not

directly mirroring those from the open-ended survey, served to complement them.
6.5 Suggestions for Improvement – Interviews 141
6.5.1 Clarifying ‘Spontaneous and Unrehearsed’
Seen in the light of the arguments I have presented above regarding spontaneity and
lack of rehearsal, several interviewees reiterated the concern of a number of survey
respondents that there was a need for clarity around expectations. Sally provided an
interesting perspective that, if taken on board, would, in her view, address both mis-
understandings about the requirements and expressed concerns about workload.
Indeed, for Sally, the disadvantage of workload was something that “I don’t under-
stand.” She went on to reflect:
If you were running it the way you used to run a conversation in that you were prepping the
kids to the nth degree and telling them what they needed to cram in, this vocab and that, I
could imagine it would be [extra work], because you can’t take a week out of learning every
month to do that sort of thing.
However:
If you just treat it like a natural sample of the kids’ conversations based on the topics we
have just been doing, it’s no extra work at all, it’s really not. For example, you teach direc-
tions. If you are lost in a town – how do you ask for directions? How do you understand
where to go? You’ve taught that; they have practised in class; they have been playing with
each other in class about giving directions around the school, or you maybe had a little trip
somewhere and you’ve had to do that. And then what they do is record the stuff that they’ve
just learnt. It’s no extra work.
In Sally’s understanding, therefore, the issue was not a polarised ‘completely

staged’ versus ‘completely off-the-cuff’, or ‘preparation and planning’ versus
‘spontaneous and unrehearsed’, as if the former were the situation with converse,
and the latter were the expectation of interact. After all, the target language “is
not their first language, so it’s not going to be completely spontaneous. They are
practising or repeating back what you have just taught them. But they are not par-
roting you.”
Sally went on to explain:
There are 20 things that they could say and they might choose five of those things to talk
about depending on their own personal experiences or somebody talking about what you
did on the weekend, they all do different things. … And they pick and choose. Some people
go to the movies, some people go to the beach, some people go by car and some people go
by train. They are not going to just regurgitate the same thing as each other. They are
individuals.
A necessary improvement to interact was therefore to provide greater clarity to

teachers around how spontaneity may be worked out in the context of on-going
work. In other words, there was a need for teachers to recognise Sally’s perspective
that assessed interactions could be ‘treated like a natural sample of conversation’
that arises in the process of teaching and learning, with all that entails for scaffold-
ing, feedback and support. This would also, in Sally’s view, address perceptions of
high workload demands.
Nevertheless, Sally admitted, “when they [students] role play with a friend they
do tend to write out what they are going to say and then write in things to make
themselves sound spontaneous.” Although this might be “funny because they bring
in their drama skills and their acting skills and things like that, so they make them-
selves look like they are just having a conversation,” the danger was the semblance
of natural interaction, not the interaction itself. As Jane put it, “in the middle of a
fairly normal plodding conversation, you will get something like tu blagues là for
‘you are kidding’.” It seemed that students were “inserting these terrible false state-
ments just to be sure that they have responded with surprise, you know.”
The issue for Jane, as with Sally, was with lack of understanding and confusion
around the meaning of spontaneity in the context of interact. Jane understood that
‘spontaneous’ was “an awful word … a hard word when you are talking about an
assessment, an assessment for a qualification.” Nevertheless, “students are coming
into that assessment knowing they are having an interaction on a certain theme, and
anyone who is assessing for credits is going to do some kind of preparation.”
Although it would seem that this “detracts from the spontaneity immediately,” this
did not, however, mean that ‘spontaneity’ was a meaningless or valueless concept
for interact, and that prior preparation was anathema.
Jane recognised the source of potential confusion. That is, “if you were to ask me
what ‘spontaneous’ is,” it would be speaking freely and naturally without prior
rehearsal. She asserted, however, “that’s not what it is in regard to the standard, I
don’t think” (my emphasis). Jane went on to explain that, in terms of the require-
ments of interact, “I think … [that] spontaneity becomes more a sense of being open
to the conversation/interaction going somewhere else and being able to handle that
in a spontaneous manner” (my emphasis). She used for illustrative purposes her
participation in the interview for this study: “a spontaneous conversation – that’s
what we are having now – but in my mind on the way here I was thinking about the
things you might ask me. Does that take away from the spontaneity?” As Peter put
it, “you always do a sort of a mock conversation in your head when you’re in another
country: what could happen? What would somebody say to me?” That, in a sense,
was prior preparation, “and the skill is being able to say ‘look I didn’t understand
that, could you explain it to me in another way?’ in the target language.”
Interpreting spontaneity in strategic ways provided a different dimension to
understanding that would not necessarily have to preclude elements of prior prepa-
ration. Nevertheless, as Carol argued, teaching the required skills for successful
spontaneous interactions – “management and inspiring the students and structuring
and scaffolding” – was “very, very difficult.” In other words, “you can get students
quite easily to learn a little script or, I don’t know, be able to practise pronunciation.”
However “to get [to] the higher level skills,” that is, “to be able to think flexibly, to
listen to other people, to respond, to be resilient enough to carry on if they make a
mistake, to take the risk of communicating verbally,” was more challenging. This
was because it “can be quite emotional for them, whereas doing writing doesn’t
seem to be emotionally challenging. They don’t feel as vulnerable when they write.”
6.5 Suggestions for Improvement – Interviews 143
6.5.2 The Task is Everything
In light of the real challenges to interact, such as encouraging genuineness and

spontaneity, and accounting for interlocutor variables, a fundamental conclusion
expressed by lead teacher Jane, and something about which she was “really clear,”
was that “the task is everything.” That is, “if you don’t set up a good task then you
are never going to get the evidence.” In Jane’s view, the task “has to be one where
the students can take control.” In other words, interaction “doesn’t work” with trans-
actional role-plays such as “at the train station.” Rather, “they have to be really open
topics.” For example, “the better ones are ‘how do you spend your free time?’ and
see where that takes you because that can take you onto overseas trips, it can take
you onto weekends away.”
If the task is critical, as Jane asserted (and as German 117 [see earlier this chap-
ter] noted), a challenge for teachers expressed by Georgina, especially for teachers
working alone, was “the actual working out what the tasks are.” The issue was “try-
ing to figure out, okay what is going to make them speak? How do I get them to get
the best Spanish out of them for this particular topic or whatever?” On this basis, a
suggestion for improvement noted by Georgina was “I think it would be great for us
to share what we are doing.” This would not only generate ideas for assessment
tasks but would also provide guidance, reassurance and support for teachers.
Georgina reflected on the recent school trip to Chile that she had undertaken with
a colleague from another school. In the course of their conversation around interact,
Georgina explained, “‘oh look, I’ve done all this’.” The other teacher “hadn’t quite
got there yet. And I went ‘look, this is what I do for feedforward’ and she said ‘what
a good idea.’” Georgina concluded that, in her view, there was value in “that kind of
sharing with other teachers, because I don’t know what the other Spanish teachers
have done. Somebody might have a wonderful time-saving plan or idea that I have
no idea about.” She concluded, “I think that kind of thing would be really useful.”
In order to elicit feedback on assessment tasks, Celia drew on her own students
as a resource. She explained that, last year, “I was really worried that the tasks
wouldn’t work,” even though she had spent a long time creating them. Last year she
had produced six interactions but was concerned about the sustainability of that:
So at the end of the year I sat down with the kids, I took a group of kids aside and went
‘right, which ones did you like and why? What ones did you select and why?’ and then went
‘okay, okay so you like this one, you like the features of that one,’ and so then I jigged it so
that the interaction came back to four. I combined some of the features they liked in some,
and got rid of some that didn’t work as well.
Janine argued that there was a need to consider carefully what tasks were appro-
priate at the lower levels of proficiency. Reflecting several perspectives expressed in
the survey, this consideration was on the basis that “a part of me really thinks that
we shouldn’t have interact at level 1. Their language just isn’t barely up to it yet.
It’s not ready for it.” She suggested, “I think maybe at level 1 we could be looking
[at] something different … maybe more of an interview or something like that. Not
expecting them to be able to do this, you know, go off the topic.” At the lower levels
of proficiency, therefore, the task was crucial to elicit the required evidence, and the
assessment perhaps required some re-thinking at these levels. Janine went on to
argue that, by the time her students reached NCEA level 3, “I’ve always found they
really enjoy doing the conversation with me and they can say lots and they can
express themselves.” It was therefore perhaps at level 3 that the interact assessment
would become the most useful.
6.6 Conclusion
The findings of the open-ended survey comments and subsequent interviews

revealed several problems for interact in practice. Foremost of these was the issue
of impracticality, which appeared to impact in a range of ways from task conceptu-
alisation to task execution to task assessment. There was also a need for greater
clarity around what would make an assessment task that would be useful and fit for
purpose. Additionally negative impact ensued from a perception that, as currently
operationalised, interact was unworkable and made unreasonable demands on
students.
In terms of unreasonableness, ‘spontaneous and unrehearsed’, whilst recognised
as a positive aspect of interact (see Chap. 5), also appeared in practice to give rise
to virtually insurmountable challenges. These challenges appeared to be exacer-
bated by teachers transferring their understandings of the requirements of converse
to the new assessment format for which these requirements were no longer appro-
priate. The challenges were also magnified by students for whom the high-stakes
nature of the assessment meant that inevitably they wanted to practise and prepare.
The perceived disadvantage of ‘spontaneous and unrehearsed’, and the suggestion
for improvement of more scaffolding and rehearsal, therefore appeared to reveal
substantial conceptual and operational challenges that would require resolution.
It must be acknowledged that the perspectives I have presented in Chaps. 5 and
6 represent teachers’ understandings of interact at an early stage in its implementa-
tion. Data were collected two years into the assessment reform process when, as
Peter made clear, “I think we’re still at that really, really early stage of portfolios” –
teachers were finding their feet, and “it will take some time for it to embed.” Also,
the perspectives presented represent teachers’ views only for the two lower levels of
examination (NCEA levels 1 and 2). As several comments exemplified, and as
Janine made explicit, perhaps interact may be more appropriate for NCEA level 3,
when students’ proficiency will have reached a higher level.
Furthermore, and taking the evidence from Chaps. 5 and 6 into account, three
key issues of concern stand out. First, if the task is crucial to interactional success,
what makes a successful task for purposes of interaction? Second, if ‘spontaneous/
unrehearsed’ (or a focus on fluency) is a key criterion for success, how is this to be
References 145
realised in ways that make sense to stakeholders? Third, and following on from
concerns about a focus on fluency, if grammatical accuracy is no longer a central
criterion for success, how is this to be understood in a context where interact is
perceived as high-stakes such that, in Peter’s words, both teachers and students may
want the interactions to be ‘perfect’? In the next chapter I present teachers’ perspec-
tives about how interact was seen to be working three years into the reform, and at
the highest NCEA level 3, with particular focus on these three issues.
References
Chapter 7
Interact and Higher Proficiency Students:
Addressing the Challenges
7.1 Introduction
In Chaps. 5 and 6 I presented data from the nationwide survey and the interviews
that were completed in 2012 as Stage I of this two-stage study. Findings were pre-
sented in terms of teachers’ perceptions of the relative usefulness of the two assess-
ment types, converse and interact, interpreted according to Bachman and Palmer’s
(1996) six qualities.
With regard to interact in practice in comparison with converse, teachers liked
the move away from the requirement to have to account for particular grammar
structures at different levels, believing that this added to the authenticity of the inter-
action. They were uncertain, however, how to interpret ‘spontaneous and unre-
hearsed’. Indeed, some teachers held the view that, at NCEA levels 1 and 2, the
spontaneity demand of interact was ‘ridiculous’ and ‘unrealistic’. Teachers also
wanted more guidance about, and examples of, appropriate assessment tasks.
As I reported at the end of Chap. 6, one interviewed teacher (Janine) argued that
perhaps interact was not ideal for students working at the lower levels of proficiency
(NCEA levels 1 and 2) because, in her thinking, they had not yet achieved a suffi-
cient level of spoken communicative proficiency to exploit its expectations. In her
view, interact was perhaps more appropriate at the highest school level (NCEA
level 3), because, in her experience, at that highest level students really enjoyed the
conversation with her and were able to draw on a broader repertoire of vocabulary
and grammar. Level of proficiency was therefore potentially a factor in contributing
to perceptions about the successful implementation of interact.
Taking Janine’s perspective into account alongside survey comments, in particu-
lar about spontaneity, it may be that the full potential of interact is brought into play
(and will be most in evidence) at the highest level of NCEA level 3. That is, at this
level students should have more independent command of the FL, commensurate
with Common European Framework levels B1 and B2. Theoretically, FL users
operating at these ‘independent’ levels should be able to demonstrate interactional

Linguistics 26, DOI 10.1007/978-981-10-0303-5_7
148 7 Interact and Higher Proficiency Students: Addressing the Challenges
proficiency in terms of “entering unprepared into conversation” (CEFR level B1)

and “interacting with a degree of fluency and spontaneity” (CEFR level B2). That
is, theoretically, students at these levels should be able to engage in (or at least
attempt to engage in) spontaneous and unrehearsed interactions. It is therefore per-
haps not surprising that, as I stated in Chap. 4, spontaneity becomes an explicit
criterion at NCEA level 3, whereas the requirement to be spontaneous is more
implicit at levels 1 and 2.
It was also evident from Stage I of the study that, at levels 1 and 2, considerable
emphasis appeared to be placed on assessment tasks that promoted situational
authenticity, replications, in the assessment context, of the kinds of straightforward
transactional interactions that students might potentially have in TLU domains
beyond the classroom. In these kinds of interactions it is possible to see how notions
such as ‘spontaneous and unrehearsed’ and ‘focus on fluency’ would be challenging
to implement, exacerbating teachers’ concerns about spontaneity. At level 3 there is
a requirement to interact proficiently around wider societal issues, to state, negoti-
ate, justify and explore one’s own and others’ opinions and worldviews. The evi-
dence of interactional competence which interact aims to tap into will arguably be
more apparent at NCEA level 3 than at NCEA levels 1 and 2. This has implications
for the kinds of tasks that teachers should be drawing on at this level.
There is also an expectation that the language used at NCEA level 3 will be more
sophisticated than the language that might be needed to complete a simple transac-
tion, requiring a higher level of command of the grammar of the target language and
consequently a renewed pressure to account for particular grammatical structures in
the interaction. This has implications, both for the task and for a focus on fluency,
especially in view of the particularly high-stakes nature of level 3.
NCEA level 3 becomes an interesting test case of how interact in theory might
be operationalised in practice, in particular with regard to three domains:
1. Task types
2. Spontaneous and unrehearsed (a focus on fluency)
3. Accommodating grammar (the place of accuracy).
Also, in light of the evidence emerging from Stage I, it is important to consider
these three domains more broadly with regard to interact at all levels.
Stage II of the study (2013) was designed both specifically to investigate NCEA
level 3 and to explore the three domains that had emerged as important issues from
Stage I. That is, in light of the advantages and disadvantages of interact and poten-
tial improvements to interact emerging from Stage I of the study, an issue of pri-
mary interest was how things differed for teachers at this highest level of proficiency,
level 3. Also of interest was greater exploration of the three key domains of task
type, spontaneous and unrehearsed, and a de-emphasis on grammatical accuracy as
a key criterion for success.
In Stage II the students, as primary stakeholders, also became a focus of
interest. Stage II therefore drew on two data sets: interviews with teachers
(n = 13) and a survey of students who were the first to take interact at level 3 (n = 119).
7.2 Examples of Task Types 149
A small-scale survey of those students who were the last to take converse at level
3 in 2012 (n = 30) was drawn on for comparative purposes.
In this chapter I explore aspects of the teacher interviews. The 13 teachers who
agreed to be interviewed for Stage II included the three teachers from Stage I who
were or had been involved in the trialling and/or moderation of interact at the time
of their initial interviews, and who continued to act as ‘lead teachers’ in these
respects. The remaining 10 participants had not taken part in Stage I. Table 7.1
records the pseudonym for each interviewed teacher, the principal language taught
and the type of school.
7.2 Examples of Task Types
In Chap. 6 I noted Jane’s assertions, in her initial interview, that “the task is every-
thing,” such that, when wishing to assess interactional proficiency, if teachers do not
set up an appropriate task they will not elicit appropriate evidence of interaction.
(As Suzanne put it, “it’s no good learning a language and being able to make
speeches. We can all do that. You’ve got to be able to interact.”) Jane revisited her
argument in her second interview. She commented that, in her experience, the
archetypal “Year 11 restaurant conversation” can often be, in terms of meeting the
requirements of interaction, “a complete write off, you know, one person being the
waiter, one person ordering.” Jane went on to explain that these more traditional
kinds of transactional role-play were “just automatically going to close it down,”
Table 7.1 Interview participants (Stage II)

Pseudonym Principal language taught Type of schoola
Jane Frenchb Co-educational state school
Sharon French Co-educational integrated school
Margaret French Co-educational state school
Marion French Girls’ state school
Naomi French Co-educational state school
James French Co-educational state school
Monika Germanb Boys’ state school
Anna German Co-educational state school
Celia Japaneseb Co-educational state school
Alison Japanese Co-educational state school
Suzanne Japanese Co-educational integrated school
Sara Spanish Co-educational state school
Linda Spanish Co-educational state school
Notes
a
A state school is a government-funded school; an integrated school was once a private (often
church-founded) school, but is now part of the state system whilst retaining its ‘special character’
b
At the time of the interviews these teachers were currently or had previously been involved in the
trialling and/or moderation of interact (they were also interviewed for Stage I of the project)
and the interaction often became “kind of dead evidence.” Problematic here was that
“the only useable evidence meeting the standard really” was found in examples
where the student was “actually authentically flummoxed enough not to blurt out
the rote learnt passages.” A transactional interaction arguably “ticks all the boxes
from a language point of view … but it’s not useful from the spontaneity of lan-
guage feature point of view.” The “best evidence” comes when “spontaneity, authen-
ticity, questioning, pausing, all the rest of it” are in evidence.
At all levels, therefore, and particularly at level 3, the assessment task, in Jane’s
view, “needs to be that quite open context of ‘what are you doing when you leave
school next year?’ ‘what are you doing for these holidays?’ ‘what’s your opinion
on part-time work?’ ‘how engaged in the environment are you?’” These last two
examples in particular lend themselves to the ‘social issues’ demands of interact at
level 3.
In what follows I consider several examples of level 3 tasks as presented by the
teachers. In light of Jane’s argument about ‘open contexts’, there was evidence from
the interviews that several teachers had given considerable thought to the kinds of
interactional tasks that would generate language that fulfilled the requirements of
NCEA level 3 (CEFR levels B1/B2). In each case, it was apparent that these tasks
were the culmination of a range of scaffolding strategies that has implications for
teachers’ understandings of ‘spontaneous and unrehearsed’.
7.2.1 Talking About the Environment
Several interviewees drew on the stock-in-trade of the environment. James and

Monika provided complementary examples of how they attempted to make the
interactions student-focused and how they scaffolded students towards independent
interactions on something that, in James’ words, was “not a very exciting topic
sometimes.”
James explained how he set his students up to interact successfully, in French, on
an environmentally-focused topic. Having “done a unit of work on the environ-
ment,” the whole class “talked about it a bit, and I gave the students suggestions on
two or three questions to get them into discussing things.” For example, “people say
New Zealand is a very green country, what do you think about it?”
It was evident that James’ procedure included allowing the students to have a
considerable amount of prior preparation with the partner with whom each would be
interacting. That is, James gave the students “a reasonable amount of time in class
… half a period or quarter of an hour to 20 min on a regular basis to work on it.”
Alternatively, “if they wanted to go somewhere else in the school to work on it with
the person they were going to speak to, that was fine.” James’ goal was to give his
students “quite a lot of leeway … I trusted them to do the work.” In his view the goal
was achieved in that “they worked on it really well. I found them quite
independent.”
In terms of outcome, James asserted:

When it came time to mark it I was really impressed… I found everyone really did a great
job, put in the effort. The range of grades was from ‘achieved’ to ‘excellence’, but I was
expecting some ‘not achieved’, but it’s really great when teachers are wrong and students
surprise you, and I was surprised.
James noted the hard work his students had put in to the interaction, “and I
thought, ‘wow, I really can trust the students to buy into it. … I can trust them to do
the work, they’re really putting in the effort’.”
Monika presented a similar environmentally-focused interactive task for students of
German. The task was called ‘Why should I care?’ The following context was provided:
New Zealand is a land that produces part of its own energy and has the benefit of being an
island far away from pollutants, sparsely populated and windy. Why should you even care
about environmental challenges, how do they affect you and your generation? Discuss with
a partner aspects of environmental threats and opportunities in the context of New Zealand
and globally. You could consider the following: Explain the challenge or opportunity to the
environment, why you consider it significant, discussing the impact of inaction, the historic
reasons for the situation, negotiating possible solutions.
Working towards the interactions, Monika, similarly to James, encouraged stu-

dents’ autonomy:
They had preparation sessions as in they decided on a topic and then they created talking
points around a topic, you know, they said ‘ok, we could talk about this and this’ and
‘what’s the language we can use for that?’
Monika reflected that her students “really worked out a strategy on how to …
actually come up to the expected [language] level.” The societal dimension was the
area that her students “were collaborating most on, I think.” Their collaborations
were “to figure out ‘ok, you know, if we talk about environment, are we going to
start from something that we’ve read? Or are we going to start from a practice that
we do in our house and then move on?’”
With regard to her own support, Monika “was floating around the room and I was
commenting on things.” She explained, “I think most feedback was concerning con-
tent … So when I overheard them practising their interactions and they stopped too
early, I said ‘you know … I would now ask ‘why?’ So you need to actually work on
this aspect [of justifying a perspective]’.” Occasionally “they came with questions,
and I said ‘yes, you can say that’ or ‘you should maybe think about this and that’.”
However, overall, they “really were incredibly independent,” and “literally managed
without [anything] other than the guidance that they had from the task sheet and
their experience of having done [interact] for two years before.”
Monika concluded:
What I saw at the outcome was true collaborative effort between students to tease out details
and depth … that goes even beyond what they normally do in English, you know. They really
had an interest in supporting each other so that they could … show off what they can say.
Monika reflected that, as a consequence of the interactions, her students “felt that
they really could communicate with people, being able to understand and respond
and support and all these things more than they were thinking [was] possible.”
Her students acknowledged, nevertheless, that this was “an artificial situation.”
Monika went on to articulate what she meant by this:
The artificial situation is, 17 year-old boys, when they are together with their mates, don’t
talk about the environment … and they do not really want to know deep underlying reasons
and details of why people do something and what would be a consequence of that kind of
action.
In other words, the kinds of topics that NCEA level 3 appeared to expect would
not necessarily promote the most effective (i.e., interactionally authentic) interac-
tion with the task. In Monika’s case, there was an attempt to promote interactional
authenticity:
I made it quite clear to the students that at that level of NCEA, in each and every subject,
the societal aspect has to come in, and that is one of the thinking skill applications of that
level, and they will just have to swallow it and do it to show that they can actually think at
[that] level.
It did appear that both Monika and James had managed to energise the topic of
the environment sufficiently to promote what they perceived as positive and appro-
priate interaction with the task. Nevertheless, their experiences highlight a need, in
Naomi’s words, to “be more mindful that even though these are 17 year-old sophis-
ticates they have very little experience in life.” When designing appropriate tasks, it
was arguably necessary to “keep it simpler, I think.” In Naomi’s view, there was a
case for “not making the tasks unrealistic for the 17 year-old students in their fifth
year of French.” This, however, would make it “quite difficult,” not only “because
of the topics that we cover” but also because of the expectation of level 3 that “this
is ‘world matters’, this is ‘outside of us’,” and “to find roles that are outside of us
that can be spontaneous and appropriate to the language that they know is actually
quite difficult.”
Part of Naomi’s answer to the dilemma was to take a more novel approach to the
environment. She created a task that arguably related environmental issues more
closely to the students’ lived experiences in a large city heavily reliant on private
transport (cars) and where additionally the main source of public transportation was
buses. Naomi’s students were required, in groups of three, to design a city-wide
metro system “because we looked at different sources of energy and we looked at
the problem of congestion and pollution coming from cars and that sort of thing.”
For the interaction students had to get together with another group to share what
they had planned, asking questions such as ‘ok, what sort of energy did you use?’
and ‘why did you choose this?’
Even with such developments to the topic, however, the environment would not
necessarily promote the most positive instances of interaction, or, as James put it,
“the environment, oh man, it’s not one of the best ones for interaction.” If, therefore,
the task is crucial to the success of the interaction, at NCEA level 3 it would seem
that tasks were required that were sufficiently outwardly and societally focused and,
at the same time, likely to be sufficiently relevant as to promote the required level of
positive interaction. Below I provide several examples of tasks designed to recon-
cile social focus with perceived relevance.
7.2.2 Mariage Pour Tous
In order to enhance perceived relevance and promote positive interaction, Naomi

gave a level of ownership of the task over to the students by asking them what they
wanted to talk about. She provided one example of the outcome of this dialogue.
Noting that “I wanted to do art,” but that the students responded by saying “no, let’s
do gay marriage, miss,” Naomi said to them “all right, I’ll see what I can find.” As
a consequence, “our first conversation was mariage pour tous [marriage for all], so,
the whole gay marriage thing.” At the time of this interaction, same sex marriage
and its legalisation had become a topical issue in both France and New Zealand and
had received considerable media attention in both countries. The interaction there-
fore lent itself to the societal dimension, comparison across two different contexts,
and the opportunity to explore and justify opposing views.
Naomi explained, “for this task I let them choose their own groups because I
didn’t want anyone to feel as if they were being judged for their opinion. So they
tended to be in groups where they shared an opinion.” Even in this grouping con-
figuration, however, contrary views could be explored. She went on to explain, “one
of the boys, he decided he was going to take the ‘no, being gay is disgusting’
approach just so he could improve the conversation, and there was one girl in
another group, she was ‘je suis désolée, je ne suis pas d’accord’ [I’m sorry, I don’t
agree], she was absolutely adamant they were wrong.” Naomi concluded, “they
were the best conversations actually.”
To help the students to prepare, students were given a collection of fifteen “con-
troversial statements” about the issue, in French, to act as initial prompts to the
discussion. The statements included some arguments that would support gay mar-
riage, and others that were, in Naomi’s words, “quite homophobic.” The students
received the prompt cards as ideas to get them started on a discussion and then
engaged in a discussion in pairs or threes, drawing on whichever statements they
wanted.
Despite what was reported as a generally successful interactional opportunity,
Naomi acknowledged that a limitation of the task in practice, and one that may have
hindered positive interaction for some students, was that “the ones that are anti
[were] very quiet, you know.” Naomi reflected, “what are you going to do? ‘I think
that it is wrong and you are different and you shouldn’t have the same rights’, you’re
not going to come out with that.”
7.2.3 Cat Café
Alison, working in the context of Japanese, described an arguably less thematically

controversial but nonetheless interactionally provocative task – ‘Does New Zealand
need a cat café?’ Alison explained that “Japanese people can’t have pets in their
apartments.” As a consequence, a cat café is “a café [that] might keep, like, fifteen
different cats.” Customers “can pay a fee and come in and sit and stroke the cats.”
Visitors “can buy drinks, but not food, so you order your coffee … and the cat café
is decked out in different styles, so it is comfortable for the cats and it’s comfortable
for the person coming in to pet the cat.” Alison concluded, “it’s very weird, but it’s
typical Japanese style.” The task would lend itself to a range of interactions that
could explore different perspectives and enable reflection on cultural differences.
For Alison one interaction in particular stood out to her. She observed:
One of my students did a wonderful piece – she was skyping a friend of hers in Japan, and
so she submitted this skyped conversation asking ‘what was a cat café like in Japan?’ and
‘did this person ever go?’ [and] ‘who used it?’ It was a gorgeous conversation. And it was
exactly in the spirit of the thing, because it was skyped, it was a real thing … I was very
proud of that particular kid’s work, it was great.
Indeed, for Alison this interaction “was completely spontaneous. That was
lovely. That was the most authentic conversation.”
Alison went on to describe the novel way in which she scaffolded her students
into the tasks. She used what she described as a ‘flipped classroom’ model. This
approach reversed the traditional teacher-led teaching model, and students gained
their first exposure to new material outside the classroom (that is, by working on it
at home). Subsequent time in class was used to build on this preparatory work, with
the teacher operating as a facilitator; the work was therefore shifted “from passive
to active learning” and towards a “focus on the higher order thinking skills such as
analysis, synthesis and evaluation” (The University of Queensland, 2012, ¶ 1).
Having “flipped the classroom so that they did the preparation at home,” Alison
explained the process:
I had an entry and exit ticket … [students] had to prove that they had done the work at home
for whatever it was that we were going to do that day by answering a question. Then, every-
thing we did every day was around speaking. So I did ‘speed-dating’, mixed up different
pairs, group work, all kinds of different things to build their conversation confidence.
Alison explained how the ‘entry and exit ticket’ worked. That is, “only those who
had truly done the preparation could participate in the task.” Those who had not
were excluded and “had to do the preparation in class time.” Alison concluded, “so
that’s how I managed it. I didn’t want for half the class to be prepared and the other
half not. Then you’re held up and you are wasting time so only those who were actu-
ally prepared could participate.”
Alison reflected that the flipped classroom experience “was fantastic and we all
loved it. We all got better at speaking off-the-cuff and not having notes.” On the
basis of this positive experience Alison decided:
Next year that will be my entire classroom practice. So they will do preparation at home,
whatever it is that they will be doing the next day, so virtually every day we will be doing
speaking, so I am going to say to them at the beginning of the year, that means everything
we do, it could be potential evidence you’re going to gather.
In terms of gathering the evidence, “what we’ve experimented with this year [is]
with the students having their own phones … so when we’re doing speed dating,
for instance, they just go from conversation to conversation with their phones.”
Alison went on to suggest how she might use this in a way that once more placed
ownership on the students:
I’m going to get the students to select the best one and just send that one in to me. Instead
of me gathering everything and deciding which is the best one, I’m going to let them
choose. And that means every day I could be gathering evidence, or they could be gathering
evidence, and then they can just send in whichever ones that they think are really good.
7.2.4 Getting Students to Take the Lead
Anna, reflecting Naomi’s and Alison’s stance of allowing students to take the lead,
outlined four different tasks designed to enhance perceived relevance and positive
interaction with the task and to facilitate student interaction in German. She explained,
“we had a massive discussion at the end of last year, the Year 12s at that point, and
basically they came up with the topic areas they were interested in and then I went
and created tasks around those.” Anna concluded, “that’s how we’ve been working
since around 2010. I haven’t set topics for them, they set them for themselves.”
The first interaction arose from individual research projects in which the students
(there were nine in the class, including three exchange students from Germany) had
each taken an era of German history and had created a web-page on it, in German.
Pairs of students would look at each other’s pages, and comment on and discuss
them. A second task focused on the role of film and TV in learning German: what
students found useful or not useful, the place for dubbing or whether it was better to
watch something that was originally made in the language. A third task was about
learning German. On the basis of “you’ve been learning German for 5 years now,”
the issue in focus was ‘so what?’ – “we did that as a large table discussion which is
quite an interesting thing, with everyone asking questions and contributing and so
on.” Anna concluded that this interaction was “not the easiest one to try and assess
afterwards, but still a really interesting conversation to do with them.” A final task
was about identity and what it was like living in New Zealand:
Were they from here or not? If they are from elsewhere, how do they find living in New
Zealand? What is interaction with New Zealanders like? What is done to integrate people
into New Zealand society? In comparison with Germany, if they had something to
compare.
Anna went on to explain, “a lot of them had been on exchange in Germany so

they would know what it is like in Germany.” This interaction “was one that they did
with me and with the exchange students we had in class, to get two different per-
spectives.” That is, “I was playing the role of a South African who ended up in
Germany … and of course the exchange students played themselves.”
Anna explained the process leading to the interactions: “we have our learning
organised into TBLAs [task-based language assessments], so I set the task right at
the beginning.” Students were then working towards an assessment opportunity
such as an interaction:
We basically develop the vocab and structures needed through various things, quite a bit of
reading, of course, listening to texts, brainstorming, doing smaller texts that build up
towards it, little interactions that again build up towards it, playing games, all sorts of activi-
ties, culminating in an assessment opportunity – but not necessarily culminating in a day
and date and having to do it right there and then.
In other words, key to the successful interactions were students “recording it

when they are ready, going off and recording it with somebody and coming back
and perhaps recording it with somebody else as well.” Anna noted, “I usually put a
period aside for that purpose, but by the time we get to the end of the year and feel
that recording wasn’t great they can always do that recording in their own time.”
Reflecting on her students’ responses to the interactions, Anna noted positive
impact. That is:
I think they generally quite enjoyed them. I think they found them quite relaxed … Well,
that’s definitely what they told me. They found it relaxed. They could just talk to each other
when they wanted to do that. So I think they liked that.
In summary, the examples of tasks I have presented above indicate a range of

different operationalisations in different contexts. They also suggest that, rather
than being completely spontaneous and unrehearsed, successful interactions at level
3 were embedded within, and arose from, quite structured scaffolding and prepara-
tory phases. Furthermore, no teachers appeared to regard this as being an invalid
interpretation of the intent of the standard. This stance has implications for inter-
preting ‘spontaneous and unrehearsed’ at all levels of interaction.
7.3 Problems Emerging
7.3.1 Spontaneous and Unrehearsed
Drawing on her experiences as a lead teacher for interact, Monika noted that,
applied broadly to interact assessment tasks, ‘rehearsed’ in the sense of ‘scripted
and rote learnt’ “absolutely contravenes the spirit of the standard, the wording of the
standard. Scripted is a complete no-no.” She noted nonetheless that “students can
have aides memoires. So they can have lists, you know, the odd word or visual aids
to help them remember.” Monika argued that, in fact, access to such resources was
authentic and what “any adult would do naturally if you want to have a comprehen-
sive interaction with somebody and you don’t want to forget something.”
Arguing that “the standard does not mean that the language is ‘spontaneous’ as
in ‘not rehearsed’,” Monika went on to argue that the notion of ‘rehearsed’ was
“something that can be open to interpretation.” In her view, the following scenario
represented legitimate rehearsal:
‘Rehearsed’ as in you learn the language around a type of interaction, around a topic, and
then you practise it, and you practise it a number of times until you feel, ‘yes, I can confi-
dently converse, interact, about this’, whether it is, you know, ‘what is your opinion about
7.3 Problems Emerging 157
this movie?’ or whether it’s about environmental problems or … level 1, ‘talk about … what
you want to do in the weekend.’
As Celia pithily stated, “it’s about learning structures, learning sentences, learn-
ing key phrases.”
For Monika and Celia, therefore, the arena in which prior preparation was seen
as an important component was about initiating the interaction, and being able to do
so comfortably. Nevertheless, Monika’s comment about ‘openness to interpretation’
about what ‘rehearsed’ meant in practice created an uneasy terrain for teachers to
navigate.
James, for example, whose environmentally-focused task had clearly arisen from
a good deal of prior preparation, argued, “I can understand why students feel a bit
safer having time to prepare certain things because they want to have time to express
their ideas and to feel a bit confident challenging ideas.” The notion of absolute
spontaneity, of interacting “‘just like that’ on a topic,” led James to conclude, “we’re
expecting too much of our students in that respect.”
Nevertheless, for James, once students had initiated the interaction on the basis
of prior preparation, the emphasis was on maintaining that interaction authentically.
He explained to his students:
It’s a conversation, enjoy your conversation, just relax. If there were things you thought you
were going to say that you forget, it doesn’t matter, you can come back to it if you remem-
ber it later on, there’s no big deal. You know, if you agree with someone, respond, if you
don’t agree, you know, respond, and all that type of thing.
James admitted, however, that, as a consequence of his process, “I found some

pairs more or less knew exactly what they were going to ask each other, and prob-
ably the other half of the class sort of knew the questions,” although they “didn’t
write everything down.” For James, “I was quite happy with that because I found
they were working really well … in class time it was great, there was French going
on all the time.” He concluded, “so while there was preparation going on I tried to
encourage them to just relax, enjoy it.” James’ reflections nonetheless revealed a
genuine dilemma: how much prior preparation is legitimate prior preparation? How
much prior preparation leads to interactions that are effectively pre-learnt, thereby
providing potentially inadequate evidence of interactional proficiency? This
dilemma, and the actual challenges that arose, underpinned the reflections of several
interviewees.
Anna argued, “of course you want them to rehearse the kind of language they are
going to use. You don’t just chuck them into it and [say] ‘off you go’ because that
will be a disaster.” Nevertheless for Anna it was important that “they haven’t scripted
it, they haven’t done the same conversation with the same person twenty times over
before they record it.” Linda similarly asserted that, in her perception, the purpose
of interact was “to assess whether they are capable of actually coping in the lan-
guage without having to learn a prewritten script.” She went on to suggest
nonetheless:
I don’t know how you can actually make it unrehearsed – unless you say, ‘okay we’ve been
studying family relationships, we’ve been studying the environment, now you’re going to
have a conversation with me on eating disorders.’ I’m sorry, that is ‘spontaneous and unre-
hearsed’, but you’re stuffed if you don’t know the words for ‘anorexia’ or ‘eating
disorders’.
Linda went on to argue that, in terms of genuinely spontaneous interaction, “while

that [scenario] is more realistic [i.e., authentic], we are dealing with 16, 17 year-olds
here, a lot is hanging on these credits.” Therefore, the extent to which the interactions
could be “truly unrehearsed” was negligible, even at NCEA level 3. In other words,
“you’d have to say ‘okay we are studying the topic and here’s your task, things you
could include.’ But it’s not going to be ‘you say this and I’ll say that’.” A crucial issue
raised by Linda was therefore the high-stakes nature of the interaction, a reality that,
it would appear, mitigated the feasibility of true or absolute spontaneity.
The tension between high-stakes and truly spontaneous was reiterated by
Margaret. Margaret argued that, when pairing students for the interactions, “I
wouldn’t want the two of them practising their thing over and over again until they
knew it off by heart and did it like robots … it’s not the purpose of the thing at all.”
It was therefore important that, in the interaction itself, “they don’t know what the
other person is going to say exactly.” Nevertheless, in her perception, there was “the
pressure to be spontaneous” coupled with “the difficulty [of] trying to do it without
any kind of rehearsal” (my emphases). The reality was that, with no prior prepara-
tion, the assessment became “too hard” and “too big an ask,” and “the kids stress a
lot with it.” There were therefore inevitably certain components which students
“would want to learn off by heart.” This would include “the formulaic expressions,
and then perhaps maybe a little bit of the meatier content, because that might be
more complex and they might really want to get it across, so they might learn off
two or three sentences.”
With regard to the tension, in an assessment context, between genuine spontane-
ity and pressure to perform, Marion acknowledged that, on the one hand, the ulti-
mate goal of interact was that “we want them to be able to converse naturally with
a French person.” On the other hand, “what I found was, even where I thought it
would have to be spontaneous, the really hard-working students just prepared every
option, and it still came across rehearsed. Because it’s so high-stakes they are just
so prepared.”
What was appropriate with regard to spontaneity was also a genuine issue from
the perspective of those moderating the samples of performances. Jane explained
that questioning of grades and performances “comes down always to the spontane-
ity.” Even though, among moderators, the decision had been made that, in its first
year of operation, “this was the year for the leniency,” nevertheless “there’s a lot of
dialogue that goes on between moderators … ‘what do you think of this?’ you
know.” It was important to uphold “the spirit of the standard.” Spontaneity was at
issue because there appeared to be “an awful lot of scripting that is going on.” The
reality was that “you really, really notice it when you do get a school or even one out
of the three interactions that isn’t scripted, and it really is lovely to hear.”
In light of the tensions as explained above, for example, by Margaret and Marion,
evidence presented to moderators was that, when interacting, students often, in
Jane’s words, “have the stuttering and they have the stammering as they’re thinking
up things,” that is, the evidence of spontaneous interaction. In the midst of that,
however, “they’ll have a big nugget of language that comes out at the time.” Jane
conceded:
You understand why that happens because they need their credits. They’ve been told ‘make
sure you get some complex language in’ and so they probably learn a certain amount of
phrases and are determined to get them in come hell or high water.
Nevertheless, squeezing in more complex language led, in Jane’s words, to “an

awkward juxtaposition” or “an incongruity” between a natural interaction and an
interaction that incorporated clearly prefabricated material. The awkwardness, how-
ever, was that occasionally the pre-learnt material was used inappropriately and
thereby disauthenticated the interaction. Jane explained:
When we first did the conversations in the old standard [converse] we had a PD [profes-
sional development] day where we all created a list of conversational fillers. That sheet now
gets trundled around the nation, and you hear these students inserting really false fillers like
bah, dis donc, bah, dis donc, bah, dis donc [goodness, wow], all the way through conversa-
tions. It just sounds ridiculous. Or things like c’est dingue, which is okay, you know, it
means ‘it’s crazy’, but you just wouldn’t say it willy nilly.
In response, for example, to a comment such as ‘I went to the movies at the

weekend’, Jane argued, “you wouldn’t say ‘that’s crazy’, you know.” She concluded,
“they don’t know how to do it, they don’t actually know how to have a conversation
[in the target language].”
With regard to spontaneity, Jane acknowledged that eventually the moderators
were “going to have to be firmer on it.” Nevertheless, and despite occasions of inac-
curate or inappropriate language use, the issue was complex, even for the modera-
tors. Interpreting ‘rehearsed’ as ‘pre-learnt’ or ‘pre-scripted’, Jane argued, “what’s
the difference between ‘rehearsed’ and ‘girlie swot’? You know, the kids who have
actually done all that work,” and therefore relied heavily in pre-learnt formulaic
expressions – “‘I went here, I did this with my family, in a car, it was a blue sky,’
you know.” This has left several moderators in the position of being “really uncom-
fortable saying ‘this is rehearsed’.”
In summary, Jane noted that understanding and enacting the ‘spontaneous and
unrehearsed’ intentions of the assessment “very much depends on the school and
how it is being presented to the students and what learning is probably going on in
the classroom.” As Margaret asserted, “I think everyone is interpreting it their own
way, the best they can, so there must be a huge variety of practice out there.”
7.3.2 Moving Away from Grammar
The perspective presented by Jane raises a second important issue for interactions at
NCEA level 3 – the tension between a focus on fluency, and therefore the use of
‘natural’ language, and the requirement to account for language that is at a
sufficiently high level of sophistication to justify a level 3 performance, particularly

to secure higher grades. I noted in Chap. 5 that teachers applauded the greater free-
dom afforded by not having to account for specific grammatical structures, albeit
recognising that grammatical accuracy was not negated. That is, it seemed that
grammatical accuracy was important, but relegated to an essentially supportive role
in terms of the extent to which it facilitated effective communication. Nevertheless,
the blueprint for the assessment for NCEA level 3 makes reference to language at
curriculum level 8 (see Chap. 3), and one clarification document (NZQA, 2014,
moderator’s newsletter, December 2012) suggests that the now redundant language-
specific curriculum documents and vocabulary and structures lists may be consulted
for guidance to determine whether the appropriate language level has been reached.
Teachers at NCEA level 3 (curriculum level 8) are therefore left with an ambiguous
scenario within which to try to interpret the requirements of the assessment.
Drawing, as had Jane, on the argument of “the spirit of the standard,” Monika
provided her own lead teacher perspective. She asserted, “in interaction nobody
really cares that much if you can use all the fancy structures … the important thing
is [that] you can continue to talk by hook or crook …” She added, “I think that is
also how NZQA interprets it, you know, that it is the interactive capability that is
being assessed first and foremost … as a student you should have autonomy to pro-
duce language all by yourself” (my emphasis). Interact therefore gave teachers “the
freedom to say ‘ok, they are doing something like in real life where it doesn’t matter
so much whether you make a mistake’.” To reinforce this, “NZQA have done away
with the grammar structures.” Monika went on to argue nevertheless:
Just because you have a new curriculum doesn’t mean that you have no content any more,
and I think most teachers think that the [former] language specific curricula give really
good guidance as to what is appropriate for topics at a particular level.
Margaret reflected that at one time she did used to focus very much on the gram-
mar requirements and “tell them ‘you have got to use this one and that one’.” She
recognised, however, that “you can’t do that when speaking spontaneously.” She
argued:
My reading of the standard is, we’re not focusing on the structures any more, we’re focus-
ing on how much they’re understanding each other and responding to each other. If they can
correct themselves or help their partner with a word, or keep that flow going, or negotiating
meaning, all that kind of stuff, that’s got the upper-hand on the structures … and now I’m
not listening out for a flash sort of chunk. Sure, if one pops in, then it’s ‘wow’, but I’m not
forcing the kids to think up structures to say. Now I’m saying ‘if someone said something
and you didn’t know what they are saying, can you learn how to say “oh you mean this?”
and rephrase it?’ One or two kids are doing this beautifully to me.
Margaret concluded that, in terms of achieving the highest levels of performance,

and demonstrations of negotiation of meaning – “that’s excellence. That’s how it
should work” (my emphases). She went on to assert, “it could be construed that
their level of language drops … the structures definitely go down, but then they are
replaced by the interactive strategies” (my emphasis). Furthermore:
You listen to native speakers when they speak … most of the time people speak to the low-
est common denominator, they cut words, it’s human nature, so why should a French con-
versation in second language be even more formal, more complex than a native speaker
would be?
Naomi reiterated the same point: “nobody speaks perfectly all of the time and
doesn’t make mistakes. … you just use what fits the purpose at the time, otherwise
you’re going to come off sounding really pompous.” As a consequence, “accuracy,
I told my kids, was not the number one thing. ‘Your use of high level structures is
not your number one priority, you need to communicate’.” Naomi was therefore
“not going to expect the excellent student” to have to demonstrate proficiency in
using ‘higher level’ structures, “you know, they must use a subjunctive, they must
have a passive, they must use the past conditional.” She argued, “it’s not all about
that. I think there’s other ways they can show off their language than structures.”
Nevertheless, Naomi conceded that she did try to “encourage them to try to use a
subjunctive [or a passive]” with the recommendation that “you should have at least
one in your repertoire somewhere.”
For Naomi, therefore, there appeared to be something of a tension between
stressing the use of ‘natural’ language that was ‘fit for purpose’ and encouraging
students still to have at their disposal examples of more complex structures. This
tension was also brought out by James and Anna. James asserted, “it’s a normal
conversation, and if you start to say things that are too fancy it’s not normal.” That
is, “appropriate language in interaction is not necessarily [that] you bring out all
your fancy grammatical structures.” He argued, “sometimes a subjunctive does
sound false, and you don’t have to use it, and it doesn’t have to be subjunctive, it
could be something else.” Nevertheless, “I know teachers just love subjunctives,
you know” and “it is very, very easy to have at least one or two of the traditional
structures in your conversation in French Year 13.” James concluded that “the main
thing that I was looking for … were actually the ideas and what people said, that
was the most important, the actual content.” Nevertheless, “I found all the students
were able to put in one or two traditional grammar things.”
The above arguments suggest an understanding that strategic competence has
now become a key criterion for success, both taking a greater role than grammatical
accuracy and as a contribution to spontaneity. Contrasting Peter’s assertion about
perfection, in terms of how students perceived the requirements (Chap. 6), Anna
argued that something that she had “been working on” with her students was the
notion that interact was “not about reaching perfection” – although she conceded “I
don’t know if I have managed to get it through yet.” That is, “I’m definitely trying
to instil … in them, ‘just interact with each other, see where you are at, see what you
can do’.” Nevertheless, Anna did prepare the students with pre-fabricated formulaic
expressions for purposes of strategic interaction. That is, “I have really started
focusing on … giving them chunks of language to use in interactions … you have
to give them the ways of apologising, and seeking clarification and those things.”
The goal would become:
I guess you are looking at how they mediate the process of communication. How do they
look for clarification? It is about using a variety of structures, showing that you can do all
kinds of things in the process of doing this, asking different questions, clarifying, reacting,
and that’s where I really see it sitting.
Relating this to an example of a particular topic of focus, Anna explained:

They know that that’s what we are working towards. We are going to sit down and talk
about friendships and relationships. And what we are in the process of learning is ways of
reacting and asking questions, all that kind of stuff. Then seeing what an exemplar looks
like of actually doing that as well.
As to the requirement that, for example, “you must use a subjunctive”:

Well, I teach them the subjunctive, of course, and I say to them ‘it’s a really good way of
suggesting stuff – when you are in the kind of conversation where you are suggesting stuff,
that’s a good way of doing it, a sophisticated way of doing it.’
Perspectives on spontaneity and the place of grammar reveal that several teachers
were making genuine attempts to reconcile the focus on fluency that they saw as
being central to interact with an acknowledgment that a demonstration of ‘sophisti-
cated’ language might serve to strengthen students’ performances. The issue became
how students might be encouraged to use higher level language in ways that natu-
rally supported the open-ended and non-prescriptive nature of the assessment, and
that naturally contributed to the interaction. Several interviewees addressed this
issue by focusing on the nature of the tasks that students might be asked to engage
in. Particularly at level 3, the requirement to balance complexity, accuracy and flu-
ency was clearly keenly felt by teachers.
In summary, perspectives on grammar reveal a tension between freedom to use
any language appropriate to the task and a requirement to make sure the language
was at the appropriate curriculum level. The tension for teachers was neatly sum-
marised by Linda. On the one hand “I found [interact] hard at first because I always
said ‘you’ve got to get a subjunctive in’.” On the other, students “could get away
without bringing those in.” On the one hand, “if you were having a formal conversa-
tion with somebody, if your interaction were a formal one, then you may well bring
in [the complex grammar].” On the other, “I think in an interaction there’s less
emphasis on bringing the fancy bits in and more on communicating what you
actually want to say.” In terms of expectation, her conclusion and message to the
‘powers that be’ was, “I wish they’d make that clear, absolutely crystal clear.”
7.4 Back to the Task
With regard to eliciting appropriate high-level language in the context of authentic

interaction, there was evidence to suggest that the task itself was crucial, such that,
if teachers focused on the task, the grammar (in terms of demonstrating a suitable
level of language) would take care of itself. Lead teachers Monika and Jane pro-
vided an overview perspective that would arguably link the perceived grammatical
7.4 Back to the Task 163
requirements with the nature of the task. Monika was of the view that the task would
automatically lead to language at the apposite level, explaining that, at the highest
curriculum level (level 8), the language expected of students was “language that is
reflecting the societal aspects” of the interaction. What was required was “language
of problematic situations, of solutions … [or] to deal with social cohesion, to deal
with environmental problems, to deal with social stereotypes.” This requirement
inevitably lent itself to grammar such as subjunctive or conditional that were “just
forced [into use] by the themes and the topics that you deal with.” In other words, in
contrast to forcing grammatical structures unnaturally into use, “the question is
really … the task or the tasks that I set, do they force a particular way of approaching
it?” (my emphasis).
Monika went on to explain, “I mean, conditional is just one thing, you know,
there are so many equivalents of language use that could tell me this person is defi-
nitely using language level that is sophisticated enough to qualify for level 8.”
Nevertheless, “if you don’t show that you can talk about possibilities or threats or,
you know, something like this, I wonder if I would award something like an excel-
lence.” Monika concluded, “it is my firm conviction that the task drives the lan-
guage, and if you set the task well the language will follow.”
Jane concurred with Monika’s perspective that at NCEA level 3 “the nature of
the tasks that the students are given is automatically so waffly and complex that you
are kind of already in that upper zone.” This meant that, by virtue of the task,
“they’re already speaking in such a high level way.” Even though “to be honest,
most of them are still sticking in, you know, a subjunctive here and a conditional
there,” the use of these grammatical constructions was not necessarily artificial
because “the task lends itself to that.”
In actual classroom practice, however, Alison and Sara brought out an interesting
juxtaposition between two contrasting student aspirations: wanting to have an
explicit focus on the formal aspects of language (‘so that we know something’), and
enjoying (and visibly relaxing in) a context that has a specific focus on fluency (‘so
that we can do something’).
Alison recognised from her own work with students both the importance of
allowing the language used to be appropriate to the task and the notion that the task
itself would likely promote the appropriate level of language:
I just said that ‘there’s nothing [grammatically] that you have to use, but you need to explore
opinions’. We talked about the quality of the language used, and how are you going to find
out what somebody else thinks? How are you going to disagree with them? How are you
going to express your opinion if it’s contrary? That sort of stuff. Then, what kind of lan-
guage will you use to make this successful?
Nevertheless, Alison reflected:

I still think I need to teach grammar of some sort. That’s how the students have security.
When you get feedback sheets from them and they go, ‘we haven’t learnt anything’, it’s
because you haven’t put a label on it, on what they’ve learnt, they need the label.
As a consequence, in Alison’s classes, there was “not a focus on structures,” but,

rather, an exploration of grammar in order to “give the students the security they
need to feel … they need the explicitness of ‘here’s what we’ve learnt and this is the
language structures, the language concepts that we have learnt …’. In Alison’s
experience, therefore, her students looked for a focus on explicit grammar so that
they could have a sense of ‘learning something’.
Sara provided a perspective that suggested that, for students, more spontaneous
and effective interactions could occur when there was not a direct focus on gram-
mar. Beginning from the principle that “I don’t want to drop the task and say, ‘nah
don’t worry about the language’, I do want them to worry. I want them to care and
use good language,” Sara described how she drew on three quite different interac-
tional task scenarios:
For the first one they interacted [as] two students, they chose their partner and they inter-
acted. The second conversation was with me, and it was with more preparation, more use of
language. The third one was completely unexpected [i.e., spontaneous]. I invited native
speakers to the class and we did speed-dating.
For the teacher-led task “I did give them a list of grammar, and also some expres-
sions.” Reflecting on its success, however, Sara conceded, “that’s the one that, if you
hear it, is not natural.” She concluded:
The one that they did with me, the one that they actually prepared a lot, it was very similar
to the old type of conversation that I did up to last year. It was like the students speaking a
lot, using good language, speaking a lot – but it was memorised.
By contrast, the task that, in Sara’s view, was “by far the best one,” was the final
task with the L1 speakers. A primary focus of the interactions was Christmas
“because the native speakers, they were from Chile, they were talking about how
Christmas is in Chile, comparing it with Christmas in New Zealand.” The speed-
dating ensured a variety of interactions. Students recorded them on their mobile
phones and were able to select the best ones as evidence for assessment. Sara
explained:
I kind of like extended the time from just like one minute, it was two or three minutes. So
after one minute they were like, ‘okay, what else can we say?’ and they were trying to get
more and more of their own language. That was awesome, to see them think at the same
time and [be] spontaneous, with unexpected questions.
The students themselves reported to Sara, “it’s good that you actually brought the
native speakers here and you made us talk, because otherwise we wouldn’t have
done it.” She concluded, “maybe for some of them it’s their first experience talking
to a native speaker.” In contrast to the prepared nature of the teacher-led task, Sara
noted that “when we did the last interaction with the speed-dating, I told them not
to worry about the language, and I noticed they were a lot more comfortable.”
7.5 Conclusion 165
7.5 Conclusion
The perspectives I have presented above reveal that, 3 years into the assessment
reform process (2013), and in the first year of availability of interact at NCEA level
3, several issues remain unclear and the operationalisation of interact is subject to a
range of local interpretations.
The exploration of perspectives from the Stage II interviews began with the
assertion that ‘the task is crucial’. Teacher perspectives suggest that, when students
are presented with an appropriate and relevant task, there is the potential for stu-
dents to interact positively with the task in ways that will help them to demonstrate
spontaneity alongside appropriate use of grammar. On the other hand, there is evi-
dence to suggest that not all tasks are appropriate, or that students are not interacting
appropriately with the task, relying quite heavily on pre-learnt material and believ-
ing that they must account for particular grammar. Their performances, or at least
aspects of them, are thereby potentially or actually disauthenticated.
The tensions raise the issue of whether the concept of ‘task’, at least as presented
in a formalised way, should be abandoned. Seen in the light of the various chal-
lenges to interact presented in the teachers’ perspectives explored in this chapter,
Marion and Alison actually challenged NZQA’s ruling that students are required to
receive written notification about the assessment tasks, a ruling that, in Peter’s
words (see Chap. 6) promotes “the whole idea … that they have to have fair condi-
tions of assessment.” Problematic in the ruling is the arguably undue attention that
it gives to each interaction as an assessment event, with consequent implications for
lack of spontaneity and artificiality of language.
Marion speculated, “if that task sheet wasn’t there we could be a lot more spon-
taneous.” She argued that “next year I thought I might write a task at the beginning
of the year and relate it to current events so I could just use that task throughout the
year whenever I want on any given event.” If, for example, something were to hap-
pen overnight, the students could have a spontaneous conversation at the beginning
of class which they could record. With regard to artificiality, this would be “a way
to get around it.”
Alison argued that, in her view, if interact were really to work most effectively,
it would be necessary to “make it so that you don’t have to specify the tasks.” That
is, currently “the fact that, you know, you have to have these tasks that you’ve
decided beforehand” means that inevitably “the students manufacture a conversa-
tion around those tasks, so it’s not true spontaneity,” and “there are certain features
that you have to look for which may not happen in a normal conversation.” As an
improvement, the interactions should be “free and open, no tasks, just any three
conversations where the students have shown they have used language authentically
in a way that’s natural and allows them to showcase what they can do.”
Following on from her experimentation with getting her students to record a
whole range of evidence of interactions, on a regular basis, using their mobile
phones (see earlier this chapter), Alison suggested:
I think what I’ll end up doing is every week they choose the best one and they send that to
me, and they keep it as well. So they just delete everything else and then I’ve got a copy of
what they think is their best one for the week. Then between us we come up with what they
will submit. So effectively what I’ll be doing is collecting heaps and heaps of evidence.
Realising that this did not necessarily fit with the requirement of the assessment
(to let the students know beforehand when an assessment was taking place), Alison
mused, “I thought, well maybe I’ll just retro-fit the tasks around what they send in,”
or alternatively “make [the tasks] so generic” that the evidence could fit. She
acknowledged, “that’s probably not politic to do that sort of thing, but I want them
to feel comfortable with what they are doing.” Alison concluded, “if you have to
specify the tasks, that makes it more artificial. If you really want true authenticity,
then take away the task. Don’t specify what the task is. Keep it open.”
In the next chapter I explore what, from the teachers’ perspective, have been the
positive washback benefits of interact at level 3. I conclude with the reflections of a
key group of stakeholders – the students.
References
Pearson.
NZQA. (2014). Languages – moderator’s newsletter. Retrieved from http://www.nzqa.govt.nz/
October-2014/
The University of Queensland. (2012). About flipped classrooms. Retrieved from http://www.
uq.edu.au/tediteach/flipped-classroom/what-is-fc.html
Chapter 8
Interact and Higher Proficiency Students:
Concluding Perspectives
8.1 Introduction
In Chap. 7 I presented the perspectives of interviewed teachers about interact at

NCEA level 3 (CEFR levels B1 to B2). The interviews with teachers address several
issues for the successful operationalisation of interact at this highest level of exami-
nation, with implications for the lower levels 1 and 2. In particular, in Chap. 7, and
in light of the tensions apparent from Stage I of the project (Chaps. 5 and 6), I pre-
sented teachers’ reflections around three key domains: the importance of task type;
spontaneity and lack of rehearsal; and moving away from grammar (i.e., not having
to account for particular grammar constructions in performance). It was evident
that, in each of these arenas, there was differential interpretation, and consequently
differences in practice, across several schools.
As regards task types, it was recognised that the task was crucial to enhancing
opportunities for students to interact successfully. There was evidence that, although
teachers might keep to established topics such as the environment, there was also
willingness to experiment, both to develop the more pedestrian topics and to try out
other topics which arose largely from the students’ suggestions. There was also, from
some teachers, a proposal that the notion of ‘task’ should be abandoned in favour of
open-ended conversations focused on whatever the students wanted to talk about,
particularly if spontaneity was considered of central importance. With regard to the
thorny issues of lack of spontaneity and prior rehearsal there was evidence of a range
of understandings and practices. With reference to grammar, there was a juxtaposi-
tion (and tension for teachers) between accounting for language that was at curricu-
lum level 8 and encouraging language that was naturally appropriate to the task.
Indications from the teacher interviews are that, inevitably, teachers at NCEA
level 3 are finding their way with interact in its first year of operation. There is evi-
dence of innovation and genuine attempts to interpret the requirements of the
assessment in ways that make sense to individual teachers. There are also signs that

Linguistics 26, DOI 10.1007/978-981-10-0303-5_8
168 8 Interact and Higher Proficiency Students: Concluding Perspectives
teachers are at times falling back on, or giving influence to, the ‘tried and tested’
pathways established in converse.
This chapter concludes the presentation of data from this 2-year study into inter-
act. I begin with an issue that is fundamental to realising the full potential of inter-
act, whether at NCEA level 3 or at the lower levels, that is, the issue of washback,
as seen from the perspective of the teachers. I turn then to the students as key stake-
holders and central recipients of the reform, and allow them to have the final word.
In particular, I present their perspectives, both on converse and, more particularly,
on interact, in the light of the central challenges and issues raised by their
teachers.
8.2 Working for Washback
I made it clear in Chap. 2 that the fundamental driver for New Zealand’s assessment
reform, not only for languages but, in fact, for all areas of the curriculum, was to
align New Zealand’s high-stakes assessment system as closely as possible with the
learner-centred and experiential aims of a revised national curriculum. This align-
ment process inevitably implies a two-way reciprocal process that I noted in Chap.
3: that the aims and intentions of the curriculum should be reflected in assessments
(what we might call feedforward), and that assessments will influence the introduc-
tion of the curriculum – washback (Cheng, 1997; East & Scott, 2011b). Fundamental
to the successful introduction of the revised assessments is therefore the extent to
which the aims and intentions of the curriculum do indeed influence the assess-
ments, and the extent to which the assessments influence what happens in class-
rooms. As Jane put it:
It all hinges on a) the task, b) the teaching that does go on and what your classroom is set
up like, because if you’re teaching them verbs and vocab lists and introducing vocab and
then just using a bingo game to reinforce it and then expecting the kids to produce this [in
interaction] it’s not going to happen.
The positive feedforward/washback implications of interact were in fact well

documented in data from both stages of the study.
Jane, for example, explained that, by aiming to embed the assessment more
seamlessly into the on-going work of the classroom, its nature as an assessment
would thereby not loom so large. There was a need to “really try and get the interac-
tion going all year.” That is, interact “doesn’t need to be this big thing, you know,
the big bear in the cupboard.” To achieve this “you just need to really make sure that
it’s happening in the whole culture of your classroom rather than expecting it to
come out of nowhere.” In other words, feedforward from the curriculum, and wash-
back from the assessment, were necessary components of success.
With regard to the expectations of the curriculum (i.e., the teaching and learning)
feeding forward into the assessment, it was apparent that teachers saw their
assessment tasks as naturally arising out of their teaching programmes, and that
8.2 Working for Washback 169
they did not see this as contradicting the requirement for tasks to be ‘spontaneous
and unrehearsed’. As Margaret explained:
I think [the task] really needs to evolve out of a unit of teaching, that you have been giving
them the input with the language and the structures, and you have been practising in a more
controlled setting. Then, as you move through that unit … they’ve been gathering informa-
tion and having many practices, then ideally by the end of that they should be able to string
that together.
The phenomenon of washback, or “the extent to which the introduction and use
of a test influences language teachers and learners to do things they would not oth-
erwise do that promote or inhibit language learning” (Messick, 1996, p. 241), argu-
ably works in the other direction to enacting the curriculum. That is, the requirements
of the assessment become the driver that influences classroom practices. In the case
of interact, positive washback would result in classroom environments that would
encourage spontaneous and unrehearsed interactions among peers beyond those
that might be recorded for assessment purposes. In turn, genuine interactional pro-
ficiency would arguably be enhanced. In Chap. 5 I presented some evidence from
the national teacher survey that washback of this sort was beginning to happen. The
teacher interviews completed as part of Stage II of the project provided some addi-
tional evidence that positive washback was a reality, both potentially and actually.
James and Alison, who both admitted that introducing interact at NCEA level 3
was in fact their first sojourn into the assessment (they had not used it at levels 1 and
2), exemplified in their reflections the washback potential of interact. James com-
mented, “initially I thought … ‘oh no, we’ve got all this to do’, but looking at the
reaction of students after they had done it, and the feeling of accomplishment, I
thought ‘wow, [interaction] is really worth stressing …’” He conceded that, even
though he was still coming to terms with the requirements of the assessment, he
noted that his students “were on their own, you know, speaking to each other, per-
forming in terms of an interaction.” He concluded that this made the interaction
“really worth doing” – “the students buy into it, I buy into it … and I saw weak
students really perform, and I was impressed and surprised, and I thought ‘wow,
that’s the way to go’.”
As I noted in Chap. 7, Alison, in her description of one of the outcomes of her
‘cat café’ task, described one interaction, a skyped conversation recorded between
one of her students and a friend in Japan, as ‘completely spontaneous, lovely, the
most authentic conversation.’ There was a sense in which this was an eye-opening
experience for her, which made the interaction “exactly in the spirit” of the assess-
ment. Furthermore, Alison’s report of her own classroom practices signalled that
these practices were moving in the direction of embedding spontaneous interactions
seamlessly into the on-going work of the classroom. This would be achieved in two
ways that were new to her at the time of the interview: her use of the ‘flipped class-
room’, and her encouragement to students to use their mobile phones to record a
range of spontaneous interactions from which evidence of interactions for
assessment purposes might be derived. Her implementation of interact was leading,
in her words, to a “way more conversation-focused classroom.”
Several other interviewed teachers commented that interact at the higher levels
was now washing back into their work lower down the school, and into the more
junior classes. Sara observed that, as a result of the new assessment, “I am doing
more speaking, I am giving more speaking time. For example, it used to be a minute
to talk, now we are spending a decent part of the lesson in conversation, just talking.”
She went on to assert, “I am taking it down to junior level, I’m doing more speaking
with my junior level because of this.” Linda similarly noted, “I try to start in Year 9.
I try to start by using the target language as much as possible … I do reward things
like, ‘you used Spanish there without being asked to, fantastic, here’s a reward’.”
With regard to washback into junior classrooms, Sharon and Anna reiterated the
same point as Sara and Linda. Sharon argued that interaction “really is starting from
Year 9, doing a lot more.” As a consequence interaction had become “actually more
my selling point now at the junior level too – ‘by Year 13, [with] the interactions
you do [now], you should be getting near to fluency’.” She went on to assert, “I am
very positive about it. The kids do enjoy it and, you know, Year 9, I’ll pull names out
of a hat and they have to go off and work together and they’ve got [devices] now that
record each other … it’s good, it’s really good.” Anna likewise noted that spontane-
ous interacting was now becoming “something that we start developing even in Year
9.” To contextualise the focus on interaction for her students, Anna explained:
We start with the very little conversations that we do with our Year 9s. We talk about the
rationale [for spontaneous interaction] quite a bit, why you do activities the way that you do
them, the fact that it isn’t about getting them perfect, so please don’t write it down because
then it isn’t a conversation.
As a consequence, “usually when they get to Year 11 they are at a point when
they are quite happy to talk to each other.” Anna concluded, “it has to be part of
every day’s lesson. I think it’s important for the teacher to trust that they [students]
will actually be speaking German with each other, that they will actually be on task.”
The above perspectives illustrate the washback potential of interact, neatly sum-
marised in the words of lead teacher Celia – “often by changing the assessment is
the way that we change the teaching practice.”
However, a risk inherent in an approach that led to an assessment as a conse-
quence of units of teaching and a more structured and controlled teaching and learn-
ing environment (as exemplified, for example, by Margaret above) must be
acknowledged. The risk is that the assessment will become a point of focus. It might
thereby become a replication of converse, with all that this implies for lack of spon-
taneity, pre-rehearsal, memorisation, limited opportunities to capture instances of
interactional competence and, ultimately, potentially inadequate automaticity.
Indeed, several interview comments, as presented in Chap. 7, suggested that there
was a real danger of these consequences.
The perspectives of lead teachers Monika and Jane revealed that, understood in
terms of what was happening in classrooms, moving into the future with interact
was a double-edged sword – on the one hand, increasing confidence in encouraging
spontaneity, and, on the other, still a propensity towards significant prior rehearsal.
Jane’s experiences, as noted in Chap. 7, indicated that, although “the evidence
that I’ve seen at all levels really would suggest that there is a more natural conversa-
8.3 The Student Surveys 171
tion going on between the participants when it works” (my emphasis), nevertheless,
“there’s still all that scripting that is going on, or the preparation that’s going on.” In
turn, the scripting and preparation hindered and obscured evidence of genuine and
spontaneous interaction.
Encouragingly, however, Monika’s experiences implied, in contrast to Jane’s, evi-
dence of a gradual movement towards interactions that were more aligned with
expectations for what interactions should be. She speculated that, 3 years ago, when
interact began, a good deal of the evidence was of interactions that were “very clearly
role plays that were scripted.” That is, listening to the evidence provided, “you could
hear paper rustling, you had these classical things that a student was speaking and the
other student was already answering even though the question wasn’t finished
because they were so eager to put their part in.” Monika noted, however, that over the
past 3 years the practice of pre-scripting had “become less and less prevalent.”
Another development observed by Monika was that, in the early stage of imple-
mentation, “there were still a lot of interactions with teachers and students because
I think teachers just didn’t know how to prepare their students for student/student
interactions.” Three years on, and with “increasingly more evidence that teachers
get good guidance from NZQA, from the Best Practice Workshops, from … teacher/
teacher coaching or discussions,” Monika asserted both that “the quantity that I’ve
seen this year of student/student interaction” had risen, and that there had been
“clearly very good preparation of the students.” As a consequence, “the very best
interactions sounded like they were truly spontaneous,” even if they were “really
well prepared.”
On balance, the experiences and outcomes that I have explored in the preceding
chapters underscore the reality, noted by Jane, that for interact to be successful in
the longer term, “it probably does come down to education and changing a whole
culture of how we teach and present stuff to kids.” Jane argued, “I’m really, really
convinced that, no matter how able the student, if they have just been trudging
through vocab lists and some happy, happy vocab games” they were not ultimately
going to be able to use the language effectively in interaction. That is, “they need
more than that … if they’re going to have any hope of actually communicating.”
Washback, in terms of an increased focus on spontaneous interactive communica-
tion in FL classrooms, arguably becomes crucial to increasing success with interact
as an assessment.
It is important, finally, to address the students’ points of view. In what follows I
present data that reveal the range of experiences and outcomes from the perspective
of the students.
8.3 The Student Surveys
Year 13 students, who had reached the highest level of examination at NCEA level
3, were surveyed on two occasions. A pilot survey (n = 30) took place towards the
end of 2012 and was targeted at those who were among the last to take converse.
The main survey (n = 119) took place towards the end of 2013 and focused on those
who were among the first to take interact. The two surveys, although partially used
to elicit a level of comparative data, were targeted at two independent groups, with
the responses relating to interact being a particular focus of interest.
8.3.1 Section I
As with the teachers who completed the national teacher survey in 2012, Section I
of the student survey focused on facets of the test usefulness construct (Bachman &
Palmer, 1996), although the statements were re-written to reflect a different target
audience and practicality was not included as a sub-construct. In common with the
Stage I teacher surveys, responses were elicited in terms of marking the appropriate
point on a 5 cm line (see Fig. 4.2). Table 8.1 provides the overall means, on a scale
from 0 (strongly disagree) to 10 (strongly agree), from the responses gathered from
both student surveys (2012 and 2013), and for each individual measurement state-
ment in the survey (the measures are presented here and elsewhere in sub-construct
order, not in the order as presented in the survey).
Table 8.1 Overall means and differences in means (students): converse and interact
Converse Interact Difference
Measure (n = 30) (n = 118)a in mean
Perceived validity and reliability
1. Helped me to show clearly what I know and can M 6.66 6.58 −0.08
do when speaking the language SD 2.00 1.9
2. Helped to provide an accurate measure of my M 6.28 6.42 0.14
speaking ability SD 2.54 2.14
3. Gave me the opportunity to show my fluency in M 6.22 6.52 0.3
speaking the language SD 2.44 1.88
Perceived authenticity and interactiveness
4. Gave me the opportunity to have a genuine M 6.5 6.06 −0.44
conversation with another person SD 1.62 2.14
5. Gave me the opportunity to use real and M 6.14 6.22 0.08
unrehearsed language SD 2.2 1.98
6. Provided a good measure of the language I may M 6.1 6.04 −0.06
need to use when talking with native speakers in SD 2.4 2.24
the future
Perceived impact
7. Completing the assessment made me feel M 5.6 5.62 0.02
anxious and stressed SD 3.12 2.88
8. I enjoyed the opportunity to speak that the M 5.98 5.8 −0.18
assessment gave me SD 2.62 2.36
Note
a
Out of the sample of 119 students, one student failed to complete Section I of the survey and was
omitted from analyses of Section I
8.3 The Student Surveys 173
Bearing in mind that the two groups were completely independent, the groups
showed remarkable symmetry across all responses. That is, the differences in the
means ranged from −0.44 to +0.3, and none of these differences was significant. On
average, students ranked both assessments relatively highly (the mean was greater
than 6/10) on all measures pertaining to validity and reliability, authenticity and
interactiveness. The central tendency suggests that, in the students’ view, both assess-
ments replicated a spoken communicative proficiency construct relatively well, and
both appeared to provide relatively good opportunity for students to display what
they knew and could do. Students were also of the same mind regarding the level of
stress that the assessment generated. Both groups rated stress (i.e., negative impact)
equally highly, and both groups perceived the assessments to be equally stressful.
On average, then, both independent groups (who were asked to provide their
responses based solely on the assessment they had taken) judged each assessment
comparably, and essentially identically, on all measures. It seemed that, in the stu-
dents’ perspective, neither assessment was better or worse in terms of perceived
usefulness or fitness for purpose.
8.3.2 Taking a Closer Look at the Numbers
A closer inspection of the descriptive statistics (standardised to scores out of 100)

reveals a noteworthy phenomenon. Although the mean scores, displayed in Fig. 8.1,
represent visually the uncanny similarity in responses, Figs. 8.2 and 8.3 depict both
the means and the variances within the responses for each measure and for each
group. It is important to take note of the wide variances in response.
Figures 8.2 and 8.3 reveal considerable variability across all measures in the
survey. That is, for one student who, for example, found interact to promote a high
100
Converse
Interact
80
60
40
20
0
1 2 3 4 5 6 7 8
Fig. 8.1 Student survey mean responses by measure (converse v. interact)

100
80
60
Data
40
20
C1 C2 C3 C4 C5 C6 C7 C8
Fig. 8.2 Converse – range of responses by measure
100
80
60
Score
40
20
I1 I2 I3 I4 I5 I6 I7 I8
Fig. 8.3 Interact – range of responses by measure

8.4 Student Survey Responses – Converse 175
level of opportunity to talk in the target language, there would be another who
thought just the opposite. This variability was most pronounced when students com-
mented on the level of stress generated by the assessment (Measure 7). Some stu-
dents reported finding the assessment to be minimally stressful; others reported that
the assessment was highly stressful. This was true for both converse and interact –
although the variation appears to be less pronounced for interact, that is, it seemed
that interact generated marginally less stress overall (the difference was not signifi-
cant, however).
Taking these levels of variability into account, we might conclude that, in fact,
neither assessment is being fair – either fair to all, or fair at all. That is, based on
these responses, it would appear that neither assessment was viewed as being con-
sistently adequate as a measure of communicative proficiency. This challenges the
ability to make inferences about the usefulness or fitness for purpose of each assess-
ment, at least as judged by the students’ perspective: some candidates benefit; others
may be disadvantaged.
The open-ended data provided further opportunity to probe students’ thinking and
experiences with regard to both converse and interact, and enabled an exploration of
strengths and weaknesses through students’ reported experiences and perceptions.
8.4 Student Survey Responses – Converse
It was evident that a number of students who took converse were quite satisfied with
what they had been asked to do. Responses included “I did enjoy the standard”
(German 01); “it was fun to do with the teacher” (French 04); “it’s a good opportu-
nity to explore how to use the language in an everyday conversation” (French 08);
“I really enjoyed the converse standard, as it was fun, relaxed and the least stressful
of the standards” (Japanese 22).
One student of French (French 10) who commented that “I really enjoyed the con-
versation assessment, it was fun coming up with a scenario and acting out my point of
view,” went on to describe how she had prepared for the assessment. Her response
highlights two essential limitations of the converse assessment format – the propen-
sity to rote-learn responses and the tendency to force artificial language into use:
I think that … writing it out beforehand ensures you can include good vocabulary and gram-
mar structures one already knows, as it can be difficult coming up with the proper use of
celui [‘this one’ – demonstrative pronoun], dont [‘of which/whose’ – relative pronoun] or
subjunctive on the spot.
Indeed, for the group of French students of whom this student was a member, the
typical converse scenario was described in these words:
We prepared a 2–3 minute conversation on a topic of our choice. We wrote the part of both
the teacher and student, had the teacher check our writing and made any changes as neces-
sary. We then had some time to practise our conversation with others in the class before
having it recorded with the teacher for assessment. (French 08)
The conversations were therefore effectively “rehearsed and learned off by heart”
and performed “as it was written” (French 10).
The notions of prior planning and rehearsal, although not commented on as
transparently as in this French context, appeared to be fairly typical. One student
(Spanish 13) reported having to “write a draft of things I could say beforehand.”
Another (German 29) “had to prepare and learn information about the environ-
ment.” Afterwards, students in this German context were required to “practise con-
versations with fellow students, and finally were tested by the teacher.”
Building on these operational requirements which effectively limited the ability
of the assessment to elicit samples of interactional proficiency, several students
commented that they believed an improvement to converse could be to include a
level of interactional reciprocity. That is, converse would work better if it were “less
structured in its delivery so it flows like a conversation, rather than like an inter-
view” (German 01), or “more freely spoken, not the question and answer type struc-
ture. That way the conversation would be like a proper conversation with a person
and not having to stop and think of an answer to the question asked” (German 03).
Indeed, the French student (French 08) who had admitted to the totally rehearsed
nature of assessments in her school argued that there should be “more emphasis on
the ‘conversation’ element, so that it isn’t just learning big paragraphs, but the ques-
tions go both ways, both participants dominate the conversation equally, etc.”
Allied to the view that reciprocity conditions should be part of the assessment
was the perspective that there should be a greater focus on fluency than on accuracy.
In particular, the requirement to account for particular grammatical structures was
often commented on as unnecessary or a hindrance. Two different groups of stu-
dents, one Spanish and the other Japanese, clearly brought out this perceived
disadvantage.
Among the Spanish group it was clear that converse required having to put in
“the appropriate structures” (Spanish 13). An improvement would therefore be “not
having to include certain structures as it makes it more difficult to have a fluent,
flowing conversation” (Spanish 11). As another student explained, “I thought that
the necessity to include structures increased pressure and anxiety as well as reduced
fluency as conversation was on including certain things and not on making the sen-
tence make sense. The conversation does not flow” (Spanish 14). This student con-
cluded, “make the focus of the standard on fluency and keeping to the context and
not on which and how many structures you can include.”
The same issues were reflected by several in the Japanese group. To succeed in
converse students had to “use a variety of language and good grammar structures”
(Japanese 26). However, the conversation should be “not marked on structures” but
“on how conversational it is and how you perform through that conversation”
(Japanese 26), or “not marked on the level of structure use, but on how you keep the
conversation going and confidence” (Japanese 22).
Several students in these Spanish and Japanese groups also admitted that the
somewhat unnatural contexts in which converse was being enacted made them feel
nervous in the assessment. The ‘test-like’ nature of a summative conversation with
8.5 Student Survey Responses – Interact 177
the teacher was a contributing factor to its ‘unnaturalness’. As one of the Japanese
students put it:
You feel nervous because it feels under test constraints. Maybe the conversation standard
should focus on [being] more natural – it should not be about how much you speak – it
should be more about if you are able to adapt to a typical conversation. (Japanese 28)
In other words, “the conversation loses its casual atmosphere because it is based
on assessment conditions” (Japanese 21), meaning that “I do not speak as well as I
would in an everyday situation because it makes me nervous because it’s an assess-
ment” (Spanish 13).
One of the Japanese students (Japanese 25) admitted that, although “I did enjoy
the standard,” the student “was really really nervous about it!” The student went on
to explain, “I find it difficult to come up with ideas fast while trying to include lots
of level 8 structures and complex language, so I didn’t enjoy this aspect of it.” The
student concluded, “I don’t feel this form of conversation shows my ability at speak-
ing.” This idea was reiterated by one of the Spanish students (Spanish 11) who
admitted to feeling “nervous” because it was “hard to get the structures in.” An
improvement would therefore be if the assessment were to “have less of a test
emphasis” (Japanese 26).
The students in the Japanese and Spanish groups whose perspectives I noted
above were joined by several others who commented on similar issues. One French
student (French 06), for example, admitted to being “very nervous because I wanted
to include everything I prepared and show the variety in language used and the way
I spoke.” An improvement suggested by a number of students might therefore be to
include “more opportunities to do conversations as opposed to marks being decided
from one conversation” (Japanese 23) or to “be given more than one attempt”
(French 06). Teachers could “provide more opportunities to practise during the
year” (Spanish 14). For this student “one standard including conversing means it is
not a focus in the classroom as writing and reading are and conversing is possibly
the most important part of learning a language.” Additionally, the assessment could
“be done with another student to limit the stress factor” (German 29), or could
“allow students to converse with other students as they might feel more comfortable
and would perform better” (Japanese 25).
8.5 Student Survey Responses – Interact
It may be argued that the assertions made regarding converse provide some level of
justification, from the students’ perspective, for the changes that, in theory at least,
should be wrought through interact. The open-ended section of the 2013 survey
provided students who were the first to complete interact to comment on their expe-
riences with the new assessment. Responses were received from 119 students work-
ing in a range of languages in different classes (Table 8.2). (As noted above, one
student did not complete Section I of the survey and was therefore excluded from
analyses of Section I.)
Table 8.2 Student survey participants (Stage II)

Class Language n Type of school
A French 4 Co-educational integrated schoola, b
B French 4 Co-educational state schoola, b
C French 15 Co-educational state schoolb
D French 9 Girls’ state schoolb
E French 14 Co-educational state schoolb
F French 13 Girls’ state school
G German 5 Co-educational state schoolb
H German 7 Boys’ state schoolb
I Japanese 11 Co-educational state schoolb
J Japanese 18 Co-educational state schoola, b
K Japanese 2 Co-educational integrated schoola
L Japanese 3 Girls’ integrated school
M Spanish 10 Co-educational state schoola, b
N Spanish 4 Co-educational state schoolb
Notes
a
Respondents were drawn from separate classes of the same school
b
Teachers from this school were also interviewed
The open-ended comments revealed, in common with the teacher interview data,
a wide range of perspectives on interact along with a wide range of operational
practices in different classrooms. In what follows, students’ comments are related
principally to the issues raised through the Stage II teacher interviews: the fluency/
accuracy tension – that is, differential interpretations of ‘spontaneous and unre-
hearsed’ and concern about the use of ‘appropriate’ (i.e., curriculum level 8) gram-
mar – and task variables. Additionally, interlocutor variables are explored. Finally,
washback is considered. For each of these issues, student comments reveal the
impact of the assessment on them.
8.5.1 Spontaneity Versus Grammar
Students in a range of classrooms reported an emphasis on interactions that “had to

be unrehearsed and spontaneous based on our knowledge” (A03). That is, students
“had to choose a topic then make up a conversation on the spot with another person”
(L103); they “weren’t allowed to script it” (F58) and therefore had to “speak spon-
taneously” (F57) and “easily bounce off each other during the conversation” (F62).
It was evident, however, that, although students may have “performed three sponta-
neous conversations over the year on separate issues/topics” (B05) or have had
“three conversations (unrehearsed with a partner)” (F52), these arose from material
“that we have studied throughout the year” (B05) or “topics we had covered in
class” (F52). Class F provided a useful example. In this class it was apparent that
“we had a lot of time in class (and out of class) to have spontaneous conversations
between ourselves and with the teacher” (F64). It appeared that these impromptu
conversations provided support for those interactions that became the assessments –
“we recorded most of these [impromptu conversations] and picked the best ones to
redo (e.g. topics we found interesting) and submit” (F64). Thus, in Class F it
appeared that spontaneous (non-assessed) interactions in class had become norma-
tive, and that these interactions formed the preparatory basis for the ‘real thing’ – the
assessed interaction.
In contexts where ‘spontaneous and unrehearsed’ appeared to be interpreted
quite literally, several student comments suggested that this aspect of the assess-
ment, coupled with specific grammar requirements, contributed to negative impact.
One student reflected:
I find the idea of a spontaneous conversation, but still having to include high level grammar,
very contradicting. It makes the standard very stressful and challenging, because I feel as
though I am not able to include high level grammar without planning in advance before-
hand. I did not find the standard enjoyable or rewarding. (F59)
As another put it, it was “difficult to have a spontaneous conversation as well as

using the level 8 vocabulary and phrases.” As a consequence, “it made me very
nervous” (A01). In other words, “it was very nerve-wracking as you are unsure of
what is going to happen during the conversation or having to worry about using
specific tenses that may not work in the conversation” (A03).
The sense of having to ‘make up a conversation on the spot’ was thus sometimes
perceived as “so hard!” (F58) or “extremely hard” (L103) or “too hard” (A04), not
only because, in Student A04’s words, “we weren’t allowed to prepare with a fellow
class member, in all three we were with different people,” but also because “we
weren’t allowed to prepare conversations” even though students had to “use all
required structures that we were meant to.” Consequently, this “made it nerve-
wracking when we did it.” Student L103 put it like this: “I hated it. … I felt uncom-
fortable and very nervous. I was not prepared for it.”
In order to enhance positive impact and interaction in contexts where it appeared
that teachers were interpreting the spontaneity and lack of rehearsal requirements
quite literally, several students had formed the view that they would like to have a
level of prior preparation. It would, for example, be better if you could “make a
rough draft of what will be said so it is not so daunting” (A01). Alternatively or
additionally:
I think that the standard should allow us to have a partner from the class that we can work
with so that we are able to have conversations that flow and seem to be spontaneous [and]
that also we have had the time to practise with them so we aren’t stuck with nothing to say
when being assessed. (A04)
In other words, as Student L103 explained, performances might be enhanced if,

first, “we are given time to write our conversations beforehand and memorise them
beforehand,” and, second, “if we are allowed to have preparation time with our
partner instead of finding out on the day who our partner is.”
In other classrooms, however, it was evident that more structured assessment-
related preparation work had preceded the interactions. That is, “we were given a
topic that we’d worked on for a couple of weeks in advance. The teacher gave us
prompts for the interaction in the form of questions or pictures to form the basis of
our conversation” (C13). Alternatively, “we were given a topic a few days before the
interactions so that we could become familiar with what we would be speaking
about” (D32). In the actual assessment “there was a spontaneous part to each of the
interactions.” For example:
The task was to be an expert on a panel about a topic related to the environment, however
we didn’t know what questions the teacher would ask us about our topic and we also had to
interact with the others on the panel without having known what they were talking about
beforehand. (D25)
A common scenario appeared to be emerging: students would be assessed “on a

predetermined topic using language we had learned leading up to it,” even though
“the interaction itself was improvised” (G68); the interaction performed “in an
unrehearsed manner and without a script or cue cards to refer to” (G71); with stu-
dents having been “given ample time to think about what we would say through the
weeks leading up to it” (E43). In these situations students were not so ‘put on the
spot’.
At the other end of classroom experience were reported instances of what
appeared to be significant prior preparation leading to the interaction. In Class N,
for each interaction “the teacher would tell us what we were to talk about and the
different language techniques [that] would be required. We would then have at least
a few days to research, study, practise what we could say.” Spontaneity was
accounted for by students being “able to apply this to a conversation with a ran-
domly selected partner” (N116). In Class I, students “were given a topic to discuss.
We were given time to research this topic and practise speaking to other people
before doing the assessment” (I80). Within the interaction students “had to use
grammar and vocab that we learnt in class this Year 13” (I81). Students could also
“practise the day before and write out possible questions and answers” (I74).
The scenario of significant pre-planning, arguably more a problem for converse
than interact, appeared to be repeated in other interact-focused classes, with stu-
dents being allowed to “plan a script” (K101) or by “writing out the conversation”
(J83). According to Student J89, however, although “when we recorded our conver-
sations we were not allowed to have scripts in front of us,” this level of prior prepa-
ration was thought to be justified:
I think it was good that we were given time to prepare because if we had to do it on the spot
I think it would have been difficult and the conversation would not have flowed nicely. Also,
if we had to do it on the spot, I believe that many people would have panicked and we would
not be able to converse to the best of our abilities.
Prior preparation was therefore seen, at least by Student J89, as a necessary pre-
requisite to a fuller and less anxiety-provoking demonstration of proficiency. In
other contexts, however, concern was expressed that prior preparation, alongside the
grammar requirement, diminished the perceived authenticity of the interaction.
Even though prior preparation “helped in forming ideas of what to say,” neverthe-
less “the amount of planning and rehearsing made it feel almost unnatural” (K101).
Additionally, “it wasn’t natural” that “I had to fit in certain grammar structures into
a conversation spontaneously” – this “really slowed it down and made it awkward”
(F55). Indeed, for the student who had commented on the ‘contradictory’ nature of
spontaneity versus grammar (F59), a significant improvement would be “removing
the grammar requirements.” In other words, it would be “better if higher grades
were not based on if you used all the Year 13 grammar or not” (F60). As Student
D29 put it, “I think it would be better if the interaction standard focused on more
colloquial language.”
Perceived lack of authenticity was apparent in the comments of several students
across a range of languages. For some, there was a perception that interact, by
virtue of being an assessment (with subsequent negative impact in terms of assess-
ment anxiety) became ‘artificial’ and failed to tap into natural and genuine conver-
sational ability that might have been in evidence outside of the assessment context.
Student M107, for example, asserted, “it makes me feel really nervous when I have
a forced conversation compared to if I was just casually speaking in or outside of
class because I know I am getting marked.” Another (G71) noted, “it made me feel
stressed and I never felt satisfied after the interact as my conversational German is
a lot different to the spoken German that came through in the interact standard.”
Another (C19) commented that, in the assessment context, “overall, I was nervous
and I felt like it didn’t show how I actually interact in French” because “I said what
I thought was wanted to be heard.” Outside of the assessment context, however,
“when I try to speak French with my friends I am more confident and I actually
express a point, like a developed thought.”
Two students of Spanish in Class N reiterated the point about how the assessment
context impacted on authenticity. Student N116 explained that, outside of the
assessment, “I often enjoy having the opportunity to speak Spanish with my peers
as I have some Latin American background which I am proud of, and therefore like
learning and practising the language.” However, the assessment context “does make
me feel a bit nervous having to try [to] prepare a relevant vocab for the conversa-
tions and then trying to get the required language use into the conversation.” Student
N117 observed that “under non-assessment conditions speaking with others in
Spanish was good practice and I felt more confident when speaking with native
speakers because of this practice” (my emphasis). By contrast, when transferring to
the assessment context “I felt this assessment was the hardest … [and] I found it
difficult to be accurate and include all the language requirements.”
Several noteworthy tensions therefore emerged around what the interact assess-
ment at level 3 appeared to demand – spontaneous, unrehearsed interactions that
nevertheless had to account in some way for curriculum level 8 grammar; the per-
ceived unreasonableness of the tension between these two apparently conflicting
requirements which led to a range of different classroom practices (some of which
replicated what would have happened for the converse assessment); and the concern
about the diminished authenticity of the evidence, particularly when gathered in
assessment conditions.
Tensions also emerged around the kinds of tasks students were asked to
perform.
8.5.2 Types of Task
The teacher interview evidence had underscored the crucial importance of the task
in encouraging not only the appropriate language but also positive interaction with
the task. The nature of the task and what students were being asked to talk about,
and the implications for positive interaction, also came through as important issues
for a number of students.
Several students across a range of languages and classes commented that they
found the tasks they were asked to engage in contributed to the sense of unnatural-
ness or lack of authenticity in the interaction. On this basis, Student D26 suggested
that interact could be improved by “involving more natural topics,” and “talking
about things more relatable to youth that have more interest for young people. …
maybe use current events from the news.” This student concluded that a greater
emphasis on content relevance “would have encouraged me more into the topic.”
Three others from different contexts provided similar suggestions: students should
“have more realistic topics to talk about. Things we would actually talk about, not
necessarily recycling etc.” (E43); or “topics more interesting for us – then it is easier
to talk freely – teenage related” (M111); or a “topic [that] could be applied to every-
day conversations such as school” (J85).
In turn, the lack of perceived authenticity in some topics led to less positive inter-
action with the task, and the task was made challenging to complete. Student D32
put it like this: “some things such as ‘current events’ were easier to talk about as you
could have a solid opinion rather than things such as talking about poems which
were more abstract and harder to form a conversation about.” Consequently, “mak-
ing the topics relevant and interesting would help with the ease of speaking.” As
Student G68 noted, “I enjoyed the chance to interact on topics we were interested in
and speak as if in a real situation.”
Allied to the limited interactional effectiveness of some tasks was the perception
that, in these less effective tasks, the opportunities for students to display their inter-
actional proficiency were hindered. Student C11 commented that “I found it diffi-
cult” when “the subjects did not interest me and I didn’t have a lot to comment about
the subject.” In this student’s view this “reduced opportunities to show what I could
say because I wasn’t excited to talk about that issue, nor did I know about it.” Several
students concurred. For Student D29, “the hardest part was thinking of things to
say.” This was because “some of the topics weren’t interesting to me at all and I had
never really thought about them,” or the topics “aren’t things I would usually talk
about in English which made it harder to think of things to say.” Student F60 simi-
larly argued, “I didn’t enjoy it because some of the topics we had to discuss … I
would struggle to have an in-depth conversation in English, let alone French.”
In contexts where “some of the specific topics were hard as we weren’t interested
in them,” an improvement in the assessment would therefore be “more opportunities
to talk about whatever we want” since “it was easier/better/made natural talking
about what we wanted” (C17). Additionally, in simulated situations designed to
provoke dialogue and difference of opinion, Student D24 “found it hard to talk
spontaneously in roles such as a board member when I had no real idea of what the
role would entail.” It would be “better if the roles were more genuine rather than
random contrived situations like an environmental panel,” and “easier if it was a
discussion on a topic and what our opinions of it were or us responding to a current
event.” Students should therefore “be given topics which we had a wider back-
ground knowledge of and which were more relevant to us” (D28).
Task relevance and task authenticity were therefore important issues for a range
of respondents. There was a perceived need for “making the topics relevant and
interesting” which “would help with the ease of speaking” (D32) and make the
interactions “more genuine” (D30). This student went on to suggest, “I think maybe
just talking casually about an article, a movie review etc… it’s more fluent and has
more spontaneity in the conversation.”
8.5.3 Peer-to-Peer Interactions
For the Stage II interviewed teachers, the perceived value and importance of peer-
to-peer interactions was largely implicit, and evidenced through teacher comments
regarding students’ ‘enjoyment’ of particular tasks (although Stage I data indicated
more transparent positive acceptance of peer-to-peer interactions). For several stu-
dents, peer-to-peer interaction was seen as a distinct advantage of the assessment,
with several reporting that interacting with a fellow student rather than the teacher
was “less intimidating” (E38, E40), leading to feeling “less nervous” (I74) or “more
confident” (I77). In turn this promoted better opportunities “to perform at my best”
(E33, E44) and “show my fluency in the language” (E44). Not only was the inter-
locutor “on the same wavelength as your partner” (E43), but also “my peer followed
and backed me up if I had forgotten an expression” (E42). For Student E42, this
scaffolding meant that, although feeling “nervous at the beginning,” this “greatly
eased off once I did the interaction.”
Student E35, however, raised a potential limitation to the effectiveness of peer-
to-peer interaction. Commenting favourably that “there was definitely less pressure
preparing for this standard than a speaking assessment with the teacher,” Student
E35 argued that this was “because we could control what would come up whereas
we cannot tell what a teacher might say.” In this student’s view, when peers could
prepare beforehand there was less need to be concerned about dealing with the
unexpected. However:
If the aim is to get students to speak in a real conversation … not in a rehearsal speech or
play, then I think that holding the conversation with the teacher would achieve that better
because the student cannot prepare word for word what to say next as they cannot predict
what the teacher will say. (E35)
For other students, the limitation identified by E35 was not an issue. By contrast,
when partners had not practised beforehand, the unpredictability of peer-to-peer
interactions was problematic (an issue also apparent from Stage I of the study).
Student N118, for example, “didn’t know what my partner was going to say, so I
had to pay close attention. It also made it harder.” Student B08 similarly asserted, “I
didn’t enjoy the feeling of not knowing what my partner might say and having
enough content to respond.”
It may be argued that, when the interlocutors do not know exactly what each one
will contribute to the interaction, this forces strategic competence into use, and is
therefore a valuable component of the spontaneous and unrehearsed nature of the
assessment. As Student E41 commented “I don’t think it gave an opportunity to do
your best all the time, because sometimes you are unprepared. But that’s the point
of it right? To see what you can do in an unfamiliar setting.” Nevertheless, there was
also a risk that interlocutor variables might become unfairly influential. As Student
N119 explained, in practice “a lot of how the conversation goes is reliant on the
other person.” This might make the interaction “enjoyable when it was with some-
one who spoke competently and confidently.” However, “when it was with someone
who doesn’t know much Spanish it was hard because they froze a lot and didn’t
always understand what you were saying.” The potentially negative influence of
interlocutor variables was brought out by Student D27 who asserted:
I found that you were also hindered if your partner was not up to the same level of French
as you or you weren’t up to your partner’s standard, therefore the interactions depended
greatly on your partners and felt like less of an individual assessment. (D27)
Interlocutor variables therefore “made one nervous as some people, through their
accents or attitude, make it hard to communicate with them” (M113).
8.5.4 Working for Washback
The student data presented above uncover a range of issues for the successful opera-
tionalisation of interact at NCEA level 3, with implications for the lower levels. At
the start of this chapter I also presented teachers’ perspectives on how successful
realisation of the intentions of interact was dependent on the experiences of stu-
dents throughout their learning programmes. A number of student comments reiter-
ated this concern.
Several students remarked on the benefit of the cumulative effect of completing
interactions for assessment purposes in the course of the year. Student C14, for
example, noted that “when we first did it, it was quite scary because I didn’t know
what to expect, but after a few times, it was a lot better.” Student I74 similarly
asserted, “at the first few interactions, I was very nervous and was frantic [about]
what to say, but later on I got used to it.” Likewise, Student M114 observed that,
with increasing exposure to opportunities to interact, “I became more confident, and
by the later conversations I really enjoyed speaking in Spanish and believe I was
speaking confidently.”
Other comments went beyond the assessment and suggested the benefits of
exposing students to more opportunities to speak and interact in classrooms than
8.6 Conclusion 185
they may currently be receiving, and of integrating interact considerably more

within on-going teaching and learning programmes. There was, for example, a need
to “speak more and do more conversations with topic you like” (M108); or to “give
more opportunities to do interactions” (I76); or to have “more practice opportunities
like small speaking groups in class even when there’s no assessments coming up”
(I80).
With regard to greater classroom emphasis on interacting, Student D30 asserted,
“I found the [NCEA level 3 curriculum] it’s mostly writing/reading exercises and
not actually conversing which is what a lot of people need to improve.” Student C21
put it like this: “personally I think it needs to be more integrated (maybe it was just
our class). For us it was like we were doing writing work the whole time, then the
day before we were given a chance to practise [an interaction].” C21 concluded that,
for interact to be successful, “I think it needs to have more regular conversations. I
found it hard because between these long periods of not speaking I’d lose my
fluency.”
8.6 Conclusion
As regards the students’ perspectives on interact in comparison with converse, two

key findings arose from the quantitative data. First, it appeared that, on average,
students rated converse and interact essentially the same on all measured aspects of
the test usefulness construct. That is, they perceived no real difference between the
two assessments in terms of usefulness. Second, students’ opinions were very much
divided over the efficacy of both converse and interact. That is, for every student
who rated converse or interact highly in different respects would be students who
rated them correspondingly poorly (see Figs. 8.2 and 8.3).
The range of experiences presented through the open-ended comments sheds
some light on the diversity of perceptions. It must of course be acknowledged that
the experiences of these students are not necessarily typical or generalisable – the
sample sizes and the range of schools are too small for that to be the case.
Furthermore, it must be conceded that the survey presents students’ perceptions of
the two assessments. What one student in a given context may have perceived to be
the case may be quite different to what another student in that same context might
have perceived. That is, perceptions were necessarily influenced by a range of vari-
ables, several of which may not have been connected directly to converse or interact
as assessments. Nevertheless, the open-ended comments do help to shed some light
on students’ perceptions of the efficacy or otherwise of converse or interact. As two
snapshots of students’ perceptions (converse in 2012 and interact in 2013), there
was remarkable symmetry around several key issues.
Bearing in mind that the group of students commenting on converse were not
doing so in comparative terms (that is, converse was the assessment of conversa-
tional proficiency for which these students were being prepared), several limitations
to converse as an assessment format emerged from the open-ended comments which
arguably support the theoretical rationale for, and the actual implementation of,
interact. Converse:
• encouraged both rote-learnt responses and artificial language use. These require-
ments made the assessment ‘test-like’ and stressful;
• would be better if it promoted more ‘natural’ reciprocal interactions with a
greater focus on fluency than on accuracy;
• should be less ‘test-like’, with more than one opportunity to complete and the
opportunity to work with peers.
With regard to interact, the open-ended comments revealed a range of perspec-
tives. A number of students across several languages and classrooms commented
that they ‘enjoyed’ the assessment, or that they found it ‘easy’. At one end of the
spectrum of experience, then, were those who, in the words of one student, “did
enjoy the interact standard as it gave me a chance to really use what I was learning
in class and really test myself” (B06), making it, in the words of another, “my favou-
rite internal” (E46). Conversely, there were students who reported that they ‘did not
enjoy’ the assessment, or found it ‘hard’ or ‘difficult’, or were ‘nervous’. At the
other end of the spectrum of experience, therefore, were those who, as one student
put it, “found it a very stressful and nervous experience … hardest standard I have
ever done” (C23).
A largely implicit discourse that informs the students’ perspectives (and that is
more clearly apparent from the teacher interviews) is that interact is a high-stakes
assessment, and students wish to perform well on it. It appeared that, for those who
believed that the assessment gave them the opportunity to display their proficiency,
the assessment was well received. For others there were challenges, exacerbated by
key elements such as ‘spontaneous and unrehearsed’, and the perceived necessity to
account for high level grammar structures (despite their teachers recognising, in
theory at least, that this was of lesser importance). In turn, several comments noted
the ‘more unnatural’ and ‘less proficient’ interactions provoked in the assessment in
comparison with those that might have been occurring naturally and spontaneously
in non-assessed contexts. Additionally, inaccessible task types and interlocutor vari-
ables contributed to negative impact and interaction for some respondents. There
was a perceived need for more opportunities to interact in class, whether or not these
interactions were for assessment purposes.
Comments by Student D25 neatly encapsulate several of the tensions raised by
the students. This student asserted that, on the one hand, “overall I think the interac-
tion was good,” and “sometimes it was enjoyable when you were confident about
the topic or got into the swing of it and just said anything, like you were having a
real conversation.” On the other hand, “I thought some of the tasks were hard … if
the interactions were more everyday conversations I think it would be better as it
was difficult to sometimes speak to the style that was expected.” In terms of the
perceived expectations of the assessment, Student D25 asserted that, with regard to
grammar, “it was difficult to incorporate the right level of French sometimes and
easier to use simple French.” As regards ‘spontaneous and unrehearsed’, the student
noted:
References 187
Although there is meant to be an element of spontaneity, I think this needs to be more

controlled and that it is important for teachers to remember we aren’t fluent in French and
that expecting us to be able to run with any variable is not very realistic and will not allow
us to show off our best French, particularly when we are also just getting used to that topic.
In Student D25’s context, however, improvements to performance could argu-

ably be made “maybe if more speaking practice was done, e.g., lots of classes right
at the beginning of the year to boost confidence and exposure to speaking as opposed
to having to just do the interactions.” Ultimately, “the fact that it was assessments
made it seem hard and like you were being judged, so you sometimes couldn’t think
or would make mistakes you normally wouldn’t make.”
In essence, the range of data that I have presented in this chapter, considered
alongside the story that has emerged from the preceding three chapters, leads to
several questions that require final consideration: is interact useful and fit for pur-
pose as an assessment of spoken communicative proficiency? In comparative terms,
is interact more or less useful than the assessment it has replaced? Which facets of
usefulness are more strongly in evidence than others? Teacher and student perspec-
tives also lead to the conclusion that there is perhaps, for several reasons, a need to
revisit the fundamental assumptions of interact as stated at the start of Chap. 1 – that
students’ spoken communicative proficiency will be primarily determined through
the collection of a range of genuine student-initiated peer-to-peer interactions as
they occur in the context of on-going classroom work throughout the year.
Considered more broadly, which kind of assessment, single interview or paired/
group, one-time static or on-going dynamic, lends itself better to the valid and reli-
able measurement of FL students’ spoken communicative proficiency? Most impor-
tantly, what recommendations, arising from the data, can be presented for making
assessments of spoken communicative proficiency as useful as possible? These
questions are addressed in the concluding chapter.
References
East, M., & Scott, A. (2011). Working for positive washback: The standards-curriculum alignment
project for Learning Languages. Assessment Matters, 3, 93–115.
http://dx.doi.org/10.1177/026553229601300302
Chapter 9
Coming to Terms with Assessment Innovation:
Conclusions and Recommendations
9.1 Introduction
In Chap. 1 I presented a typical process for the development of a new assessment.

The process begins with an initial conceptualisation of what the assessment will aim
to measure, that is, a theoretical construct that will guide and influence the assess-
ment development. This theoretical construct represents the ‘ideal’, the founda-
tional understanding of what the assessment developers, whoever they might be,
consider to be the dimensions of competence of which we need evidence. Building
on that theoretical foundation, assessment developers attempt to design assessments
that will adequately and fairly measure the facets of the construct that the assess-
ment developers consider to be crucial. Part of the process will be considerations of
how best to elicit the evidence considered necessary, including the format of the
assessment (e.g., static or dynamic; nature of the task; conditions of assessment). As
Bachman and Palmer (2010) make clear, this initial process is typically carried out
with the best of intentions, and the motivation to create meaningful and useful
assessments that will capture evidence of the construct of interest in valid and reli-
able ways.
Bachman and Palmer (2010) go on to argue, however, that creating an assess-
ment at the theoretical level, even with the highest motivations and the most robust
of arguments, is quite different to enacting that assessment in the real worlds of
teachers and students. Once an assessment begins to be put into use, there is always
the possibility that its use will either not lead to the intended consequences, or will
lead to unintended consequences that may have negative impact on stakeholders.
The interface between the assessment and the real worlds of its users is where the
rubber meets the road. It is an interface fraught with challenges.
The introduction of a new assessment of spoken communicative proficiency in
the New Zealand context – called interact – provides an interesting example of the
tensions inherent in introducing new assessments. Interact was proposed with the

Linguistics 26, DOI 10.1007/978-981-10-0303-5_9
190 9 Coming to Terms with Assessment Innovation: Conclusions and Recommendations
best of intentions. However, relatively early in the process of development, and

before interact had begun to be put into use, there were warning signs that teachers
were apprehensive about the proposal (East & Scott, 2011a, 2011b). As the assess-
ment began to be rolled out in classrooms, beginning in 2011, teachers’ anxieties
and concerns about interact continued to surface, albeit intermingled with com-
ments that supported the new assessment. At the anecdotal level, it seemed that, in
practice, interact was generating a range of responses, both critical and favourable.
Ultimately, interact can only be considered useful or fit for purpose to the extent
that it does promote the collection of useful and meaningful evidence of students’
spoken communicative proficiency. What is required is validity-supporting evi-
dence that this is the case. The 2-year study that has been documented in this book
was an attempt to investigate, in a robust and coherent way, both teachers’ and stu-
dents’ perspectives on the effectiveness of the new assessment in the early stages of
its implementation, with a view to contributing to a validity argument. As I stated in
Chap. 1, the fundamental questions I aimed to address were these: What are teach-
ers and students making of the innovation? What is working, what is not working,
what could work better? What are the implications, both for on-going classroom
practice and for on-going evaluation of the assessment? This concluding chapter
draws on the data to address these questions.
The chapter begins with an overview of the essential theoretical drivers for inter-
act. A summary of findings from the study is then presented in relation to percep-
tions of usefulness (Bachman & Palmer, 1996). Findings are then discussed in the
light of the theoretical issues raised in Chaps. 1, 2, and 3. The chapter continues
with a discussion of the tensions between static and dynamic models of assessment,
and the implications for assessments such as interact. Recommendations for
enhancing assessments of spoken communicative proficiency are then made. The
chapter concludes with a discussion of the limitations of the present study and the
kinds of future research that are necessary to move forward our understanding of
effective (i.e., useful and valid) assessments of spoken communicative proficiency.
9.2 Theoretical Underpinnings of Interact
Building on a sociocultural view of learning, the introduction of interact was

intended to move the focus of spoken assessment away from the high-stakes nature
of a ‘one-time testing event’ (as operationalised through the former converse assess-
ment) and towards viewing the assessment as fundamentally integrated into on-
going classroom work (Gipps, 1994; Poehner, 2008; Torrance, 2013a). In turn, this
would encourage assessment for learning, a key assessment concept encouraged by
New Zealand’s Ministry of Education (2011). Within that paradigm there would be
opportunities for feedback and feedforward that might enhance students’ future per-
formances (ARG, 1999, 2002a, 2002b; Hattie, 2009; Hattie & Timperley, 2007).
Through interact, a sincere attempt was being made to move students away from
9.2 Theoretical Underpinnings of Interact 191
assessments of spoken proficiency that had effectively often become one-sided pre-
learnt and inauthentic ‘conversations’. The move was intended to be towards genu-
ine, authentic and natural interactions that would tap into, and provide evidence of,
students’ proficiency benchmarked against a broader spoken communicative profi-
ciency construct.
The move signalled by interact was also designed to reflect, and therefore to
encourage, a teaching and learning context where the emphasis, on the basis of a
revised national curriculum for New Zealand’s schools (Ministry of Education,
2007), had become communication. In other words, the New Zealand curriculum
for languages was designed to promote real language use in the classroom, built on
the premise that students learn how to communicate through interaction in the target
language (Nunan, 2004; Willis & Willis, 2007). There is a sense in which the whole
modus operandi of interact, and arguably the perspective held by those responsible
for proposing and developing interact, was to encourage assessments that would
move students towards the crucial goal of effective interaction.
Seen more broadly, the introduction of interact was intended to reflect the aims,
intentions and theoretical foundations of communicative or proficiency-oriented
approaches to language teaching and learning that have now become firmly embed-
ded in classroom practices in many contexts worldwide. These approaches focus on
drawing on knowledge of a language with a view to actually using that language in
authentic communication with a wide variety of people in a range of different con-
texts (Hedge, 2000). In other words, where language is viewed as “interactive com-
munication among individuals,” interact, in theory at least, was introduced in order
to encourage “meaningful, authentic exchanges” and “the creation of meaning
through interpersonal negotiation among learners” (Brown, 2007, p. 218).
A further driver for interact may be derived from a theoretical perspective that
peer-to-peer interaction is thought to be beneficial to students’ language acquisition.
As Philp et al. (2014) argue, the theoretical underpinnings of this thinking are both
cognitive (e.g., Long’s [1983, 1996] interaction hypothesis) and sociocultural (e.g.,
Vygotsky’s [1978] zone of proximal development). Seen from a cognitive perspec-
tive, the development of language proficiency is enhanced by face-to-face interac-
tion and communication. Seen from a sociocultural perspective, collaborations and
interactions are designed to help learners to move from one level of proficiency
(what they can only do with help and support) to a higher level of proficiency (what
they can eventually do independently). The end-goal, whether viewed cognitively or
socioculturally, is automaticity in language use (De Ridder, Vangehuchten, and
Seseña Gómez 2007; DeKeyser, 2001; Segalowitz, 2005) – students are ultimately
able to undertake interactions with speakers of the target language, not necessarily
faultlessly, but certainly with a degree of independence and fluency.
The validity and reliability of interact as an assessment are therefore bound up
with the extent to which its introduction and use, both as an assessment and via its
washback, facilitates, encourages and captures genuine evidence of FL students’
spoken communicative proficiency. The above theoretical arguments suggest that
interact is a valid means of doing this.
9.3 Summary of Findings
9.3.1 Overview
The study reported here utilised a mixed-methods design, drawing on largely quali-
tative data that were complemented by a level of quantification (Lazaraton, 1995,
2002). Quantitative data were principally elicited from several surveys: a nation-
wide teacher survey in 2012, and two small-scale student surveys in 2012 and 2013.
Qualitative data were elicited from open-ended sections of the three surveys and
teacher interviews that took place in 2012 and 2013. Published documents were
used as additional and complementary data sources.
The closed-ended sections of the surveys provided the opportunity to measure
teachers’ and students’ perceptions against facets of Bachman and Palmer’s (1996)
six qualities of test usefulness (construct validity, reliability, authenticity, interac-
tiveness, impact and practicality). The open-ended sections provided scope for
respondents to express views on the assessments in operation. In Stage I these opin-
ions were subsequently interpreted through the test usefulness lens. In Stage II
respondent perceptions were drawn on to throw light on issues that had emerged as
crucial in Stage I. In the teacher survey, teachers’ perspectives were sought on inter-
act in comparison with converse. The student surveys focused on either converse or
interact, depending on which assessment the students had been working towards.
The teacher data revealed that, overall, the reception of the new interact assess-
ment was positive: interact was a definite improvement on converse. In terms of
usefulness, there was a perception that interact would promote assessments that
were significantly more valid and reliable, and significantly more authentic and
interactive, than converse. Respondents also appeared to believe that students per-
ceived interact to be a better measure of their spoken communicative proficiency.
These findings lend support to the introduction of interact as a useful (and therefore
valid) assessment.
On the minus side, interact was viewed as significantly more impractical than
converse. Also, on average, respondents did not see any difference between the two
assessments in terms of students’ stress, that is, each assessment was considered to
generate a comparable level of stress for candidates. Two comments are pertinent
here. First, a comparison of strengths of responses (Fig. 5.3) does suggest that inter-
act might marginally outperform converse here (although the differences are not
significant). Second, the open-ended survey data, as noted in Chap. 5, revealed that
a number of respondents made comments to suggest a perceived (and positive)
reduction in stress for students when taking interact in comparison with converse.
It appeared that principal language taught did not make a difference to teachers’
perceptions. There were, however, significant differences in perception depending
on whether the teacher reported using interact at the time of the survey. Using the
assessment was a contributing factor in perceiving it more favourably, although not
diminishing the challenges of its operationalisation.
The student data (two independent groups) revealed a somewhat different pic-
ture: neither assessment was perceived as being better or more useful than the other.
9.3 Summary of Findings 193
Also, the range of responses indicated that, whether converse or interact was being
considered, students differed considerably in their perceptions. That is, for one stu-
dent who considered interact (or converse) to be highly useful, there would be
another who considered it to be hardly useful at all.
Taking all the data into account, it would seem as if, at first sight, there are two
parallel but contrary discourses at work among the teachers. These contrary dis-
courses may serve to shed some light on the range of views held by the students. On
the one hand, there is a discourse that highly favours interact as an authentic and
valid measure of FL students’ spoken communicative proficiency, leading to posi-
tive impact and interaction for many students. On the other hand, there is a contrary
discourse that questions the usefulness of interact because it is highly impractical,
has negative impact on some students, and cannot really live up to the expectations
for ‘spontaneous and unrehearsed’. In this respect, the assessment not only fails to
capture instances of students’ actual spoken communicative proficiency but also
leads to student stress and anxiety. Indeed, some teachers considered the demand to
be spontaneous and unrehearsed as unrealistic. In other words, evidence of the out-
comes of a focus on fluency is arguably insufficiently captured through the way in
which interact is operationalised in some schools. These findings bring into ques-
tion the arguably superior validity of interact in comparison with converse.
Analyses of the Stage II interviews and surveys focused in particular on three
aspects of the reform: the nature of the task; issues around spontaneity; and the
place of accuracy. The data reveal a variety of understandings and practices which
may also shed light on the range of student perceptions that the student data had
revealed.
In summary, then, and interpreted through the lens of test usefulness, interact is
perceived as more useful than converse in terms of validity, reliability, authenticity
and interactiveness. It is seen as less useful in terms of practicality. Impact may be
seen as positive with regard to the kinds of interactions students may engage in.
However, findings as regards students’ stress are ambivalent. On the one hand, it
seems that interact really makes no difference in relation to student anxiety – a test
is a test, after all. On the other hand, open-ended comments suggest some lowering
of anxiety by virtue of the interact format.
A number of key issues emerge in the light of the arguments I presented in Chaps.
1, 2, and 3 about the effective implementation and operationalisation of valid and
reliable assessments of FL spoken communicative proficiency. These key issues
serve to substantiate perceptions of usefulness as revealed in the data and provide
further contributions to the validity debates.
9.3.2 Positive Dimensions of Assessments Such as Interact
The evidence certainly suggests that, in comparison with one-time summative

teacher-led interview tests such as converse, the on-going peer-to-peer interactions
anticipated by assessments such as interact have clear advantages. They appear to
facilitate a more genuine reflection of what students can do with the FL in a range
of TLU domains, whether in the assessments themselves or potentially in future
real-world interactions (Bachman & Palmer, 1996; Galaczi, 2010; van Lier, 1989;
Weir, 2005). Assessments such as interact are also able to measure a broader con-
struct of communicative proficiency than one-time interviews (Brooks, 2009;
Ducasse & Brown, 2009; Jacoby & Ochs, 1995; May, 2011). This includes the
opportunities to tap into all facets of a communicative proficiency construct (Canale,
1983; Canale & Swain, 1980; Martinez-Flor, Usó-Juan, and Alcón Soler 2006;
Roever, 2011). In this regard, narrower interview type tests arguably under-represent
the construct of interest and are therefore less valid measures of the construct
(Messick, 1989).
With regard to positive impact, the open-ended evidence suggests that, on the
whole, students are more relaxed and less stressed (Együd & Glover, 2001; Fulcher,
1996; Nakatsuhara, 2009; Ockey, 2001). When this occurs, this seems to be related
to the practice effect of multiple assessment opportunities. Impact is also enhanced
by the opportunities to collect a range of evidence over time and to present the best
evidence in a summative portfolio (East, 2008; Sunstein & Lovell, 2000). Positive
interaction is heightened by developing tasks, perhaps in consultation with the stu-
dents, that students perceive as relevant and interesting, something they both want
to talk about and have some knowledge to talk about, thereby promoting “meaning-
ful language communication” (Norris, 2002, my emphasis) that will elicit the most
meaningful instances of proficiency (Leaper & Riazi, 2014).
Additionally, assessments such as interact arguably promote more positive
washback than converse, because the paired and peer-to-peer format either reflects
the extensive use of pair/group work that is already happening in communicatively
oriented classrooms, or will encourage more such work (Együd & Glover, 2001;
Galaczi, 2010; Swain, 2001; Taylor, 2001). As a consequence, the goals of the
teaching programme and the goals of assessment become more integrated (Morrow,
1991; Smallwood, 1994). In the words of Swain (1984), assessments such as inter-
act are able to ‘bias for best’ and ‘work for washback’ in several ways. That is, if
washback is defined as “an active direction and function of intended curriculum
change by means of the change of public examinations” (Cheng, 1997, p. 38), antic-
ipated changes at the curriculum level can arguably be brought about by changing
the assessment. After all, as Buck (1988) argued, when the assessment is seen as
important to the students, and pass rates are perceived as important to the teachers,
teachers will naturally tailor what they do in their classrooms to the demands of the
assessment.
Drawing on the survey data, a positive outlook around the kinds of assessment
promoted through interact is arguably well articulated by the comments of two
French teachers:
• Students enjoy working together, encourage each other to do well. It is in their
interests to try hard and to work co-operatively. (French 047)
• I like the idea of spending more time conversing in authentic type situations. The
students’ skills did improve. (French 070)
9.3 Summary of Findings 195
Another French teacher (French 120) asserted, “students have totally embraced
the interact standard and have enjoyed the opportunity to be creative … and to talk
with their peers in a stressless environment. Talking with the teacher has almost
disappeared.” One Japanese teacher (Japanese 138) noted, “students responded very
well. … [The] portfolio work where they could submit work they were happiest
with meant they were highly motivated to achieve. They enjoyed interacting with
each other.” This teacher went on to explain, “my lessons are now far more focused
on communication skills … and thus far more enjoyable. … [The] emphasis has
totally changed from [the] old standard which was rote learnt.”
In terms of the student perspective, Student D24 commented, “I thought that it
was a really good opportunity to actually speak in French, and I’m glad I did it oth-
erwise I wouldn’t have known what I could actually do.” Another noted, “it did give
me a sense of accomplishment and increase my confidence in speaking Japanese”
(I76). Student E45 asserted, “I really enjoyed the interact standard. It was not at all
stressful or hard, and did not make me anxious at all. I found it was a good oppor-
tunity to demonstrate my speaking ability.”
All of the above arguments, from both the literature and the data, lend consider-
able support to the introduction of peer-to-peer assessments such as those antici-
pated by interact. However, despite what are arguably significant gains as a
consequence of introducing interact, there are also several drawbacks.
9.3.3 Negative Dimensions of Assessments Such as Interact
The data reveal concern about the differential impacts of several variables. There
was a perceived risk, among both teachers and students, that interlocutor variables
(who was paired with whom) would mean that it was not always possible for the
students to demonstrate their full proficiency (Foot, 1999; Fulcher, 2003; Leung &
Lewkowicz, 2006; O’Sullivan, 2002), with subsequent implications for the validity
and fairness of the assessment (Galaczi & ffrench, 2011). Among the students, task
variables also sometimes hindered positive interaction (Bachman & Palmer, 1996).
That is, some students reported that, in their perception, several tasks they were
asked to complete did not interest them or were perceived to be irrelevant.
Additionally, several student responses indicated that, in their experience, when
interact was operationalised as a series of ‘assessment events’, there was in fact
minimal washback into classrooms, and other skills (such as writing) tended to
dominate classroom practice. There was also a perception that enacting interactions
in this way (i.e., as ‘tests’) disauthenticated the interactions and enhanced a sense of
stress that was not necessarily present when interacting with others outside the
assessment context. It was as if, as Shohamy (2007) put it, real knowledge could not
be expressed in the assessment.
A strong message weaving its way through the data was that interact, just like its
predecessor converse, was, at the end of the day, a high-stakes assessment, and
needed to be treated as such. As a consequence, and taking into account both
interlocutor and task variables, students were going to feel stressed and anxious
regardless of the format because they would inevitably wish to do as well as possi-
ble (Brown & Abeywickrama, 2010; Shohamy, 2007). In that respect, interact could
confer neither advantage nor disadvantage over converse.
9.4 Static or Dynamic: A Fundamental Problem
9.4.1 Is Interact a Test?
Perhaps the largest issue facing interact going forward is that there is a tendency for
stakeholders to approach interact as if it is a test. Teachers do not yet appear fully to
understand, or are somewhat reticent to embrace, the socioculturally-informed per-
spective that students should be allowed to “bring samples of their real language-use
activities for assessment,” even though teachers do support the idea of “offering a
selection of tasks in formal, organised tests” (Luoma, 2004, p. 103). There is a sense
in which the assessment is trying to be innovative, but is currently constrained, at
least in that it may be largely replicating the one-time test format in many
contexts.
When the perceptual (and perhaps actual) emphasis of interact is on a test-like
format, it is not surprising that several negative perceptions ensue. Among them:
• teachers in the national survey saw interact as significantly more impractical
than converse – after all, it requires the collection of at least three times the
amount of evidence, and throughout the year (this was by far the largest single
reported disadvantage);
• there was no perceived difference between the two assessments in terms of stu-
dent stress (at least as measured quantitatively);
• some teachers viewed negative impact in unequivocal terms (‘ridiculous’ and
‘unrealistic’);
• teachers and students were continuing to resort to practices (such as pre-rehearsal
and scripting, and forcing unnecessary grammar into use) that threatened to
undermine the very purposes of interact.
As for the student data that indicate no significant differences in perception
between the two assessments, this may simply be a reflection of Bachman and
Palmer’s (2010) argument that no single best test exists, and that what suits one set
of candidates may not suit another. Of more concern, however, is interpreting this
finding in the light of Gardner, Harlen, Hayward, and Stobart (2008) who suggested
that “[o]ne of the most common reasons for ‘no-difference’ or even negative find-
ings for the impact of innovation in education is that the intended changes are not
properly in place” (p. 12). Seen from a more formative or dynamic assessment per-
spective, it appears that interact has not (yet) gone far enough. There is a ‘culture
clash’ between what interact could be within a more formative assessment for learning
9.4 Static or Dynamic: A Fundamental Problem 197
model and what teachers (and students) currently perceive as ‘good assessment
practice’. Thus, the principle of embedding the assessments seamlessly within on-
going work was largely eschewed or not understood by teachers. Margaret, for
example, despite her clear argument that interactions should arise out of the teach-
ing and learning programme (see Chap. 8), illustrates a perspective that an approach
that embeds the assessment seamlessly would not be acceptable. She asserted:
I did hear that ‘oh, you just glide by with the voice recorder, drop in for 30 seconds and
listen to a couple and then move on down to the next’. … I don’t subscribe to that and I do
think it’s a big assessment event in the sense that it’s worth five or six credits of the year,
and three or four assessments … Personally I wouldn’t do it that way.
Peter’s perspective (reported in Chap. 6) clearly brings out two negative corollar-
ies for students of continuing to focus on interact as an assessment event. In his
words, students “want to know, ‘am I being assessed on this? Is this an important
one?’” When they realise that the interaction will contribute to assessment, they
prepare what they are going to say and effectively learn a script. Students’ desire to
perform at their best leads them to over-prepare. As a consequence, the interaction
no longer sounds natural, and students’ marks are affected because the interaction is
not spontaneous. As other teachers in the survey put it, “some still insist on writing
out a script and memorising” (French 022), or “practising, rehearsal and memorisa-
tion (still) dominant” (Japanese 085). In turn, and when the construct of interest
includes the ability to respond spontaneously, over-preparation and over-rehearsal
effectively introduce construct irrelevant variables (Messick, 1989). These variables
compromise the evidence available of interactional proficiency. Ultimately the stu-
dents do not do as well as they might have done if the focus had not been on ‘the
test’.
The above limitations do not necessarily invalidate interact. They do, however,
lead to problems in practice that require mediation. The blame for the on-going
perception of each interaction as an assessment event cannot be laid solely on the
teachers who may be holding onto the way things were when converse was in opera-
tion. It is not just a question of an historical overhang from the bygone days of
norm-referenced summative examinations or the more behaviourist-oriented
teacher-student interview test. It is also a question of how the current NCEA is pre-
sented to teachers. The ways in which the assessment conditions are framed by the
authorities (e.g., notification of the assessment event in advance; clear task brief;
requirements for moderation) underscore accountability and measurement perspec-
tives that sit more comfortably within a static assessment of learning context. It
seems that the authorities, whilst in principle encouraging ‘samples of real language
use’, in practice encourage a testing model, not necessarily deliberately, but conse-
quentially by virtue of the conditions surrounding internal assessment. (In this
regard, Alison’s suggested approach, as noted in Chap. 7 – that she would choose
next year to set generic tasks, not to tell the students when the assessments were,
and to collect a whole range of spontaneous and unrehearsed evidence – is described
by her as “not politic”, that is, not following the rules, even though her goal was
laudable, to help her students to “feel comfortable with what they are doing.”)
Peter, as recorded in Chap. 6, summarises the dilemma well. On the basis that
students need to have “fair conditions of assessment,” it is necessary to tell the stu-
dents when an assessment event is coming up, with the corollary that students “need
to do their very, very best in this.” Consequently, interact becomes an “assessment
circus” – it becomes just like converse used to be, except that there is a requirement
to have at least three pieces of evidence rather than just one.
One student perspective neatly encapsulates several of the tensions. On the posi-
tive side, the student noted:
I believe the interact standard gave people the opportunity to have a legitimate unscripted
conversation … The unscripted nature of it allowed students to demonstrate their ability to
speak normally and develop sentences on the spot, which is a realistic, practical use of the
language. (E35)
Nevertheless, the student observed that the benefit to students of demonstrating

truly spontaneous interaction was “only if they wanted that to happen.” That is:
Those confident enough to go unscripted still wrote everything down sentence for sentence
and essentially staged a conversation. … At the same time some of the Year 13 language
that we were required to incorporate at times felt slightly unnatural in a genuine conversa-
tion, as normal conversation usually doesn’t require complicated grammar structure, for
example.
This student concluded that these tensions “defeated the point of the standard.”
The data leave us with a sense that, although much has been achieved by the
introduction of interact, there remains much still to be done in terms of helping
teachers and students (and perhaps the powers that be) to move beyond a conceptu-
alisation of interact as a dedicated assessment event, and to view interact as some-
thing that might become fully integrated into the teaching and learning programme.
The tensions essentially appear to represent a huge trade-off between authenticity
and assessment. As lead teacher Monika put it, “this is where reality meets ideal-
ism.” Monika went on to reflect, “I mean, it’s like anything that you start, you know,
you write it and it’s only in practice that you realise ‘ok this is misinterpreted, this
is not how the people who wrote it meant it’.” In Monika’s view there was a genuine
dichotomy “between high-stakes assessment and the freedom to experiment.” It was
also, it seemed, a genuine challenge to get teachers, and their students, to shift from
a high-stakes testing mindset to an assessment for learning mindset. As Celia put it,
it is still the case that “teachers are looking at the interaction like it’s an exam.” In
Naomi’s words, the teacher therefore “needs to make the shift from the old standard
to the new one.” That is, moving into the future, “I think the biggest issue is going
to be the teachers who are not prepared to shift their thinking.” As Poehner (2008)
notes, teachers often lack familiarity with the theory and principles that inform
assessment practices. It is therefore not surprising that teachers struggle with enact-
ing a more dynamic or formative assessment model.
In turn, and to use Bachman and Palmer’s (2010) argument, despite the ideals
and best intentions of the assessment developers, the reality is that the assessment is
not necessarily leading to the anticipated consequences, and is even leading to unin-
tended consequences that are proving to be detrimental to some stakeholders.
9.4 Static or Dynamic: A Fundamental Problem 199
We are left with a fundamental problem for interact, and for assessments like it.
That fundamental problem is the tension between two different and potentially
irreconcilable paradigms for assessment, or the ‘assessment for/of learning’ or
‘dynamic versus static’ dichotomy. As I explained in East (2008), and noted in
Chap. 2:
The big dilemma is that the two assessment paradigms are not mutually exclusive. We can-
not say that either one is ‘right’ or ‘wrong’, ‘better’ or ‘worse’. They are just different, and
based on different assumptions about what we want to measure. As a result, there is tension
between them and often an attempt to ‘mix and match’, with assessment for learning some-
times taking the dominant position in the arguments, and with the assessment of learning
staking its claim when there is a feeling that its influence is being watered down. (p. 10)
Perceiving the tension in dichotomous terms belies the complexity of the situa-
tion, however. I argued in East (2012):
One way of viewing assessment is as a continuum, with classroom-based activities that
provide opportunities for formative feedback at one end, and formal ‘high-stakes’ testing
and examinations at the other. Conceptualising assessment as a continuum allows for a
range of assessment activities that may, in different ways and at different times, meet the
demands of different stakeholders for performance information that is meaningful and use-
ful. (p. 165)
It is certainly clear from the data collected in this study that teachers view inter-
act somewhat differently depending on where, in their perception, it might sit on the
continuum of assessment (even though the inevitable tendency, in view of the
requirements of NZQA, is to err on the side of a high-stakes accountability under-
standing). There also appears to be genuine confusion about what different concep-
tualisations of assessment mean for actual practice.
Before moving on to make some recommendations that might serve to strengthen
and clarify the expectations of assessments such as interact, perhaps more funda-
mental issues at stake are these: What is the goal of the assessment? What do we
want to know by virtue of the assessment data? The question of how best to achieve
the goal is arguably unanswerable until we are clear about what we want to
measure.
9.4.2 What Do We Want to Measure?
Going back to the fundamental drivers of interact, ultimately we want to know the
extent to which students can perform successfully in independent interaction. In
other words, we want, ideally, to capture instances of automaticity in operation.
Clearly, in the New Zealand context, the one-time summative interview test was not
working in this regard. Its failure drew largely from its pre-rehearsed and scripted
nature, but was also influenced by its one-sidedness. As a consequence, we were left
with questionable evidence about FL speakers’ proficiency in interacting.
What, then, gives us the best or most useful evidence of interactional proficiency?
It may be argued that spontaneous and unrehearsed interactions, seamlessly embedded
within the teaching and learning programme, provide the most authentic evidence.
Nevertheless, ‘spontaneous and unrehearsed’ appeared to be notions that provoked
considerable anxiety and uncertainty among both teachers and students.
When seen from a sociocultural perspective, the concept of ‘spontaneous and
unrehearsed’ does not need to be taken to mean that students are expected to speak
‘off the cuff’ without any prior opportunities to practise the language required.
After all, the sociocultural perspective is based on the understanding that to achieve
ultimate independence or automaticity requires scaffolding and mediation, that is,
intervention and feedback from a more capable peer (which may be the teacher). As
I noted in Chap. 2, in the task-oriented classroom, for example, and as part of the
cycle of teaching and learning, the scaffolding process may include task preparation
(students working collaboratively to prepare a given task) and task repetition (stu-
dents being given opportunities to repeat a task, drawing on feedback from an ear-
lier performance). These scaffolding techniques will arguably enhance eventual
successful task completion (Bygate, 1996, 2001, 2005; Ellis, 2005; Mochizuki &
Ortega, 2008; Pinter, 2005, 2007; Skehan, 2009) and ultimate task automaticity –
the student can successfully perform the task independently and unaided. Translating
this scenario to the assessment context, the notion of interactions that are spontane-
ous and unrehearsed should not be seen to be in conflict with the notion that auto-
maticity requires a good deal of preceding preparatory work.
Also, independence and automaticity do not suggest perfect understanding and
command of the FL. Rather, independence and automaticity suggest the ability to
communicate commensurate both with the context and with one’s ability, and to
maintain and sustain the interaction in the face of interactional difficulties and lack
of understanding, leading to a more adequate or fuller understanding. In other
words, strategic competence (Canale & Swain, 1980), a competence that is arguably
neglected (or at least not tapped into sufficiently) in one-way interviews or mono-
logic assessments of speaking (see Chap. 2), becomes an important goal.
The fundamental issue at stake, seen in the light of the intentions of interact, the
data generated from this project and the range of perspectives presented regarding
‘spontaneous and unrehearsed’, is this: the need to capture instances of interper-
sonal interaction in the FL that, regardless of the level of preparation that has pre-
ceded them, and regardless of the language used (simple or sophisticated), provide
evidence of automaticity from which conclusions regarding future real-world
appropriate interactive language use can be drawn. This is what we want to mea-
sure. (As Marion, recorded in Chap. 7, put it, the ultimate goal of interact for stu-
dents is that “we want them to be able to converse naturally with a French person.”)
Notwithstanding the challenges inherent in determining when automaticity is
achieved and differentiating between different levels of proficiency, the question
then becomes how we best elicit the evidence required – that is, whether we draw
on a static or dynamic model.
The finding from the student data of no significant differences across any mea-
sures of usefulness lends itself to two different sets of circumstances, depending on
where stakeholders sit in the ‘static/dynamic’ debate.
9.5 Where to from Here? 201
The finding would be reassuring for those who might favour the more traditional
and static testing format found in converse. It may be argued that we could justifi-
ably maintain the one-time static assessment model (perhaps including a series of
one-time tests) without this being negatively perceived by the students as principal
stakeholders in comparison with a more dynamic model. However, the continued
use of one-time tests does not address issues such as negative impact in terms of test
taker stress, potentially negative interaction with the test task (depending on what
the students are asked to do), and the risk of disauthentication (the test is necessarily
artificial and cannot adequately reflect how instances of spoken interaction norma-
tively occur in real life). Returning to the one-time assessment format (or even oper-
ationalising interact as a series of stand-alone assessments) focuses on the interaction
as a test and also heightens the potential for students to (over-)prepare in advance.
Both these consequences potentially compromise the opportunity to gather evi-
dence of genuine spontaneity.
By contrast, the finding of no significant difference in perception among the
students might be alarming for those who would wish to advocate for the more
innovative and dynamic assessment format that, in theory at least, constitutes inter-
act. It provides no evidence, from the students’ perspective, that interact confers
any benefit or improvement. On the other hand, the finding does nothing to bring
into question the introduction of interact when seen from the students’ perspective.
If there really is no perceptual difference for the students as the candidates, the posi-
tive advantages of interact from the teachers’ perspective (as outlined in Chaps. 5
and 7) might be sufficient to swing the argument in favour of an assessment model
that is more seamlessly integrated into classroom work. (Also, students did note
several key disadvantages to converse in practice, presented in Chap. 8.) This inte-
gration may well be the most effective to elicit real samples of what students know
and can do, and will arguably lessen student anxiety. Drawing on a collection of
genuinely authentic interactions as evidence of FL spoken proficiency is intuitively
appealing, and there can be no claim to an argument that students see themselves as
disadvantaged by the on-going nature of interact. However, collecting ‘real life’
evidence that emerges from students’ regular work challenges fundamental notions
of standardisation and reliability that traditionally inform high-stakes assessment.
9.5 Where to from Here?
For all the reasons that I have rehearsed earlier in this chapter (and indeed at various
stages throughout this book), it may be proposed that the interactional evidence
required for interact, in theory at least, is best secured through offering students
opportunities to engage in a range of peer-to-peer interactions. These would take
place throughout the school year and in the context of the teaching and learning
programme. Students can then select evidence of their best performances for sum-
mative grading purposes. This ‘performance-based’ theoretical stance to interact
arguably represents a genuine attempt to ‘mix and match’ (East, 2008) between
different paradigms of assessment. Nevertheless, the theoretical stance also leads to

differential understandings because of the sometimes uneasy interplay between two
assessment paradigms.
The preceding arguments present a convincing case for the usefulness and valid-
ity of interact. They do, however, raise the genuine question, how can a model of
assessment that is built seamlessly within a teaching and learning programme be
used for high-stakes purposes where issues of reliability, consistency of measure-
ment and accountability loom large? We are left with a clear tension between the
‘professional and learning’ goals of assessment and its ‘managerial and account-
ability’ goals (Gipps & Murphy, 1994; Torrance, 2013a, 2013b). There are no easy
answers to resolving this tension.
9.5.1 Scenario 1
It may be necessary for those who are responsible for making decisions on the
assessment (bodies such as NZQA and the Ministry of Education) to recognise the
impracticality and perceptual challenges for stakeholders of an assessment model
that could work seamlessly alongside normal classroom work. That is, there is per-
haps the need to acknowledge the stakeholder perception that interact is a high-
stakes assessment that must be separated from normal work. This would not mean
that the paired/group assessment format, with the advantages it offers, would need
to be abandoned. It may mean, however, that perhaps New Zealand should return to
the static one-time test model (or operate the assessment as a series of tests). Indeed,
speaking assessments that focus on interactions between two interlocutors are now
both well-established and normative in a range of contexts (e.g., Cambridge English
language assessment, 2015), but are operationalised as summative one-time tests.
Summative testing is a conventional and widely accepted method of measurement
(Brown & Abeywickrama, 2010). The implication is that paired or group one-time
tests are sufficient, fit for purpose and adequately representative of the construct
with regard to measuring spoken communicative proficiency. Indeed, the increased
practicality of paired assessments in comparison with single-candidate assessments
that was identified in prior studies (Ducasse & Brown, 2009; Galaczi, 2010; Ockey,
2001) presumes a one-time assessment format.
9.5.2 Scenario 2
In theory at least, assessments such as interact support the notion of seamlessness

between teaching, learning and assessment and therefore the inclusion of evidence
drawn from normal day-to-day activities. To ensure the success of interact, a shift
in understanding among stakeholders (perhaps including assessment authorities
9.5 Where to from Here? 203
such as NZQA) is required. Teachers and students need support in appreciating the
validity of lesson-embedded evidence for assessment purposes.
There are practicality considerations to approaches that promote on-going evi-
dence. For example, teachers might frequently need, in Margaret’s words, to ‘glide
by with the voice recorder’, or effectively record many lessons in the hope of per-
haps catching some instance of natural and spontaneous impromptu interaction (a
danger, of course, is that the interactions may still become contrived by virtue of the
presence of the recording device). However, although impracticality was singled out
as the most referred to comparative disadvantage of interact, there was some evi-
dence to suggest that the benefits of collecting on-going evidence outweighed this
disadvantage. The following three teacher survey comments illustrate this perspec-
tive (my emphases):
• It takes extra time, but I feel this is time well spent. (French 060)
• It is time consuming but it is worth it! (French 140)
• Perhaps it is more time consuming for teachers and students but the outcome is
far better. (Spanish 146)
An alternative for collecting on-going evidence is to follow Alison’s plan of
passing ownership of recording over to the students to record their interactions on a
mobile phone or other portable device. As Anna observed, most students “now have
phones that have such good recording devices inside that there’s no reason they
can’t just record it on their phone.” Teachers would also need to be comfortable with
passing ownership to the students, which might include as evidence instances of
genuine interaction that may occur outside the classroom (for example, a student
recording on a mobile phone an authentic interaction that takes place on a trip over-
seas). As for the concern about the genuineness of the evidence, this is arguably not
as great a risk as teachers may fear. Teachers know their students and, by and large,
can tell the difference between a genuine and a contrived interaction. (Naomi, for
example, noted that it was evident from the first interaction submitted by two of her
students that “they’ve learnt what they want to say, and you can tell by the way their
voices come across, it’s very much ‘I’m reading this from my brain’.”)
Perhaps the greatest advantage to assessments such as interact in terms of on-
going collections of evidence is the enhanced washback potential. Sally put it like
this: interact contributed to focusing on interaction as “a real skill, and so they’ve
got more confidence in speaking. I think that’s the aim, really, as language teach-
ers.” Sally had recently spoken to a teacher who had decided not to use interact
“because she thinks it’s too much work.” This teacher asked why Sally was perse-
vering with it. Sally responded:
Why wouldn’t you? Why would you want to produce students that could go to a country
and never actually practise speaking or have no confidence in speaking. I want my students
to know how to catch a train, get accommodation, order a meal, buy some clothes, act
appropriately in a host family situation, talk to people. It just seems weird not embracing
it – ‘oh it’s too much work for me, therefore I’m not going to do it.’ Well it’s not about you,
it’s about the kids, actually.
Clearly, in the New Zealand context, more thought needs to be given to how
interact as an assessment can be operationalised, particularly with regard to whether
and how the tension between authentic real world samples of language and high-
stakes accountability can be resolved, that is, how the two potentially conflicting
notions can work alongside each other convincingly and acceptably. There is scope,
and necessity, for continuing debate.
9.6 Recommendations
Looking at the assessment of FL students’ spoken communicative proficiency from

a global perspective, it is important to be mindful of Bachman and Palmer’s (2010)
argument, previously stated, that a number of alternative approaches will be possi-
ble in any assessment situation. Each will offer advantages and disadvantages. That
is, the issue of gathering evidence through a one-time test (or a series of tests), or
through real examples elicited in the context of normal work and activities, is unre-
solved by the data in this project, and it is inevitable that different kinds of assess-
ment scenario will exist in different contexts. Also, as I have already noted,
controlled or summative speaking tests, even when operationalised as paired or
group tests, are normative in many jurisdictions across the world. Regardless of how
speaking assessments may be operationalised in various contexts, several findings
emerge from the data that should inform debates about how we might elicit stu-
dents’ best or most representative performances when speaking in an FL. Bearing in
mind the argument I have presented at several points that the task is crucial, the
following recommendations for on-going classroom practice are advanced:
1. Acknowledge that the ultimate goal is measurement of automaticity with regard
to potential real-world interactions with target language speakers. This measure-
ment should not take place until students, and their teachers, are comfortable that
students can perform the task used for assessment in a way that demonstrates a
sufficient degree of automaticity.
2. Promote maximal opportunities for the development of automaticity. Students
need to be exposed to as many opportunities as possible to interact with their
peers and others in authentic situations, whether these are assessed or not.
3. Provide feedback on students’ interactions that will help them to enhance their
performances across all dimensions of a spoken communicative proficiency
construct.
4. Move towards an understanding of assessment in ‘performance-based’ terms
whereby students are “assessed as they perform actual or simulated real-world
tasks” and “measured in the process of performing the targeted linguistic acts”
(Brown & Abeywickrama, 2010, p. 16).
9.6 Recommendations 205
5. Measure successful performances not only in terms of task completion (Long &
Norris, 2000; Norris, 2002) but also in terms of a clearly articulated construct
(Canale, 1983; Canale & Swain, 1980; Martinez-Flor et al., 2006; Roever, 2011).
In this connection, the language of interest is not, for example, the demonstration
of particular aspects of sophisticated grammar or lexis, but, rather, what is appro-
priate to the task.
6. Interpret ‘real-world’ tasks in terms of both situational and interactional authen-
ticity. That is, tasks that aim to replicate situational authenticity (ordering a cup
of coffee in a restaurant) should require a dimension of reciprocity and interac-
tion that moves beyond the potentially rote-learnt and artificial ‘waiter-customer’
scenario. This is particularly important as students advance in their proficiency,
for example, towards the more independent level anticipated at CEFR B1 and B2
(Council of Europe, 2001). Tasks need to elicit communicative behaviours such
as co-operating, collaborating, expressing opinions or negotiating meaning that
naturally arise in the performance of the task in the real world (East, 2012; Van
den Branden, 2006).
7. Give students more ownership of what they want to talk about. As I argued in
Chap. 3, a useful assessment task is one that promotes positive interaction for
most candidates. They are able to engage with the task in ways that enable them
to demonstrate the full extent of their proficiency. In this connection, tasks must
be seen as relevant by the candidates (Bachman & Palmer, 1996).
8. Be mindful of the potentially negative impact of interlocutor variables.
Mindfulness here must take into account that any real-world encounter beyond
the classroom or assessment will involve a dimension of unknown territory
which will need to be negotiated, and that this is therefore a skill we wish to
measure. However, it may be necessary to collect a range of evidence, pairing
students in different configurations or having students interact either with the
teacher or with a more proficient speaker who is unknown to them. It may also
be beneficial to allow students to select their own partners.
9. Be realistic. Automaticity is not to be equated with perfection. It is relative to
students’ stage in learning and the requirements of the task. Automaticity may be
demonstrated in very basic transactional scenarios (e.g., a simple interaction
about oneself and one’s family, and that of the interlocutor). Such transactions
may well draw on pre-learnt holistic ‘chunks’ of language (Nunan, 2004).
Automaticity will be determined by the appropriate use of language (does the
student demonstrate a real understanding of the language being used and what is
going on in the transaction?) and the ability to sustain the interaction (can the
student cope appropriately with potential unpredictability and breakdowns in
communication?).
The above recommendations for teaching and assessment practice are designed
to be operationalisable regardless of how the assessment is constructed – static,
dynamic, or somewhere in between.
9.7 Limitations and Conclusion
The key limitation of the study I have reported here is its focus on teachers and
students as the principal sources of evidence for test usefulness. Focusing on the
stakeholders arguably only enables a study into perceptions and may only indirectly
take account of the actual evidence of test taker performances. A more wide-ranging
examination of comparative usefulness, and subsequent claims to validity, would
need to take into account evidence derived from assessments generated under differ-
ent assessment conditions, and how performances contribute to evidence about the
different facets of a spoken communicative proficiency construct that are deemed to
be important.
A further limitation is that the study took place at an early stage in the assessment
reform, that is, in the first or second year of introduction of interact at a particular
NCEA level. This was intentional with a view to capturing comparative data whilst
the old assessment, converse, was still relatively fresh in teachers’ minds. However,
teachers’ reception of, and operationalisation of, interact may have been influenced
by their recent experiences with a more well-established assessment and/or their
limited experiences with working with the new assessment. It is possible that the
study took place too early in the process of reform for both teachers and NZQA to
have fully grasped the implications (hence no difference in perception for the stu-
dents, and some sense of ‘business as usual’ for the teachers, albeit across a range
of assessment points).
With regard to the data collected, a limitation of the teacher survey (Stage I) is
non-response bias. Of the targeted sample, 74 % failed to respond. It is not possible
to account for reasons for this, although these reasons probably include teachers
who were not using the new assessment at the time of the survey, whether by choice
or by not having a senior class at the time. A limitation of the teacher interviews is
that they were drawn from convenience samples and are therefore subject to sam-
pling bias and are not necessarily representative.
A limitation of the student surveys is that the sample sizes were small (n = 30 and
n = 119). It is therefore not possible to generalise findings to the wider populations
of different students across the different languages. It is also possible that some
responses of Year 13 students in the first study were influenced by being in com-
bined (Year 12 and 13) classes, a common phenomenon in New Zealand, and as
such having witnessed their Year 12 classmates being prepared for interact, even if
the respondents were taking converse. Certainly some open-ended comments from
students suggest some familiarity with the revised expectations of interact.
None of the limitations to the study is insurmountable. With regard to the national
teacher survey, I noted in Chap. 5 that 26 % represents a healthy return rate for a
postal survey (Resnick, 2012). Moreover, comparison between response rates across
the five targeted languages and the numbers of senior secondary students (NCEA
levels 1 to 3) taking each FL in 2012 (Education Counts, 2012) leads to a virtually
perfect correlation (r = .996, p < .001) and arguably adequate representation in the
sample. Additionally, I noted in Chap. 4 that the use of a range of data, both
9.7 Limitations and Conclusion 207
quantitative and qualitative, and collected over 2 years, alongside published docu-
mentary material, facilitated both data source and methodological triangulation
(Bryman, 2004; Denzin, 1970). The end result is a robust study into stakeholder
perspectives.
Furthermore, to return to the justifications for the study which I presented in
Chap. 1, the focus on stakeholder perspectives was deliberate and provides an
important but often marginalised dimension to considerations of usefulness and
validity. As Bachman and Palmer (2010) argue, those responsible for developing a
particular assessment “need to take into consideration the potential consequences of
using an assessment, and of the decisions to be made, for different stakeholders in
the assessment situation” (p. 25). There is therefore a responsibility “to consult with
all relevant stakeholders in order to identify as many unintended consequences as
possible” (p. 26). Seen in comparison with more traditional test validation studies,
the project reported here represents, to borrow again McNamara’s (1996) words,
“another kind of research on language testing of a more fundamental kind, whose
aim is to make us fully aware of the nature and significance of assessment as a social
act” (p. 460, my emphases). Stakeholder perspectives provide valuable insight into
the ways in which assessments have social implications. Messick’s (1989) concep-
tualisation of validity places particular emphasis on the social consequences and
influences of a test or assessment, both on the individuals who take it and on wider
society. The importance of so-called consequential validity has implications for
validity studies, and the recognition of assessment as a social act prevents validation
studies that focus purely on scores from becoming a “barren exercise inside the
psychometric test-tube” (Bachman, 2000, p. 23).
Having said that, future research will ideally consider performance evidence.
Performance evidence would provide a useful additional and complementary
dimension in accordance with more traditional test validation studies. This would
include performance scores. Additional performance analyses might include stu-
dents’ interactional patterns, and the relative influence of different assessment for-
mats on complexity, fluency and accuracy (see, e.g., Taylor & Wigglesworth, 2009).
There is a sense in which studies to elicit these kinds of information, the evidential
bases of test interpretation and use (Messick, 1989), are fundamental to our under-
standing of what makes a particular assessment useful or fit for purpose.
Future research might also consider, alongside performance evidence, a replica-
tion of the teacher and student surveys after, say, 5 years, that is, after the interact
assessment has had full opportunity to become embedded in teachers’ and students’
thinking and experiences, and perhaps after the assessment has been formally
reviewed by the assessment authorities. The evidence from such surveys might be
compared to the data reported here to determine whether, and in what respects, there
have been changes in perceptions over time.
The above avenues for future research provide a platform for on-going evalua-
tion of assessments such as interact, with a view to continuing improvement and
greater clarity around the static-dynamic nexus and relative balances between these
two assessment paradigms.
Limitations notwithstanding, the evidence gathered from this two-stage project

indicates that, at least as far as the participants are concerned, and at this early stage
in the assessment reform process, interact is working relatively well. Teachers per-
ceive interact to be, in most respects, a significantly more useful assessment than
converse. It would appear that most teachers have understood and have come to
accept the learner-centred and experiential nature of New Zealand’s revised school
curriculum and its emphasis, for FL programmes, on communication and interac-
tion. It would also appear that teachers perceive interact to be a valid form of assess-
ment which reflects these curricular aims, albeit constrained by a view of interact as
a test. For the students, there is greater ambivalence and also a wide range of percep-
tions, positive and negative, regardless of the assessment format. Nevertheless,
open-ended comments reveal perspectives that would suggest that, when operation-
alised more in line with intentions, and subject to some modifications, interact
would likely be preferred over converse.
As I argued in Chap. 1, Winke (2011) underscores the importance of gathering
the teacher’s voice as a means to “shed light on the validity of the tests, that is,
whether the tests measure what they are supposed to and are justified in terms of
their outcomes, uses, and consequences” (p. 633). A similar rationale for collecting
the students’ perspective is supported, for example, by Bachman and Palmer (1996),
Messick (1989) and Rea-Dickins (1997).
As for the range of views expressed at this early stage in implementation, Monika
noted that “at first there was this huge brouhaha and ‘everything is different and
dangerous and we can’t do it and the students are too dumb’.” However, “it will cool
down and people [will] learn to see the good things.” Monika argued that, in her
view, interact “is working, you know … it’s just not true that it is not working.” She
conceded, “it is not working smoothly and without its hiccups to start with, you
know, but it’s becoming better and better.” In her view, one contribution to improve-
ment was that there needed to be more sharing and support. Monika explained, “I
don’t think it’s that people are jealously guarding their wonderful thoughts, it’s just
that there’s no impetus or system of sharing.” Also, “it’s a bit scary, too, you know,
to put your foot out and say ‘here, this is what I’m doing’ and, you know, don’t
shoot the messenger.” Monika concluded, “there’s no culture of that kind of thing in
the teaching community for languages, I think, but possibly because it’s never been
fostered, not because there’s not the need or want for it.”
With regard to the fundamental challenge of impracticality, Luoma (2004) argues
that “[s]peaking assessments are time-consuming and they require a fair amount of
work” (p. 191, my emphasis). Considerable time input is therefore arguably a given.
Luoma goes on to say: “[a]lthough reporting on what we are doing means spending
more time still, it is also helpful because it forces us to think about the activities
more carefully.” Additionally, “learning about other speaking testers’ experiences
can help us learn more. This expanding cycle of knowledge helps us develop better
speaking assessments and moves the field of assessing speaking forward” (p. 191).
It is my hope that the study reported and discussed in this book will make a positive
contribution to taking debates about meaningful assessments of spoken communi-
cative proficiency forward.
References 209
References
ARG. (2002a). Assessment for learning: 10 principles. Retrieved from http://webarchive.nation-
alarchives.gov.uk/20101021152907/http:/www.ttrb.ac.uk/ViewArticle2.aspx?ContentId=
15313
Press.
Brown, H. D. (2007). Principles of language learning and teaching (5th ed.). New York: Pearson.
Bryman, A. (2004). Triangulation. In M. B. Lewis-Beck, A. Bryman, & T. Liao (Eds.), Encyclopedia
of social science research methods (pp. 1143–1144). Thousand Oaks, CA: Sage. http://dx.doi.
org/10.4135/9781412950589.n1031
Buck, G. (1988). Testing listening comprehension in Japanese university entrance examinations.
JALT Journal, 10(1), 15–42.
England: Longman.
org/exams/
applin/i.1.1
Council of Europe. (2001). Common European Framework of Reference for languages. Cambridge,
applin/aml057
East, M. (2008). Dictionary use in foreign language writing exams: Impact and implications.
Amsterdam, Netherlands / Philadelphia, PA: John Benjamins. http://dx.doi.org/10.1075/lllt.22
Zealand. Amsterdam, Netherlands/Philadelphia, PA: John Benjamins. ahttp://dx.doi.
org/10.1075/tblt.3
179–189. http://dx.doi.org/10.1080/15434303.2010.538779
Együd, G., & Glover, P. (2001). Readers respond. Oral testing in pairs – secondary school perspec-
elt/53.1.36
13(1), 23–51. http://dx.doi.org/10.1177/026553229601300103
org/10.4324/9781315837376
J. Mader, & Z. Ürkün (Eds.), Recent approaches to teaching and assessing speaking. IATEFL
UK: Open University Press.
81–112. http://dx.doi.org/10.3102/003465430298487
University Press.
31(2), 177–204. http://dx.doi.org/10.1177/0265532213498237
References 211
dx.doi.org/10.1017/cbo9780511733017
competence through speaking. In E. Usó-Juan & A. Martínez-Flor (Eds.), Studies on language
org/10.1515/9783110197778.3.139
5845
Ministry of Education. (2007). The New Zealand curriculum. Wellington, NZ: Learning Media.
Ministry of Education. (2011). Ministry of education position paper: Assessment (schooling sec-
foreign language classrooms: A study of guided planning and relativization. Language
Teaching Research, 12(1), 11–37. http://dx.doi.org/10.1177/1362168807084492
Philp, J., Adams, R., & Iwashita, N. (2014). Peer interaction and second language learning. New
York, NY: Routledge. http://dx.doi.org/10.4324/9780203551349
Macmillan.
A view from the UK. Language Testing, 14(3), 304–314. http://dx.doi.org/10.1177/0265
53229701400307
28(4), 463–481. http://dx.doi.org/10.1177/0265532210394633
Segalowitz, N. (2005). Automaticity and second languages. In C. J. Doughty & M. H. Long (Eds.),
http://dx.doi.org/10.1002/9780470756492.ch13
amp047
org/10.1177/026553220101800302
cbo9780511667282.002
Press.
Bibliography
ACTFL. (2012). ACTFL proficiency guidelines 2012. Retrieved from http://www.actfl.org/

publications/guidelines-and-manuals/actfl-proficiency-guidelines-2012
Anastasi, A., & Urbina, S. (1997). Psychological testing. Upper Saddle River, NJ: Prentice Hall.
ARG. (2002a). Assessment for learning: 10 principles. Retrieved from http://webarchive.nationalar-
chives.gov.uk/20101021152907/http:/www.ttrb.ac.uk/ViewArticle2.aspx?ContentId=15313
ARG. (2006). The role of teachers in the assessment of learning. London, England: University of
London Institute of Education.
Australian Council for Educational Research. (2002). Report on the New Zealand national cur-
riculum. Melbourne, Australia: ACER.
University Press.
Bachman, L. F. (1991). What does language testing have to offer? TESOL Quarterly, 25, 671–704.
http://dx.doi.org/10.2307/3587082
Press.
Biggs, J., & Tang, C. (2011). Teaching for quality learning at university: What the student does
(4th ed.). Maidenhead, England: McGraw-Hill/Open University Press.
Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in
Psychology, 3(2), 77–101. http://dx.doi.org/10.1191/1478088706qp063oa
Pearson.

Linguistics 26, DOI 10.1007/978-981-10-0303-5
214 Bibliography
Bryman, A. (2004a). Member validation and check. In M. Lewis-Beck, A. Bryman, & T. Liao
(Eds.), Encyclopedia of social science research methods (p. 634). Thousand Oaks, CA: Sage.
http://dx.doi.org/10.4135/9781412950589.n548
Bryman, A. (2004b). Triangulation. In M. B. Lewis-Beck, A. Bryman, & T. Liao (Eds.),
Encyclopedia of social science research methods (pp. 1143–1144). Thousand Oaks, CA: Sage.
http://dx.doi.org/10.4135/9781412950589.n1031
Buck, G. (1988). Testing listening comprehension in Japanese university entrance examinations.
JALT Journal, 10(1), 15–42.
Buck, G. (1992). Translation as a language testing procedure: Does it work? Language Testing,
9(2), 123–148. http://dx.doi.org/10.1177/026553229200900202
England: Longman.
Byram, M. (1997). Teaching and assessing intercultural communicative competence. Clevedon,
England: Multilingual Matters.
Byram, M. (2000). Assessing intercultural competence in language teaching. Sprogforum, 18(6),
8–13.
Byram, M. (2008). From foreign language education to education for intercultural citizenship:
Essays and reflections. Clevedon, England: Multilingual Matters.
Byram, M. (2009). Intercultural competence in foreign languages: The intercultural speaker and
the pedagogy of foreign language education. In D. K. Deardorff (Ed.), The Sage handbook of
intercultural competence (pp. 321–332). Thousand Oaks, CA: Sage.
Byram, M., Gribkova, B., & Starkey, H. (2002). Developing the intercultural dimension in lan-
guage teaching: A practical introduction for teachers. Strasbourg, France: Council of Europe.
Byram, M., Holmes, P., & Savvides, N. (2013). Intercultural communicative competence in for-
eign language education: Questions of theory, practice and research. The Language Learning
Journal, 41(3), 251–253. http://dx.doi.org/10.1080/09571736.2013.836343
org/exams/
applin/i.1.1
Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19,
254–272. http://dx.doi.org/10.1017/s0267190599190135
Clapham, C. (2000). Assessment and testing. Annual Review of Applied Linguistics, 20, 147–161.
http://dx.doi.org/10.1017/s0267190500200093
Cohen, R. J., & Swerdlik, M. E. (2005). Psychological testing and assessment: An introduction to
tests and measurement (6th ed.). New York, NY: McGraw Hill.
Council of Europe. (1998). Modern languages: Teaching, assessment. A common European frame-
work of reference. Strasbourg, France: Council of Europe.
Bibliography 215
Crocker, L. (2002). Stakeholders in comprehensive validation of standards-based assessments: A

commentary. Educational Measurement: Issues and Practice, 22, 5–6. http://dx.doi.
org/10.1111/j.1745-3992.2002.tb00079.x
Crooks, T. (2010). New Zealand: Empowering teachers and children. In I. C. Rotberg (Ed.),
Balancing change and tradition in global education reform (2nd ed., pp. 281–310). Lanham,
MD: Rowman and Littlefield Education.
Csépes, I. (2002). Is testing speaking in pairs disadvantageous for students? Effects on oral test
scores. novELTy, 9(1), 22–45.
Davis, L. (2009). The influence of interlocutor proficiency in a paired oral assessment. Language
applin/aml057
Dervin, F. (2010). Assessing intercultural competence in Language Learning and Teaching: A criti-
cal review of current efforts. In F. Dervin & E. Suomela-Salmi (Eds.), New approaches to
assessment in higher education (pp. 157–173). Bern, Switzerland: Peter Lang.
Dobric, K. (2006). Drawing on discourses: Policy actors in the debates over the National Certificate
of Educational Achievement 1996–2000. New Zealand Annual Review of Education, 15,
85–109.
East, M. (2005). Using support resources in writing assessments: Test taker perceptions. New
Zealand Studies in Applied Linguistics, 11(1), 21–36.
East, M. (2007). Bilingual dictionaries in tests of L2 writing proficiency: Do they make a differ-
ence? Language Testing, 24(3), 331–353. http://dx.doi.org/10.1177/0265532207077203
East, M. (2008b). Language evaluation policies and the use of support resources in assessments of
language proficiency. Current Issues in Language Planning, 9(3), 249–261. http://dx.doi.
org/10.1080/14664200802139539
East, M. (2009). Evaluating the reliability of a detailed analytic scoring rubric for foreign language
writing. Assessing Writing, 14(2), 88–115. http://dx.doi.org/10.1016/j.asw.2009.04.001
Zealand. Amsterdam, Netherlands/Philadelphia, PA: John Benjamins. http://dx.doi.
org/10.1075/tblt.3
East, M. (2013, August 24). The new NCEA ‘interact’ standard: Teachers’ thinking about assess-
ment reform. Paper presented at the New Zealand Association of Language Teachers (NZALT)
Auckland/Northland Region language seminar, Auckland.
East, M. (2014a, July 6–9). To interact or not to interact? That is the question. Keynote address at
the New Zealand Association of Language Teachers (NZALT) Biennial National Conference,
Languages Give You Wings, Palmerston North, NZ.
East, M. (2014b). Working for positive outcomes? The standards-curriculum alignment for
Learning Languages and its reception by teachers. Assessment Matters, 6, 65–85.
East, M. (2015a). Coming to terms with innovative high-stakes assessment practice: Teachers’
org/10.1177/0265532214544393
East, M. (2015b). Taking communication to task – again: What difference does a decade make? The
Language Learning Journal, 43(1), 6–19. http://dx.doi.org/10.1080/09571736.2012.723729
216 Bibliography
179–189. http://dx.doi.org/10.1080/15434303.2010.538779
Edge, J., & Richards, K. (1998). May I see your warrant please?: Justifying outcomes in qualitative
research. Applied Linguistics, 19, 334–356. http://dx.doi.org/10.1093/applin/19.3.334
Együd, G., & Glover, P. (2001). Readers respond. Oral testing in pairs – Secondary school perspec-
Elder, C. (1997). What does test bias have to do with fairness? Language Testing, 14(3), 261–277.
http://dx.doi.org/10.1177/026553229701400304
Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks:
What does the test-taker have to offer? Language Testing, 19(4), 347–368. http://dx.doi.org/
10.1191/0265532202lt235oa
elt/53.1.36
13(1), 23–51. http://dx.doi.org/10.1177/026553229601300103
org/10.4324/9781315837376
J. Mader & Z. Ürkün (Eds.), Recent approaches to teaching and assessing speaking. IATEFL
England: Open University Press.
Gov.UK. (2015). Get the facts: GCSE reform. Retrieved from https://www.gov.uk/government/
publications/get-the-facts-gcse-and-a-level-reform/get-the-facts-gcse-reform
Graham, J. W. (2012). Missing data: Analysis and design. New York, NY: Springer.
Haertel, E. H. (2002). Standard setting as a participatory process: Implications for validation of
standards-based accountability programs. Educational Measurement: Issues and Practice, 22,
16–22. http://dx.doi.org/10.1111/j.1745-3992.2002.tb00081.x
81–112. http://dx.doi.org/10.3102/003465430298487
University Press.
Higgs, T. V. (Ed.). (1984). Teaching for proficiency: The organizing principle. Lincolnwood, IL:
National Textbook Company.
Bibliography 217
(Ed.), The Oxford handbook of applied linguistics (2nd ed., pp. 110–123). Oxford, England:
Hipkins, R. (2013). NCEA one decade on: Views and experiences from the 2012 NZCER National
Survey of Secondary Schools. Wellington, NZ: New Zealand Council for Educational Research.
Hu, G. (2013). Assessing English as an international language. In L. Alsagoff, S. L. McKay, G. Hu,
& W. A. Renandya (Eds.), Principles and practices for teaching English as an international
language (pp. 123–143). New York, NY: Routledge.
Hunter, D. (2009). Communicative language teaching and the ELT Journal: A corpus-based
Warwick, England.
Iwashita, N. (1996). The validity of the paired interview in oral performance assessment. Melbourne
Papers in Language Testing, 5(2), 51–65.
Kane, M. J. (2002). Validating high-stakes testing programs. Educational Measurement: Issues
and Practice, 21(1), 31–42. http://dx.doi.org/10.1111/j.1745-3992.2002.tb00083.x
Kaplan, R. M., & Saccuzzo, D. P. (2012). Psychological testing: Principles, applications, and
issues (8th ed.). Belmont, CA: Wadsworth, Centage Learning.
Klapper, J. (2003). Taking communication to task? A critical review of recent trends in language
teaching. Language Learning Journal, 27, 33–42. http://dx.doi.org/10.1080/09571730385200061
Kline, P. (2000). Handbook of pychological testing (2nd ed.). London, England: Routledge. http://
dx.doi.org/10.4324/9781315812274
The common European framework of reference: The globalisation of language education pol-
icy (pp. 233–247). Clevedon, England: Multilingual Matters.
Kramsch, C. (1987). The proficiency movement: Second language acquisition perspectives.
Studies in Second Language Acquisition, 9(3), 355–362. http://dx.doi.org/10.1017/
s0272263100006732
Kramsch, C. (2005). Post 9/11: Foreign languages between knowledge and power. Applied
Linguistics, 26(4), 545–567. http://dx.doi.org/10.1093/applin/ami026
Kunnan, A. J. (2000). Fairness and justice for all. In A. J. Kunnan (Ed.), Fairness and validation in
language assessment (pp. 1–14). Cambridge, England: Cambridge University Press.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data.
Biometrics, 33(1), 159–174. http://dx.doi.org/10.2307/2529310
Language Testing International. (2014). ACTFL Oral Proficiency Interview by Computer (OPIc).
Retrieved from http://www.languagetesting.com/oral-proficiency-interview-by-computer-opic
31(2), 177–204. http://dx.doi.org/10.1177/0265532213498237
Leung, C. (2005). Convivial communication: Recontextualizing communicative competence.
International Journal of Applied Linguistics, 15(2), 119–144. http://dx.doi.
org/10.1111/j.1473-4192.2005.00084.x
Leung, C. (2007). Dynamic assessment: Assessment for and as teaching? Language Assessment
218 Bibliography
Lewkowicz, J. (2000). Authenticity in language testing: Some outstanding questions. Language

Liddicoat, A. (2005). Teaching languages for intercultural communication. In D. Cunningham &
A. Hatoss (Eds.), An international perspective on language policies, practices and proficien-
cies (pp. 201–214). Belgrave, Australia: Fédération Internationale des Professeurs de Langues
Vivantes (FIPLV).
Liddicoat, A. (2008). Pedagogical practice for integrating the intercultural in language teaching
and learning. Japanese Studies, 28(3), 277–290. http://dx.doi.org/10.1080/10371390802446844
Liddicoat, A., & Crozet, C. (Eds.). (2000). Teaching languages, teaching cultures. Melbourne.
Australia: Language Australia.
Lo Bianco, J., Liddicoat, A., & Crozet, C. (Eds.). (1999). Striving for the third place: Intercultural
competence through language education. Melbourne, Australia: Language Australia.
dx.doi.org/10.1017/cbo9780511733017
Madaus, G. F., & Kellaghan, T. (1992). Curriculum evaluation and assessment. In P. W. Jackson
(Ed.), Handbook on research on curriculum (pp. 119–154). New York, NY: Macmillan.
Mangubhai, F., Marland, P., Dashwood, A., & Son, J. B. (2004). Teaching a foreign language: One
teacher’s practical theory. Teaching and Teacher Education, 20, 291–311. http://dx.doi.
org/10.1016/j.tate.2004.02.001
competence through speaking. In E. Usó-Juan & A. Martínez-Flor (Eds.), Studies on language
org/10.1515/9783110197778.3.139
May, L. (2009). Co-constructed interaction in a paired speaking test: The rater’s perspective.
5845
Blackwell.
Merriam, S. B. (2009). Qualitative research: A guide to design and implementation. San Fransisco,
CA: Jossey-Bass.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’
responses and performances as scientific inquiry into score meaning. American Psychologist,
50, 741–749. http://dx.doi.org/10.1037//0003-066x.50.9.741
http://dx.doi.org/10.1177/026553229601300302
Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded sourcebook
(2nd ed.). Thousand Oaks, CA: Sage.
Bibliography 219
Ministry of Education. (1993). The New Zealand curriculum framework. Wellington, NZ: Learning
Media.
Ministry of Education. (1995a). Chinese in the New Zealand curriculum. Wellington, NZ: Learning
Media.
Ministry of Education. (1995b). Spanish in the New Zealand curriculum. Wellington, NZ: Learning
Media.
Ministry of Education. (1998). Japanese in the New Zealand curriculum. Wellington, NZ: Learning
Media.
Ministry of Education. (2002a). French in the New Zealand curriculum. Wellington, NZ: Learning
Media.
Ministry of Education. (2002b). German in the New Zealand curriculum. Wellington, NZ:
Learning Media.
Ministry of Education. (2007). The New Zealand curriculum. Wellington, NZ: Learning Media.
Ministry of Education. (2010). Learning languages – Curriculum guides. Retrieved from http://
learning-languages-guides.tki.org.nz/
Ministry of Education. (2011a). Ministry of Education position paper: Assessment (schooling sec-
Ministry of Education. (2011b). New Zealand curriculum guides senior secondary: Learning lan-
guages. Retrieved from http://seniorsecondary.tki.org.nz/Learning-languages
Ministry of Education. (2012a). Secondary student achievement. Retrieved from http://nzcurricu-
lum.tki.org.nz/System-of-support-incl.-PLD/School-initiated-supports/Professional-learning-
and-development/Secondary-achievement
Ministry of Education. (2012b). What’s new or different? Retrieved from http://seniorsecondary.
tki.org.nz/Learning-languages/What-s-new-or-different
Ministry of Education. (2014a). Learning languages – Achievement objectives. Retrieved from
http://nzcurriculum.tki.org.nz/The- New- Zealand-Curriculum/Learning-areas/Learning-
languages/Achievement-objectives
Ministry of Education. (2014b). Resources for internally assessed achievement standards.
Retrieved from http://ncea.tki.org.nz/
Resources-for-Internally-Assessed-Achievement-Standards
Mislevy, R., Wilson, M. R., Ercikan, K., & Chudowsky, N. (2003). Psychometric principles in
student assessment. In T. Kellaghan & D. L. Stufflebeam (Eds.), International handbook of
educational evaluation (Vol. 9, pp. 489–531). Dordrecht, Netherlands: Kluwer Academic
Publishers. http://dx.doi.org/10.1007/978-94-010-0309-4_31
Mitchell, R., & Martin, C. (1997). Rote learning, creativity and ‘understanding’ in classroom for-
eign language teaching. Language Teaching Research, 1(1), 1–27. http://dx.doi.
org/10.1177/136216889700100102
foreign language classrooms: A study of guided planning and relativization. Language Teaching
Research, 12(1), 11–37. http://dx.doi.org/10.1177/1362168807084492
National Foundation for Educational Research. (2002). New Zealand stocktake: An international
critique. Retrieved from http://www.educationcounts.govt.nz/publications/curriculum/9137
Newton, P., & Shaw, S. (2014). Validity in educational and psychological assessment. London,
England: Sage.
Nitta, R., & Nakatsuhara, F. (2014). A multifaceted approach to investigating pre-task planning
effects on paired oral test performance. Language Testing, 31(2), 147–175. http://dx.doi.
org/10.1177/0265532213514401
220 Bibliography
Norris, J. (2008). Validity evaluation in language assessment. Frankfurt am Main, Germany: Peter
Lang.
Norris, J., Bygate, M., & Van den Branden, K. (2009). Introducing task-based language teaching.
In K. Van den Branden, M. Bygate, & J. Norris (Eds.), Task-based language teaching: A reader
(pp. 15–19). Amsterdam, Netherlands/Philadelphia, PA: John Benjamins.
Norton, J. (2005). The paired format in the Cambridge Speaking Tests. ELT Journal, 59(4), 287–
297. http://dx.doi.org/10.1093/elt/cci057
NZQA. (2014a). Assessment and moderation best practice workshops. Retrieved from http://www.
nzqa.govt.nz/about-us/events/best-practice-workshops/
NZQA. (2014b). External examinations. Retrieved from http://www.nzqa.govt.nz/qualifications-
standards/qualifications/ncea/ncea-exams-and-portfolios/external/
NZQA. (2014c). External moderation. Retrieved from http://www.nzqa.govt.nz/providers-part-
ners/assessment-and-moderation/managing-national-assessment-in-schools/secondary-moder-
ation/external-moderation/
NZQA. (2014d). History of NCEA. Retrieved from http://www.nzqa.govt.nz/qualifications-
standards/qualifications/ncea/understanding-ncea/history-of-ncea/
NZQA. (2014e). Internal moderation. Retrieved from http://www.nzqa.govt.nz/providers-part-
ners/assessment-and-moderation/managing-national-assessment-in-schools/secondary-moder-
ation/external-moderation/internal-moderation/
NZQA. (2014f). Languages – Clarifications. Retrieved from http://www.nzqa.govt.nz/qualifica-
tions-standards/qualifications/ncea/subjects/languages/clarifications/
NZQA. (2014g). Languages – Moderator’s newsletter. Retrieved from http://www.nzqa.govt.nz/
qualifications-standards/qualifications/ncea/subjects/languages/moderator- newsletters/
october-2014/
NZQA. (2014h). NCEA subject resources. Retrieved from http://www.nzqa.govt.nz/qualifications-
standards/qualifications/ncea/subjects
NZQA. (2014i). Search framework. Retrieved from http://www.nzqa.govt.nz/framework/search/
index.do
NZQA. (2014j). Secondary school qualifications prior to 2002. Retrieved from http://www.nzqa.
govt.nz/qualifications-standards/results-2/secondary-school-qualifications-prior-to-2002/
NZQA. (2014k). Standards. Retrieved from http://www.nzqa.govt.nz/qualifications-standards/
qualifications/ncea/understanding-ncea/how-ncea-works/standards/
Pardo-Ballester, C. (2010). The validity argument of a web-based Spanish listening exam: Test
usefulness evaluation. Language Assessment Quarterly, 7(2), 137–159. http://dx.doi.
org/10.1080/15434301003664188
Philp, J., Adams, R., & Iwashita, N. (2014). Peer interaction and second language learning.
New York, NY: Routledge. http://dx.doi.org/10.4324/9780203551349
Macmillan.
Poehner, M., & Lantolf, J. P. (2005). Dynamic assessment in the language classroom. Language
Teaching Research, 9(3), 233–265. http://dx.doi.org/10.1191/1362168805lr166oa
Bibliography 221
Popham, W. J. (2006). Assessment for educational leaders. Boston, MA: Pearson.

A view from the UK. Language Testing, 14(3), 304–314. http://dx.doi.
org/10.1177/026553229701400307
Rea-Dickins, P. (2004). Understanding teachers as agents of assessment. Language Testing, 21(3),
249–258. http://dx.doi.org/10.1191/0265532204lt283ed
Richards, J. C. (2001). Curriculum development in language teaching. Cambridge, England:
Cambridge University Press. http://dx.doi.org/10.1017/cbo9780511667220
University Press.
Richards, J. C., & Rodgers, T. S. (2014). Approaches and methods in language teaching (3rd ed.).
Cambridge, England: Cambridge University Press.
28(4), 463–481. http://dx.doi.org/10.1177/0265532210394633
Ryan, K. (2002). Assessment validation in the context of high-stakes assessment. Educational
Measurement: Issues and Practice, 22, 7–15. http://dx.doi.org/10.1111/j.1745-3992.2002.
tb00080.x
Sakuragi, T. (2006). The relationship between attitudes toward language study and cross-cultural
attitudes. International Journal of Intercultural Relations, 30, 19–31. http://dx.doi.
org/10.1016/j.ijintrel.2005.05.017
Samuda, V., & Bygate, M. (2008). Tasks in second language learning. Basingstoke, England:
Palgrave Macmillan. http://dx.doi.org/10.1057/9780230596429
Savignon, S. (2005). Communicative language teaching: Strategies and goals. In E. Hinkel (Ed.),
Handbook of research in second language teaching and learning (pp. 635–651). Mahwah, NJ:
Lawrence Erlbaum.
Saville, N., & Hargreaves, P. (1999). Assessing speaking in the revised FCE. ELT Journal, 53(1),
42–51. http://dx.doi.org/10.1093/elt/53.1.42
Scott, A., & East, M. (2009). The standards review for learning languages: How come and where
L. Parmenter (Eds.), The common European framework of reference: The globalisation of lan-
guage education policy (pp. 248–257). Clevedon, England: Multilingual Matters.
Segalowitz, N. (2005). Automaticity and second languages. In C. J. Doughty & M. H. Long (Eds.),
http://dx.doi.org/10.1002/9780470756492.ch13
Sercu, L. (2010). Assessing intercultural competence: More questions than answers. In A. Paran &
L. Sercu (Eds.), Testing the untestable in language education (pp. 17–34). Clevedon, England:
Multilingual Matters.
Shearer, R. (n.d.). The New Zealand curriculum framework: A new paradigm in curriculum policy
development. ACE Papers, Issue 7 (Politics of curriculum, pp. 10–25). Retrieved from https://
researchspace.auckland.ac.nz/handle/2292/25073
Shohamy, E. (2000). Fairness in language testing. In A. J. Kunnan (Ed.), Fairness and validation
in language assessment (pp. 15–19). Cambridge, England: Cambridge University Press.
Shohamy, E. (2001a). The power of tests: A critical perspective on the uses of language tests.
Harlow, England: Longman/Pearson. http://dx.doi.org/10.4324/9781315837970
Shohamy, E. (2001b). The social responsibility of the language testers. In R. L. Cooper (Ed.), New
perspectives and issues in educational language policy (pp. 113–130). Amsterdam, Netherlands/
Philadelphia, PA: John Benjamins Publishing Company. http://dx.doi.org/10.1075/z.104.09sho
Shohamy, E. (2006). Language policy: Hidden agendas and new approaches. New York, NY:
Routledge. http://dx.doi.org/10.4324/9780203387962
222 Bibliography
Skehan, P. (2001). Tasks and language performance assessment. In M. Bygate, P. Skehan, &
M. Swain (Eds.), Researching pedagogic tasks: Second language learning, teaching and test-
ing (pp. 167–185). London, England: Longman.
amp047
J. Cummins & C. Davison (Eds.), International handbook of English language teaching
(pp. 271–288). New York, NY: Springer. http://dx.doi.org/10.1007/978-0-387-46301-8_20
Spolsky, B. (1985). The limits of authenticity in language testing. Language Testing, 2(1), 31–40.
http://dx.doi.org/10.1177/026553228500200104
Spolsky, B. (1995). Measured words. Oxford, England: Oxford University Press.
org/10.1177/026553220101800302
The University of Queensland. (2012). About flipped classrooms. Retrieved from http://www.uq.
edu.au/tediteach/flipped-classroom/what-is-fc.html
Tomlinson, B. (Ed.). (2011). Materials development in language teaching (2nd ed.). Cambridge,
Turner, J. (1998). Assessing speaking. Annual Review of Applied Linguistics, 18, 192–207. http://
dx.doi.org/10.1017/s0267190500003548
cbo9780511667282.002
Wajda, E. (2011). New perspectives in language assessment: The interpretivist revolution. In
M. Pawlak (Ed.), Extending the boundaries of research on second language learning and
teaching (pp. 275–285). Berlin: Springer. http://dx.doi.org/10.1007/978-3-642-20141-7_21
Bibliography 223
Weimer, M. (2002). Learner-centered teaching: Five key changes to practice. San Francisco, CA:
Jossey-Bass.
Press.
Wood, R. (1993). Assessment and testing. Cambridge, England: Cambridge University Press.
Yoffe, L. (1997). An overview of the ACTFL proficiency interview: A test of speaking ability.
Shiken: JALT Testing & Evaluation SIG Newsletter, 1(2), 2–13.
Index
A Communicative proficiency, 5, 28, 40, 44, 45,

Accountability, 9, 12, 30–32, 35, 53, 81, 87, 58, 59, 175, 194
197, 199, 202, 204 Complexity, 162, 199, 207
Accuracy, 5, 25, 86, 113, 114, 119, 148, Consequences, 7, 9, 12–15, 17, 20, 33, 45, 81,
160–162, 167, 176, 180, 186, 193, 207 189, 198, 201, 207, 208
Alternative assessment, 36, 52 Consequential validity, 207
American Council on the Teaching of Foreign Construct, 110
Languages (ACTFL), 4–5, 27, 30, 41 irrelevant variance, 13, 14, 44, 197
Anxiety, 33, 43, 81, 113, 115, 122, 129, 176, under-representation, 13, 14, 42, 194
180, 181, 193, 200, 201 validity, 10, 87, 101, 192
Assessment blueprints, 1, 7, 56, 69, 72 Criterion-referencing, 56
Assessment Reform Group, 33, 52 Curriculum, 127, 178
Audio-lingualism, 3, 5
Authenticity, 6, 38, 65, 79, 81–83, 86–88, 101,
106, 109, 110, 112, 113, 118, 119, 123, D
147, 150, 166, 173, 180–182, 192, 193, Discourse competence, 26, 28
198, 205 Dynamic, 30, 33, 35–37, 40, 45, 51, 78, 84,
Automaticity, 4, 5, 27, 28, 191, 199, 200, 187, 189, 190, 196, 199–201, 205, 207
204, 205 Dynamic assessment, 31
B F
Best Practice Workshops, 71, 171 Fairness, 11, 14, 15, 32, 44, 130, 195
Feedback, 33–35, 37, 52, 55, 72, 84, 86–88,
114, 115, 132, 141, 151, 190, 200, 204
C Feedforward, 33, 52, 55, 72, 86, 88, 114, 168,
Common European Framework of Reference 190
for languages (CEFR), 5, 27, 58, 85, Fit for purpose, 2, 8, 11, 32, 58, 78, 83,
129, 148, 205 111, 118, 130, 144, 161, 187, 190,
Communicative competence, 3, 4, 25–30, 202, 207
34, 61 Fluency, 5, 25, 85, 87, 113, 114, 117, 148,
Communicative interaction, 28, 82 159, 162, 163, 170, 176, 183, 185, 186,
Communicative Language Teaching (CLT), 191, 193, 207
3–6, 25, 43, 58, 82 Formative, 31, 33, 34, 40, 54, 86

Linguistics 26, DOI 10.1007/978-981-10-0303-5
226 Index
G New Zealand Curriculum Framework (NZCF),

Grammar-translation, 3, 5, 58 53, 55
Grammatical competence, 26 New Zealand Qualifications Authority
(NZQA), 2, 55, 202, 203
Norm-referenced, 54, 197
H Norm-referencing, 56
High-stakes assessment, 1, 2, 6, 20, 42, 51,
53–56, 77, 83, 86, 129, 168, 186, 195,
198, 201, 202 O
Oral Proficiency Interview test, 41
I
IGCSE, 132 P
Impact, 13, 17, 33, 44, 45, 69, 79, 81, 84, 87, Paired/group, 30, 44, 45, 51, 187
88, 101, 106, 110, 111, 114, 115, 120, Peer-to-peer interactions, 1, 4, 77, 115, 130,
121, 123, 126, 127, 129, 130, 135, 137, 183, 187, 193, 201
138, 156, 173, 178, 179, 181, 186, 189, Performance outcomes, 9–11, 13, 14, 32,
192–194, 196, 201, 205 36, 81
Impracticality, 125–127, 131, 137, 144, 202, Performance scores, 2, 11, 12, 207
203, 208 Performance-based assessment,
Interaction hypothesis, 4, 191 36, 37, 204
Interactional authenticity, 82, 83, 152 Portfolio, 34–36, 116, 122, 127, 128, 132,
Interactional competence, 27–29, 42, 43, 45, 135–137, 144, 194, 195
81, 87, 148, 170 Practicality, 79, 81, 82, 87, 88, 101, 106–111,
Interactional proficiency, 42, 86, 147–149, 131, 137, 192, 193, 202, 203
157, 169, 176, 182, 199 Pragmatic competence, 28
Interactiveness, 79, 81, 83, 88, 101, 106, 109, Proficiency, 32
110, 112, 113, 118, 121, 123, 173, 182, Proficiency movement, 3, 4, 36
192, 193 Psychometric model, 10, 31, 33
Intercultural communicative competence, 29 Psychometrics, 12
Intercultural competence, 28
Interlocutor effects, 44
Interlocutor variables, 44, 45, 129, 130, 138, Q
143, 178, 184, 186, 195, 205 Qualitative research, 16
Internal assessments, 6, 53, 59, 87, 101, 128
Interview test, 1, 41–43, 45, 58, 77, 84, 87,
111, 193, 197, 199 R
Rebuttals, 88
Reliability, 10–12, 16, 32, 33, 35–37, 53,
K 78–80, 83, 87, 101, 109, 110, 116, 130,
Key competencies, 61, 64 173, 191–193, 201, 202
Role-play, 38, 113, 134, 135, 140, 149
L
Learning languages, 7, 62, 64, 67, 68 S
Scaffolding, 62, 93, 131–134, 141, 142, 144,
150, 156, 183, 200
M Situational authenticity, 82, 148, 205
Ministry of Education, 52, 63, 64, 66, 202 Sociolinguistic competence, 26, 28, 29, 86
Spoken communicative proficiency, 1, 6,
11, 14, 17, 25–30, 36, 37, 39–41,
N 43, 45, 51, 53, 61, 77, 79, 81, 83,
National Certificate of Educational 84, 87, 96, 97, 101, 104, 106, 111,
Achievement (NCEA), 6, 51, 56, 66, 147, 173, 187, 189–193, 202, 204,
147, 197 206, 208
Index 227
Spontaneity, 42, 86, 87, 113, 122, 133, U

139–143, 147, 148, 150, 156–159, 161, Usefulness, 8, 13, 17, 45, 78, 79, 81, 84, 87,
162, 165, 167, 170, 179–181, 183, 193, 88, 91, 101, 103, 107, 109, 111, 112,
201 126, 173, 175, 185, 187, 190, 192, 193,
Standards, 56, 66 200, 202, 206, 207
Standards-Curriculum Alignment Languages
Experts (SCALEs) project, 64
Static, 30, 32, 33, 35–37, 40, 45, 51, 78, 187, V
189, 190, 197, 199–202, 205, 207 Validation, 3, 12, 14, 16, 17, 21, 80, 207
Strategic competence, 26, 27, 113, 114, 120, Validity, 2, 8, 10–18, 20, 32, 33, 35–37, 41,
161, 184, 200 44, 53, 78–80, 87, 109, 110, 116, 123,
Stress, 81, 106, 115, 129, 158, 173, 175, 177, 130, 173, 191, 193, 195, 202, 203,
192, 193, 195, 196, 201 206–208
Summative, 6, 31–33, 35, 40, 53, 54, 61, 63,
65, 77, 86, 116, 193, 197, 199, 201,
202, 204 W
Warrants, 88
Washback, 33, 41, 43, 59, 69, 72, 116,
T 122–124, 168–170, 178, 191, 194,
Task-based language assessment (TBLA), 195, 203
37–39, 155
Task-based language teaching (TBLT), 5–7,
25, 37–39, 62 Z
TLU domain, 30, 32, 38, 39, 82, 83 Zone of proximal development (ZPD),
Triangulation, 95, 97, 207 34, 191

(Educational Linguistics 26) Martin East (Auth.) - Assessing Foreign Language Studentsâ ™ Spoken Proficiency - Stakeholder Perspectives On A

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(Educational Linguistics 26) Martin East (Auth.) - Assessing Foreign Language Studentsâ ™ Spoken Proficiency - Stakeholder Perspectives On A

Uploaded by

Copyright:

Available Formats

Educational Linguistics

More information about this series at http://www.springer.com/series/5894

Assessing Foreign Language

ISSN 1572-0292 ISSN 2215-1656 (electronic)

Library of Congress Control Number: 2015960962

Springer Singapore Heidelberg New York Dordrecht London

Printed on acid-free paper

Assessing Foreign Language Students’ Spoken Proficiency makes an outstanding

January 2016 Jack C. Richards

East, M. (2015). Coming to terms with innovative high-stakes assessment practice:

1 Mediating Assessment Innovation: Why Stakeholder

2.3 Static or Dynamic............................................................................... 31

4.6.3 Administering the Main Survey ............................................. 92

6.5 Suggestions for Improvement – Interviews........................................ 140

9.4 Static or Dynamic: A Fundamental Problem ..................................... 196

Bibliography .................................................................................................... 213

Index ................................................................................................................. 225

Fig. 3.1 The original NCEA assessment matrix ............................................. 60

This book recounts a story of assessment innovation. Situated within a context of

© Springer Science+Business Media Singapore 2016 1

1.2 Background: The Importance of Interaction

1.2.1 Communicative Language Teaching

interactionally competent on the international scene” (p. 367). Such competence

1.2.2 Communicative Language Testing

longer sufficient. Rather, it was necessary to view proficiency more holistically in

1.3 Curriculum and Assessment Reforms in New Zealand

1.3.2 Implementing Assessment Reform: A Risky Business

Implementing assessment innovation is, however, a process fraught with challenges,

1.4 Assessment Validation

1.4.1 Fundamental Considerations

Assessment is a matter of central concern to all those involved in the educational

They may lead to a tweak to a programme here, an alteration to an approach there,

When it comes to assessing FL students’ language proficiency, we need, in

1.4.2 The Contribution of Assessment Score Evidence

Bearing in mind the central importance of performance outcomes (i.e., grades,

1.4.3 The Limitations of Assessment Score Evidence

sufficient evidence on which to base conclusions about usefulness and validity.

1.4.4 Towards a Broader Understanding of Assessment

assessment is subject to construct irrelevant easiness. Construct irrelevant difficulty

Bachman (2000) similarly suggests that “investigating the construct validity of

1.4.5 A Qualitative Perspective on Assessment Validation

be used to inform validity arguments around different kinds of assessment such as

1.5 The Structure of This Book

Kane (2002) argues that a traditional perspective on measurement as “an essentially

“the engines of reform and accountability in education.” He concludes that “[f]or

Differential impacts from different kinds of assessments raise important issues

ACTFL. (2012). ACTFL proficiency guidelines 2012. Retrieved from http://www.actfl.org/publica-

© Springer Science+Business Media Singapore 2016 25

I situate assessments of spoken communicative proficiency within a broader under-

2.2 What Does It Mean to Communicate Proficiently?

2.2.1 Communicative Competence as the Underlying

The well-established Canale and Swain framework provides a useful starting-point

2. Sociolinguistic proficiency: the FL speaker would be able to demonstrate the use

2.2.2 Developing the Framework of Communicative

The interactional dimension of communicative proficiency is in fact not lost on

A counter-argument is that, even though intercultural communicative compe-

2.3 Static or Dynamic

2.3.1 The Static Assessment Paradigm

2.3.2 The Dynamic Assessment Paradigm

Dynamic assessment for learning, in contrast to the static assessment of learning,

2.3.3 Static or Dynamic – A Complex Relationship

2.4 Task-Based or Construct Based

Whether operationalised within a static or dynamic model of assessment, or the