You are on page 1of 16

Assessing Speaking

Fumiyo Nakatsuhara, Nahal Khabbazbashi and Chihiro Inoue

Introduction
Over half a century ago, Lado (1961: 239) stated:
the ability to speak a foreign language is without doubt the most highly prized language skill…
Yet testing the ability to speak a foreign language is perhaps the least developed and the least
practiced in the language testing field.
The field of spoken assessment has come a long way since Lado’s observation. One need
look no further than the wide range of international and local speaking tests available worldwide
as well as the wealth of research studies on the subject. The choices of tests offered to users vary
along several parameters such as task format, delivery mode, and scoring methods, all of which
have important implications in terms of construct coverage and practical demands. Recent
advances in digital technologies and disruptive innovations have been an accelerating force in
challenging current practices, introducing new research methodologies, and contributing to the
diversification of speaking assessments resulting in a field where the state of the art is constantly
in flux.
Before exploring some of the above parameters and assessment practices in more depth, it
is important to first provide a historical overview of the testing of speaking. While any such
account is by its nature partial “and at the mercy” of available evidence (Weir, Vidaković and
Galaczi, 2013: 1), we can only understand where we are now by looking back at where we have
come from.

Historical perspectives
While spoken assessment has a long history in content-based educational settings, where writing
was considered ancillary to speaking (Latham, 1877; Russel, 2002), the history of assessing second
language (L2) speaking seems to have emerged more recently; see Fulcher (2015a) for a
comprehensive overview of the history of speaking assessment and areas of prominent research.
The Cambridge Certificate of Proficiency in English was launched in 1913 and was
predominantly designed for English teacher trainees. It is one of the earliest L2 proficiency tests
that featured speaking components - dictation, read-aloud and conversation tasks in an examiner-
candidate format as well as a written paper on knowledge of phonetics (Weir et al., 2013). In its
original form (1913-45), the test mirrored prominent approaches to L2 speaking in the late 19th
century Europe, such as the Direct Method and the Oral Method with a primary focus on oral
fluency (Palmer, 1921) and the Reform Movement which emphasised the importance of spoken
language - in particular phonetics (Sweet, 1899).
The experience of the two World Wars and pressing military needs led to an exponential
growth in L2 oral assessment research and practices in the mid-20th century. In particular, the US
Foreign Service Institute (FSI) Oral Proficiency Interview (OPI), operationalised in 1952, was a
significant initial step towards defining and measuring a multi-trait speaking construct reliably.
The original form of the test was accompanied with a six-band holistic scale with performance
descriptors defining criterial features at each level, and a checklist of five analytic criteria (accent,
comprehension, fluency, grammar and vocabulary) was later added in 1958 to address variation in
scores psychometrically (Sollenberger, 1978). The OPI test was seminal in the history of speaking
assessment, paving the way for the booming of subsequent oral assessment initiatives in the US in
the 1960s to 1980s (Fulcher, 2003). These include the Interagency Language Roundtable (ILR)
speaking test and the American Council for the Teaching of Foreign Languages (ACTFL) OPI. In
the UK, the University of Cambridge Local Examinations Syndicate (UCLES) also started
denoting performance descriptors in the 1970s in an attempt to define the test construct more
explicitly and to standardise speaking examiners’ rating judgements. However, that the
development of rating scales and performance descriptors in those days was primarily based on
expert judgment and intuition rather than empirical evidence (Fulcher, 2003). The fields of
language testing, second language acquisition and applied linguistics were still young, with little
research that could inform the understanding of the nature of L2 speaking ability and performance.
The landscape of the field looked vastly different in the 1990s; theoretical models such as
Bachman’s (1990) model of communicative language ability started to emerge along with
empirical research exploring those models. Several approaches to developing and validating rating
scales – and by extension understanding the measured speaking construct – were examined. They
include Fulcher’s (1996) analysis of learner speech to develop data-driven rating scales, Pollitt and
Murray’s (1996) investigation into examiners’ perceptions of proficiency when rating spoken
performances, and North and Schneider’s (1998) measurement-driven approach to scale
development as embodied in the CEFR.
Since the 1990s, the constructs measured in spoken tests have become increasingly more
diverse as represented in large-scale international speaking assessments. To illustrate, in addition
to the traditional examiner-candidate interview format as in the IELTS Speaking test since 1989,
a paired speaking format was introduced in the Cambridge Main Suite examinations during the
1990s and early 2000s. This allowed for a broadening of the speaking construct and the assessment
of richer language of an interactional nature (Galaczi, 2014). While the importance of a social and
interactional trait became more prominent on the one hand, an automated approach to scoring
spoken responses, as used in the Pearson Test of English Academic launched in 2009, focused
emphasis on a cognitive, psycholinguistic construct, on the other (van Moere, 2012). Semi-direct
speaking tests – as characterised by computer (or online) delivery of tasks, digital response
capturing and human rating – such as the British Council’s Aptis Speaking test introduced in 2012
(O’Sullivan and Dunlea, 2020), target a selected range of spoken language elicited in monologic
and simulated Q & A formats. Advances in computer technologies facilitated widening the
construct assessed in semi-direct tests; for instance, TOEFL iBT launched in 2005 (Chapelle,
Enright and Jamieson, 2008) introduced integrated tasks of listening and speaking and of reading,
listening and speaking. In direct speaking tests as well, to embrace the role of listening necessary
for effective interaction (Nakatsuhara, 2018), candidates’ listening comprehension skills are now
assessed as part of some interactional speaking tests – see for example the speaking and listening
rating scales for Trinity’s Integrated Skills in English (ISE) examinations (Trinity College London,
2021).

Critical issues

The historical overview above illuminates some of the major influences – theoretical, pedagogical,
social, political, and practical – on the conceptualisation and operationalisation of the speaking
construct and how it has shifted over different periods of time. In the 21st century, it is technology
that is arguably the strongest influence presenting both challenges and opportunities in the
assessment of speaking. In the next few sections, we draw on the latest research and practice to
examine some of the most important parameters in speaking assessment in the digital age and
highlight critical issues.

Speaking tasks and measured constructs


At the heart of speaking assessment are the constructs that we intend to tap into, and test tasks play
a fundamental role in operationalising these constructs and eliciting performances. While several
task classification systems exist, a useful approach is to distinguish tasks in terms of the conditions
in which candidates’ output language is elicited. On this basis, we have broadly classified the most
common speaking task types into four categories of elicited imitation, independent, integrated and
interactive tasks and will discuss each in more detail highlighting their relative strengths and
limitations.

Elicited Imitation tasks


Elicited imitation (EI) refers to an assessment technique in which candidates are required to listen
to utterances and repeat them verbatim or to read aloud written sentences (Yan, Maeda, Lv and
Ginther, 2016). While their popularity declined in the 1980s with the communicative approaches
to language teaching and assessment, they have made a comeback in recent years with the rise of
automated speech evaluation (ASE) systems. The use of EI tasks is closely related to the
functioning of the speech recogniser element of ASE systems; predictable speech can be
recognised and scored with greater accuracy and precision compared to spontaneous unpredictable
speech (Xi, Higgins, Zechner and Williamson, 2012).
There has been much debate in the language testing literature regarding EI tasks and their
construct validity: for example, their successful completion has been suggested to reflect parsing
and comprehension of the input (Bley-Vroman and Chaudron, 1994) while others attribute it to
phonological short-term memory capacities (Gathercole and Baddeley, 1993). Van Moere (2012:
325) suggests that these tasks can tap into processing competence defined as “the efficiency with
which learners process language” - a core competence for any L2 spoken production and
interaction. Yan et al.’s (2016) meta-analysis of EI studies generally supported their use as a
reliable measure of L2 spoken proficiency. On the other hand, these tasks have been criticised for
their narrow representation of the speaking construct and their limited authenticity (c.f. Galaczi,
2010 and Yan et al., 2016); the tasks, in themselves, do not require candidates to generate ideas or
to draw on lexical resources and syntactic knowledge to translate ideas into speech – processes
which are essential for the spontaneous production of speech (Field, 2011). A suggestion put
forward by Yan et al. (2016) is to re-evaluate EIs in light of the research and to combine them
(rather than replace) with other communicative language tasks in order to build on strengths but to
also address their limitations.

Independent tasks
Independent speaking tasks require candidates to “draw on their own ideas or knowledge [in order]
to respond to a question or prompt” (Brown, Iwashita and McNamara, 2005: 1) and commonly
include Q&A, picture description and individual long-turn tasks. Among these, (simulated) Q&A
tasks – where questions are posed by an examiner (or a computer) – elicit spontaneous responses
from candidates. The difficulty level of questions can be graded according to the degree of
demands in relation to required cognitive processes such as conceptualisation, lexical and
grammatical encoding (Field, 2011). Questions can be formulated to ask about simple, factual
information immediately relevant to the candidates, or can elicit more extended and coherent
responses of an abstract nature. While computer-delivered tests currently use only scripted
questions, one of the benefits of this format in face-to-face tests is its authenticity, allowing the
interlocutor to employ semi-scripted or unscripted questions to probe the proficiency level of
candidates as the interview sequences unfold. However, interlocutor variability has been
problematised and extensively researched for this format, suggesting that striking a balance
between interlocutor standardisation and authentic interaction is key to the successful use of this
task (e.g. Brown, 2003; Lazaraton, 2002; Seedhouse and Nakatsuhara, 2018).
The individual long-turn which requires more organised and coherent speech than in
interview or picture description tasks is usually regarded as the most challenging independent task
type. These tasks typically provide candidates with verbal and/or visual prompt(s) based on which
extended stretches of speech are elicited. One of the issues in these tasks is the extent to which the
topic of the prompt and candidates’ background knowledge of the topic affect their performance
(Khabbazbashi, 2017). The provision of pre-task planning time is therefore considered valuable to
minimise variance related to the conceptualisation demands caused by topic, though research
suggests that the benefit of planning time for enhancing candidates’ performances under testing
conditions is often mixed or unclear (e.g. Wigglesworth and Elder, 2010).

Integrated tasks
Independent tasks have been criticised for the absence of input and for not allowing candidates an
“equal footing in terms of background knowledge” (Weigle, 2004: 30). Consequently, integrated
speaking tasks which require candidates “to speak about a topic for which information [has] been
supplied from another source” (Jamieson, Eignor, Grabe and Kunnan, 2008: 76) have attracted
greater attention in recent years. Technology has facilitated the use of such task types in computer-
mediated tests (e.g. TOEFL iBT, Oxford Test of English) owing to the ease of presenting various
modalities in a single test setting.
While recent research has suggested that integrated tasks are “not immune to the influence
of prior topical knowledge on scores” (Huang, Hung and Plakans, 2018: 43) as posited in the
literature, various other benefits of integrated tasks have been put forward. Integrated tasks are
considered to better reflect real-life communicative acts especially in the educational and
professional target language use (TLU) domains, and to be tapping into a different construct from
independent speaking tasks. Specifically, they require not only the reading and/or listening and
speaking skills, but also engage other cognitive skills, such as those for selecting, organising, and
transforming source information for the production of output (Barkaoui, Brooks, Swain and
Lapkin, 2012; Brown et al., 2005). In the endeavour to better understand the construct of integrated
tasks, critical topics for investigation include the balance between input and output skills in terms
of source difficulty and factor structures. For example, research has offered interesting insights
into the extent to which scores derived in interactive test tasks can be attributed to input skill(s)
rather than speaking skills (Sawaki, Stricker and Orange, 2009); whether or not lower and higher
proficiency candidates are differentially impacted by the input task (Ockey, 2018); and the extent
to which scores from integrated and independent tasks can be defensively be combined for
reporting purposes (Lee, 2006).

Interactive tasks
The focus of this task type is the interactional aspects of spoken language where candidates are
typically paired or grouped to interact with one other. Interactive tasks gained in popularity in
response to concerns raised about the asymmetrical nature of interaction and pre-allocated
question-answer turns in traditional examiner-candidate interview tasks (e.g. Johnson, 2001;
Seedhouse and Nakatsuhara, 2018). Paired and group tasks enable candidates to demonstrate a
wider range of language functions by allowing them to negotiate ongoing interactional
organisation (Brooks, 2009; Galaczi, 2014). Fulcher (2003: 189) also notes the potential of these
tasks to tap into more cognitively and strategically demanding performances which are likely to
mirror non-test conversations. As such, these tasks are considered suitable to assess candidates’
interactional competence (Galaczi and Taylor, 2018; Plough, Banerjee and Iwashita, 2018) in
addition to more linguistically oriented competence.
However, one long-standing issue with these tasks is the interlocutor effect, which relates
to the potential impact on performance of the person(s) candidates happen to be paired or grouped
with. A number of studies have investigated the influence of candidates’ own and their partners’
characteristics such as personality, proficiency level, L1 background, gender and acquaintanceship
(e.g. Berry, 2007; O’Sullivan, 2008). Findings from these studies are often mixed suggesting
complex interactions among those factors as well as with external factors such as the size of groups
(Nakatsuhara, 2011), making it practically impossible to control for all interlocutor variables.
Related concerns include the separability of test scores for individuals involved in the co-
construction of interaction (May, 2011), and score reliability due to increased variability associated
with co-participants (Bonk and Ockey, 2003; Van Moere, 2006). While these issues might appear
to diminish the value of these interactional tasks, Brown (2003: 20) notes that such variation cannot
be construct-irrelevant “especially where the construct can be interpreted as encompassing
interpersonal communication skills”. Nevertheless, attempts should be made to ensure fairness to
candidates, for example, by offering multiple test tasks or multiple test occasions (Van Moere,
2006).

Test delivery modes


The construct coverage of a speaking test depends on the types of elicitation task as described so
far, and computer-mediated tests used to be associated with only certain kinds of task - elicited
imitation and fundamentally monologic task types. However, recent advances in technologies have
started widening the range of tasks that computer-mediated speaking tests can deliver. In other
words, the affordance of technologies has started showing the potential to expand the range of the
construct which computer-mediated speaking tests can tap into (see also Xi, this volume; Sasaki,
this volume).
For example, the use of video-conferencing technology to deliver a face-to-face speaking
test enables reciprocal interaction without relying on the proximity of speakers. It also reflects the
ways in which we communicate nowadays in the social, educational and professional domains. An
increasing body of research has investigated the cognitive, contextual and scoring validity of
video-conferencing-delivered tests, as well as examining their feasibility (e.g. Nakatsuhara, Inoue,
Berry and Galaczi, 2017; Ockey, Timpe-Laughlin, Davis and Gu, 2019; Zhou, 2015). These
studies also suggest that the speaking construct to embrace under the video-conferencing condition
includes more explicit meaning negotiation and clearer indication of turn taking than face-to-face
communication.
The application of virtual environments has also been explored for the potential of
enhancing authenticity of communication. Ockey, Gu and Keehner (2017) explore the possibilities
offered by virtual environments in providing candidates with an immersive and true-to-life
experience that can facilitate the elicitation of collaborative discussions. Furthermore, the use of
Spoken Dialogue Systems (SDS) in conjunction with virtual environments seems like a promising
avenue. A review of literature on educational games indicates how virtual characters can generate
responses and ask follow-up questions while giving some support and feedback to learners (e.g.
Morton, Gunson and Jack, 2012). However, challenges still remain particularly in terms of high
word error rates in L2 speech recognition and the difficulties in predicting all possible L2
utterances even in highly defined contexts. As Litman, Strik and Lim (2018) note, current
computer-human interactions are mostly limited to an utterance level and do not generate
responses over several turns.
Marking models and rating approaches
We have thus far discussed different ways of eliciting candidates’ spoken language. Once elicited,
another critical issue relates to how these speech samples are rated. Marking models – the
approaches used for assigning scores/ratings to performances – have been shown to affect rater
marks (e.g. Barkaoui, 2011) and are therefore an important aspect of a test’s validity argument.
Hunter, Jones, and Randhawa (1996: 62) refer to a “continuum of scoring practices” with a holistic
approach as a “wide-angle assessment of the whole, without clearly delineated criteria” on the one
end and atomistic scoring with “microscopic assessment of constituent parts” without an overall
perspective on candidate performance on the other (see Galaczi and Lim, this volume; Ockey, this
volume for a discussion of methods used for ensuring and evaluating judgment quality or different
measurement techniques).
A review of the literature suggests holistic, analytic, and part scoring as the most widely used
human-mediated marking models in speaking assessment with performance decision trees
(Fulcher, Davidson and Kemp, 2011) and binary-choice scales (O’Grady, 2019) gaining in
popularity. Advancements in speech processing and machine learning technologies are also
transforming the ways in which speaking is tested evidenced in the increasing use of automated
approaches to the marking of speaking (Chen et al., 2018; Xi, Higgins, Zechner and Williamson,
2012). Definitions and discussions of the strengths and limitations of these approaches are
documented extensively elsewhere (c.f. Davis, 2018; Galaczi and Lim, this volume; Khabbazbashi
and Galaczi, 2020; Xi, this volume). Empirical research on how these different approaches
compare, however, is rather limited particularly in speaking assessment. In those cases where such
comparisons have been made (e.g. Chen et al., 2018; Harsch and Martin, 2013; Xi, 2007), there
has been a reliance on correlations in reporting empirical relationships. This may disguise the
impact of scoring approaches on individual candidate marks and therefore it is important to
consider alternative ways of comparison with an explicit focus on the “practical significance”
(Fulcher, 2003: 65). For example, in their study on the comparison of holistic, analytic, and part
marking models in speaking, Khabbazbashi and Galaczi (2020) showed that despite high
significant correlations, the choice of marking model had an impact on candidates’ final CEFR
classifications. An implication of this study relates to the comparison between human and machine
marks in automated systems. Automated speaking evaluation technologies largely rely on human-
awarded marks as the ‘gold standard’ for training and evaluation of systems (Chen et al., 2018;
see Fulcher, 2010: 296-297 for a discussion of issues related to using a ‘gold standard’). Given
that those marks are influenced by choice of marking model, by extension, they have an impact on
the source data for machine learning and system evaluations. As such, these should be taken into
account and made transparent when reporting human-machine agreement levels.

Main research methods


The previous sections touched upon some of the important parameters in the assessment of
speaking; we now focus our attention on the different approaches used for examining these factors.
The field of speaking assessment has benefitted from a wide range of research methodologies,
including analysis of TLU-domains and speaking test corpora (Cushing, this volume), analysis of
scoring validity using CTT and IRT (Brown, this volume; Ockey, this volume), and analysis of
rater perceptions and rater behaviour (Galaczi and Lim, this volume) amongst others. However,
since these research methods are shared with the assessment of other skills, here we examine
research methodologies unique to spoken assessment.
Over the past three decades, there has been a proliferation of studies that have analysed
speaking test discourse – rather than or supplementary to – the analysis of scores. Results of such
research are valuable sources for the design, description, and most crucially the validation of
speaking tests as they allow for an examination of the extent to which the intended construct of a
speaking test is actually reflected in elicited performances. We now review three methods of
analysis which have been found particularly useful in speaking assessment and consider their
strengths and limitations.

Conversation Analysis
One of the most popular qualitative analysis methods used to investigate discourse in spoken
language tests is Conversation Analysis (CA; Sacks, Schegloff and Jefferson, 1974). Since
Lazaraton’s (2002) pioneering application of CA techniques on a range of Cambridge
examinations in the early 1990s, CA has revealed various interactional features of spoken
discourse in interview, paired and group oral formats (e.g. Brown, 2003; Galaczi, 2014; Lam,
2018; Seedhouse and Nakatsuhara, 2018). These studies have mainly focused on how the
interactional organisation reflects the institutional aim of eliciting construct-relevant discourse, to
what extent the organisation resembles the CA benchmark of ‘ordinary’ conversation, and in what
ways high-scoring and low-scoring candidates demonstrate their interactional competence.
Following the conventions of CA, these studies regard repeated listening/viewing of audio
or video recordings for production of transcripts as an important part of analysis for discovery
(Hutchby and Wooffitt, 1998). The analysis is based on analysts’ development of ‘emic’
perspectives to understand “why that now?” in interaction (Shegloff and Sacks, 1973: 299). It is
worth highlighting that coding and quantification in CA are quite controversial, since the
quantification of CA data is considered premature when our understanding of the phenomena that
we may wish to count is still partial (Shegloff, 1993). However, Heritage (1995: 404)
acknowledges that quantification could be successful for well-defined elements in ‘institutional
talk’, to which speaking test interactions belong. It is therefore not uncommon to see some sort of
quantification in CA research in the field of speaking assessment, in order to supplement purely
qualitative CA findings (e.g. Galaczi, 2014; Nakatsuhara, 2011).

Language Function Analysis


While CA is used to describe interactional organisation, language function analysis is a useful tool
to provide summative information of the nature of spoken utterances in relation to cognitive and
contextual demands posed by test tasks. For example, language functions observed in relevant
TLU domains can inform test and task design. At test development and piloting stages, they can
be used to verify or modify draft test specifications and materials. Their analysis can also be used
as part of on-going validation research of operational tests to monitor test version comparability
and to ensure that there is sufficient overlap between intended and observed language functions.
For example, O’Sullivan, Weir and Saville’s (2002) function checklist consists of an
extensive table of functions categorised under informational, interactional and managing
interaction functions. While originally designed for the analysis of paired spoken discourse in the
Cambridge General English examinations, the list has been modified and applied to other speaking
tests, such as the IELTS Speaking Test (e.g. Brooks, 2003) and the Aptis Speaking Test (e.g.
O’Sullivan and Dunlea, 2020). It has also been utilised to investigate examiners’ input language
during a speaking test (Nakatsuhara, 2018) and real-life group discussions in class (Ducasse and
Brown, 2011).
While the analysis of language functions is useful to generate an overall picture of what
candidate language is like, a statistical summary of elicited language functions cannot capture the
level of sophistication displayed in realising each language function (Green, 2012). This limitation
points to the need for a mixed-methods approach that also involves qualitative description of a
sample of actual utterances.

Micro-linguistic Analysis
Another useful approach to investigating features of candidate performance is micro-linguistic
analysis. This is commonly conducted in studies looking into either scoring validity; e.g. seeking
to validate ‘subjective’ human ratings or refining the rating scales by investigating the values from
the ‘objective’ measures that represent the ratings (e.g. Hsieh and Wang, 2017; Tavakoli,
Nakatsuhara and Hunter, 2020), or context validity (e.g. evaluating test administration conditions
such as the provision of planning time; see Iwashita, McNamara and Elder, 2001; O’Grady, 2019;
Wigglesworth and Elder, 2010).
Such micro-linguistic analyses often involve calculating variables of Complexity – both
lexical and syntactic, Accuracy and Fluency (CAF), derived from studies in task-based research,
and some of the analyses can now be facilitated or partly automated by software such as Praat
(Boersma and Weenink, 2016). The discussion of variables of lexical complexity can be found in
Pallotti (2021), including vocd-D value and the percentages of words found in certain vocabulary
lists. For syntactic complexity, there are different ways in which candidates might syntactically
complexify their performance (Norris and Ortega, 2009), including the number of words per
syntactically defined unit, the amounts of coordination and subordination. For accuracy,
commonly used variables include the percentage of error-free clauses and weighted errors per
clause ratio (Foster and Wigglesworth, 2016). Regarding fluency, de Jong (2016) provides a
comprehensive summary of the three types: speed fluency (e.g. speech rate), breakdown fluency
(e.g. the number and length of silent pauses over total speaking time) and repair fluency (e.g. the
number of repetitions, false starts and reformulations). For integrated tasks, the successful
reproduction of the input text is crucial for task achievement, and therefore, some studies have
compared the idea units in the input and candidate performances (e.g. Brown et al., 2005; Frost et
al., 2012).
While there are criticisms against the use of such micro-linguistic analyses on the grounds
that the variables may be too crude and that low-inference analytic categories may fail to capture
the reality of human communication (Fulcher, 2015b), a number of studies have found fair to good
agreement between the objective variables and subjective human ratings (e.g. Inoue, 2016). Such
approaches can therefore offer useful recommendations to enhance the contextual and scoring
validity of tests.

Future directions
In this chapter, we first looked back at the history of speaking assessment and subsequently
considered where we are now in terms of critical parameters in the testing of speaking in the digital
age. One striking observation is the role that technology has played in shaping the speaking
construct - for example in narrowing the construct where there is an over-reliance on elicited
imitation tasks and in broadening the construct in the case of integrated listening/reading into
speaking tasks. In ASE systems, technology has removed human interaction from the equation,
whereas in video-conferencing delivery and virtual environments it has facilitated interaction in
innovative ways. As Galaczi (2010: 47) argues, there is no “best way” to test speaking and the
important criterion is to evaluate different approaches in light of “fitness for purpose”. In looking
into the future, it is hard to predict in what ways technology might influence speaking assessment.
What is important, however, is to always go back to the construct and ensure that it is technology
at the service of the construct and not the other way around.
In gazing into our crystal ball, here are some questions we can ask: what might the speaking
construct look like in the future? Are current models of communicative competence adequately
capturing the complexities of language use? Elder, McNamara, Kim, Pill and Sato (2017: 15), for
example, criticise the over-reliance of existing models on linguistic features at the expense of other
more complex “non-linguistic cognitive, affective, and volitional” factors involved in interaction.
But how can we, as language testers, draw a line between what is construct-relevant and construct-
irrelevant? In an era where technology affords “seamless” blending of visual and text-based input
and between written and spoken modes of delivery and where hybrid spoken/written discourse is
common, is it still defensible to test the four skills independently (c.f. scenario-based assessment;
Purpura, 2016)? What would the construct of L2 speaking look like in the future when in-ear
translations may become a reality? As Litman et al. (2018: 305) speculate, will the day “come
when what we want to assess is the ability to communicate both with humans and with machines”?
With the increasing use of ASE systems, will there be a new set of ethical questions? For instance,
can machines be the ultimate arbiter of whether or not L2 speakers can study or live abroad? What
about accountability? Who is responsible, should anything go wrong? Can hybrid human-machine
approaches to scoring as proposed by Isaacs (2018) and de Jong (2018) be used to address such
ethical concerns as well as making the best use of what ASE systems and human raters can offer?
The answers to these questions are unforeseeable. It is however essential that we consider
all newly emerging questions in light of “the adequacy and appropriateness of inferences and
actions based on test scores or other models of assessment” (Messick, 1989: 13). To this end, we
should continuously seek to understand the ever-changing TLU domains and what we mean by the
term ‘speaking’.

Further readings
Fulcher, G. (2003). Testing second language speaking. London: Longman/Pearson Education.
This book, although written some time ago, provides a thorough guide into designing and
implementing speaking tests, as well as useful and critical summaries of research in L2
speaking assessment.
Fulcher, G. (2015). Assessing second language speaking. Language Teaching 48(2), 198-216.
This article illustrates a timeline of L2 speaking assessment research since the mid-19th
century by summarising publications which advanced our understanding of 12 selected
themes in the field.
Lim, G. (ed.) (2018). Conceptualizing and operationalizing speaking assessment for a new
century [Special issue], Language Assessment Quarterly, 15(3).
The articles in this special issue consider important aspects of the speaking construct such as
interactional competence, collocational competence, fluency, and pronunciation and whether
and to what extent they have been assessed while reflecting on the potential role of
technology in enhancing assessment practices.
Taylor, L. (2011). Examining speaking: Research and practice in assessing second language
speaking. Cambridge: UCLES/Cambridge University Press.
This edited volume presents a review of relevant literature on the assessment of speaking and
provides a systematic framework for validating speaking tests.

References
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford
University Press.
Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and
rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279–
293. https://doi.org/10.1080/0969594X.2010.526585
Barkaoui, K., Brooks, L., Swain, M. and Lapkin, S. (2012). Test-takers’ strategic behaviors in
independent and integrated speaking tasks. Applied Linguistics, 34(3), 304-324.
https://doi.org/10.1093/applin/ams046
Berry, V. (2007). Personality differences and oral test performance. Frankfurt: Peter Lang.
Bley-Vroman, R. and Chaudron, C. (1994). Elicited imitation as a measure of second-language
competence. In Mackey, A., Gass, S. (eds.), Research methodology in second-language
acquisition. Hillsdale, NJ: Lawrence Erlbaum, 245-261.
Boersma, P. and Weenink, D. (2016). Praat: Doing phonetics by computer [Computer software].
http://www.PRAAT.org/
Bonk, W. J. and Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group
oral discussion task. Language Testing, 20(1), 89-110.
https://doi.org/10.1080/15434303.2016.1236797
Brooks, L. (2003). Converting an observation checklist for use with the IELTS Speaking Test.
Research Notes, 11, 20-21.
—— (2009). Interacting in pairs in a test of oral proficiency: Co-constructing a better
performance. Language Testing, 26(3), 341-
366. https://doi.org/10.1177/0265532209104666
Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency.
Language Testing, 20(1), 1-25. https://doi.org/10.1191/0265532203lt242oa
Brown, A., Iwashita, N. and McNamara, T. (2005). An Examination of rater orientations and
test-taker performance on English-for-Academic-Purposes speaking tasks. TOEFL
Monograph No. TOEFL-MS-29. https://www.ets.org/Media/Research/pdf/RR-05-05.pdf
Chapelle, C., Enright, M., Jamieson, J. (Eds.) (2008). Building a validity argument for the Test
of English as a Foreign Language. London: Routledge.
Chen, L., Zechner, K., Yoon, S., Evanini, K., Wang, X., Loukina, A., . . . Ma, M. (2018).
Automated scoring of nonnative speech using the SpeechRater SM v. 5.0 engine. ETS
Research Report Series, ETS RR–18-10. https://doi.org/10.1002/ets2.12198
Davis, L. (2018). Analytic, holistic, and primary trait marking scales. In J. I. Liontas, M.
DelliCarpini and TESOL International Association (eds.), The TESOL Encyclopaedia of
English Language Teaching, 1–6. https://doi.org/10.1002/9781118784235.eelt0365
—— (2016). Fluency in second language assessment. In D. Tsagari and Banerjee, J. (eds.),
Handbook of second language assessment. Berlin: Mouton de Gruyter, 203-218.
De Jong, N. H. (2016). Predicting pauses in L1 and L2 speech: the effects of utterance
boundaries and word frequency, International Review of Applied Linguistics in Language
Teaching, 54. 113-132. https://doi.org/10.1515/iral-2016-9993
—— (2018). Fluency in Second Language Testing: Insights From Different
Disciplines. Language Assessment Quarterly, 15(3), 237-
254. https://doi.org/10.1080/15434303.2018.1477780
Ducasse, A. M. and Brown, A. (2011). The role of interactive communication in IELTS
Speaking and its relationship to candidates’ preparedness for study or training contexts.
IELTS Research Reports, 12. https://www.ielts.org/-/media/research-reports/ielts-rr-
volume-12-report-3.ashx
Elder, C., McNamara, T., Kim, H., Pill, J. and Sato, T. (2017). Interrogating the construct of
communicative competence in language assessment contexts: What the nonlanguage
specialist can tell us. Language & Communication, 57, 14-21.
https://doi.org/10.1016/j.langcom.2016.12.005
Field, J. (2011). Cognitive validity. In Taylor, L. (ed.), Examining speaking: Research and
practice in assessing second language speaking. Cambridge: UCLES/Cambridge
University Press, 65-111.
Foster, P. and Wigglesworth, G. (2016). Capturing accuracy in second language performance:
The case for a weighted clause ratio. Annual Review of Applied Linguistics, 36, 98-116.
https://doi.org/10.1017/S0267190515000082
Frost, K., Elder, C. and Wigglesworth, G. (2012). Investigating the validity of an integrated
listening-speaking task: A discourse-based analysis of test takers’ oral
performances. Language Testing, 29(3), 345–
369. https://doi.org/10.1177/0265532211424479
Fulcher, G. (1996). Does thick description lead to smart tests? A data-based approach to rating
scale construction. Language Testing, 13(2), 208–238.
https://doi.org/10.1177/026553229601300205
—— (2003). Testing second language speaking. London: Longman/Pearson Education.
—— (2010). Practical language testing. London: Hodder Education
—— (2015a). Assessing second language speaking. Language Teaching 48(2), 198-216.
—— (2015b). Re-examining language testing. A philosophical and social inquiry. London, New
York: Routledge.
Fulcher, G., Davidson, F. and Kemp, J. (2011). Effective rating scale development for speaking
tests: Performance decision trees. Language Testing, 28(1), 5–
29. https://doi.org/10.1177/0265532209359514
Galaczi, E. D. (2010). Face-to-face and computer-based assessment of speaking: Challenges and
opportunities. In L. Araújo (ed.), Computer-based assessment of foreign language
speaking skills. Luxembourg, LU: Publications Office of the European Union, 29-51.
https://www.testdaf.de/fileadmin/Redakteur/PDF/Forschung-
Publikationen/Volume_European_Commission_2010.pdf
—— (2014). Interactional Competence across Proficiency Levels: How do Learners Manage
Interaction in Paired Tests? Applied Linguistics, 35(5): 553-574.
https://doi.org/10.1093/applin/amt017
Galaczi, E. and Taylor, L. (2018). Interactional competence: Conceptualisations,
operationalisations, and outstanding questions. Language Assessment Quarterly, 15(3),
219-236. https://doi.org/10.1080/15434303.2018.1453816
Gathercole, S. E. and Baddeley, A. D. (1993). Phonological working memory: A critical building
block for reading development and vocabulary acquisition? European Journal of
Psychology of Education, 8(3), 259-272.
Green, A. (2012). Language functions revisited: theoretical and empirical bases for language
construct definition across the ability range. Cambridge: UCLES/Cambridge University
Press.
Harsch, C. and Martin, G. (2013). Comparing holistic and analytic scoring methods: Issues of
validity and reliability. Assessment in Education: Principles, Policy & Practice, 20(3), 281–
307. https://doi.org/10.1080/0969594X.2012.742422
Heritage, J. (1995) Conversation analysis: methodological aspects. In U. M. Quasthoff (ed.),
Aspects of oral communication. Berlin: Water de Gruyter, 391-418.
Hsieh, C. and Wang, Y. (2017). Speaking proficiency of young language students: A discourse-
analytic study. Language Testing, 36(1), 27-50. https://doi.org/10.1177/0265532217734240
Hutchby, I. and Wooffitt, R. (1998). Conversation analysis. Cambridge: Cambridge University
Press.
Huang, H.-T. D., Hung, S.-T. A. and Plakans, L. (2018). Topical knowledge in L2 speaking
assessment: Comparing independent and integrated speaking test tasks. Language
Testing, 35(1), 27–49. https://doi.org/10.1177/0265532216677106
Hunter, D.M., Jones, R.M. and Randhawa, B.S. (1996). The use of holistic versus analytic
scoring for large-scale assessment of writing. The Canadian Journal of Program
Evaluation, 11(2), 61–85.
Inoue, C. (2016). A comparative study of the variables used to measure syntactic complexity and
accuracy in task-based research. The Language Learning Journal, 44(4), 487-505.
Isaacs, T. (2018). Shifting sands in second language pronunciation teaching and assessment
research and practice. Language Assessment Quarterly, 15(3), 273-
293. https://doi.org/10.1080/15434303.2018.1472264
Iwashita, N., McNamara, T. and Elder, C. (2001). Can we predict task difficulty in an oral
proficiency test? Exploring the potential of an information-processing approach to test
design. Language Learning 51(3), 401-436. https://doi.org/10.1111/0023-8333.00160
Jamieson, J., Eignor, D., Grabe, W. and Kunnan, A. J. (2008). Frameworks for a new TOEFL. In
C. A. Chapelle, M. K. Enright, and J. M. Jamieson (eds.), Building a validity argument for
the Test of English as a Foreign Language. New York, NY: Routledge, 55-95.
Johnson, M. (2001). The art of non-conversation: a re-examination of the validity of the oral
proficiency interview. New Haven, London: Yale University Press.
Khabbazbashi, N. (2017). Topic and background knowledge effects on performance in speaking
assessment. Language Testing, 34(1), 23–48. https://doi.org/10.1177/0265532215595666
Khabbazbashi, N. and Galaczi, E. (2020). A comparison of holistic, analytic, and part marking
models in speaking assessment Language Testing, 37(3), 333–
360. https://doi.org/10.1177/0265532219898635
Lam, D.M.K. (2018). What counts as “responding”? Contingency on previous speaker
contribution as a feature of interactional competence. Language Testing, 35(3), 377–401.
https://doi:10.1177/0265532218758126
Latham, H. (1877). On the action of examinations considered as a means of selection.
Cambridge: Dighton, Bell and Company.
Lazaraton, A. (2002). A qualitative approach to the validation of oral language tests. Cambridge:
UCLES/Cambridge University Press.
Lee, Y.-W. (2006). Dependability of scores for a new ESL speaking assessment consisting
of integrated and independent tasks. Language Testing, 23(2), 131–
166. https://doi.org/10.1191/0265532206lt325oa
Lado, R. (1961). Language testing: The construction and use of foreign language tests. A
teacher’s book. London: Longman.
Litman, D., Strik, H. and Lim, G. S. (2018). Speech technologies and the assessment of second
language speaking: Approaches, challenges, and opportunities, Language Assessment
Quarterly, 15(3), 294-309. https://doi.org/10.1080/15434303.2018.1472265
May, L. (2011). Interaction in a paired speaking test: The rater’s perspective. Frankfurt: Peter
Lang.
Messick, S. (1989). Validity. In R. L. Linn, (ed.), Educational measurement (3rd edition).
London, NY: McMillan.
Morton, H., Gunson, N. and Jack, M. (2012). Interactive language learning through speech-
enabled virtual scenarios. Advances in Human-Computer Interaction, 2012, 1-14.
https://doi.org/10.1155/2012/389523
Nakatsuhara, F. (2011). Effects of the number of participants on group oral test
performance. Language Testing, 28(4), 483-508.
https://doi.org/10.1177%2F0265532211398110
—— (2018). Investigating examiner interventions in relation to the listening demands they make
on candidates in oral interview tests. In E. Wagner and G. Ockey (eds.), Emerging issues in
the assessment of second language listening. Amsterdam/Philadelphia: John Benjamins,
205-225.
Nakatsuhara, F., Inoue, C., Berry, V. and Galaczi, E. (2017). Exploring the use of video-
conferencing technology in the assessment of spoken language: A mixed-methods study.
Language Assessment Quarterly, 14(1), 1-18.
https://doi.org/10.1080/15434303.2016.1263637
Norris, J. M. and Ortega, L. (2009). Towards an organic approach to investigating CAF in
instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555–578.
https://doi.org/10.1093/applin/amp044.
North, B. and Schneider, G. (1998). Scaling descriptors for language proficiency
scales. Language Testing, 15(2), 217–262. https://doi.org/10.1177/026553229801500204
Ockey, G. J. (2018). The degree to which it maters if an oral test tasks require listening. In
Wagner, E. and Ockey, G. (eds.), Emerging issues in the assessment of second language
listening. Amsterdam/Philadelphia: John Benjamins, 193-204.
Ockey, G. J., Gu, L. and Keehner, M. (2017). Web-based virtual environments for facilitating
assessment of L2 oral communication ability. Language Assessment Quarterly, 14(4), 346-
359. https://doi.org/10.1080/15434303.2017.1400036
Ockey, G., Timpe-Laughlin, V., Davis, L. and Gu, L. (2019). Exploring the potential of a video-
mediated interactive speaking assessment. ETS Research Report Series, ETS RR–19-05.
https://doi.org/10.1002/ets2.12240
O’Grady, S. (2019). The impact of pre-task planning on speaking test performance for English-
medium university admission. Language Testing, 36(4), 505-526.
https://doi.org/10.1177/0265532219826604
O’Sullivan, B. (2008). Modelling performance in tests of spoken language. Frankfurt: Peter
Lang.
O’Sullivan, B. and Dunlea, J. (2020). Aptis General technical manual Ver 2.1 Technical Reports,
TR/2020/001.
https://www.britishcouncil.org/sites/default/files/aptis_technical_manual_v_2.1.pdf
O’Sullivan, B., Weir, C.J. and Saville, N. (2002). Using observation checklists to validate
speaking-test tasks. Language Testing, 19(1), 33-56.
https://doi.org/10.1191/0265532202lt219oa
Pallotti, G. (2021). Measuring complexity, accuracy and fluency (CAF). In P. Winke and T.
Brunfaut (eds.), The Routledge Handbook of Second Language Acquisition and Language
Testing. NY: Routledge, 201–210.
Palmer, H. E. (1921). The oral method of teaching languages. Cambridge: Heffers.
Plough, I., Banerjee, J. and Iwashita, N. (2018). Interactional competence: Genie out of the
bottle. Language Testing, 35(3), 427–445. https://doi:10.1177/0265532218772325
Pollitt, A. and Murray, N. L. (1996). What raters really pay attention to. In M. Milanovic and N.
Saville (eds.), Performance testing, cognition and assessment: Selected papers from the
15th Language Testing Research Colloquium. Cambridge: UCLES/Cambridge University
Press, 74-91.
Purpura, J. E. (2016). Second and foreign language assessment. The Modern Language Journal,
100(Supplement 2016), 190–208. https://doi.org/10.1111/modl.12308
Russell, D. R. (2002). Writing in the academic disciplines: A curricular history (2nd edition).
Carbondale: Southern Illinois University Press.
Sacks, H., Schegloff, E. and Jefferson, G. (1974). The simplest systematics for the organization
of turn-taking for conversation. Language, 50(4), 696-735. https://doi.org/10.2307/412243
Sawaki, Y., Stricker, L. J. and Oranje, A. H. (2009). Factor structure of the TOEFL Internet-based
test. Language Testing, 26(1), 5-30. https://doi.org/10.1177/0265532208097335
Schegloff, E. A. (1993) Reflections on quantification in the study of conversation. Research on
Language and Social Interaction 26(1), 99-128.
https://doi.org/10.1207/s15327973rlsi2601_5
Schegloff, E. A. and Sacks, H. (1973). Opening up closings. Semiotica, 8, 289-327.
http://dx.doi.org/10.1515/semi.1973.8.4.289
Seedhouse, P. and Nakatsuhara, F. (2018). The discourse of the IELTS speaking test: The
institutional design of spoken interaction for language assessment. Cambridge: Cambridge
University Press.
Sollenberger, H. E. (1978). Development and current use of the FSI Oral Interview test. In Clark,
J. L. D. (ed.) Direct testing of speaking proficiency: Theory and application. Princeton, NJ:
Educational Testing Service, 89-103.
Sweet, H. (1899). The practical study of languages. London: Dent.
Tavakoli, P., Nakatsuhara, F. and Hunter, A. M. (2020). Aspects of fluency across assessed levels
of speaking proficiency. Modern Language Journal, 104(1), 169-191.
https://doi.org/10.1111/modl.12620
Trinity College London. (2021). ISE rating scales.
www.trinitycollege.com/qualifications/english-language/ISE/ISE-results-and-certificates/ISE-
rating-scales
Van Moere, A. (2006). Validity evidence in a university group oral test. Language Testing, 23(4),
411–440. https://doi.org/10.1191/0265532206lt336oa
—— (2012). A psycholinguistic approach to oral language assessment. Language Testing, 29(3),
325-344. https://doi.org/10.1177/0265532211424478
Weigle, S. C. (2004). Integrating reading and writing in a competency test for non-native
speakers of English. Assessing writing, 9(1), 27-55. https://doi.org/10.1016/j.asw.2004.01.002
Weir, C. J., Vidaković, I. and Galaczi, E. D. (2013). Measured constructs: A history of
Cambridge English language examinations 1913–2012. Cambridge: UCLES/Cambridge
University Press.
Wigglesworth, G. and Elder, C. (2010). An investigation of the effectiveness and validity of
planning time in speaking test tasks. Language Assessment Quarterly, 7(1), 1-24.
https://doi.org/10.1080/15434300903031779
Xi, X. (2007). Evaluating analytic scoring for the TOEFL® academic speaking test for
operational use. Language Testing, 24(2), 251-286.
https://doi.org/10.1177/0265532207076365
Xi, X., Higgins, D., Zechner, K. and Williamson, D. (2012). A comparison of two scoring
methods for an automated speech scoring system. Language Testing, 29(3), 371-394.
https://doi.org/10.1177%2F0265532211425673
Yan, X., Maeda, Y., Lv, J. and Ginther, A. (2016). Elicited imitation as a measure of second
language proficiency: A narrative review and meta-analysis. Language Testing, 33(4),
497–528. https://doi.org/10.1177%2F0265532215594643
Zhou, Y. (2015). Computer-delivered or face-to-face: Effects of delivery mode on the testing of
second language speaking. Language Testing in Asia, 5(2), 1-16.
https://doi.org/10.1186/s40468-014-0012-y

You might also like