Ubiratã Kickhöfel Alves and Jeniffer Imaregna Alcantara de Albuquerque (Eds.) Second Language Pronunciation

Ubiratã Kickhöfel Alves and
Jeniffer Imaregna Alcantara de Albuquerque (Eds.)

Second Language Pronunciation
Studies on Language
Acquisition
Series Editors
Luke Plonsky
Martha Young-Scholten
Volume 64
Second Language
Pronunciation
Different Approaches to Teaching and Training
Edited by
Ubiratã Kickhöfel Alves and
Jeniffer Imaregna Alcantara de Albuquerque
ISBN 978-3-11-073951-0
e-ISBN (PDF) 978-3-11-073612-0
e-ISBN (EPUB) 978-3-11-073614-4
ISSN 1861-4248
Library of Congress Control Number: 2022943012
Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie;
detailed bibliographic data are available on the internet at http://dnb.dnb.de.
© 2023 Walter de Gruyter GmbH, Berlin/Boston

Typesetting: Integra Software Services Pvt. Ltd.
Printing and binding: CPI books GmbH, Leck
www.degruyter.com
About the Authors
Akiyo Joto is a professor emeritus at the Prefectural University of Hiroshima in Japan. She holds
an MA in English linguistics awarded by Okayama University, Japan, and an MA in TEFL
conferred by Ball State University, USA. Her main research concerns the analysis of English
pronunciations of native Japanese speakers and its application to teaching English sounds to
Japanese learners from the perspective of contrastive phonetics between English and Japanese.
She is currently working on the development of a teacher’s manual of English sounds with video
instructions for elementary school English education in Japan. Email: joto@pu-hiroshima.ac.jp
Anabela Rato is an Assistant Professor and the Associate Chair of the Undergraduate Program
in Portuguese Studies at the Department of Spanish and Portuguese (University of Toronto,
Canada). She is also the Chair of the Canadian Association of Teachers of Portuguese (CATPor).
She received her Ph.D. in Language Sciences, with a specialization in English Linguistics, and
her Master’s degree in English Language, Literature, and Culture from the University of Minho,
Portugal. Her research interests include Second Language (L2) Speech Learning, Heritage
Language (HL) Phonological Acquisition, Speech Perception and Production, Phonetic Training,
and Applied Phonetics. Email: anabela.rato@utoronto.ca
Bastien De Clercq is a post-doctoral researcher and lecturer in the linguistics department of

the Vrije Universiteit Brussel. His research focusses on second language acquisition in a
range of languages (French, English, Dutch) and has explored various acquisitional
phenomena, from phonology to vocabulary, syntax and morphology. More specifically, his
doctoral research explored the role of linguistic complexity, as well as typological or
contrastive factors in SLA. Other research interests include the effects and effectiveness of
(phonological) training, such as High Variability Perceptual Training. He has published in The
Modern Language Journal and guest-edited a special issue in Second Language Research.
Email: bastien.De.Clercq@vub.be
Cosme Daniel Paz is a PhD student and graduate assistant in Agricultural Sciences at
Universidad Nacional de Mar del Plata (UNMdP), Argentina. He is a member of the Research
Group Cuestiones del Lenguaje at UNMdP - ANPCYT - INTA. He is an Agricultural Engineer
graduated from Universidad Nacional de Salta (UNSA), Argentina, 2011. From 2012-2016, he
was awarded a CONICET doctoral scholarship. Main research areas: Statistical analysis
related to L2 speech development. Email: cosmepaz@gmail.com
Denis Liakin is Full Professor of French and Linguistics in the Department of French Studies at
Concordia University in Montreal. Prof. Liakin completed a PhD in Linguistics at the University
of Western Ontario (2003) and joined Concordia University in 2004. His research interests
include effects of computer technology on L2 learning, corrective phonetics and second
language acquisition of syntax. His current SSHRC project investigates the pedagogical use of
mobile devices for improving L2 pronunciation. Email: denis.liakin@concordia.ca
Denise Cristina Kluge is a professor at Federal University of Rio de Janeiro (UFRJ) at the
Department of Anglo-Germanic Languages and also part of the Graduate Studies Program in
Language at Federal University of Paraná (UFPR). She is graduated in Portuguese and English
https://doi.org/10.1515/9783110736120-202
VI About the Authors
teaching from Universidade do Vale do Rio dos Sinos - Unisinos (2000), and did her MA
(2004) and PhD (2009) in Linguistics at Federal University of Santa Catarina (UFSC) in Brazil.
Her research interests include speech perception and production, perceptual training, effect
of visual cues, acquisition/learning an additional language and teaching pronunciation.
Email: deniseckluge@gmail.com
Diana Oliveira received her Ph.D. in Language Sciences, with a specialization in Applied
Linguistics, in 2020 and her Master’s degree in Portuguese as Foreign or Second Language
in 2016 from the University of Minho, in Portugal. She is a junior researcher at CEHUM,
UMinho, and is interested in individual differences in second language speech learning.
Email: oliveira.diana27@gmail.com
Elena Cotos is an Associate Professor of TESL/Applied Linguistics in the English Department

at Iowa State University. She is also the Director of the Center for Communication Excellence
of the Graduate College. Her research interests include English for academic and specific
purposes, corpus-based genre analysis, genre-based automated writing evaluation and
pedagogy, and language learning and assessment. She is the principal investigator for global
online and massive open courses offered in partnership with FHI360 and American English E-
Teacher Program, U.S. Department of State, Bureau of Educational and Cultural Affairs.
Email: ecotos@iastate.edu
Elena Kkese (PhD in Linguistics) has taught at secondary and tertiary education since 2004.
Elena’s research focuses on phonetics and its relation to phonology, bilingualism, sociophonetics,
sociolinguistics, teaching and education. Her research interests include the speech and visual
perception and production of phonetic and contextual information in L1 and L2 and the
implications to L2 pronunciation and literacy. She is the author of Identifying Plosives in L2
English: the case of L1 Cypriot Greek speakers, L2 Writing Assessment: The Neglected Skill of
Spelling, as well as Speech Perception and Production in L2. Email: elenakkese@hotmail.com
Ellen Simon is an Associate Professor in English Linguistics at Ghent University, Belgium. Her
research field is that of second language phonetics and phonology and she has published in
a.o. Second Language Research, Journal of Child Language, Journal of Phonetics and International
Journal of Bilingualism. She has published two book volumes with Academia Press: Voicing in
Contrast (2010) on the acquisition of the English voicing contrast by native speakers of Dutch
and Media-induced Second Language Acquisition (Simon & Van Herreweghe, 2018) on the
acquisition of English by primary school children in Flanders. She is currently working on issues
of accent variation, intelligibility and the effect of training and exposure on L2 speech learning.
Email: ellen.simon@ugent.be
Idée Edalatishams received her PhD in Applied Linguistics and Technology from Iowa State
University, where she worked as a communication consultant at the Writing Center and the
Center for Communication Excellence and taught first-year composition and a range of
graduate and undergraduate ESL courses. Her primary research is in spoken corpus
linguistics, pronunciation, and multilingual speakers’ oral communication. She has the
developed the Corpus of Teaching Assistant Classroom Speech and is the Faculty ESL
Specialist at George Mason University Writing Center, where she develops programming,
About the Authors VII
conducts research, and trains graduate consultants on supporting multilingual students’

academic communication. Email: iedalati@gmu.edu
Ilvi Blessenaar, MA is a speech and language therapist, clinical linguist, lecturer at the
Utrecht University of Applied Sciences Utrecht (HU), Department for Speech and Language
Therapy and a junior researcher at Research Group for Speech and Language Therapy. Her
teaching and research focuses is on the role of Speech Language Therapists in second
language pronunciation in the Netherlands and Belgium. She is also active in the field of
children with Developmental Language Disorders (DLD) and Speech Sounds Disorders (SSD).
Email: ilvi.blessenaar@hu.nl
Jeniffer Imaregna Alcantara de Albuquerque is an Associate Professor at Universidade

Tecnológica Federal do Paraná (UTFPR), Brazil. She has been working at the Modern Language
Department at UTFPR since 2013. Prof. Albuquerque completed a PhD in Psycholinguistics at
Federal University of Rio Grande do Sul (UFRGS), in Brazil, with a split-site PhD visit to the
University of Groningen (Rijksuniversiteit Groningen), hosted by Prof. Dr. Wander Lowie, in 2019.
Her research interests include speech perception and production, intelligibility, dynamic models
of language development, development of additional languages and pronunciation teaching.
Lily Compton is the Graduate Communication Programs Coordinator at the Iowa State
University’s Center for Communication Excellence. Her primary research is in curriculum and
instructional technology, online education, teacher education, and oral communication. She
taught and designed curriculum for courses for the oral communication skills of
International Teaching Assistants (ITAs), methods for teaching English as a Second
Language, and instructional technology for online language learning. She oversees the
institutional language tests for ITAs and trains the test raters. She also mentors and
supervises the English Speaking Consultants and instructors of the ITA oral communication
courses. Email: lcompton@iastate.edu
Lizet van Ewijk (PhD) is a speech and language therapist and senior Lecturer at the Utrecht
University of Applied Sciences Utrecht (HU), Department for Speech and Language Therapy
and a senior researcher at Research Group for Speech and Language Therapy. Her teaching
and research focus is on improving communication opportunities for adults with
communicative vulnerability. Email: lizet.vanewijk@hu.nl
María Claudia Troglia holds a degree in English Language Teaching (Universidad Nacional de Mar
del Plata – UNMDP -, 2010). Currently, she is a teaching assistant at Discurso Oral II at the English
Teacher Training Program at UNMDP. She is also a teaching practice instructor (English Teacher
Training Program, Instituto Superior Idra, Mar del Plata, Argentina. 2019–2021). She is a member
of the research group Cuestiones del Lenguage at UNMDP. Email: claudiatroglia@gmail.com
Natallia Liakina’s professional experience includes teaching French as a second language at the
university level in Ontario and in Quebec. Since 2006, she has taught at the French Language
Centre at McGill University. Her current research is focused on corrective phonetics and the
impact of new technologies such as speech technologies and augmented reality games on L2
teaching and learning both in the classroom setting and online. As part of her work at McGill, she
has taught FSL classes and developed educational materials. Email: natallia.liakina@mcgill.ca
VIII About the Authors
Pauline Degrave is Assistant Professor in Dutch Didactics at UCLouvain, Belgium. Her key
research interests are foreign language acquisition - especially Dutch by French-speaking
learners - and the relationship between music and language. She explored the effect of musical
training and abilities as well as the use of music in foreign language classrooms. Specialized in
pedagogy (secondary school and higher education), she has been teaching Dutch to French
speakers for more than 10 years. She has published several Dutch handbooks and research
articles in International Review of Applied Linguistics in Language Teaching and Journal of
Language Teaching and Research. Email: pauline.degrave@uclouvain.be
Pedro Luis Luchini holds a Post-doctoral degree in Linguistics from Universidad Federal Rio
Grande Do Sul, Porto Alegre, Brazil (2019), a PhD in Letters, from Universidad Nacional de
Mar del Plata (UNMdP), (2015), an MA in ELT and Applied Linguistics (AL) from King’s College,
University of London, UK (2003). Currently, he is a full professor and research group director at
Cuestiones del Lenguaje, UNMdP, Argentina. Main research areas: AL with a focus on English
pronunciation. Email: luchinipedroluis@gmail.com
Pollianna Milan is a professor at Federal University of Paraná (UFPR) at the Department of

Literature and Linguistics. She holds a degree in Portuguese and Spanish Teaching and did
her MA and Ph.D. in Linguistics at Federal University of Paraná (UFPR) in Brazil. Her work has
appeared in book chapters in publications such as “Fonética e Fonologia de Línguas
Estrangeiras: subsídios para o ensino” (Alves et al., 2020). Her research interests include
speech perception and production, perceptual training, prosody, acquisition of languages,
especially the teaching and learning of Brazilian Portuguese as an additional language.
Email: pollimilan@hotmail.com
Quentin Decourcelle is a teacher of Dutch as a Foreign Language for native speakers of French.
He graduated from Ghent University with a Master in Linguistics and Literature, with English as
the main subject. In 2018, he successfully completed a Master of Advanced Studies in
Linguistics, in which he specialized in multilingual and foreign language learning and teaching.
His research interests include incidental language learning, grammar acquisition and language
training. He was affiliated to the English Section of the Linguistics Department at Ghent
University from 2018 to 2021. Email: decourcelle.quentin@hotmail.be
Ronaldo Lima Jr is a professor at the Federal University of Ceará, Brazil, where he teaches
English and general phonetics and phonology at both undergraduate and graduate levels. He
is the founder and current director of the Laboratory of Phonetics and Multilingualism
(LabPhoM) at the Federal University of Ceará. He has a doctorate in Linguistics, a master’s in
Applied Linguistics, and his main research interest is in the phonological development of
nonnative languages. Email: ronaldo.limajr@gmail.com
Susan Jackson is a PhD candidate at Concordia University in Montreal, Canada. She is

interested in second language acquisition theory as it applies to language teaching. In
particular, she examines both how input impacts the formation of novel phonological
representations and how this can inform pronunciation instruction. She teaches ESL and TESL
courses at universities in Quebec including Concordia, l’Université du Québec à Montréal and
McGill, and was acting director of the TESL program at l’Université du Québec en Abitibi-
Témiscamingue. Email: susan.jackson@concordia.ca
About the Authors IX
Sviatlana Karpava (PhD) is a Lecturer in Applied Linguistics/TESOL at the Department of

English Studies, University of Cyprus and Coordinator of the Testing, Teaching and Translation
Lab. She is a Management Committee Member and WG5 Co-Leader of the “European Family
Support Network Cost Action. A bottom-up, evidence-based and multidisciplinary approach”
(2019-2023). Her area of research is applied linguistics, morpho-syntax, semantics and
pragmatics, first and second language acquisition, bilingualism, multilingualism and dialect
acquisition, sociolinguistics, teaching and education. She is interested in heritage language
use, maintenance and transmission, language loss, shift and attrition, family language
policy and intercultural communication. Email: karpava.sviatlana@ucy.ac.cy
Thaïs Cristófaro Silva is a Professor in Linguistics at the Postgraduate Program in Linguistics

at Federal University of Minas Gerais (UFMG). Between 1994–2019 she managed the Phonology
Laboratory. She is a researcher at the National Research Council in Brazil (CNPq) and
FAPEMIG (Minas Gerais). Her Master’s degree is from UFMG and her PhD is from the School
of Oriental and African Studies, University of London. She has worked with Brazilian Indian
languages, Portuguese and English. Her research focuses mainly on the study of sound
variation and change using Exemplar Model approaches and Laboratory Phonology
methodological principles. Email: thaiscristofaro@gmail.com
Tim Kochem is a Lecturer in the English Department at Iowa State University. His primary
research is in L2 pronunciation pedagogy, language teacher education, educational technology,
and distance education. He worked as an English Writing, English Speaking, and Interpersonal
Communications Consultant at the Center for Communication Excellence for four years. He has
also taught a global online course for the Online Professional English Network (OPEN), Using
Educational Technology in the English Language Classroom, as well as introductory courses in
public speaking and linguistics at Iowa State University. Email: tkochem@iastate.edu
Tracey Derwing, Professor Emeritus, has extensively researched L2 pronunciation and fluency,
especially the relationships among intelligibility, comprehensibility, and accent. She has also
investigated native speakers’ speech modifications for L2 speakers and has conducted
workplace studies involving pragmatics and pronunciation. For several years she directed a
research center on immigration and integration. Currently, she serves on a committee that
advises the Canadian government on language training for newcomers. Much of Tracey’s work
has been conducted with Murray Munro – together they wrote Pronunciation Fundamentals:
Evidence-based perspectives for L2 teaching and research, in addition to dozens of research
articles. E-mail: tderwing@ualberta.ca
Ubiratã Kickhöfel Alves is an Associate Professor at the Graduate Program in Linguistics at

Universidade Federal do Rio Grande do Sul (UFRGS), Brazil. Prof. Alves completed a PhD in
Linguistics at Pontifícia Universidade Católica do Rio Grande do Sul (2008) and joined UFRGS in
2010. He carried out his Post-Doctoral research at Universidad Nacional de Mar del Plata
(Argentina) in 2014. He has advised Master’s Dissertations and PhD Theses in the field of L2
Phonetics and Phonology for more than ten years. He coauthored Pronunciation Instruction for
Brazilians: Bringing Theory and Practice Together (Cambridge Scholars Publishers, 2009).
Email: ukalves@gmail.com
X About the Authors
Walcir Cardoso is a Professor of Applied Linguistics in the Department of Education at

Concordia University. He conducts research on the L2 acquisition of phonology,
morphosyntax, and vocabulary, and on the effects of computer technology (e.g., clickers, text-
to-speech synthesizers, automatic speech recognition) on L2 learning. The quality of his
research has been recognized by a Paul Pimsleur Award for Research in Foreign Language
Education (bestowed by the American Council on the Teaching of Foreign Languages), and a
UNESCO King Sejong Literacy Prize (a team award as a co-investigator with the Centre for the
Study of Learning and Performance). Email: walcir.cardoso@concordia.ca
Wellington Mendes is an English Teacher in the Federal Center for Technological Education of
Minas Gerais (CEFET-MG), where he develops research related to second language speech. His
Master’s degree is in Theoretical and Descriptive Linguistics from the Federal University of
Minas Gerais (UFMG), where he is also concluding his PhD on the acquisition of English as a
Second Language and its relationship with sound variation and change. He is certified in both
Teaching English as a Foreign Language by the University of Toronto and in English Language
Teaching by the Federal University of Minas Gerais. Email: wellington.matt@gmail.com
Yuri Nishio received her Ph.D. from Nagoya University, Japan, in 2007. Since 2016, she has
worked as a professor at Meijo University’s Faculty of Foreign Studies. She teaches English
phonetics and seminars related to second language acquisition. She is the head of the
Intercultural Cooperative Research Center for analyzing the effectiveness of study abroad
programs. She is interested in the mechanisms of perception and production of English
sounds by Japanese speakers, developing ICT materials to help Japanese learners improve
their pronunciation, and in creating comprehensive teaching guidelines for English
phonetics. Email: ynishio@meijo-u.ac.jp
Contents
About the Authors V
Ubiratã Kickhöfel Alves, Jeniffer Imaregna Alcantara de Albuquerque

Introduction 1
Part I: Pronunciation development and intelligibility:

Implications for teaching and training studies
Thaïs Cristófaro Silva, Wellington Mendes

Plural formation in English: A Brazilian Portuguese case study 13
Elena Kkese, Sviatlana Karpava

Effect of task, word length and frequency on speech perception in L2
English: Implications for L2 pronunciation teaching and training 41
Pedro Luis Luchini, Cosme Daniel Paz, María Claudia Troglia

L2 accented speech measured by Argentinian pre-service teachers 85
Jeniffer Imaregna Alcantara de Albuquerque, Ubiratã Kickhöfel Alves

Dynamic paths of intelligibility and comprehensibility: Implications for
pronunciation teaching from a longitudinal study with Haitian learners
of Brazilian Portuguese 107
Part II: L2 pronunciation teaching
Ronaldo Lima Jr
A dynamic account of the development of English (L2) vowels by
Brazilian learners through communicative teaching and through explicit
instruction 147
Tim Kochem, Idée Edalatishams, Lily Compton, Elena Cotos

An extra layer of support: Developing an English-speaking consultation
program 167
XII Contents
Ilvi Blessenaar, Lizet van Ewijk

Putting participation first: The use of the ICF-model in the assessment
and instruction of L2 pronunciation 197
Part III: L2 pronunciation training: Implications

for the classroom
Susan Jackson, Walcir Cardoso

Orthographic interference in the acquisition of English /h/ by
Francophones 229
Yuri Nishio, Akiyo Joto

Improving fossilized English pronunciation by simultaneously viewing a
video footage of oneself on an ICT self-learning system 249
Natallia Liakina, Denis Liakin

Speech technologies and pronunciation training: What is the potential
for efficient corrective feedback? 287
Part IV: Pronunciation in the laboratory: High variability

phonetic training
Ellen Simon, Bastien De Clercq, Pauline Degrave, Quentin Decourcelle

On the robustness of high variability phonetic training effects: A study
on the perception of non-native Dutch contrasts by French-speaking
learners 315
Pollianna Milan, Denise Cristina Kluge

Effects of perceptual training in the perception and production of
heterotonics by Brazilian learners of Spanish 345
Anabela Rato, Diana Oliveira

Assessing the robustness of L2 perceptual training: A closer look at
generalization and retention of learning 369
Contents XIII
Conclusion
Tracey M. Derwing
An overview of pronunciation teaching and training 399
Index 413
Ubiratã Kickhöfel Alves, Jeniffer Imaregna Alcantara de
Albuquerque
Introduction
Pronunciation teaching and phonetic training in second
language development: What do they have to offer?
The learning process of a new sound system may be challenging not only to stu-
dents, but also to their teachers. When facing this challenge, L2 learners need to
develop new strategies to perceive as well as to produce those new sounds. In
turn, when trying to help their students in this task, teachers may find it difficult
to set the goals to be reached in their pronunciation classes, as well as to decide
on which aspects have to be taught and how these aspects should be addressed
in their classrooms.
In order to help both learners and teachers overcome these challenges, re-
search on L2 pronunciation (be it in the classroom or in the language labora-
tory) plays a fundamental role. Considering this scenario, as we go through the
pages of the most consolidated journals on L2 learning and teaching, we may
easily notice that there has been a significant increase in the number of studies
focusing on L2 pronunciation instruction and perceptual/production training in
the last two decades. This growth accompanies the rising number of studies on
L2 acquisition in general, being the result of new developments in both the
fields of L2 speech and L2 teaching.
As for the developments in the field of L2 speech, the last twenty years have
witnessed a significant growth in the propositions of new L2 perceptual models,
such as the Native Language Magnet Model (NML – Kuhl 2000), the Perceptual
Assimilation Model-L2 (PAM-L2 – Best and Tyler 2007), The Second Language Lin-
guistic Perception model (L2LP – Escudero 2005) and the recent Revised Speech
Learning Model (SLM-r – Flege and Bohn 2021),1 among others. Even though all
these models are related in what regards their empirical object of investigation,
each one of them reflects different views of language and phonetic primitives,
ranging from a psychoacoustic account, such as the SLM-r, to a direct-realist,
articulatory basis, as claimed in the PAM-L2. These different accounts and the
discussions proposed in each of them have contributed to different fields of
 This is a revised (and updated) version of Flege’s (1995) Speech Learning Model.
Ubiratã Kickhöfel Alves, Federal University of Rio Grande do Sul

Jeniffer Imaregna Alcantara de Albuquerque, Technology Federal University of Paraná
https://doi.org/10.1515/9783110736120-001
2 Ubiratã Kickhöfel Alves, Jeniffer Imaregna Alcantara de Albuquerque
Linguistics, such as Phonetics and Phonology, Psycholinguistics and Language

Acquisition, helping us understand how L2 sounds are learned. Moreover, as
these models are based on the recent developments of Formal Linguistics, they
also have a lot to contribute to these core research areas, as L2 data constitute a
rich source of evidence to these fields.
Along with this better understanding of the L2 learning process, new theoreti-
cal constructs have also been proposed in the L2 teaching field, and new class-
room methodologies and goals for L2 pronunciation teaching have been set.
Munro and Derwing’s (1995) definitions of ‘intelligibility’, ‘comprehensibility’ and
‘accentedness’ have made it clear that these are independent (though related)
constructs, as an L2 learner may be very intelligible and yet show traces of their
L1 accent. In the last 35 years, these constructs have been revisited and new re-
search methodologies have also been developed (Albuquerque 2019; Derwing and
Munro 2015; Kang, Thomson, and Moran 2018; Munro and Derwing 2015; Nagle,
Trofimovich, and Bergeron 2019; Thomson 2018; Trovimovich et al. 2020, among
many others), having thus reshaped the goals of pronunciation teaching. In this
scenario, Levis’ (2005, 2018, 2020) intelligibility principle sustains that the goal of
pronunciation teaching is to promote intelligible L2 speech, instead of aiming at
accent-free productions. This has led teachers to rethink their practices, also help-
ing researchers to find new methodologies on how to measure the efficacy of pro-
nunciation instruction or training, be it in the classroom or in the laboratory.
In view of the academic progress in both the fields of L2 speech perception/
production and L2 teaching, research on L2 pronunciation has progressed quan-
titatively and qualitatively, becoming even more heterogeneous and complex.
This is clear in the literature reviews and meta-analyses carried out in the last
decade by Saito (2012), Lee, Jang, and Plonsky (2015), Thomson and Derwing
(2015) and Saito and Plonsky (2019). In their meta-analysis of 77 studies of L2 pro-
nunciation teaching published between 1982 and 2017, Saito and Plonsky (2019)
propose a framework for studies on instructed second language pronunciation,
according to three main criteria: (i) whether studies adhere to a ‘global’ (such as
intelligibility and comprehensibility judgements) or ‘specific’ (such as investiga-
tions on specific segments) character of pronunciation performance; (ii) whether
L2 speech is assessed by human judges or with acoustic analyses; (iii) whether
data are obtained spontaneously or in a more controlled setting/instrument. This
study not only shows the growth of the pronunciation field in the last years, but
also sets important methodological challenges that are still to be faced by re-
searchers in the field, paving the way for future investigations. There remains no
doubt, therefore, that L2 pronunciation instruction/training constitutes an ef-
fervescent theme of investigation in the current L2 scenario, bridging the gap be-
tween studies on L2 speech and L2 teaching.
Introduction 3
The present volume congregates these different approaches to L2 pronunciation

research in the classroom or in the laboratory. The chapters consist of top-ranked
proposals selected for oral presentation at the symposium on “L2 Pronunciation
Teaching and Training: Different Approaches”, which took place at the AILA online
Convention at the University of Groningen in August 2021. This book presents 13
chapters (besides this ‘Introduction’ and the ‘Conclusion’), all of them written by
well-known researchers from different universities in the world, who focus their in-
vestigations on a variety of first and target languages. The chapters address L2 pro-
nunciation teaching or training in their diversity of approaches, goals, methods and
background theories, aiming to strengthen the (imperative) connection between
studies on L2 acquisition and L2 teaching.
The 13 chapters of this book are organized in four main themes, departing
from L2 speech studies and moving towards the role of pronunciation teaching
and classroom and laboratory approaches to pronunciation training. The first
part (‘Pronunciation development and intelligibility: implications for teaching
and training studies’) consists of four chapters that focus on the development of
L2 pronunciation and its impact on speech intelligibility and comprehensibility.
In all of its chapters, the implications of these studies for the pronunciation in-
struction/training scenario are highlighted, reinforcing the expected connection
between Formal and Applied research. The second set of chapters addresses L2
pronunciation teaching, focusing on different approaches and language perspec-
tives to the discussion on the role of classroom intervention studies in L2 devel-
opment. The last two modules of the book address pronunciation training. The
three chapters of the third part focus on the use of technology to foster pronunci-
ation learning in the classroom. These studies deal with the use of speech tech-
nologies in order to understand learners’ difficulties and how to address them.
Finally, the last section of the book presents four studies on High Variability Pho-
netic Training (HVPT), which consists of a laboratory approach in which learners
are trained with stimuli from a variety of talkers and in different phonetic envi-
ronments (Logan, Lively, and Pisoni 1993; Logan and Pruitt 1995; Thomson and
Derwing 2016). Besides their explicit focus on laboratory approaches to L2 speech
development, the implications of their results for L2 pronunciation teaching are
also highlighted, once again aiming to connect the classroom and laboratory en-
vironments, as well as the fields of L2 speech and L2 teaching. In what follows,
we provide a brief description of each chapter.
In Chapter 1, Thaïs Cristófaro Silva and Wellington Mendes focus on the role
of orthography in the production of plural formation in English by Brazilian
learners (L1: Brazilian Portuguese). The authors investigate two orthographic pat-
terns in stop + sibilant sequences: one presenting two consonants in word-final
position (‘cups’, ‘cats’), and the other presenting a vowel letter between two
consonants (‘grapes’, ‘plates’). Their data show significantly higher rates of

vowel productions when the orthographic form contains the letter <e>, thus sug-
gesting that orthographic information is part of phonological representations.
The data is analyzed in light of the Exemplar Model (Bybee 2001, 2008; Johnson
1997) combined with the Revised Speech Learning Model (SLM-r) (Flege and
Bohn 2021). In view of this combined theoretical approach, we consider this
chapter to be representative of our claim that L2 studies can also contribute to
revisit (and also reshape) models of Phonetics/Phonology. As suggestions for
pronunciation teaching are provided at the end of the chapter, the connection
between the fields of L2 speech and L2 teaching is also strengthened.
In the second chapter, Elena Kkese and Sviatlana Karpava present interest-
ing data on the English L2 learning process by speakers of Greek. The chapter
addresses L2 speech perception, focusing on the identification and discrimina-
tion of English vowels and consonants by Greek learners. The authors show that
task, word length, and frequency are significant predictors of perceptual accu-
racy. The results are discussed in light of the Native Language Magnet Model
(Kuhl 2000), thus reinforcing our claim that L2 data can be analyzed in light of
different models and perspectives. As for the pedagogical implications, the au-
thors discuss how the predictor variables can be considered in the classroom en-
vironment, enabling teachers to explore them when designing perceptual tasks.
The third and fourth chapters address the constructs of L2 speech intelli-
gibility, comprehensibility and accentedness through different theoretical-
methodological perspectives. In the third chapter, Pedro Luchini, Cosme Paz and
Claudia Troglia carry out an experimental study in which L2 productions in En-
glish by five international students (from Argentina, Belgium, China, Japan and
Poland) are assessed for measurements of comprehensibility and accentedness
(following Munro and Derwing’s 1995, 2015 definitions) by 22 Spanish-L1 Argenti-
nian prospective English language teachers. After assessing these constructs, the
Argentinian listeners were invited to complete a complementary activity in which
they should identify the linguistic factors that they thought had played a role in
their assessment. Chapter 4, in turn, addresses the constructs of intelligibility and
comprehensibility under a Complex, Dynamic account of language development
(Beckner et al. 2009; De Bot 2017; De Bot, Lowie, and Verspoor 2007; Larsen-
Freeman and Cameron 2008; Lowie and Verspoor 2019; Verspoor 2017, among
many others). In view of the methodological challenges imposed by a dynamic
account, Jeniffer Albuquerque and Ubiratã Alves present the results of a 12-point
longitudinal data collection conducted with three Haitian speakers living in Bra-
zil (L2: Brazilian Portuguese), listened by two different Brazilian participants, all
of them showing different levels of L2 experience. Despite their different ac-
counts of intelligibility and comprehensibility, these two chapters converge in
Introduction 5
providing empirical support to the claim that intelligibility and comprehensibil-

ity emerge in a speaker-listener interaction, and both members of the interaction
play a pivotal role in the development of intelligible/comprehensible speech
(Derwing and Munro 2015). Pedagogical implications are also highlighted in the
two chapters, in accordance with Levis’ (2005, 2018, 2020) ‘intelligibility princi-
ple’ in L2 teaching.
Chapter 5, which opens the second part of the book, is also grounded on a
Complex, Dynamic account of language development. In this chapter, Ronaldo
Lima Jr. also presents longitudinal data of the English vowels /i ɪ ɛ æ u ʊ/ pro-
duced by prospective Brazilian teachers of English who took part in a Phonetics
course in the third semester of their English Language Teaching major. The
study compares these learners’ productions before receiving explicit pronuncia-
tion instruction (first two semesters) and after taking the Phonetics course
(third and fourth semesters). We believe this chapter sustains our initial claim
that pronunciation teaching and training studies have accompanied the many
changes verified in the field of L2 acquisition in general, as we nowadays find
pronunciation teaching studies carried out in a wide variety of accounts, rang-
ing from more traditional acquisition approaches to complex accounts of lan-
guage development and teaching.
Still with a focus on L2 pronunciation, in the sixth chapter, Tim Kochem,
Idée Edalatishams, Lily Compton and Elena Cotos present an English-Speaking
Consultation (ESC) program carried out at Iowa State University. As explained
by the authors, with this program, the students at the University are offered
one-to-one pronunciation practice sessions focusing on specific segmental and
suprasegmental features of English, which also allows them to focus on general
needs related to a specific task they might carry out in their courses. As for the
theoretical bases, the authors explain that the ESCs are grounded on the Tech-
nological Pedagogical Content Knowledge (TPACK) framework (Mishra and
Koehler 2007), once again showing that the teaching of pronunciation may be
planned and carried out through a variety of theoretical accounts. Also, we con-
sider it important to highlight that L2 pronunciation projects like this one con-
tribute to bridging the gap between L2 speech and L2 teaching, showing the
multifaceted and complex nature of L2 pronunciation teaching research.
Closing the second set of chapters, Ilvi Blessenaar and Lizet van Ewijk ad-
dress L2 pronunciation teaching from a different perspective: the application of
the “International Classification of Functioning, Disability and Health” (ICF) in L2
teaching practice. The ICF is a framework proposed by the World Health Organiza-
tion (2001, 2013) that is commonly used in healthcare by Speech and Language
Therapists. The authors argue that the ICF model can assist the L2 professional
and the L2 learner in relating pronunciation difficulties to intelligibility in daily
life and help identify influencing factors. The application of the model is illus-
trated with a case study of a Syrian refugee living in the Netherlands. This chapter
reflects the interdisciplinary status of the field of L2 teaching, as new pedagogical
approaches may be adapted or developed from previous research carried out in a
variety of related fields of knowledge.
The third part of the book, which addresses pronunciation training and its
implications for the classroom, is opened with a chapter by Susan Jackson and
Walcir Cardoso. In this chapter, the authors carry out an artificial language
learning experiment in order to investigate whether the inconsistent grapheme-
to-phoneme correspondence for /h/-initial words in English has an impact on
Francophone learners’ ability to encode the fricative /j/ as part of a newly-
learned word. In this learning experiment, the students were taught English
pseudo-words by associating auditorily presented stimuli with non-objects and
were placed into one of three learning conditions: auditory + congruent spell-
ing, auditory + congruent/incongruent (inconsistent) spelling, and auditory
only. The accuracy rates in a subsequent word-picture matching task suggest
that the acquisition of a novel phoneme is more difficult when the grapheme-
phoneme correspondence of the target language is inconsistent. This study
shows how training studies (especially artificial language experiments) may
contribute to showing the main developmental processes and sources of diffi-
culties faced by learners. These results have important implications for pronun-
ciation teaching, especially for the design of pronunciation materials/classes.
The next two chapters in this third block deal with different training activities
that can be implemented in language classrooms. In Chapter 9, Yuri Nishio and
Akiyo Joto tested how an Information and Communications Technology (ICT) self-
learning system is effective in teaching English vowel and consonant sounds to
Japanese learners, by focusing on the pronunciation of the names of the letters of
the English alphabet. The Japanese participants in the experiment were divided in
two groups, each one of them being trained in different platforms. In one of the
platforms, besides the native speakers’ video of the articulation of the sound, the
learners were shown a self-learning video, in which they could visualize their own
production. The results showed that both groups benefitted from training, and the
members of the group who were able to visualize their own face showed some ad-
vantages concerning the learning of consonants. Chapter 10, in turn, focuses on
Automatic Speech Recognition-based applications. In this chapter, Natallia Liakina
and Denis Liakin address the different types of implicit and explicit corrective
feedback provided by these apps and discuss their impact on the acquisition of L2
pronunciation. The authors also report on the results of an action research on the
use of three different ASR-based tools, with a special focus on the learners’ percep-
tions of the usefulness of the different types of feedback provided by each one of
Introduction 7
these tools. Together, these two chapters make it clear that the use of technologies
may be an aid in the L2 classroom, and training approaches using such technolo-
gies may be implemented in classroom activities aiming at the teaching of pronun-
ciation. In other words, teaching and training approaches may be merged with the
aim of helping learners achieve higher levels of speech intelligibility.
Finally, the last three chapters deal with High Variability Phonetic Training
(HVPT) and their empirical implications for L2 speech and teaching. In chapter
11, Ellen Simon, Bastien de Clercq, Pauline Degrave and Quentin Decourcelle
investigate the robustness of HVPT on the perception of non-native Dutch con-
trasts by French-speaking learners. By ‘robustness’, the authors refer to (i) the
generalizability of the training to novel tokens and talkers; (ii) the long-term
effects of HVPT; and (iii) the effect of HVPT in non-optimal listening conditions.
Their results, which show variability in the efficacy of HVPT in most robustness
variables, are discussed in view of the moderating variables examined. This is
an innovative study as it focuses on a target language other than English, show-
ing that investigations on different L2 systems have become a common (and de-
sired) research practice in the last few years.
Also verifying the effects of HVPT in retention and generalization, Polli-
anna Milan and Denise Kluge carry out an experiment on the effects of HVPT in
the perception and production of heterotonics by Brazilian learners of Spanish.
This study is innovative not only concerning the L1 and L2 systems involved,
but also in its focus on heterotonic words. The study also innovates in adopting
a Complex, Dynamic Systems perspective in an HVPT study, focusing on both
individual and group analyses. In the same fashion as in Simon et al.’s study,
Milan and Kluge’s results also show variability among participants, which is ex-
plained according to the tenets of a Complex, Dynamic account. Finally, closing
the last module of this volume, Anabela Rato and Diana Oliveira present a sys-
tematic review of 27 perceptual training studies, carried out over the last 40
years, which include the testing of generalization and retention of learning. As
it provides a detailed picture of the HVPT research scenario, this chapter also
presents suggestions for future research, paving the way for new studies on per-
ceptual training both in the laboratory and in the classroom.
The concluding chapter of the book is authored by Tracey Derwing. This con-
clusion not only presents a summary of the current research questions addressed
in pronunciation teaching and training studies, but also predicts future scenarios
for both researchers and practitioners in the field. Given her vast experience in
both L2 teaching and L2 acquisition studies, Professor Derwing’s chapter pro-
vides suggestions to bridge the gap between pronunciation researchers and prac-
titioners, which is one of the most important goals set for this book.
All in all, the chapters in this volume are grounded on different views of
language acquisition (ranging from traditional accounts, such as fossilization,
to more innovative approaches, which view language as a Complex, Dynamic
system) and different teaching perspectives and frameworks (such as Celce-
Murcia et al’s, the TPACK framework for pronunciation teaching and the ICF
model, among others). Therefore, it is not by chance that the label ‘different ap-
proaches’ is part of the title of this volume. We see these different approaches
as exciting and positive, as they reflect the interdisciplinary nature as well as
the growth this field has had throughout the years. We hope the chapters in
this volume contribute to new theoretical and methodological developments in
the L2 pronunciation teaching and training studies, consolidating the contribu-
tion of this research theme to the field of L2 acquisition.
References
Albuquerque, Jeniffer Imaregna Alcantara de. 2019. Caminhos dinâmicos em inteligibilidade
e compreensibilidade de línguas adicionais: Um estudo longitudinal com dados de fala
de haitianos aprendizes de Português Brasileiro [Dynamic paths of intelligibility and
comprehensibility inadditional languages: a longitudinal study on speech data from
Haitian learners of Brazilian Portuguese]. Porto Alegre, Brazil: Universidade Federal do
Rio Grande do Sul dissertation.
Beckner, Clay, Nick C. Ellis, Richard Blythe, John Holland, Joan Bybee, Jynyun Ke, Morten
H. Christiansen, Diane Larsen-Freeman, William Croft & Tom Schoenemann. 2009.
Language is a Complex Adaptive System: Position paper. Language Learning 59(s.1). 1–26.
Best, Catherine & Michael D. Tyler. 2007. Nonnative and second-language speech perception:
Commonalities and complementarities. In Ocke-Schwen Bohn & Murray J. Munro (eds.),
Language Experience in Second Language Speech Learning: In honor of James Emil Flege,
13–34. Amsterdam: John Benjamins.
Bybee, Joan. 2001. Phonology and Language Use. Cambridge: Cambridge University Press.
Bybee, Joan. 2008. Usage-based grammar and second language acquisition. In Peter
Robinson & Nick Ellis (eds.), Handbook of Cognitive Linguistics and Second Language
Acquisition, 216–235. New York: Routledge.
Celce-Murcia, Marianne, Donna M. Brinton, Janet M. Goodwin & Barry Griner. 2010. Teaching
Pronunciation: A Course Book and Reference Guide. Cambridge: Cambridge University
Press.
De Bot, Kees. 2017. Complexity Theory and Dynamic Systems Theory: Same or different?
In Lourdes Ortega & ZhaoHong Han (eds.), Complexity Theory and Language
Development: In Celebration of Diane Larsen-Freeman, 51–58. Amsterdam: John
Benjamins.
De Bot, Kees, Wander Lowie & Marjolijn H. Verspoor. 2007. A Dynamic Systems Theory
approach to second language acquisition. Bilingualism: Language & Cognition 10(1). 7–21.
Introduction 9
Derwing, Tracey M. & Murray J. Munro. 2015. Pronunciation Fundamentals: Evidence-based

Perspectives for L2 Teaching and Research. Amsterdam: John Benjamins.
Escudero, Paola. 2005. Linguistic perception and second language acquisition. Utrecht:
Utrecht University dissertation.
Flege, James Emil. 1995. Second language speech learning: Theory, findings, and problems.
In Winnifred Strange (ed.), Speech Perception and Linguistic Experience: Issues in Cross-
Language Research, 233–277. Timonium, MD: York Press.
Flege, James Emil & Ocke-Schwen Bohn. 2021. The Revised Speech Learning Model (SLM-r).
In Ratree Wayland (ed.), Second Language Speech Learning: Theoretical and Empirical
Progress, 3–83. Cambridge: Cambridge University Press.
Johnson, Keith. 1997. Speech perception without speaker normalization: An exemplar model.
In Keith Johnson & John Mullenix (eds.), Talker Variability in Speech Processing, 145–165.
San Diego: Academic Press.
Kang, Okim, Ron I. Thomson & Mehan Moran. 2018. Empirical approaches to measuring the
intelligibility of different varieties of English in predicting listener comprehension.
Language Learning 68(1). 115–146.
Kuhl, Patricia. 2000. A new view of language acquisition. Proceedings of the National
Academy of Sciences of the United States of America 97(22). 11850–11857.
Larsen-Freeman, Diane & Lynne Cameron. 2008. Complex Systems and Applied Linguistics.
Oxford: Oxford University Press.
Lee, Junkyu, Juhyun Jang & Luke Plonsky. 2015. The effectiveness of second language
pronunciation instruction: a meta-analysis. Applied Linguistics 36(3). 345–366.
Levis, John M. 2005. Changing contexts and shifting paradigms in pronunciation teaching.
TESOL Quarterly 39(3). 369–377.
Levis, John M. 2018. Intelligibility, Oral Communication, and the Teaching of Pronunciation.
Cambridge: Cambridge University Press.
Levis, John M. 2020. Revisting the Intelligibility and Nativeness Principles. Journal of Second
Language Pronunciation 6(3). 310–328.
Logan, John S., Scott E. Lively & David B. Pisoni 1993. Training listeners to perceive novel
phonetic categories: How do we know what is learned? Journal of the Acoustical Society
of America 94(2). 1148–1151.
Logan, John S. & John S. Pruitt. 1995. Methodological issues in training listeners to perceive
non-native phonemes. In Winifred Strange (ed.), Speech Perception and Linguistic
Experience, 351–377. Timoniu, MD: York Press.
Lowie, Wander & Marjolijn H. Verspoor. 2019. Individual differences and the ergodicity
problem. Language Learning 69(s1). 184–206.
Mishra, Punya & Matthew J. Koehler. 2007. Technological pedagogical content knowledge
(TPCK): Confronting the wicked problems of teaching with technology. In Society for
Information Technology & Teacher Education International Conference, Santo Antonio,
USA, 2007, 2214–2226. Waynesville, NC: Association for the Advancement of Computing
in Education (AACE).
Munro, Murray J. & Tracey M. Derwing. 1995. Foreign accent, comprehensibility and
intelligibility in the speech of second language learners. Language Learning 45(1).
73–97.
Munro, Murray J. & Tracey M. Derwing. 2015. Intelligibility in research and practice: teaching
priorities. In Marnie Reed & John M. Levis (eds.), The Handbook of English Pronunciation,
377–396. Malden, MA: Wiley Blackwell.
Nagle, Charles, Pavel Trofimovich & Annie Bergeron. 2019. Toward a dynamic view of second
language comprehensibility. Studies in Second Language Acquisition 41(4). 647–672.
Saito, Kazuya. 2012. Effects of instruction on L2 pronuciation development: a synthesis of 15
quasi-experimental intervention studies. TESOL Quarterly 46(4). 807–819.
Saito, Kazuya & Luke Plonsky. 2019. Effects of second language pronunciation teaching revisited:
a proposed measurement framework and meta-analysis. Language Learning 69(3).
652–708.
Thomson, Ron I. 2018. Measurement of accentedness, intelligibility, and comprehensibility.
In Okim Kang & April Ginther (eds.), Assessment in Second Language Pronunciation,
11–29. London & New York: Routledge.
Thomson, Ron I. & Tracey M. Derwing. 2015. The effectiveness of L2 pronunciation instruction:
A narrative review. Applied Linguistics 36(3). 326–344.
Thomson, Ron I. & Tracey M. Derwing. 2016. Is phonemic training using nonsense or real
words more effective? In John Levis, Huong Le, Ivana Lucic, Evan Simpson & Sonca Vo
(eds.), Proceedings of the 7th annual Pronunciation in Second Language Learning and
Teaching Conference, Dallas, Texas, 2015, 88–97. Ames, IA: Iowa State University.
Trofimovich, Pavel, Charles L. Nagle, Mary Grantham O’Brien, Sara Kennedy, Kym Taylor Reid,
Lauren Strachan. 2020. Second language comprehensibility as a dynamic construct.
Journal of Second Language Pronunciation 6(3). 430–457.
Verspoor, Marjolijn H. 2017. Complex Dynamic Systems Theory and L2 pedagogy: lessons to
be learned. In Lourdes Ortega & ZhaoHong Han (eds.), Complexity Theory and language
Development: In celebration of Diane Larsen-Freeman, 143–162. Amsterdam: John
Benjamins.
World Health Organization. 2001. International Classification of Functioning, Health and
Disability. Genova: World Health Organization.
World Health Organization. 2003. How to use the ICF: A practical manual for using the
International Classification of Functioning, Disability and Health (ICF). Genova: World
Health Organization.
Part I: Pronunciation development and
intelligibility: Implications for teaching
and training studies
Thaïs Cristófaro Silva, Wellington Mendes
Plural formation in English: A Brazilian
Portuguese case study
Abstract: This study examines the role of orthography in the production of plural
formation in English by Brazilian Portuguese (BP) speakers. Two orthographic
patterns were examined for English nouns whose plural is pronounced as a (stop +
sibilant) sequence: [ps, ts, ks, bz, dz, gz]. One of the patterns presents two letters
word-finally – cups, cats, marks – whereas the other one presents a silent <e>
between two consonants: grapes, plates, cakes. The question we posed is
whether these different orthographic patterns would trigger different pronuncia-
tions for Brazilian L2 learners of English. An ongoing sound change involving
[Cs] ~ [Cis] in regular plural forms in BP was also considered. An experiment was
designed to test the production of regular plural forms in English and Brazilian
Portuguese to examine (stop + sibilant) sequences. Results showed that English
learners are more likely to pronounce a vowel when the orthographic pattern is
<Ces> rather than <Cs>. These results are discussed in the light of proposals
which suggest that phonological and orthographic representations are activated
in L2 production (Bassetti 2017; Hamann and Colombo 2017; Rastle et al. 2011).
The role played by an ongoing sound change from the L1 into L2 English is also
addressed. It was shown that [Cs] sequences consist of a robust pattern in Bra-
zilian Portuguese, which is adopted in L2 English. The [Cs] ~ [Cis] alternation
observed in BP and adopted in L2 English offers evidence that subphonemic
properties are part of phonological representations. The emergence of [z] is a
challenge for BP speakers learning English, as this pattern does not occur in
BP. Finally, some suggestions for the pronunciation teaching of regular plural
nouns in English are presented.
Keywords: plural formation, orthography, phonology, phonetic variability
Thaïs Cristófaro Silva, National Council for Scientific and Technological Development,
Research Supporting Foundation of Minas Gerais, Federal University of Minas Gerais
Wellington Mendes, Federal Center for Technological Education of Minas Gerais, Federal
University of Minas Gerais
https://doi.org/10.1515/9783110736120-002
14 Thaïs Cristófaro Silva, Wellington Mendes
1 Introduction
This paper aims to investigate the pronunciation of regular plural nouns in
English which are produced by Brazilian Portuguese speakers of L2 English
(BP-EL2). The investigation was twofold. Firstly, it considered whether different
orthographic patterns would trigger different pronunciations of plural forms, by
assessing how orthographic and phonological representations can be related.
Secondly, it considered the relationship between an ongoing sound change in
Brazilian Portuguese (BP) as a first language (L1) into L2 English. An Exemplar
Model approach is proposed to account for the findings, which also incorporates
the revised Speech Learning Model (SLM-r) (Flege and Bohn 2021). The model to
be presented captures the relationship between speech production, perception
and orthography in L1 and L2 and conceives language as a dynamic system. This
first section reviews the literature on the relationship between orthography and
pronunciation. Then, a review on epenthesis in BP-EL2 production of past and
participle as well as 3rd person singular present and regular plural formation is
presented. This motivates the present study and offers insights on new ways of
approaching L2 pronunciation.
Studies on the relationship between orthography and phonology have in-
creased in recent years (Bassetti 2017; Colantoni, Steele, and Escudero 2015;
Hamman and Colombo 2017; Rafat 2015; Zhou 2021). The main research ques-
tions in this topic aim to explain how L2 learners mediate the relationship be-
tween the already known phonological and orthographical knowledge from the
L1 in order to build an L2.
The major contribution from works on the relationship between orthogra-
phy and phonology is to model representations as being multimodal, in which
perception, production and orthography interact. In the past, several works ad-
dressed the relationship between orthography and the pronunciation of BP-EL2
speakers. The main concern was the presence of an epenthetic vowel in BP-EL2
which would reflect a letter corresponding to a vowel. A major characteristic of
BP phonology is to insert an epenthetic vowel to prevent illicit consonantal
clusters which are orthographically represented by two contiguous consonantal
letters (Collischonn 2002). Such strategy applies in the native lexicon as in
dogma ['dɔ.gi.mə] or afta ['a.fi.tə], as well as in loanwords, word-initially or
word-finally, as in Skype [is.'kaj.pi] (Gomes 2019), and word-medially, as in
podcast [pɔ.dʒi.'kɛs.tʃi] (Nascimento 2016).1 The epenthetic vowel occurs more

frequently as [i], but may also appear as [e].
Epenthetic vowels are claimed to be a characteristic of BP-EL2 speech (De-
latorre 2006; Gomes 2009; Silveira 2007). Delatorre (2006) investigated the pro-
duction of English past and participle forms by BP-EL2 speakers, as in moved or
robbed. Typically, two epenthetic vowels appeared in the pronunciation of BP-
EL2: one epenthetic vowel breaks up the word-internal consonant cluster and
the other one prevents word-final consonants, as in asked ['as.ke.dʒi] and
saved ['seɪ.ve.dʒi]. She claimed that the orthographic input, which was present
in a reading task, favoured higher rates of an epenthetic vowel, as opposed to a
free speech task, which did not present any orthographic stimulus. Therefore,
she argued that the orthographic input favoured the presence of epenthetic
vowels in BP-EL2 learners’ pronunciation.
Silveira (2007) investigated the production of word-final epenthesis in BP-
EL2 speakers. She compared words whose final letter was a consonant (e.g.
mad [mæd]) to words whose final letter was a silent ⟨e⟩ (e.g. made [meɪd]). Her
results showed that words ending in a silent ⟨e⟩ presented higher rates of epen-
thesis than words that ended in a consonantal letter. Akin to Delatorre (2006),
the results of Silveira (2007) showed that a reading task favoured higher rates
of epenthetic vowels than a free speech task, indicating that orthographic input
contributed to an epenthetic vowel to be manifested.
A problem that arises from these works comes from the role of orthography
in the rule-based approach they assume. Epenthesis was seen as a general rule
in BP which prevents word-final consonants and certain consonantal clusters
to be manifested. In their view, such a rule was then transferred to BP-EL2.
However, in rule-based models, orthography lies outside of Grammar, so it can-
not contribute to shaping phonological representations. The same holds for the
role played by the task, as it lies outside Grammar.
Gomes (2009) also investigated BP-EL2 speakers’ vowel epenthesis in verbs
ending in <ed>. Her main contribution was to show that epenthesis decreases as
the time of exposure to the L2 increases: basic level speakers produced 78% of ep-
enthetic vowels; intermediate level speakers had 66% of epenthesis and advanced
speakers showed 23% of epenthetic vowels. Within Gomes’ (2009) rule-based
framework, epenthesis is understood as a phonological rule of vowel insertion.
Another case of epenthetic vowels reported in the literature involves word-
final consonant and sibilant sequences, [Cs], which typically appear in 3rd
 Alveopalatal affricates [tʃ, dʒ] may occur in BP when followed by a high front vowel, reflect-
ing a palatalization process: tia [tia] ~ [tʃia] (aunt), dia [dia] ~ [dʒia] (day).
person singular present and regular plural forms in English. BP-EL2 speakers
tend to optionally insert an epenthetic vowel between the two word-final conso-
nants, for example, cakes [keɪks] ~ ['keɪ.kis] (Cristófaro-Silva 2011). Interest-
ingly, works that considered 3rd person singular present and regular plural
forms in English spoken by BP-EL2 speakers addressed voicing agreement
rather than epenthesis. Let us consider works on voicing agreement and then
we will return to the alternation between [Cs] ~ [Cis].
Zanfra (2013) studied the voicing of sibilants in English by BP-EL2 speakers.
Although her focus was not specifically on plural forms, her results throw some
light on the current discussion. First, she considered cases where a word-final [s]
was expected to be pronounced (e.g. house, bus). Her results showed that [s] was
recurrent in words whose orthography ended in the letter <s>, as in bus, as op-
posed to words that presented a silent <e> word-finally, as in house. In the latter
case, higher rates of [z] were attested, suggesting that the silent letter <e> played a
role in the pronunciation of the word-final sibilant. It is worth mentioning that [z]
occurred followed by a vowel: ['haʊzi] house. The voicing of the sibilant, in this
case, is explained by a regressive assimilation rule involving adjacent segments in
BP word boundaries. Only voiceless sibilants occur word-finally followed by a
pause in BP, as in mês [mes] ‘month’. The regressive assimilation rule predicts that
if the next word begins with a vowel or a voiced consonant, then [z] occurs: mês
anterior [mez ə̃.te.ɾi.ˈoɾ] ‘previous month’ and mês bonito [mez bo.ˈni.tʊ] ‘beautiful
month’. If a voiceless consonant follows the sibilant, then [s] occurs: mês passado
[mes pa.ˈsa.dʊ] ‘last month’. Zanfra (2013) tested whether the BP voicing assimila-
tion rule involving adjacent segments in word boundaries would apply in BP-EL2
learners’ productions. Her results showed that sibilants tended to be voiced when
followed by a voiced consonant (e.g. The house backyard is huge) or by a vowel
(e.g. The mouse I saw is white). Conversely, a sibilant was voiceless when the fol-
lowing context was a pause (e.g. I won’t go if he goes.) or a voiceless consonant
(e.g. These pancakes are great). Zanfra (2013) suggested that BP-EL2 speakers
transfer the BP regressive assimilation rule into their L2 English.
Fragozo (2017) investigated the voicing of sibilants in English regular plural
forms and 3rd person singular presented by BP-EL2 speakers. She assessed the
extent to which a sibilant would be manifested as voiced after a voiced conso-
nant, as in dogs or clubs, which would reflect the acquisition of a progressive
assimilation rule from English. The underlying representation for regular plural
and 3rd person singular present is assumed to be /z/ (Hayes 2011). The progres-
sive assimilation rule predicts that if a vowel or a voiced consonant precedes
/z/, the output is [z], as in keys, dogs. If a voiceless consonant precedes /z/, it
surfaces as [s], as in cats. Finally, if a sequence of sibilants occurs, the outcome
is [ɪz], as in kisses. Fragozo (2017) also examined words in context to verify if
the regressive assimilation rule, which applies to BP, would be transferred to L2

English. She found that voiced sibilants in BP-EL2 productions tended to follow
the regressive assimilation rule from BP, whereas the English progressive as-
similation rule had a very low rate in her data (0.6%). She argues that the low
rates of voiced sibilants [z] in BP-EL2 follows from the fact that these conso-
nants are partially voiced in English. Data from her control group of native
speakers presented 44% of expected voiced sibilants. Thus, as sibilants are par-
tially voiced in English, they would not be accessible in L2 English.
Zanfra (2013) and Fragozo (2017) both investigated voicing agreement in
BP-EL2 speakers within a rule-based approach where there would be a competi-
tion between a regressive assimilation rule from BP and a progressive assimila-
tion rule from English. Their major finding is that the English progressive
assimilation rule hardly applies in BP-LE2. The explanation for this finding lies
on the transfer of the regressive assimilation rule from BP into L2 English. A
question that arises from this assumption is whether a rule that is transferred
from the L1 to the L2 could change as time goes by. Another issue which is po-
lemic is the role played by orthography, as in house or bus (Zanfra 2013). Or-
thography cannot be modelled within a rule-based approach as it is not part of
Grammar. Furthermore, the rule-based approach adopted by Zanfra (2013) and
Fragozo (2017) neglected the role played by an epenthetic vowel that may inter-
vene between the two word-final consonants, as in cakes [keɪks] ~ ['keɪ.kis].
This paper intends to be a contribution to the understanding of word-final
sibilants in BP-EL2 focusing on the English regular plural formation. Similar to
previous research, this paper considers the role played by orthography in the
development of plural formation in L2 (Delatorre 2006; Silveira 2007; Zanfra
2013). On the other hand, different from previous works which adopt rule-based
approaches, this paper models L2 phonology within an Exemplar Model ap-
proach by considering representation robustness and the role of fine phonetic
detail in shaping representations. Within this proposal, orthography is mod-
elled as part of the linguistic knowledge of literate speakers.
In order to understand our proposal, one has to consider an ongoing sound
change in BP which involves [Cs] ~ [Cis] word-finally in the plural forms of
nouns. We will refer to these cases as [Cs]-nouns. Consider Table 1.
Table 1 shows examples of BP [Cs]-nouns. The first column lists the orthog-
raphy of BP nouns in the singular form. The second column presents the corre-
sponding transcriptions, which show that a word-final unstressed high vowel
occurs. The third column lists the plural forms for the nouns listed in the first
column, where the letter <s> is added to the singular. The fourth column, which
Table 1: [Cs] nouns in BP.
BP singular Transcription BP [Cs]-nouns Transcription Gloss
clube ['klu.bi] clubes ['klu.bs] ~ ['klu.bis] clubs

duque ['du.ki] duques ['du.ks] ~ ['du.kis] dukes
jipe ['ʒi.pi] jipes ['ʒi.ps] ~ ['ʒi.pis] jeeps
pote ['pɔ.tʃi] potes ['pɔ.ts] ~ ['pɔ.tʃis] pots
corresponds to our main interest, shows that [Cs] alternates with [Cis] word-
finally.2 The last column presents the gloss.
The alternation between [Cs] ~ [Cis] word-finally in BP follows from the re-
duction and eventual loss of unstressed high front vowels when flanked be-
tween a consonant and a final sibilant (Cristófaro-Silva, Almeida, and Guedri
2008; Leite 2006; Soares 2016). The alternation between the presence and ab-
sence of an unstressed high vowel between a consonant and a sibilant also ap-
plies to BP-EL2 plural forms, as in cakes [keɪks] ~ ['keɪ.kis]. This paper intends
to investigate [Cs] ~ [Cis] in English regular plural forms produced by BP-EL2
speakers attempting to address the question of whether an ongoing sound
change from the L1 plays a role in L2 learning.
The role of orthography in L2 pronunciation will also be addressed in this
paper. BP has only <Ces>3 as the orthographic correlate for [Cs] ~ [Cis], as
shown in Table 1, whereas English, on the other hand, has two orthographic
correlates for [Cs]: <Ces> as in grapes and <Cs> as in cats.4
This paper is organized as follows. The next section presents the EMPL-2
model (Exemplar Model in L2 Phonology). The third section describes the meth-
odology adopted in this study. The fourth section presents and discusses the
results and is followed by a suggestion to the teaching of English plural forms
to BP-EL2 speakers.
 In BP, word-final voiceless alveolar fricatives remain voiceless regardless of the alternation
between [Cs] ~ [Cis]. As previously mentioned, only if the next word begins with a vowel or a
voiced consonant will [z] occur (e.g. mês anterior [mez ə̃.te.ɾi.ˈoɾ] ‘previous month’). A question
that arises is whether both the alternation between [Cs] ~ [Cis] and the presence of a following
vowel have an influence on the voicing property of the word-final sibilant in L2 English. This
question will be addressed at the end of our analysis.
 BP presents words such as cheques (checks) and mangues (wetlands), which display the
<Cues> orthographic pattern. For the sake of clarity, this pattern will be represented in this
paper as <Ces>.
 A restricted number of plural forms in English present the <Cues> orthographic pattern, e.g.,
tongues and techniques. Due to the limited number of examples, they are not considered in
this paper.
2 Modeling L2 phonological representations

Traditionally, phonological representations present only contrastive segments.
Phonemes represent abstract representations and are transcribed between
slashes, as in /p/, whereas allophones are transcribed between squared brack-
ets, as in [p]. The actual pronunciation of a given sound emerges from some sort
of processing, in which the abstract representation is turned into a pronounce-
able one. Only unpredictable segments are present in phonological representa-
tions, i.e., phonemes. Subphonemic or allophonic variation which is predictable
by the environment is not part of phonological representations but emerges
through processing. Within this view, representations are simple – as they pres-
ent only unpredictable segments – and processing is complex by some theoreti-
cal artefact (Johnson 1997).
It is likely that this traditional assumption led dictionaries to list only phone-
mic or unpredictable segments. For example, /p/ represents both aspirated and
unaspirated properties that occur for example in paper [ˈpheɪ.pər], but which is
typically transcribed in dictionaries as /'peɪ.pər/. The aspirated [ph] is usually not
represented, as it is predictable to occur in the beginning of a stressed syllable,
which is the case in the first syllable in paper. The /p/ in the final syllable of
paper is not in a stressed environment and thus is not aspirated. A pronounce-
able transcription for paper is [ˈpheɪ.pər]. Furthermore, dictionaries typically do
not indicate what the slashes represent, and some dictionaries either use squared
brackets – but actually present phonemic representations – or do not present
any brackets at all (Horta, Cristófaro-Silva, and Soares 2021). Of course, for a na-
tive speaker it is obvious that the initial sound in paper is aspirated but the [p] in
the second syllable is not aspirated. However, for L2 learners this is not transpar-
ent and little or no information is provided as how to decode the symbols pre-
sented in dictionaries.
The aspiration of [ph] has a high degree of predictability in English: it is man-
ifested in the beginning of stressed syllables. However, in other cases, as for ex-
ample /t/, the outcome is not so predictable if one considers a word like water,
where /t/ may be manifested either as an alveolar stop, as a flap or as a glottal
stop. In these three potential pronunciations, /t/ occurs in an intervocalic posi-
tion and variation is related to geographical or sociolinguistic factors and not
simply by contextual ones (Harris and Kaye 1990). The pronunciation of /t/ also
varies depending on word boundaries. For example, in It may!, /t/ is pronounced
as an alveolar stop but in It is! /t/ can be pronounced either as an alveolar stop,
as a flap or as a glottal stop, as it is in an intervocalic position. It seems plausible
to posit a single segment /t/ to account for intervocalic pronunciations across word-
boundaries since it is contextually predictable: when followed by a consonant, [t]
occurs, as in It may!, whereas in intervocalic position, as in It is!, either an alveolar

stop, a flap or a glottal stop may occur. However, these latter pronunciations would
be related to geographical or sociolinguistic factors which are not predictable. Fi-
nally, /t/ may also be aspirated when it appears in the beginning of a stressed syl-
lable: [theɪk] take, following a generalization that also applies to /p/ in English, as
previously seen. The discussion about the phonetic realisation of /p/ and /t/
shows that in some cases there is a high degree of predictability; however, in other
cases, several options for pronunciations may occur.
Once again, dictionaries tend not to fully provide information about pro-
nunciation to L2 learners. The vast amount of language variation that goes on
in a language may be to blame for not providing explanations to L2 learners on
how to decode transcriptions in dictionaries or textbooks. Nevertheless, fairly
predictable variation – as the aspirated or unaspirated stops in English – is not
provided in dictionaries or textbooks to L2 learners.
Another type of information about pronunciation which is not available to
L2 learners in dictionaries is predictable morphophonological alternations. This
is the case of plural formation in English. Progressive assimilation accounts for
[z] to occur in nouns that end in a vowel, diphthong or voiced consonant (e.g.,
tree [triːz], pie [phaɪz], bags [bæɡz]), as well as [s] to occur after nouns that end
in a voiceless consonant (e.g. cats [khæts]). The insertion of a vowel between
the two sibilants aims to prevent sequences of identical segments, so [ɪz] is the
plural morpheme for words that end in a sibilant (e.g., kisses [khɪsɪz], bushes
[bʊʃɪz], watches [wɒtʃɪz]). Thus, a simple representation of the plural suffix /z/
is proposed to account for the pronunciation of regular plural in English nouns
(Hayes 2011). The choice for either morpheme [z, s, ɪz] is predictable depending
on the final sound that occurs in the noun. As the plural morpheme is predict-
able, the traditional proposal assumes that it does not have to be listed in dic-
tionaries. Only irregular plurals are listed. However, learners of English as L2
have to infer the pronunciation of plural forms without being offered any infor-
mation or instruction about it.
We understand that the theoretical approach adopted by dictionaries and
textbooks which excludes subphonemic or predictable information limits the
amount of information needed by L2 learners with regards to pronunciation. We
argue that L2 learners must be offered detailed information about pronunciation,
which includes subphonemic or predictable information. The theoretical ap-
proach that accommodates this proposal is the Exemplar Model (Bybee 2001,
2008; Johnson 1997), combined with the Revised Speech Learning Model (SLM-r)
(Flege and Bohn 2021), which we refer to as the EMPL2 (Exemplar Model in L2
Phonology), as proposed by Cristófaro-Silva and Guimarães (2021).
Exemplar Models claim that linguistic representations are shaped from expe-
rience (Bybee 2001, 2008, 2010). Any exemplar which is experienced is mapped
and abstractly represented by phonological and semantic identity and similarity.
In terms of the sounds of a given language, any fine phonetic detail as well as
contextual information is mapped onto abstract representations. Within an Exem-
plar Model approach, aspirated as well as unaspirated stops in English are present
in phonological representations, as well as the contextual information which de-
fines that aspiration of stops occurs in stressed position. Abstract representations
also contain grammatical information which emerges from the categorization of
experienced exemplars: “Lexical organization provides generalizations and seg-
mentation at various degrees of abstraction and generality. Units such as mor-
pheme, segment, or syllable are emergent in the sense that they arise from the
relations of identity and similarity that organize representations” (Bybee 2001: 7).
Within this view, the three plural suffixes for English nouns – [z], [s], [ɪz] –
emerge from language experience offering grammatical generalizations. That
means that any given noun has a plural morpheme associated to it when the
plural is regular. Irregular plurals have a special grammatical representation.
Consider Figure 1.
Figure 1: Plural formation in English.

Figure 1 illustrates the network involved in plural formation in English

based on Bybee (1995, 2002). For the sake of the present discussion, only ortho-
graphic representations are shown in Figure 1. A network that connects form
and meaning is formed by exemplars, which in Figure 1 are presented inside
grey or white boxes. Singular forms are illustrated in grey boxes. Plural forms
are illustrated in white boxes. From linguistic experience, it emerged that plural
forms in English can be regular or irregular, as shown on the top of Figure 1.
Irregular plurals may have no morpheme or have an idiosyncratic plural. Regu-
lar plurals present three potential morphemes: [s, z, ɪz]. Generalizations emerge
as nouns ending in a voiceless consonant will take [s], nouns ending in a
voiced consonant will take [z] and nouns ending in a sibilant will take [ɪz]. Any
new noun that is accessed by a speaker – be it in the singular or in the plural –
will be connected to this network. If a speaker experiences a word such as
sheep in the context of plural, it will connect it to the Ø-plural category. If it
were a regular plural, it should have been sheeps, which was not the case. Simi-
larly, any word that ends in a voiceless consonant will receive [s] as its plural
morpheme, unless its plural is irregular. Networks are built from experience. It
follows from this that any network in L2 will start being built from the moment
the learner combines form and meaning to create linguistic categories. As L2
speakers already have their mother tongue, they use it to accommodate input
from the L1. This explains why L2 speakers have an accent which improves as
access to L2 becomes greater. It also explains why speakers with the same mother
tongue have similar accents as they share a linguistic network from the L1.
The SLM-r (revised Speech Learning Model) was proposed by Flege and
Bohn (2021) and presents important advances in relation to a former model pre-
sented in Flege (1995), mainly with regards to differences between ‘early’ and
‘late’ L2 learners. The authors claim that the model’s “primary aim is to provide
a better understanding of how the phonetic systems of individuals re-organize
over the life-span in response to the phonetic input received during naturalistic
L2 learning.” (Flege and Bohn 2021: 3). Two points follow from the primary aim
of the model: L2 evolves in a dynamic fashion over one’s life-span and phonetic
systems interact with the input received in the L2. Flege (1995) and Flege and
Bohn (2021) both assume that phonetic categories are the source of L2 pronun-
ciation. This view makes the SLM-r compatible with Exemplar Models which
also assumes that the input is relevant in shaping grammatical representations.
The major contribution of the EMPL2 is to add to Exemplar Models and SLM-r
the relevance of detailed phonetic information in shaping L2 phonological rep-
resentations, besides connecting orthography to linguistic representations.
Exemplar Models and the SLM-r assume that language-specific phonetic cat-
egories are sufficiently rich in detail. Additionally, both Exemplar Models and
the SLM-r also share the assumption that perception and production interact in a
dynamic fashion to construct abstract representations. The SLM-r suggests a
three-level model: sensory motor level, a phonetic category level and a lexico-
phonological level (Flege and Bohn 2021: 12), which could be accommodated in
three levels of Exemplar Models: neuromotor production schemas, perceptual-
articulatory categories and constructions. The EMPL2 adds an orthographic level
which is present in literate speakers’ representations. Consider Figure 2.
Figure 2: EMPL2 Plural formation in English.
Figure 2 presents a network consisting of a zoom from Figure 1 for regular plu-
ral formation that takes the morpheme [s]. Orthographic representations are
presented inside angle brackets and phonetic representations are presented
without any brackets. All speakers have phonetic representations which are in
fact formed of all the exemplars experienced for the category. Thus, several ex-
perienced instances of cup form the exemplar for this word. For the sake of clar-
ity, this is simplified in Figure 2 to a single exemplar. Only literate speakers
have access to orthographic representations which are connected to their corre-
sponding phonetic category. In the diagram of Figure 2, all the words shown
end in a voiceless consonant, i.e., C-. A generalization that emerges from this
network is that any noun that ends in a voiceless consonant (except sibilants)
will receive the morpheme [s] if it is a regular plural. It is also inferred from the
network in Figure 2 that a noun in the plural that ends in a final [s] presents a
voiceless consonant word-finally in its singular form. It is also inferred from the
network in Figure 2 that plural forms for the nouns presented take <s> as the
orthographic representation of plural regardless of whether the final consonant
in the spelling of the nouns are different.
Adult L2 learners have primarily written input. As they do not have the L2
network for orthography, they adopt the potential corresponding orthography
from their L1. Besides pronunciation and orthography, the model presented in
Figures 1 and 2 also include a perceptual level which is connected to the network.
Modelling perception and production within a single model has been pro-
posed by the Bidirectional Phonetics-Phonology model (Boersma 2011; Boersma
and Hamann 2009). Currently, the model was extended by a reading grammar
that encompasses orthography and is referred to as BiPhon Model (Hamann and
Colombo 2017; Zhou 2021). The main difference between the EMPL2 proposal and
the BiPhon Model lies on how abstract representations are related to empirical
data. In the BiPhon model, simple abstract representations are processed in a
complex manner, whereas in the EMPL2, representations are complex and map-
ping is simple (Johnson 1997). We tested the EMPL2 model in plural formation in
English by BP-EL2 speakers focusing on the distribution of [s, z].5
3 Methodology
This study investigated the production of plural nouns in two languages: L1 Bra-
zilian Portuguese and L2 English. In BP’s regular plural forms, only a voiceless
consonant occurs word-finally. Thus, all nouns in their plural forms will be re-
ferred to as Cs-nouns. In English regular plural forms, either a voiceless or a
voiced sibilant may occur word-finally depending on the final consonant of the
noun in the singular. The data from English will be referred to as Cs-nouns and
Cz-nouns depending on how the sibilant is expected to be pronounced in English.
A set of 36 plural nouns ending in a sequence of (stop + sibilant) were con-
sidered in BP, which present a single orthographic pattern: <Ces>, as in cheques
 Since our objective is to investigate the production of [Cs] ~ [Cis] sequences, analyzing [ɪz]
would take us beyond this paper as the absence of [i] would trigger two adjacent sibilants.
Cristófaro-Silva, Almeida and Guedri (2008) analyzed adjacent sibilants in BP. Future studies
could consider these cases in L2 English.
[ʃɛks] ~ ['ʃɛ.kis] ‘cheques’. For the L2 English case study, a set of 36 words were
selected, where 15 words display the orthographic pattern <Ces>, as in grapes
[ɡreɪps], and the other 21 words display the orthographic pattern <Cs>, as in
maps [mӕps]. This distribution is shown in Table 2.
Table 2: BP and L2 English target words.
Brazilian Cs-nouns
Portuguese
ps ts ks bs ds gs
<Ces> alpes artes cheques árabes baldes açougues

() chopes botes cliques Caribes grades bumerangues
crepes chutes duques clubes lordes jegues
jipes cortes leques orbes redes mangues
naipes dentes toques plebes sedes ringues
xaropes potes truques robes tardes sangues
English Cs-nouns Cz-nouns
ps ts ks bz dz gz
<Ces> grapes gates cakes tubes sides —

() ropes kites lakes globes codes
tapes notes snakes cubes modes
<Cs> cups cats parks jobs beds dogs, bags,

() maps boats books pubs kids pigs, drugs,
ships hats masks labs hands bugs, rugs
Table 2 shows the distribution of the target words used for the BP and L2 En-
glish experiments. The uppermost part of the table lists BP Cs-nouns divided by
cluster type. As mentioned earlier, all 36 BP nouns display the < Ces> ortho-
graphic pattern. The bottom of the table lists L2 English targets, which are com-
prised by both Cs-nouns (e.g. cups [kʌps]) and Cz-nouns (e.g. bags [bæɡz]). L2
English words have also been divided by their orthographic patterns: words
such as grapes, gates and cakes are spelled with <Ces> word-finally, whereas
words such as cups, cats and books end in <Cs>.
In order to disguise the purpose of the experiment, a set of 72 filler items
were added to the words listed in Table 2 during the trials. Filler items consisted
of singular nouns that did not have a consonant cluster in word-final position,
as in ball [bɔːl] for English and banana [ba.'nɐ̃.nə] ‘banana’ for BP. All filler
items were discarded for the purpose of analysis.6 Stimuli presentation was ran-
domized with the sort_rand macro of Microsoft PowerPoint 2019.
The experiment comprised two tasks that were performed by all partici-
pants, which took place one after the other. The first one consisted of a picture-
counting task in which participants were asked to count and name the items
shown in the pictures. Short carrier sentences that did not include orthographic
stimuli of the target words were given, as illustrated in Table 3.
Table 3: Stimuli and expected answers in the picture-counting task.
Stimuli Expected answers
quatro crepes argentinos foram vistos
____ ____ argentinos foram vistos
uma banana foi vista
____ ____ foi vista
two maps are seen
____ ____ are seen
one tree is seen
____ ____ is seen
 BP filler items include the following words: abelha (bee), avenida (avenue), bambu (bam-
boo), banana (banana), batata (potato), bingo (bingo), bolo (cake), brinquedo (toy), cadeira
(chair), caminho (path), caneta (pen), carteira (wallet), cobra (snake), copo (cup), corvo (crow),
estátua (statue), família (family), festa (party), flecha (arrow), foto (photo), gato (cat), gravata
(tie), lago (lake), lenço (handkerchief), logotipo (logotype), menino (boy), mesa (table), metrô
(metro), mochila (backpack), pizza (pizza), sapo (frog), sapato (shoe), sofá (sofa), tornado (tor-
nado), torta (pie) and vulcão (volcano). English filler items include the following words: arrow,
avenue, bamboo, banana, bee, bingo, boy, country, cowboy, crow, day, eye, family, key, logo,
metro, party, pen, photo, pie, pizza, potato, radio, sky, spa, statue, tie, tissue, tomato, tornado,
toy, tree, volcano, way, window and zoo.
Table 3 illustrates slides that were presented in the picture-counting task.

The first stimulus in Table 3 was used for BP and participants were expected to
pronounce it as quatro crepes argentinos foram vistos (four crepes were seen).
The second stimulus in Table 3 consisted of a BP filler item and participants
were expected to pronounce it as uma banana foi vista (a banana was seen).
The third stimulus in Table 3 was for English and participants were expected to
pronounce it as Two maps are seen. Finally, the fourth kind of stimulus con-
sisted of an English filler and participants were expected to pronounce it as one
tree is seen. In the BP sentences, each plural noun was followed by either a
vowel – as in crepes argentinos (Argentine crepes) (108 tokens) – or, alterna-
tively, by a voiceless consonant – as in crepes canadenses (Canadian crepes)
(108 tokens). This distribution intended to evaluate if the sibilant would be
voiced intervocalically. As for the English sentences, each plural noun was fol-
lowed a vowel – as in two maps are – (216 tokens) and, later, by a pause – as in
two maps (216 tokens). This distribution intended to evaluate if voiced sibilants
would be pronounced voiceless when followed by a pause.
The second trial consisted of a reading task. Initially, participants were
asked to read 72 BP sentences aloud. These sentences included the 36 target
words listed in Table 2 (e.g. Os alpes italianos são belíssimos) and 36 filler
items. After that, participants were asked to read 72 English sentences, which
also included 36 target words (e.g. The lakes are clean) and 36 filler items. Simi-
lar to the picture-counting task, BP nouns in the reading task were followed by
either a vowel or a voiceless consonant, with 108 tokens for each condition. On
the other hand, L2 English nouns were followed a vowel and, later, by a pause,
with 216 tokens for each condition. The overall number of syllables was con-
trolled for both languages: 4 in English and 12 in BP, considering the deletion
of the [i] vowel. Sentence-level intonation and the morphological class of each
word were also controlled.
A group of six Brazilians studying at the Federal Center for Technological Ed-
ucation of Minas Gerais, in the city of Araxá, participated in this study.7 All par-
ticipants were high school students who had been taking English classes as part
of the school’s curriculum for about one year. The group consisted of 3 males
and 3 females and their ages ranged from 15 to 17. Each student was asked to
take the online Kaplan Placement Test and the ones with either B1 or B2 profi-
ciency levels (intermediate learners) of the Common European Framework of
 This research has been approved by the ethics committee from the Universidade Federal de
Minas Gerais, reference number: CAAE: 15116119.9.0000.5149.
Reference for Languages were invited to take part in this research. They were not
given any information about the aim of the experiment.
Prior to the experiment, all participants filled out consent and screening
forms, ensuring that their data would be used for scientific purposes only. Due
to the recent COVID-19 pandemic, all interactions were performed remotely
through a video call on Google Meet. Experiments were recorded with the Open
Broadcaster Software Studio at 48 kHz sampling rate. The obtained recordings
were converted into WAVEform audio format by the software Adobe Premiere
2020, which was able to maintain the same sampling rate as the original
files. The average time to complete the experiment was 45 minutes. A total of
648 tokens were collected for the L2 English study. For the BP study, 432 tokens
were collected. Samples were edited and manually annotated using Praat Text-
Grids (Boersma and Weenink 2020). The R Studio (R Studio Team 2020) was
used for statistical analysis. The chosen test was the Pearson’s Chi-square,
available in the basic R Studio package (function chisq.test), which assesses
the significance effects of each variable. The adopted significance threshold
was 0.05, in agreement with general linguistic investigations (Levshina 2015).
Two main research questions were investigated. The first one is related to
the relationship between phonological and orthographical representations. BP
plural forms whose final orthography is <Ces> were examined as well as regular
plural forms in English whose orthographical representations were either <Ces>
or <Cs>. The hypothesis posited was that the orthographic pattern <Ces> will
present more realizations of a vowel intervening between the final consonants
than the orthographic pattern <Cs>. This would show that a letter <e> favours a
vowel to be manifested. We also assessed whether or not visual input influen-
ces the pronunciation of a vowel between the word-final consonants.
The second question addressed the role of an ongoing sound change in-
volving [Cs] ~ [Cis] in L2. The hypothesis posited is that the most common pat-
tern from the L1 will emerge in the L2. This will offer evidence that it is not just
sounds that are transferred from the L1 to the L2, but rather patterns that reflect
subphonemic alternations.
Finally, this research considered the voice quality of word-final sibilants.
In BP only voiceless sibilants occur word-finally, unless a vowel follows it, to
which a voiced sibilant occurs. In English, voiced and voiceless sibilants occur
word-finally. When a vowel follows the sibilant, the voice quality remains as it
formerly was (rather that changing as in BP). We posited that word-final voice-
less sibilants will be favoured in L2 English, as it is the more robust pattern in L1.
We also posited that a voiced sibilant occurs at higher rates in an intervocalic
position [Cis + vowel].
4 Results and discussion

A number of recent works have proposed that orthography is mapped into pho-
nological representations (Colantoni, Steele, and Escudero 2015; Gomes 2019;
Hamman and Colombo 2017). As discussed earlier, studies on BP learners pro-
ducing the regular past and participle forms in English showed that the pres-
ence of a letter which corresponded to the vowel in <ed> orthography favoured
an epenthetic vowel to occur. The orthography in regular verbs is always <ed>
although it may be pronounced as [d, t, ɪd]. Considering that the orthography
for past tense and participle is always <ed>, it is not possible to determine
whether it is the presence or the lack of a letter <e> that favours a vowel to
occur. In the case study presented in this paper, there are two orthographic pat-
terns for the regular plural forms: <s> as in parks and jobs and <es> as in cakes
and tubes. Thus, we can test whether or not the presence of a letter <e> favours
a vowel to be manifested in English plural forms. Consider Figure 3.
[Cs] rates per orthographic pattern

100
80
Frequency (%)
60
96
40 83
62
20
0
Brazilian Portuguese <Ces> English <Ces> English <Cs>
Orthographic Petterns
Figure 3: [Cs] rates per Orthographic Pattern.
Figure 3 shows the rates of [Cs] in regular plural forms in BP and BP-EL2.8 The
leftmost column shows that regular plural forms in BP, whose orthography is
 For the purpose of the present discussion, we refer to [Cs] as a (consonant + sibilant) se-
quence. As it will be discussed later, the sibilant may be either [Cs] or [Cz]. At this stage, voic-
ing is not relevant.
<Ces>, presented 62% of a consonant followed by a sibilant: [Cs]. That means

that when a letter <e> appears in the orthography of BP plural forms, a vowel is
manifested in 38% of the cases. The two rightmost columns report data from
English spoken by BP-EL2 speakers. When the orthography in the plural form is
<Ces>, a consonant followed by a sibilant [Cs] occurred in 83% of the cases,
whereas in the cases where the orthography was <Cs>, a consonant followed by a
sibilant occurred in 96% of the productions. This result shows that the pronunci-
ation of [Cs] is more recurrent when the orthography is <Cs> than when the or-
thography is <Ces> in regular plural forms in English. In other words, a vowel
will appear at higher rates when the orthographic pattern is <Ces> than when it
is <Cs>. Thus, it is more likely that a plural form as tapes will have a vowel pro-
nounced between the last two consonants than a plural form as maps. The differ-
ence between the data presented in the two rightmost columns is statistically
significant for the orthographic patterns (χ2 = 36.113, df = 1, p < 0.01).9 The expla-
nation for such difference lies in the different orthographic patterns.
Figure 3 shows results for the orthographic patterns <Ces> and <Cs> in BP-
EL2. We also considered if the different tasks favoured a vowel to be pronounced.
According to Delatorre (2006) and Silveira (2007), visual input favours a vowel to
occur. In our experiment, the picture-counting task had no orthographic visual
input whereas orthography was available in the reading task. If Delatorre (2006)
and Silveira (2007) are correct, then we expect that vowels would occur at higher
rates in the reading task than in the picture-counting task in our experiment. Re-
sults showed that 51% of [Cs] forms were associated with the picture-counting
task, whereas 49% of [Cs] forms were associated with the reading task. However,
no statistically significant differences were found between these tasks (χ2 = 0.66,
df = 1, p-value = 0.41). This shows that it is the orthographic pattern rather the
type of task that favours a vowel to occur in BP-EL2 when a vowel is present in
orthographic forms.
Our proposal differs from Delatorre (2006), Silveira (2007) and Zanfra (2013)
who suggested that visual access to orthography in a reading task favoured the
pronunciation of a vowel. Our claim is that once speakers are literate, orthogra-
phy is part of their grammar, i.e., it has a permanent impact on mental represen-
tations. The EMPL2 model adopted in the current paper differs from Delatorre
(2006), Silveira (2007) and Zanfra’s (2013) rule-based approach mainly by assum-
ing that orthography is part of linguistic knowledge and not external to it.
Figure 3 shows a striking result: [Cs] occurs at high rates in BP and BP-EL2.
In other words, the presence of a vowel intervening between the last two
 We acknowledge that a bigger set of data may be required to shed new light on this matter.
consonants is low: 38% in BP and 10.5% in BP-EL2. That means that in most
cases [Cs] occurs in regular plural forms in BP and in BP-EL2. Within an Exem-
plar Model, the [Cs] pattern is more robust than [Cis]. We suggest that the robust-
ness of [Cs] in the L2 comes from the ongoing sound change in the L1, where [Cs]
occurs at higher rates than [Cis]. We claim that an ongoing sound change in the
L1 – which reflects subphonemic information – plays an important role in shap-
ing L2 linguistic knowledge. In other words, phonetic detail has an impact in L2
phonology. This issue is further explored in the following pages.
The second research question we posited regarded the voice quality of the
word-final sibilant in [Cs] and [Cis]. This was the main issue considered by Zanfra
(2013) and Fragozo (2017) within a rule-based approach. Their analysis claimed
that voicing in BP-EL2 did not achieve the rates expected in English due to con-
straints of BP distribution of sibilants and regressive assimilation. BP only presents
voiceless sibilants word-finally. However, across word-boundaries, BP sibilants are
voiced when followed by a voiced consonant or a vowel: mês [mes] ‘month’, mês
bonito [mez ˈbo.ni.tu] ‘beautiful month’, mês anterior [mez ə̃.te.ɾi.ˈoɾ] ‘previous
month’. According to Zanfra (2013) and Fragozo’s (2017) proposal, the regressive
assimilation rule triggered sibilants to be voiced when the sibilant was followed by
a voiced consonant or a vowel.
In this paper, we offer an alternative view to the preceding rule-based ap-
proaches. Within the scope of the EMPL2, it is suggested that generalizations
from an ongoing sound change in BP phonology are transferred into BP-EL2,
where phonetic detail plays an important role in shaping mental representa-
tions. General results showed that [Cs] occurred in 62% of cases in BP and in
89.5% of cases in BP-EL2 (Figure 3). We suggest that these results show that
[Cs] is a robust pattern which is adopted in English L2. Consider Figure 4.
All bars in Figure 4 show the rates for word-final voiceless sibilants where
the alternation between [Cs] ~ [Cis] is observed in regular plural forms. The
white bars illustrate data from BP where sibilants are followed by a voiceless
consonant (1st and 2nd white bars) or by a vowel (3rd and 4th white bars).10 Re-
sults show that a voiceless sibilant always occurs when it is followed by a voice-
less consonant (1st and 2nd white bars). This was somewhat expected as only
voiceless sibilants occur word-finally in BP. The third white bar shows that in
 The white bars aggregate BP data of the picture-counting task and the reading task, with a
total of 432 tokens. 216 tokens consist of target words being followed by a voiceless consonant,
whereas the other 216 tokens consist of target words being followed by a vowel. The gray bars
aggregate L2 English data of Cs-nouns and considers both the picture-counting task and the
reading task, with a total of 432 tokens. The black bars aggregate L2 English data of Cz-nouns
and also considers both production tasks, with a total of 432 tokens.
Voiceless sibilant rates per phonetic environment

100
80
Frequency (%)
60
100 100 100 96
85 90 86
40 81
58 59 56
20 40
0
Cs + C- Cis + C- Cs + V Cis + V Cs# Cs + V Cis# Cis + V Cz# Cz + V Ciz# Ciz + V
BP Cs-nouns English Cs-nouns English Cz-nouns
Figure 4: Rates of word-final voiceless sibilants per phonetic environment in PB and L2 English.
85% of the cases in which [Cs] is followed by a vowel, a voiceless sibilant oc-
curs. However, when [Cis] occurs, voiceless sibilants were produced in 58% of
the cases, as seen in the fourth bar. In BP, it is traditionally assumed that a
voiceless sibilant is produced as voiced when flanked between two vowels. If
this generalization applied to all cases in our data, then the third and fourth
white bars should present 100% of voiced sibilants, which is not the case. In
intervocalic position, 42% of intervocalic sibilants are voiced, and in cases
where [Cs] is followed by a vowel, 15% of voiced sibilants occurred (3rd white
bar). What the results presented in the third and fourth white bars show is that
a voiced sibilant may or may not occur in BP when the following environment
is a vowel. Thus, what takes place is not the application of a rule as posited by
Zanfra (2007) or Fragozo (2013), but rather a variable pattern involving the
[Cs] ~ [Cis] alternation.
Within an Exemplar Model, results from BP reflect that sibilants fol-
lowed by a voiceless consonant have a very robust pattern in BP (1st and 2nd
white bars). If generalizations from the [Cs] ~ [Cis] in BP applies to L2 En-
glish, it is expected that a voiceless sibilant occurs word-finally in Cs-nouns
and Cz-nouns. This follows from the fact that voiceless sibilants categori-
cally occur in word-final position in BP (Cristófaro-Silva 2003). On the other
hand, exemplars for sibilants followed by a vowel may display variability in
L2 as being either voiced or voiceless. This follows from the findings shown
in the third and fourth white bars in Figure 4.
The grey and black bars illustrate results for regular plural forms in English
spoken by BP-EL2 speakers where [Cs] ~ [Cis] is observed. The grey bars illus-
trate results for plural forms which are expected to present a voiceless sibilant
word-finally: Cs-nouns. The black bars illustrate plural forms which are expected
to present a voiced sibilant word-finally: Cz-nouns. An overview of BP-EL2 data
shows that voiceless sibilants occur at high rates in Cs-nouns and Cz-nouns. It
is also observed that Cs-nouns display higher rates of voiceless sibilants than
Cz-nouns. This is expected as BP only presents voiceless sibilants word-
finally. What we have to account for is the cases in which a voiceless sibilant
is expected in Cs-nouns but a voiced one occurs. Similarly, for Cz-nouns, we
have to account for cases in which a voiceless sibilant occurs when a voiced
one is expected. Consider Table 4.
Table 4: Expected and unexpected plural forms in BP-EL2.
Cs-nouns
Context Expected [s] Unexpected [z]
. Cs# % %
. Cs + V % %
. Cis# % %
. Cis + V % %
Cz-nouns
Context Unexpected [s] Expected [z]
. Cz# % %

. Cz + V % %
. Ciz# % %
. Ciz + V % %
Table 4 presents the rates of voiceless and voiced sibilants (cf. Figure 4). The
upper part of the table shows results for Cs-nouns and the lower part of Figure 4
shows results for Cz-nouns. Unexpected realizations of the plural morpheme
are presented in the shaded areas of the table.
Our proposal based on the EMPL2 accounts for the high rates of the expected
morpheme [s] (3rd column) as it reflects the more robust pattern in BP. Cases in
which an unexpected voiced sibilant [z] occurred in Cs-nouns tended to present an
adjacent vowel (4%, 10% and 19%), which, similarly to BP, favour voiced sibilants
(cf. 3rd and 4th white bars in Figure 4). The unexpected voiced sibilants in Cs-nouns
can be accounted for as the adoption of a subphonemic pattern observed in BP.
Cz-nouns present more unexpected voiceless sibilants than voiced ones
(except for the Ciz+V environment, which will be mentioned soon). A high
number of unexpected [s] reflects the BP robust pattern, which is adopted in L2
English. An expected [z] occurs at 14% word-finally, which possibly reflects the
emergence of English phonology, where [z] occurs word-finally. In the other
three environments, higher rates of [z] are observed. Notice that in these con-
texts an adjacent vowel occurs. The highest rates of [z] occur in intervocalic po-
sition (40%), which is an environment that favors voiced sibilants in BP. These
results can be understood as reflecting exemplar patterns from the ongoing
sound change involving [Cs] ~ [Cis] in BP being adopted in BP-EL2 phonology.
In general, we can conclude that Cs-nouns present an expected plural
morpheme at higher rates than Cz-nouns. The expected plural morpheme in
Cz-nouns word-finally is low (14%). Although this is an expected pattern in
English, it appears to be challenging to BP learners. An adjacent vowel con-
tributes to the occurrence of a voiced sibilant, especially when the [Cis] pat-
tern is manifested (cf. lines 3 and 4 for Cs and Cz-nouns in Table 4).
Our results throw some light on the line of research carried out by Zanfra (2013)
and Fragozo (2017), who investigated the voicing of sibilants followed by a vowel
within rule-based approaches. We account for the fact that voiceless sibilants have
the highest rates in regular plural forms in BP-EL2, as [s] is the most robust exemplar
in word-final position in BP. We also account for the fact that the pattern [Cis] fa-
vours a voiced sibilant in BP-EL2, as voiced sibilants are favoured in similar contexts
in BP (i.e. when they’re followed by a word-initial vowel). This indicates that L1 ex-
emplar patterns which reflect subphonemic information are adopted in the L2. Fi-
nally, our analysis explains why [z] presents a low rate of production in BP-EL2: it is
an emergent pattern in the L2, since it doesn’t occur word-finally in BP (unless when
they’re followed by a word-initial vowel, as stated above). Since [z] doesn’t pattern
in both languages word-finally, its exemplars are not robust in the L2. It will be
through experience that such exemplars will become robust and more recurrent.
Thus, Fragozo’s (2017) interpretation that partial voicing in English prevents [z] from
occurring does not hold. A final word must be said about intelligibility and compre-
hensibility (Derwing 2018). It is likely that the facts discussed in this paper do not
affect intelligibility in BP-EL2. For example, the plurals of monks and monkeys in BP-
EL2 are likely to present the same alternating forms – [mʌŋks] ~ [ˈmʌŋ.kɪs]11 – which
 Assuming that all the other segmental content is pronounced accordingly (Cristófaro-Silva
2011).
will possibly be resolved by the context in which they occurred. Our main con-
cern in this paper was rather to consider the role of orthography in L2 phonologi-
cal representations (which seem to be favored by the orthographic patterns
rather than the visual presentation of the words) and to account for the variable
patterns observed in BP-EL2, which, in our assumption, come from an ongoing
sound change in BP. Further investigations on whether or not intelligibility and
comprehensibility are affected are desirable. The next section considers some sug-
gestions for teaching the pronunciation of English regular plural nouns to BP
learners.
5 Suggestions for L2 pronunciation teaching

Considering the results found in the current paper, we propose the following
strategies to improve the pronunciation of English regular plural nouns by BP-
EL2 learners:
1. Discuss the [Cs] ~ [Cis] ongoing sound change from BP with learners.
2. Show students that in BP-EL2 the <Ces> orthographic pattern influences
higher rates of [Cis] sequences than <Cs>.
3. Consider Cs-nouns to indicate that [Cis] is not a pattern that occurs in En-
glish, whereas [Cs] is largely attested.
4. Present voiced sibilants as a recurrent morpheme in English and practice it
with a following vowel (which favours [z] to occur).
5. Present the generalization concerning plural formation in English to stu-
dents: nouns that end in voiceless consonants take the [s] morpheme and
nouns that end in voiced consonants take the [z] morpheme. Inform that
this information is not available in dictionaries.
Besides the steps shown above, a considerable amount of practice is desirable.

That could include reading transcriptions, grouping words that end in the same
sound and identifying expected and unexpected pronunciations in English reg-
ular plural nouns. The main objective of the steps we suggested is not to reach
a native-like pronunciation, but rather to contribute towards fluency and in-
telligibility in BP-EL2 (Alves et al. 2020).
6 Conclusions
This study examined the role of orthography in the production of plural forma-
tion in English by Brazilian Portuguese (BP) speakers. It also considered the role
played by the [Cs] ~ [Cis] ongoing sound change from BP into L2 English. Results
showed that the orthographic pattern <Ces> favours a vowel to occur at higher
rates that the <Cs> pattern. It was also shown that it was not visual access to or-
thographic forms that triggered a vowel to occur. We suggest that orthography
has a permanent effect on literate individuals’ mental representations. A bigger
set of data in future investigations may provide further insights on this proposal.
Concerning the role played by the BP ongoing sound change involving the
[Cs] ~ [Cis] alternation, it was shown that it has an impact on the L2. The analy-
sis based on the EMPL2 showed that robust patterns from the L1 are adopted in
L2, including fine phonetic detail that reflects subphonemic properties. The
proposal put forward in this paper offers a more comprehensive analysis than
previous rule-based models, as it explains the different pathways or trends that
BP-EL2 learners use to produce regular plural forms in English: from Cs-nouns
to Cz-nouns word-finally.
This study opens a number of questions that could be addressed in future
studies. All participants were classified as having an intermediate level of profi-
ciency in English. If our proposal is correct, we expect that students at advanced
levels will present similar distributions to those we found, but at lower rates.
This is because they will have had greater exposure to the L2 and therefore will
have more robust exemplars in the foreign language. A similar study could also
be carried out for the 3rd person singular present in English regular verbs, which
presents a similar distribution to the regular plural formation. Cases in which [z]
occurs word-finally in English singular nouns could be considered in order to as-
sess whether or not morphophonological generalizations contribute to improving
phonological knowledge. Other ongoing sound changes could be considered to
evaluate their impact on BP-EL2 phonology. Additionally, it might be worth as-
sessing whether the production of voiced and voiceless sibilants actually affects
intelligibility in BP-EL2. Finally, our recommendations for the teaching of English
plural forms could also be tested in order to verify whether they indeed improve
pronunciation as suggested.
References
Alves, Ubiratã, Susiele Silva, Luciene Brisolara & Ana Paula Engelbert (eds.). 2020. Fonética e
Fonologia de Línguas Estrangeiras: subsídios para o ensino. [Phonetics and phonology of
foreign languages: a support for teaching practices]. Campinas: Pontes Editores.
Bassetti, Bene. 2017. Orthography affects second language speech: Double letters and
geminate production in English. Journal of Experimental Psychology: Learning, Memory,
and Cognition 43(11). 1835–1842.
Boersma, Paul. 2011. A programme for bidirectional phonology and phonetics and their
acquisition and evolution. In Anton Benz and Jason Mattausch (eds.), Bidirectional
Optimality Theory, 33–53. Amsterdam and Philadelphia: John Benjamins Publishing
Company.
Boersma, Paul & Silke Hamann. 2009. Phonology in Perception. Berlin and New York: Walter
de Gruyter.
Boersma, Paul & David Weenink. 2020. Praat: Doing Phonetics by Computer [Computer
program]. Version 6.1.30, retrieved 3 November 2020 from http://www.praat.org.
Bybee, Joan. 1995. Regular morphology and the lexicon. Language and Cognitive Processes 10
(5). 425–455. https://doi.org/10.1080/01690969508407111 (accessed 28 May 2021).
Bybee, Joan. 2002. Word frequency and context of use in the lexical diffusion of phonetically
conditioned sound change. Language Variation and Change 14(3). 261–290.
Bybee, Joan. 2008. Usage-based grammar and second language acquisition. In Peter
Robinson & Nick Ellis (eds.), Handbook of Cognitive Linguistics and Second Language
Acquisition, 216–235. New York: Routledge.
Bybee, Joan. 2010. Language, Usage and Cognition. Cambridge: Cambridge University Press.
Colantoni, Laura, Jeffrey Steele & Paola Escudero. 2015. Second Language Speech.
Collischonn, Gisela. 2002. A epêntese vocálica no português do sul do Brasil [Vowel
epenthesis in the South of Brazil]. In Leda Bisol and Cláudia Brescancini (eds.), Fonologia
e Variação: Recortes do Português Brasileiro [Phonology and Language Variation: Issues
in Brazilian Portuguese], 205–230. Porto Alegre: EDIPUCRS.
Cristófaro-Silva, Thaïs. 2003. Fonética e Fonologia do Português: Roteiro de Estudos e Guia de
Exercícios [Phonetics and Phonology of Brazilian Portuguese: study guide and exercises],
7th edn. São Paulo: Contexto.
Cristófaro-Silva, Thaïs. 2011. Pronúncia do Inglês para Falantes do Português Brasileiro.
[English Pronunciation for Brazilian Speakers]. São Paulo: Contexto.
Cristófaro-Silva, Thaïs, Leonardo Almeida & Cristine Guedri. 2008. Phonological traces in the
loss of a plural marker in Brazilian Portuguese. Estudos Linguísticos [Linguistic Studies] 1
(1). 285–299. Lisboa: Edições Colibri/CLUNL. https://clunl.fcsh.unl.pt/wp-content/
uploads/sites/12/2018/02/thais-silva.pdf (accessed 03 July 2021).
Cristófaro-Silva, Thaïs & Daniela Guimarães. 2021. Paper submitted to Seminário de Ciências
da Fala [Speech Sciences Seminar], Federal University of Minas Gerais, 18–19 October.
Delatorre, Fernanda. 2006. Brazilian EFL learners production of vowel epenthesis in words
ending in -ed. Santa Catarina: Federal University of Santa Catarina thesis.
Derwing, Tracey. 2018. Putting an accent on the positive: New directions for L2 pronunciation
and instruction. International Symposium on Applied Phonetics, University of Aizu, Japan,
2018, 12–18.
Flege, James Emil. 1995. Second language speech learning: Theory, findings, and problems. In
Winifred Strange (ed.), Speech Perception and Linguistic Experience: Issues in Cross-
Flege, James Emil & Ocke-Schwen Bohn. 2021. The revised speech learning model (SLM-r). In
Ratree Wayland (ed.), Second Language Speech Learning: Theoretical and Empirical
Fragozo, Carina. 2017. Aquisição de regras fonológicas do Inglês por falantes de Português
Brasileiro [Acquisition of phonological rules in English by Brazilian Portuguese
speakers]. São Paulo: University of São Paulo dissertation.
Gomes, Maria Lúcia de Castro. 2009. A produção de palavras do inglês com o morfema ED por
falantes brasileiros: uma visão dinâmica [A dynamic view on the production of English
-ed morphemes by Brazilian speakers]. Curitiba: Federal University of Paraná dissertation.
Gomes, Matheus Freitas. 2019. A redução segmental em sequências#(i) sC no português
brasileiro [Vowel lenition in #(i)sC clusters in Brazilian Portuguese]. Belo Horizonte:
Federal University of Minas Gerais MA thesis.
Hamann, Silke & Ilaria Colombo. 2017. A formal account of the interaction of orthography and
perception. Natural Language and Linguistic Theory 35(3). 683–714.
Harris, John & Jonathan Kaye. 1990. A tale of two cities: London glottalling and New York City
tapping. Berlin and New York: Walter de Gruyter. https://doi.org/10.1515/tlir.
1990.7.3.251 (accessed 28 May 2021).
Hayes, Bruce. 2011. Introductory phonology. Oxford: John Wiley and Sons.
Horta, Bruno, Thaïs Cristófaro-Silva & Victor Soares. 2021. O Ensino de Pronúncia de Inglês
[Teaching English Pronunciation]. To appear in the journal Colineares. http://
natal.uern.br/periodicos/index.php/RCOL.
Johnson, Keith. 1997. Speech perception without speaker normalization: An exemplar model.
In Keith Johnson & John Mullenix (eds.), Talker Variability in Speech Processing, 145–165.
San Diego: Academic Press.
Leite, Camila Tavares. 2006. Seqüências de (oclusiva alveolar+ sibilante alveolar) como um
padrão inovador no português de Belo Horizonte [(alveolar stop + alveolar sibilant)
clusters as an innovative pattern in the Brazilian Portuguese spoken in the city of Belo
Horizonte]. Belo Horizonte: Federal University of Minas Gerais MA thesis.
Levshina, Natalia. 2015. How to do Linguistics with R: Data Exploration and Statistical
Analysis. Amsterdam and Philadelphia: John Benjamins Publishing Company.
Nascimento, Katiene. 2016. Emergência de padrões silábicos no português brasileiro e seus
reflexos no inglês língua estrangeira [Emerging sound patterns in Brazilian Portuguese
and their impact on English as a Foreign Language]. Fortaleza: Universidade Estadual do
Ceará dissertation.
Rafat, Yasaman. 2015. The interaction of acoustic and orthographic input in the acquisition of
Spanish assibilated/fricative rhotics. Applied Psycholinguistics 36(1). 43–66.
Rastle, Kathleen, Samantha McCormick, Linda Bayliss & Colin Davis. 2011. Orthography
influences the perception and production of speech. Journal of Experimental Psychology:
Learning, Memory, and Cognition 37(6). 1588–1594.
R Studio Team. 2020. RStudio: Integrated Development for R. [Computer program]. Retrieved
2 December 2020 from http://www.rstudio.com.
Silveira, Rosane. 2007. O papel desempenhado pelo tipo de tarefa e pela ortografia na
produção de consoantes em final de palavra [The role of task type and orthography on
the production of word final consonants]. Revista de Estudos da Linguagem [Language
studies journal] 15(1). 147–180.
Soares, Victor Hugo Medina. 2016. Encontros consonantais em final de palavra no português
brasileiro [Word-final consonante clusters in Brazilian Portuguese]. Belo Horizonte:
Federal University of Minas Gerais MA thesis.
Zanfra, Mayara. 2013. Phonological context as a trigger of voicing change: a study on the
production of English /s/ and /z/ in word-final position by Brazilians. Florianópolis:
Federal University of Santa Catarina MA dissertation.
Zhou, Chao. 2021. L2 speech learning of European Portuguese /l/ and /ɾ/ by L1-Mandarin
learners: Experimental evidence and theoretical modelling. Lisbon: University of Lisbon
dissertation.
Effect of task, word length and frequency
on speech perception in L2 English:
Implications for L2 pronunciation teaching
and training
Abstract: This study presents the findings of three perception tasks examining
the relative difficulty encountered by learners of L2 English in perceiving conso-
nants and vowels in high- and low-frequency words. The tasks focused on the
word level and involved a phoneme identification task, a discrimination task,
and a word dictation task. The participants were 130 students at public and pri-
vate universities in Greek-speaking Cyprus, exposed to L2 English as the lan-
guage of instruction. Overall, the findings indicate a task effect. Word length is
also a significant factor for speech perception based on the findings. Moreover,
the results of the study indicated difficulties with word frequency. According to
the item analysis, low-frequency words are more difficult to perceive, especially
with respect to consonants in the word dictation task. This could be attributed to
the acoustic-orthography interface in L2 phonology. Age, gender, as well as years
of L2 instruction and use, are statistically significant factors for speech percep-
tion. The overall pattern trend is in line with the Native Language Magnet Model
(NLM; Kuhl 2000), suggesting that non-native contrasts may be difficult to dis-
criminate when the prototype of an L1 category closely resembles two L2 phones.
Keywords: L2 speech perception, consonants, vowels, word frequency, word

length, task effect
1 Investigating speech perception

Speech perception or the “understanding of speech” (Hacquard, Walternand,
and Marantz 2007; Raphael, Borden, and Harris 2007: 331; Saito 2015; Thomson
2012; Wang and Chen 2019) involves assigning meaning to an input sound
(speech signal) based on the information that is heard or perceived in it. The lis-
tener needs to map the speech signal to linguistically meaningful units, which
Elena Kkese, Cyprus University of Technology

Sviatlana Karpava, University of Cyprus
https://doi.org/10.1515/9783110736120-003
42 Elena Kkese, Sviatlana Karpava
entails a complicated decoding task. Infants, however, are able to discriminate

between speech sounds in their first language (L1) and even most speech sounds
in any second language (L2) before the age of around eight months (Best and
McRoberts 2003). Their general perception referring to their ability to perceive
non-native speech sounds seems to decrease in the second half of their first year
of life, since, by that time, infants become attuned to their L1 moving to lan-
guage-specific phonetic perception. Their ability to discriminate native speech
sounds continues into adulthood and is almost effortless; however, the ability to
discriminate non-native speech sounds as adults declines considerably (Goto
1971; Iverson et al. 2003; Werker and Tees 1984).
Among other perception models, the Native Language Magnet theory (NLM
and NLM-e: Kuhl 1993; Kuhl et al. 2008) provides an interesting model for speech
perception by trying to account for the mechanisms underlying the learning of
language-specific perception and providing evidence of how infants extract infor-
mation from the speech signal. Specifically, NLM-e describes the phonetic percep-
tion by infants in different phases: in the first phase, infants can discriminate all
speech sounds; in phase two, they become more sensitive to native phonetic cues
compared to non-native patterns; in phase three, phonetic learning equals word-
learning; and in phase four, the outcome of analysing incoming speech is rela-
tively stable neural representations. Kuhl et al. (2008) refer to this as “native lan-
guage neural commitment”, which forms abstract phonetic categories stored in
memory and is language-specific as it is driven by earlier linguistic experience.
Because of this commitment, learning the speech sounds of a non-native
language in adulthood becomes difficult, leading to inadequate L2 perception.
The “native language neural commitment” interferes with the creation of new
mappings since by the time most L2 users start acquiring an L2, they are adults;
this implies that there has been loss of neural plasticity. Non-native perception,
however, could be improved by the creation of new mappings. In such cases, a
new perceptual system would be formed, allowing for the systems of the L1 and
L2 to coexist with minimal interference, as two separate regions of the brain;
one region would be responsible for processing the native language and the
other region would be responsible for processing the non-native language(s).
According to NLM, the perceptual magnet effect is what stops L2 users from per-
ceiving incoming speech objectively. The perceptual magnet effect refers to the
L1 sound categories acting as attractors for newly perceived tokens. When the
speech token is closer to a native category prototype, it is more difficult to be
perceived; this is called the “gravitational pull” (Kuhl 1993; Kuhl et al. 2008).
Whereas the NLM supports that perceptual representations are stored in
memory, the Perceptual Assimilation Model (PAM and PAM-2: Best 1993, 1994,
1995; Best and Tyler 2007) supports an ecological approach to speech perception
Effect of task, word length and frequency on speech perception in L2 English 43
(Best 1984) suggesting that listeners extract the invariants of articulatory gestures.
According to this model, children identify and learn to hear high-level articulatory
gestures, which differentiate L1 sound contrasts and facilitate L1 perception. These
L1-specific high-level articulatory gestures are used in new language environ-
ments. Beginner listeners assimilate L2 sounds to L1 sounds, which are perceived
as most similar given that non-native environments lack familiar articulatory ges-
tures. Discrimination is expected to be excellent in the cases when an L2 contrast
is perceptually assimilated to different native categories (two-category assimila-
tion). However, discrimination is expected to be poor when contrasting L2 sounds
are assimilated to the same L1 category (single category assimilation). In the case
of an L2 contrast, in which the one member is assimilated as a good version and
the other as a poor version of a native category (category-goodness assimilation),
the perceptual difficulty depends on the degree of difference in category goodness
between the two L2 phones while discrimination is expected to be moderate to
good. The next type involves cases where one L2 phone is categorised and the
other is not (uncategorised-categorised assimilation) while discrimination is good.
When both L2 phones are uncategorised (uncategorised-uncategorised assimila-
tion), discrimination may be poor or very good depending on the auditory and
phonetic similarities between the L2 phones. Finally, when the two non-native
phones are very different from the articulatory gestures of the L1 phonemes, these
are not perceived as speech sounds (non-assimilable assimilation); discrimination
may be poor to very good depending on the similarity of the sounds.
In turn, the Speech Learning Model (SLM: Flege 1995, 2002) supports that
the problems in acquiring L2 sounds are the result of the learners’ tendency to
relate new sounds to the existing positional allophones. This process is called
“equivalence classification”, and because of it, L2 sounds get filtered out by L1
phonology. The model suggests that because “the mechanisms and processes
used in learning L1 sound system remain intact over the life spam” (Flege 1995:
239), adults could learn the accurate perception of new L2 properties. Learners
could create new categories given that they could perceive the phonetic differen-
ces between L2 sounds. Based on the model, L2 sounds are more difficult to be
perceived if they are similar to L1 sounds; however, L2 sounds that differ com-
pared to L1 sounds are easier to be perceived. One main difference compared to
NLM is that SLM predicts one common phonological space for both the L1 and
the L2 systems. Recently, Flege and Bohn (2021) have revised the SLM; the re-
vised Speech Learning Model (SLM-r) is an individual differences model aiming
to account for how phonetic systems reorganise over the life span based on the
phonetic input received during the L2 learning.
Therefore, the three above-mentioned perceptual models aim to explain L1
and L2 speech perception and production as well as the connection between
them. These cues could be articulatory or acoustic, as already outlined, de-

pending on the model. Emphasis is given to previous linguistic experience,
which influences the initial state of L2 perception facilitating discrimination in
the cases the same acoustic cues are present in the L1. In all three models, what
seems to interfere with L2 perception is the similarity between the L1 and L2
sounds. The NLM, though, seems to account more satisfactorily for speech per-
ception since it explicitly refers to the mechanisms leading to the acquisition of
language-specific perception and provides evidence of how infants extract in-
formation from the signal by identifying distributional and probabilistic proper-
ties of the language used in the immediate environment (Kuhl 1993; Kuhl et al.
2008). Based on the NLM theory, therefore, the closer a novel sound is to an L1
category, the harder it will be for the listener to perceive it. In this context,
when listening to specific consonants and vowels in L2 English, Cypriot-Greek
(CG) listeners will experience a “gravitational pull” since the L1 category will
act as a magnet for novel sounds preventing their accurate perception.
The NLM, therefore, attempts to account for the inaccurate perception of L2
phonetic segments in the specific context; nonetheless, L2 perception is also ex-
pected to be influenced by extra-linguistic and linguistic factors. In the L2 con-
text under investigation, extra-linguistic factors involve age, gender, years of L2
instruction and use while linguistic factors refer to word length and word fre-
quency. With regard to extra-linguistic factors, age seems to be an important fac-
tor for the accurate acquisition of L2 sounds (Hurford 1991; Lenneberg 1967; Long
1990; Patkowski 1990; Scovel 1969; Walsh and Diller 1981). More specifically,
children exposed to the L2 by the age of six manage to speak the L2 without an
accent; nonetheless, if learning starts after the age of 12, foreign accent will
occur, which is thought to be the result of a neurological change due to normal
maturation (Lenneberg 1967; Patkowski 1990; Scovel 1969). Turning to gender,
female speakers seem to be more concerned with accuracy in L2 pronunciation
compared to male speakers (Moyer 2016), suggesting that the former may process
language in a different manner than their male interlocutors. Moving to years of
instruction, there is an advantage of adult L2 learners who seem to benefit more
from the amount of classroom exposure to the L2 and proficiency (Krashen et al.
1978). Lastly, reported use in the L2 is another important factor for L2 acquisition.
The L2 learners who use the L2 more should acquire it better (Johnson and Krug
1980; Schumann 1978).
Linguistic factors, on the other hand, involve word length and word fre-
quency, which are factors that could further affect the perception of L2 sounds.
Concerning word length (Goldstein 1983; Kkese 2016), there is evidence that
words involving more syllables could be more easily detected than one-syllable
words (Goldstein 1983; Hulme et al. 2006; Lovatt, Avons, and Masterson 2000).
Based on word frequency (Kkese 2016; Kkese and Karpava 2019; Pierrehumbert
2003), high-frequency words can be accessed faster while there will be fewer
problems retrieving these words when information is missing or when there is
noise in the acoustic signal. Low-frequency words, however, cannot be identi-
fied on the basis of fewer perceptual cues and, as a result, cannot be that easily
predicted (Kkese 2016).
The present study aims to investigate the perception of the complete set of
English consonants and vowels by CG listeners of L2 English when these are
found in high- and low-frequency words, according to the NLM theory, taking
extra-linguistic and linguistic factors into consideration. Even though L2 En-
glish is widely used in Greek-speaking Cyprus (Kkese 2016), phonetic research
comparing L1 CG and L2 English is limited, focusing on plosive consonant per-
ception on a word (Kkese 2016, 2020a, 2020b; Kkese and Petinou 2017a, 2017b)
or utterance level (Kkese 2016), as well as consonant and vowel perception on a
word level (Karpava and Kkese 2020; Kkese and Karpava 2019).
1.1 Phonological systems of CG and English
To gain an insight into the inventories of consonants and vowels in SBE (Stan-
dard British English) and CG, it is important to briefly describe the differences
between the two systems. Even though SBE and CG share a similar alphabet,
there are many differences in terms of phonology that merit attention. To start
with, CG is a southeastern dialect of Greek, spoken in Greek-speaking Cyprus.
The dialect is a closer variety to ancient Greek since it differs in the phonological,
lexical, and syntactical level when compared to Greek (Petinou and Terzi 2002).
CG has a complicated consonant system consisting of approximately 51 sounds
(Table 1), including voiceless plosives and affricates, voiceless and voiced frica-
tives, nasals, and liquids (Arvaniti 1999). Consonants are further distinguished
based on consonant length (Arvaniti 2010; Kkese 2016). Turning to the vowel in-
ventory, this is constituted of the five simple vowels /i e a u o/ while there are no
diphthongs (Table 2). Moreover, vowels in CG do not differ in terms of duration
(Lengeris 2009), tense-lax or long-short distinction (Arvaniti 2007).
The target English variety investigated in the present study is Standard British
English (SBE), given that this is the variety in which students were exposed in En-
glish phonetics and phonology modules at the specific universities.1 SBE has a
 Participants at the specific universities had to attend one English phonetics and phonology
module two times a week (for three hours in total).
Table 1: CG consonantal inventory (adapted from Arvaniti 2010).
Labial Alveolar Postalveolar Palatal Velar
Plosive p p ͪ: b t t ͪ: d c c ͪ: ɟ k k ͪ :g
Affricate ts ʧ ʧ: ʤ
Fricative f f: v v: θ θ: ð ð: ʃ ʃ: ʒ ʒ: ç ç: j j: x x: ɣɣ:
s s: z z:
Nasal m m: n n: ɲ ŋ
Lateral l l: ʎ
Tap ɾ
Trill r
Table 2: CG vowel inventory.
high mid low
front i e
central a
back u o
Note: /o u/ are rounded vowels.
consonant system of only twenty-four sounds (Table 3). Concerning vowels, SBE
has a more complicated vowel system, consisting of at least twenty sounds (Deterd-
ing 2004). Specifically, there are twelve monophthongs /i: ɪ ɛ æ u: ʊ ɔ: ɒ ɑ: ʌ ɜ: ə/
(Table 4), which are stressed phonemes except for the unstressed schwa [ə] (Crut-
tenden 2014). Furthermore, this variety consists of eight diphthongs /aı eı ɔı aʊ əʊ
ıə ɛə ʊə/, the five falling diphthongs /aı eı ɔı aʊ əʊ/ and the three centering diph-
thongs /ıə ɛə ʊə/ (Cruttenden 2014). Triphthongs are also present in SBE; these con-
sist of the five closing diphthongs with /ə/ added at the end, resulting in /aıə eıə ɔıə
aʊə əʊə/. Duration differences between the lax and tense vowels are also impor-
tant; /ɪ ɛ æ ʊ ɒ ʌ/ are lax vowels while /i: u: ɔ: ɑ: ɜ:/ are tense.
One major difference between SBE and CG involves the consonantal invento-
ries of the two language varieties. Even though both varieties have a voiceless/
voiced distinction, the consonantal inventory of CG is considerably larger as indi-
cated in Table 1 due to additional consonants as well as the consonant length
distinction, which are lacking in SBE. Allophonic differences also account for
some of the differences; even though some consonants are shared by SBE and CG
Table 3: SBE consonantal inventory (adapted from Carr 1999).
Bilabial Labio- Dental Alveolar Post- Palatal Velar Glottal

dental alveolar
Plosive P b t d k g
Affricate ʧ ʤ
Fricative f v θ ð s z ʃ ʒ h
Nasal m n ŋ
Approximant ɹ j
Lateral l
Approximant
Note: Other symbols: /w/ voiced labio-velar approximant
Table 4: SBE vowel inventory (monophthongs).
high mid low
front i: ɪ e æ
central ʌ ə ɜ:
back u: ʊ ɔ: ɒ a:
Note: Only /u: ɔ: ʊ ɒ/ are rounded vowels

Centring diphthongs:
1. Three ending in /ə/: /ɪə eə ʊə/
Closing diphthongs:
1. Three ending in /ɪ/: /eɪ aɪ ɔɪ/
2. Two ending in /ʊ/: /əʊ aʊ/
such as the plosive consonants /p t k/, which occur across the two phonetic in-
ventories, CG also includes /pʰ tʰ kʰ/. Whereas in SBE these are allophonic differ-
ences (non-contrastive), in CG these are separate phonemes.
A second major difference between SBE and CG relates to the phonological
make-up of their vowel inventories. Namely, CG has seven monophthongal vowel
categories less than SBE, while differences can be observed between the ones that
are present based on vowel transcriptions alone. Whereas the five monophthongs
are orthographically similar in SBE and CG, they differ considerably at the pho-
netic level. This implies that there is not an orthographic-acoustic link between
SBE and CG, as the same grapheme can represent different phonemes in the two
languages. Specifically:
1. the grapheme ‘a’ can be represented with the phoneme /a/ in CG as in [ˈgata]
(cat), [ˈkap:a] (cape), but with /æ ɑ: ə/ in SBE as in [pʰæt] (pat), [pʰɑːt] (part),
[əˈpʰɑːt] (apart);
2. the grapheme ‘i’ can be represented with the phoneme /i/ in CG as in [ˈmiti]
(nose), [miˈsi] (half) but with /i: ɪ/ in SBE as in [ˈli:tə] (litre), [ˈlɪtə] (litter);
3. the grapheme ‘e’ can be represented with the phoneme /e/ in CG as in
[ˈmeres] (days), [ˈslaises] (fetes) but with /ɛ ɜ:/ in SBE as in [end] (end), [ɜːnd]
(earned);
4. the grapheme ‘o’ can be represented with the phoneme /o/ in CG as in [ˈkopos]
(trouble), [ˈponos] (pain) but with /ɔ: ɒ/ in SBE as in [pʰɔːt] (port), [kʰɒt] (cot);
5. the grapheme ‘u’ can be represented with the phoneme /u/ in CG as in
[sɣuˈrus] (curly ones), [ˈkuklus] (dolls) but with /u: ʊ/ in SBE as in [ˈlu:kə]
(lucre), [fʊl] (full).
Furthermore, CG lacks completely diphthongal and/or triphthongal vowel cate-

gories; diphthongs are only present at the orthographic level but are not repre-
sented phonetically. On the other hand, diphthongs are present in SBE on the
phonetic, but not on the orthographic level. Triphthongs, though, are quite diffi-
cult to pronounce and recognise.
This study focused on the perception of consonants and vowels in L2 English
by CG listeners. Consonants and vowels that are considered relatively difficult for
this specific population (Arvaniti 1999; Karpava and Kkese 2020; Kkese 2016,
2020a, 2020b; Kkese and Karpava 2019; Kkese and Petinou 2017a, 2017b) were
chosen; this selection was further based on the researchers-instructors’ experi-
ence with CG learners of L2 English. Given the fact that there is dearth of phonetic
and phonological studies in CG (Kkese 2016), it was of special interest to investi-
gate L2 speech perception of consonants and vowels since the L1 seems to have a
quite different sound inventory. Based on previous cross-linguistic studies on
speech perception, listeners map L2 sounds onto their L1 sound system. This im-
plies that the phonetic distances between the two sound systems may play an im-
portant role when it comes to the degree of success in L2 speech perception (Flege
and Wayland 2019; Guion et al. 2000; Wang and Chen 2019). Investigating L2
speech perception of consonants and vowels could provide a useful insight since
evidence from previous speech perception studies suggest that consonants and
vowels differ both acoustically and functionally. Consonants are said to be more
categorically perceived than vowels (Fry et al. 1962; Repp 1981, 1984), while the
two types of segments carry different kinds of information (Bonatti et al. 2005). As
a consequence, vowels are expected to be more challenging to L2 listeners (Per-
eira 2014). Concerning vowels, previous studies in production have indicated that
the size of the vowel inventory could influence vowel variability, with fewer
vowels imposing more vowel variability (Recasens and Espinosa 2006); more
complex L1 vowel inventories, thus, could facilitate listeners’ ability to attend to
cues in a native-like manner when perceiving L2 English vowels (Hacquard, Wal-
ter, and Marantz 2007; Kivistö-de Souza and Carlet 2014). Therefore, the present
study aimed to examine the relative difficulty encountered by CG listeners of L2
English in perceiving consonants and vowels focusing on consonant voicing and
vowel length. Specifically, the following research questions were investigated:
1. Is there a task effect on the consonant and vowel perception in L2 English?
Do learner variables such as age, gender, and years of L2 instruction and
use correlate with the results in the three tasks?
2. What is the effect of word length on the discrimination of L2 English conso-
nantal and vocalic contrasts?
3. What is the effect of word frequency on the perception of L2 English conso-
nant and vowel sounds?
Given that we assume that speech sounds are perceived categorically, vowels
are expected to be more difficult for the L2 listeners of English. One of the au-
thors’ intentions, thus, was to examine L2 speech perception as a function of
the type of task used for discrimination. L2 sounds were further examined
based on word length; words with fewer syllables were expected to be more dif-
ficult for L2 learners due to the lack of suprasegmental information (Kkese
2016). The authors also hypothesised that low-frequency words would be less
efficiently processed compared to high-frequency words since the former may
not be known to most people. This word frequency effect (Monsell, Doyle, and
Haggard 1989) suggests low-frequency words in L2 English may be distin-
guished with more difficulty by the L2 listeners. The findings seem to have sig-
nificant implications for L2 pronunciation teaching and training.
2 Methodology
2.1 Participants
130 normal-hearing and vision adults participated in this study; the selection
phase took place based on the participants’ self-reported language background
information, which was obtained via a language background questionnaire. The
questionnaire consisted of seven questions in the effort to gather some general
information about the participants, their first language, and further information
about their L2 English usage and exposure. Based on the participants’ responses,
84 were female and 46 male L1 CG speakers and their mean age was 20, ranging
from 17 to 28 (SD=2.91). They were all undergraduate students attending two pub-
lic and one private universities in Greek-speaking Cyprus, exposed to L2 English
as the language of instruction; 59 participants were attending a public university
and 71 were students at the private university. With regard to L2 English, the par-
ticipants’ mean age of exposure to the language was 9.6, ranging from 0 to 19
(SD=3.42), and the mean number of years of formal instruction to L2 English was
10, ranging from 0 to 28 (SD=2.63). Concerning visits to English-speaking coun-
tries, 60 participants reported positively while 70 reported that they had never
been to an English-speaking country. Most students reported that they use L2 En-
glish in their everyday life (90 participants) while only 35 participants responded
that they do not generally use the language. Finally, in terms of L2 proficiency,
the mean number of obtained IELTS score was 6.5, ranging from 5 to 9 (SD=1.3),
indicating a low intermediate to advanced L2 English proficiency.
For the present study, non-probability convenience sampling was used given
that the participants were attending General English, Academic Writing and/or
Linguistics courses taught by the two researchers in L2 English. The only partici-
pants who were excluded from the sample were students whose L1 was not CG. It
is worth mentioning that participants had no previous knowledge of English pho-
netics and phonology at the beginning of the study. Participation was on a
completely voluntary basis and students were ensured about their confidentiality
of their personal information. They agreed to take part in the study by signing a
consent form; the participants were divided into five groups (N=26) and they
were tested in consonant and vowel identification in familiar and non-familiar
real words spoken by native SBE speakers.
2.2 Procedure
All perceptual tasks took place in three quiet computer rooms at the universities
with individual computers and headphones (listening volume was set at 75dB)
and were always closely monitored by the two researchers. The research period
involved one fall semester while data were collected in different sessions in
which the phoneme identification task was administered first (first session). The
next two sessions (sessions two and three) involved the administration of the dis-
crimination task; during the second session, the task focusing on consonants was
administered while session three involved the administration of the discrimina-
tion task focusing on vowels. The last task administered was the word dictation;
this task, however, was completed in six different sessions, as described below.
The four tasks were pre-recorded using Audacity 1.3. Beta software for recording
and editing sounds. All the speakers were native SBE speakers; for the first two
tasks, namely the phoneme identification and discrimination tasks, the same fe-
male speaker (age 35) was used. For the word dictation task, one female and a
male speaker were used, as the acoustic input was from the online Macmillan
Dictionary.
During the first session, participants had to listen to five different words
that consisted of a minimal set, namely a target word and its four foils; they
were required to circle the word they could hear for a second time. Overall, the
participants had to respond to 20 minimal sets involving 100 words in total. For
sessions two and three, the discrimination tasks focusing on consonants and
vowels involved two-alternative forced-choice tasks in which participants had
to respond to 60 minimal pairs (120 words in total) in each task by circling the
one of the two words they heard for a second time. In sessions four to nine, the
participants had to listen to the six dictation tasks and record on a given score-
sheet the words they could hear; the task involved 120 words in total while
each session consisted of 20 words, namely ten words involving consonants
and ten involving vowels. For all the perceptual tests, participants had the
printed form in front of them, and they could listen to every task for a second
time, which allowed them to complete any missing information.
2.3 Stimuli
The target sounds were a set of SBE consonants and vowels, which are problematic
for CG listeners of L2 English; these were placed word-initially, -medially or -finally
in real high- and low-frequency words, which had a transparent spelling. All three
tasks focused on the word level and the words were checked against the minimal
pairs for English RP (Received Pronunciation) lists by John Higgins (2008) and the
word list of Francis and Kucera (1982). The decision to examine these sets of pho-
nemes was driven by predictions of L2 speech perception models such as NLM,
PAM, and SLM, which take into account the influence of the L1 inventory when
predicting difficulties in L2 acquisition as well as evidence from previous studies
in L2 speech perception by CG listeners (Karpava and Kkese 2020; Kkese 2016,
2020a, 2020b; Kkese and Karpava 2019; Kkese and Petinou 2017a, 2017b). Partici-
pants had to undertake the three tasks, developed by the researchers, which
aimed to expose them to both L2 consonant and vowel contrasts in different trials.
They had to respond via a circling response mode and/or recording target words
while their answers were scored as correct or incorrect, generating an overall cor-
rect score percentage. Besides, correct score percentages for every consonant and
vowel category were further obtained.
2.3.1 Phoneme identification task
The target sounds were the twenty-three SBE consonants /θ ð t d h s z ʃ ʧ ʤ v w f

j k g p b n m ŋ l r/ and the eleven monophthong vowels /ɪ i: e æ ɑ: ɒ ɔ: u: ʊ ɜ: ʌ/
arranged in minimal sets. For this task (see Appendix 1), 100 monosyllabic words
were arranged in 20 minimal sets, each consisting of five words. Out of the 20
minimal sets, ten involved consonants and the rest involved vowels. Two to four
distractors were used every 16 presentations; the distractors used for the minimal
sets involving consonants were made of vowels. Similarly, the distractors for the
vowel minimal sets consisted of consonants. The speaker was a female native
speaker of SBE.
2.3.2 Discrimination task
With reference to the consonants used for this task, these involved /ð/-/d/, /s/-/z/,
/θ/-/ð/, /p/- /f/, /t/-/d/, /m/-/n/, /θ/-/s/, /p/-/b/, /k/-/g/, and /l/-/r/. Vowels in-
cluded /ɪ/-/i:/, /e/-/i:/, /æ/-/e/, /ɒ/-/ɔ:/, /u:/-/ʊ/, /ɑ:/-/æ/, /ə/-/ɑ:/, /ɜ:/-/e/,
/ɔ:/-/ɜ:/, and /ʌ/-/ɜ:/. The focus of this task was on minimal pairs, so two dis-
crimination tasks (see Appendices 2a and 2b) were developed to address both
consonants and vowels. Each involved a total of 120 mono- or bi-syllabic words
presented in two fully randomised blocks of 60 minimal pairs. Distractors made
up eight of the minimal pairs; specifically, two to four distractors were used for
every 16 presentations. Consonants served as the distractors of the discrimination
task focusing on vowels, while the distractors were vowels for the minimal pairs
focusing on consonants. The same female native speaker of SBE was used.
2.3.3 Word dictation task
The SBE consonants included in the word dictation task were /ð z θ v d ŋ h b g ɹ/;
vowels involved /æ ɜː ɔː i: u: ɑ: e ʌ ə ʊ/. With regard to consonants, these involved
sounds which are problematic to CG speakers, mainly voiced consonants (Kkese
2016); as for vowels, five short and five long vowels were chosen given that in CG
there is no distinction in vowel length (Kkese 2016). The task (see Appendices 3a
and 3b) was made of 120 mono- and bi-syllabic words, out of which 60 involved
consonants and 60, vowels. There were ten conditions for consonant sounds
(6 words each) and ten conditions for vowel sounds (6 words each). The dicta-
tion task was split into six dictation sessions, each consisting of 20 words (ten for
consonants and ten for vowels). The acoustic input for isolated words from the
online Macmillan Dictionary was used while two speakers were employed, namely
a female and a male native SBE users.
3 Results
3.1 Target perception: Vowels vs. consonants
The researchers have analysed the data for each task in terms of the target percep-
tion of vowels and consonants by the participants, which are presented as percen-
tages, averages of participants’ correct answers. The results of the study indicated
that the students were able to identify the vowel sounds /æ/, /ɜ:/, /e/, /ʌ/, /ʊ/, /e/
better in the phoneme identification task than in the other two tasks. This was not
the case for the sound /u:/, since performance was better in the dictation task;
also, for the sounds /ɔ:/, /ɑ:/, /i:/, performance was better in the discrimination
task, see Table 5. Taking each task into consideration, the most challenging vowels
in the phoneme identification task were the long vowels /ɑ:/ (20%), /u:/ (60.76%)
and /ɔ:/ (61.53%), which share backness.
It should be noted that the dictation task was quite challenging for the par-
ticipants as their performance were below or barely above 50% of accuracy for
most of the categories.
The most difficult vowels for perception in the dictation task were the long
vowels /ɜ:/ (21.90%), /ɑ:/ (42.47%) and /i:/ (47.68%); also, the short vowels /ʌ/
(40.62%), /æ/ (44.17%) and /ʊ/ (44.35%) were causing further difficulties to the
participants. The other vowel sounds were slightly better (above 50%): /ɔ:/
(53.81%) and /e/ (52.86%). It should be noted that the students were quite suc-
cessful regarding the perception of [ə] (74.62%).
With respect to the discrimination task, the most challenging vowel pairs for
perception (below 50%) were /ɒ ɔ/ (44.83%); and /u: ʊ/ (47.98%), see Table 5.
Table 5: Target perception: Vowels: Three tasks.
Category Phoneme identification task Dictation task Category Minimal pair task
/æ/ .% .% æ/e .%

/ɜ:/ .% .% ɜ:/e .%
/ɔ:/ .% .% ɔ:/ɜ: .%
/e/ .% .% æ/e .%
/ɑ:/ % .% ɑ:/æ .%
/ʌ/ .% .% ʌ/ɜ: .%
Table 5 (continued)
/u:/ .% .% u:/ʊ .%

/ʊ/ .% .% u:/ʊ .%
/e/ .% .% e/i: .%
/ɔ:/ .% .% ɒ/ɔ: .%
/i:/ N/A .% ɪ/i: .%
/ə/ N/A .% ə/ɑ: .%
As for the consonants, the results of the study indicated that the students were
able to identify the consonant sounds /z/, /d/, /b/, /g/, /t/ better in the phoneme
identification task than in the other two tasks. This was not the case for the sounds
/ð d/, /ð θ/, /p f/, /m n/, /θ s/, /l r/, which were perceived better in the discrimina-
tion task, and the sound /h/, which was perceived better in the dictation task.
Looking into each task, the most challenging consonants (below 50% of the
target-like perception) in the phoneme identification task were /ð/ (34.60%), /f/
(35.38%) and /θ/ (43.80%), which are similar in terms of manner of articulation.
The other difficult consonant sounds were /n/ (65.38%) and /s/ (73%), which are
comparable concerning the place of articulation.
The most difficult consonants for perception (below 50% of the target-like
perception) in the dictation task were /ð/ (23.13%) and /θ/ (37%), which are
close in terms of manner and place of articulation, /v/ (42.91%), as well as /ŋ/
(29%) and /g/ (40.07%), these latter sharing the same place and voicing. The
other consonant sounds that caused some difficulties were /d/ (54.44%) and /b/
(56.74%), which are similar with respect to the voicing and manner of articula-
tion, and /z/ (59.38%) and /r/ (70.20%), which are similar regarding voicing. It
should be noted that the students were successful in perceiving /h/ (85.18%).
As for the discrimination task, there were no challenging consonant sound
pairs for perception (below 50%). Still, some consonants caused difficulties: /ð θ/
(68.62%), /p b/ (70.51%) and /t d/ (73.44%), which differ in voicing, see Table 6.
Table 6: Target perception: Consonants: Three tasks.
/ð/ .% .% ð/d .%

/z/ .% .% s/z .%
/θ/ .% % ð/θ .%
/f/ .% N/A p/f .%
Table 6 (continued)
/d/ % .% t/d .%

/n/ .% N/A m/n .%
/s/ % N/A θ/s .%
/b/ .% .% p/b .%
/g/ . .% k/g .%
/t/ .% N/A t/d .%
/r/ N/A .% l/r .%
/v/ N/A .% N/A N/A
/ŋ/ N/A % N/A N/A
/h/ N/A .% N/A N/A
3.2 Non-target perception: Vowels
With reference to non-target perception, several vowels were replaced in the three
perceptual tasks; namely, the substitutions mostly involved the vowels /æ e ɜ:/.
Both in the phoneme identification and dictation tasks, /æ/ was mainly substituted
by /ʌ/, as well as by /ɑː/. Concerning the discrimination task, /e/ was mostly re-
placed by /æ/ since the two vowels are front and unrounded. The vowel sound /ɜ:/
was misperceived as /ɔː/, /ʊ/ and /uː/ in the phoneme identification task, depend-
ing on duration. In the dictation task, /ɜ:/ was replaced by /εə/, /ʌ/, /ɑː/ and /æ/.
In the discrimination task, /e/ was mostly substituted by /ɜ:/.
In the phoneme identification task and dictation task, /ɔ:/ was replaced
by /ɒ/. In the discrimination task, /ɜ:/ was more substituted by /ɔ:/; /ɔ:/ was
more misperceived as /ɒ/. In the dictation task, /ɔ:/ was also replaced by /əʊ/.
In the phoneme identification task and discrimination task, /e/ was compen-
sated by /iː/; in the dictation task, /e/ was substituted by /eɪ/, /ʌ/ and /æ/.
In all three tasks, /ɑ:/ was misperceived as /æ/. In the dictation task, /ɑ:/
was further substituted by /aʊ/, /aɪ/, /ʌ/, and by /ɒ/, which are central/back
vowels. In the discrimination task, /ʌ/ was misperceived as /ɜ:/.
In the phoneme identification task and the minimal pair task, the sound /u:/
was taken over by /ʊ/. In the dictation task, /u:/ was misperceived as /iː/ and /ɪ/.
In the minimal pair task, the sound /ɪ/ was substituted mostly by /i:/. In the dicta-
tion task, the sound /i:/ was misperceived as its short counterpart /ɪ/ and as /u:/.
In the discrimination task, /ə/ was replaced by /ɑː/ and in the dictation task by /ɪ/
and /aʊ/, see Table 7.
Table 7: Vowel perception across the three tasks: non-target perception and types of errors.
56
Test/Category Non-target Substitution
/iː/ /e/ /ʌ/ /ɑː/ /æ/ /ɒ/ /ɜː/ /ɔː/ /uː/ /ʊ/ /ɪ/ /əʊ/ /aɪ/ /aʊ/ /eə/ No production
P /æ/ % % % % % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
M æ/e % N/A % N/A N/A % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
D /æ/ % % % % % % % % % % % % % % % % %
Category Non-target /ɔː/ /uː/ /ʊ/ /ɒ/ /ɜ:/ /ɑː/ /ʌ/ /æ/ /e/ /i:/ /ɪ/ /əʊ/ /aɪ/ /aʊ/ /eə/ No production
P /ɜ:/ % % % % % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
M ɜ:/e % N/A N/A N/A N/A % N/A N/A N/A % N/A N/A N/A N/A N/A N/A N/A
D /ɜ:/ % % % % % % % % % % % % % % % % %
Category Non-target /ʊ/ /ɒ/ /uː/ /ɜː/ /ɔ:/ /ɑː/ /ʌ/ /æ/ /e/ /i:/ /ɪ/ /əʊ/ /aɪ/ /aʊ/ /eə/ No production
P /ɔ:/ % % % % % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
M ɔ:/ɜ: % N/A N/A N/A % % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
D /ɔ:/ % % % % % % % % % % % % % % % % %
Category Non-target /iː/ /ʌ/ /æ/ /ɪ/ /e/ /ɑː/ /ɒ/ /ʊ/ /uː/ /ɔ:/ /eɪ/ /əʊ/ /aɪ/ /aʊ/ /eə/ No production
P /e/ % % % % % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
M e/i: % % N/A N/A N/A % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
D /e/ % % % % % % % % % % % % % % % % %
Category Non-target /æ/ /e/ /ʌ/ /iː/ /ɑ:/ /ɒ/ /ɪ/ /ʊ/ /uː/ /ɔ:/ /eɪ/ /əʊ/ /aɪ/ /aʊ/ /eə/ No production
P /ɑ:/ % % % % % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
M ɑ:/æ % % N/A N/A N/A % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
D /ɑ:/ % % % % % % % % % % % % % % % % %
Category Non-target /æ/ /ɑː/ /e/ /ɜ:/ /ʌ/ /ɒ/ /ɔ:/ /uː/ /ɪ/ /ɔ:/ /eɪ/ /əʊ/ /aɪ/ /aʊ/ /eə/ No production
P /ʌ/ % % % % % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
M ʌ/ɜ: % N/A N/A N/A % % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
D /ʌ/ % % % % % % % % % % % % % % % % %
Category Non-target /ɜː/ /ʊ/ /ɔː/ /ɒ/ /u:/ /ʌ/ /e/ /iː/ /ɪ/ /æ/ /eɪ/ /əʊ/ /aɪ/ /aʊ/ /eə/ No production
P /u:/ % % % % % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
M u:/ʊ % N/A % N/A N/A % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
D /u:/ % % % % % % % % % % % % % % % % %
Category Non-target /uː/ /ɜː/ /ɒ/ /ɔː/ /ʊ/ /ʌ/ /e/ /iː/ /ɪ/ /æ/ /eɪ/ /əʊ/ /aɪ/ /aʊ/ /eə/ No production
P /ʊ/ % % % % % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
M u:/ʊ % % N/A N/A N/A % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
D /ʊ/ % % % % % % % % % % % % % % % % %
Category Non-target /ʊ/ /ɒ/ /uː/ /ɜː/ /ɔ:/ /ʌ/ /æ/ /e/ /ɪ/ /iː/ /eɪ/ /əʊ/ /aɪ/ /aʊ/ /eə/ No production
P /ɔ:/ % % % % % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
M ɒ/ɔ: % N/A % N/A N/A % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
D /ɔ:/ % % % % % % % % % % % % % % % % %
Category Non-target /ʌ/ /æ/ /ɪ/ /e/ /iː/ /ə/ /uː/ /ʊ/ /ɒ/ /ɜː/ /eɪ/ /əʊ/ /aɪ/ /aʊ/ /eə/ No production
P /iː/ N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
M ɪ/i: % N/A N/A % N/A % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
D /i:/ % % % % % % % % % % % % % % % % %
(continued)
Effect of task, word length and frequency on speech perception in L2 English
57
Table 7 (continued)
58
/iː/ /e/ /ʌ/ /ɑː/ /æ/ /ɒ/ /ɜː/ /ɔː/ /uː/ /ʊ/ /ɪ/ /əʊ/ /aɪ/ /aʊ/ /eə/ No production
Category Non-target /ʌ/ /ɑː/ /ɪ/ /e/ /ə/ /æ/ /ɜː/ /ɔ:/ /ɒ/ /iː/ /eɪ/ /əʊ/ /aɪ/ /aʊ/ /eə/ No production
P /ə/ N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
M ə/ɑ: % N/A % N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
D /ə/ % % % % % % % % % % % % % % % % %
✶
P=Phoneme identification task; M= Discrimination task; D=Dictation task; N/A=Not Available
3.3 Non-target perception: Consonants
The analysis of the non-target perception revealed that in the phoneme identifica-
tion task and in the dictation task, students misperceived /ð/, /θ/ and /t/ as /d/. In
the discrimination task, /d/was more substituted by /ð/. In addition, in the dicta-
tion task, /ð/ was also perceived as /w/, /v/, /l/, and /j/.
In the phoneme identification task, /z/ was misperceived as /dʒ/; in the dis-
crimination task, /z/ was more commonly replaced by its voiceless counterpart /s/.
In the dictation task, /z/ was perceived as /s/, /d/ and /dʒ/, which could be due to
the similarity in voicing and manner of articulation. In the phoneme identification
task, /θ/ was misperceived as /d/ and /v/; in the discrimination task, /θ/ was more
commonly perceived as /ð/, while in the dictation task, the students perceived /θ/
as /b/, /f/, /t/, /d/, /k/, /p/, /v/ and /l/.
In the phoneme identification task, /f/ was misperceived as /v/, /j/ and/w/.
In the discrimination task, /f/ was substituted by /p/. In the phoneme identifica-
tion task, /d/ was replaced by /b/, /p/ and /t/. In the discrimination task, /d/ was
substituted by /t/. In the dictation task, /d/ was misperceived as /t/, /b/ and /k/.
In the phoneme identification task, /n/ was identified as /m/, /ŋ/, /r/ and /l/.
In the discrimination task, /n/ was replaced by /m/. In the phoneme identification
task, /s/ was changed into /z/ and /ʃ/; in the discrimination task, /θ/ was per-
ceived as /s/, which is mainly due to the matching in voicing and manner of
articulation.
In the phoneme identification task, /b/ was substituted by /p/; in the dis-
crimination task, /p/ was replaced mostly by /b/; in the dictation task, /b/ was
substituted by /p/, /d/, /k/, /t/ and /l/. In the phoneme identification task, /g/
was substituted by /p/; in the discrimination task, /g/ was not differentiated
from /k/. In the dictation task, /g/ was misperceived as /t/, /d/, /k/, and /r/,
based on voicing, manner and place of articulation.
In the phoneme identification task, /t/ was substituted by /n/, /m/, and /ŋ/.
In the discrimination task, /t/ was misperceived as /d/, as they are congruent in
manner and place of articulation. In the discrimination task, /r/ was more misper-
ceived as /l/. In the dictation task, /r/ was substituted by /l/, /b/, /k/ and /p/. In
the dictation task, /v/ was represented by /t/, /w/, /l/, /f/, /b/, /n/, /s/, and /ð/; in
turn, /ŋ/ was replaced by /n/, /m/, and /ʃ/, see Table 8.
Table 8: Consonant perception across the three tasks: non-target perception and types of
errors.
/t/ /d/ /θ/ /h/ /ð/ /w/ /b/ /f/ /v/
P /ð/ % % % % % N/A N/A N/A N/A N/A
M ð/d % N/A % N/A N/A % N/A N/A N/A N/A
D /ð/ % % % % % % % % % %
Category Non-target /dʒ/ /s/ /ʃ/ /tʃ/ /z/ /ð/ /d/ /b/ /v/
P /z/ % % % % % N/A N/A N/A N/A N/A

M s/z % N/A % N/A N/A % N/A N/A N/A N/A
D /z/ % % % N/A N/A N/A % % % N/A
Category Non-target /v/ /d/ /w/ /h/ /θ/ /ð/ /b/ /f/ /t/
P /θ/ % % % % % N/A N/A N/A N/A N/A

M θ/ð % N/A N/A N/A N/A % % N/A N/A N/A
D /θ/ % % % N/A N/A N/A N/A % % %
Category Non-target /v/ /w/ /h/ /j/ /θ/ /ð/ /f/ /p/ /t/
P /f/ % % % % % N/A N/A N/A N/A N/A

M f/p % N/A N/A N/A N/A N/A N/A % % N/A
D /f/ N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Category Non-target /g/ /p/ /b/ /t/ /d/ /ð/ /f/ /ʃ/ /r/
P /d/ % % % % % N/A N/A N/A N/A N/A

M d/t % N/A N/A N/A % % N/A N/A N/A N/A
D /d/ % N/A N/A % % N/A % N/A N/A %
Category Non-target /m/ /ŋ/ /r/ /l/ /n/ /ð/ /f/ /p/ /r/
P /n/ % % % % % N/A N/A N/A N/A N/A

M n/m % % N/A N/A N/A % N/A N/A N/A N/A
D /n/ N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Category Non-target /z/ /ʃ/ /dʒ/ /tʃ/ /s/ /θ/ /f/ /p/ /r/
P /s/ % % % % % N/A N/A N/A N/A N/A

M s/θ % N/A N/A N/A N/A % % N/A N/A N/A
D /s/ N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Category Non-target /p/ /d/ /k/ /g/ /b/ /θ/ /f/ /ʃ/ /t/
P /b/ % % % % % N/A N/A N/A N/A N/A

M b/p % % N/A N/A N/A % N/A N/A N/A N/A
D /b/ % % % % % N/A N/A N/A N/A %
Table 8:
/r/ /k/ /l/ /m/ /n/ /z/ /p/ /s/ /dʒ/ /g/ /j/ No production
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
% % % % % % % % N/A N/A % %
/r/ /k/ /l/ /m/ /n/ /w/ /p/ /t/ /dʒ/ /g/ /j/ No production
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
/k/ /z/ /p/ /m/ /n/ /z/ /p/ /s/ /dʒ/ /g/ /j/ No production
% % % N/A N/A N/A N/A N/A N/A N/A N/A %
/k/ /z/ /ʃ/ /m/ /n/ /z/ /p/ /s/ /dʒ/ /g/ /j/ No production
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
/v/ /w/ /k/ /n/ /z/ /dʒ/ /θ/ /s/ /m/ /g/ /j/ No production
% % % % % % % % N/A N/A N/A %
/r/ /v/ /w/ /l/ /m/ /n/ /s/ /θ/ /dʒ/ /g/ /j/ No production
% % % % % % % N/A N/A N/A N/A %
Table 8 (continued)
/t/ /d/ /θ/ /h/ /ð/ /w/ /b/ /f/ /v/
Category Non-target /t/ /d/ /p/ /b/ /k/ /g/ /f/ /θ/ /t/
P /g/ % % % % % N/A N/A N/A N/A N/A

M g/k % N/A N/A N/A N/A % % N/A N/A N/A
D /g/ % % % % % % N/A % % N/A
Category Non-target /m/ /n/ /ŋ/ /b/ /t/ /d/ /f/ /θ/ /t/
P /t/ % % % % N/A N/A N/A N/A N/A N/A

M t/d % N/A N/A N/A N/A % % N/A N/A N/A
D /t/ N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Category Non-target /l/ /r/ /m/ /n/ /ŋ/ /d/ /f/ /θ/ /b/
P N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
M l/r % % % N/A N/A N/A N/A N/A N/A N/A
D /r/ % % N/A % % N/A % % N/A %
Category Non-target /b/ /f/ /t/ /r/ /h/ /w/ /k/ /l/ /m/
P /v/ % % % % % % % % % %
M /ŋ/ % % N/A % N/A N/A N/A N/A % %
D /h/ % % N/A N/A % % N/A N/A N/A N/A
✶
P=Phoneme identification task; M= Discrimination task; D=Dictation task; NA=Not Available
Table 8 (continued)
/r/ /k/ /l/ /m/ /n/ /z/ /p/ /s/ /dʒ/ /g/ /j/ No production
/r/ /v/ /w/ /l/ /m/ /n/ /s/ /θ/ /dʒ/ /g/ /j/ No production
% N/A N/A % % % N/A N/A N/A N/A N/A %
/r/ /v/ /w/ /l/ /ʃ/ /p/ /s/ /θ/ /dʒ/ /g/ /j/ No production
/k/ /p/ /g/ /ð/ /ʃ/ /p/ /s/ /θ/ /dʒ/ /g/ /j/ No production
% % % % N/A N/A N/A N/A N/A N/A N/A %
/n/ /p/ /θ/ /s/ /ð/ /d/ /v/ /k/ /dʒ/ /g/ /ʃ/ No production
% % % % % N/A N/A N/A N/A N/A N/A %

% % N/A N/A N/A % % % % % % %
N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A %
3.4 Comparison of the three tasks: Vowels and consonant

perception
The inferential analysis of the data was run in IBM SPSS Statistics 25. The re-
searchers used mean scores of the participants’ performance per task, per cate-
gory, as the number of test items was different across the three tasks. The
analysis showed that the students scored higher in the tasks involving the pho-
neme identification and discrimination tasks than in the dictation task, both for
vowels and consonants. Overall, the phoneme identification task elicited more
target-like vowel sounds (77.6%) than the discrimination task (71.34%) and the
dictation (48.67%) tasks; for consonants, the best performance was associated
with the discrimination task (80.28%) than with the phoneme identification
(73%) or the dictation (49.81%) tasks. Vowel perception was better in the pho-
neme identification task whereas consonant perception seemed to be better in
the discrimination task and the dictation task. These findings suggest that there
is a task effect on vowel and consonant perception by L2 learners of English with
L1 CG background, see Figure 1. According to a one-way ANOVA, there is no sta-
tistically significant difference among the three tasks regarding vowel perception
(F(2,128)= 1.518, p=.124), but there is a statistically significant difference regard-
ing consonant perception (F(2,128)=2.733, p=.002✶; w2=0.162✶✶) and a large ef-
fect size.
Vowels Consonants
100%
80%
60%
40%
20%
0%
Phoneme identification task Minimal pair task Dictation task
Figure 1: Target perception: Vowels vs. consonants: Three tasks.
According to the paired samples t-test statistical analysis (using IBM SPSS Sta-
tistics 25), the difference between target vowel and consonant perception is
statistically significant: in the phoneme identification task (t(129)= −3.293,
p=.001 ✶✶ , d=0.826 ✶✶ ), with a large effect size, in the discrimination task
(t(129)= −12.366, p=.000✶✶; d=0.937✶✶), with a large effect size, and in the
dictation task (t(129)= −9.958, p=.000✶✶; d=1.190✶✶), with a large effect size.
According to Pearson correlation, age is positively correlated with correct

vowel perception (r(130)=.215✶, p=.014✶), which means that the older the learn-
ers, the better their ability to perceive L2 English vowels is, which is probably
related to the increase in the years of learning L2 English; years of learning of L2
English are positively correlated with target consonant perception (r(130)=.231✶,
p=.008✶✶), which suggests that the quantity of L2 English input positively affects
the learners’ perception of consonants, and negatively correlated with the correct
vowel perception (r(130)= −.231✶, p=.008✶✶) in the phoneme identification task.
With respect to the discrimination task, age is a statistically significant factor
for consonant perception (r(130)=.421✶✶, p=.000✶✶), which suggests that the older
the learners the better their perception of L2 English consonants is, which could
be related to the increase of the quality and quantity of the L2 English input, as
well as gender (r(130)=.193✶, p=.038✶) as female participants seem to outperform
male participants, the years of studying L2 (r(130)=.499✶✶, p=.000✶✶), visits to En-
glish-speaking countries (r(130)= −.299✶✶, p=.001✶✶), and contact with English
people (r(130)= −.350✶✶, p=.000✶✶). Age (r(130)=.293✶✶, p=.001✶✶) and contact
with English-speaking people (r(130)=.186✶, p=.039✶) are statistically significant
factors for vowel perception, which suggest a positive effect of the length and
quantity, quality of L2 exposure on L2 English vowel perception.
Regarding the dictation task, visits to English-speaking countries are nega-
tively correlated with correct vowel perception (r(130)= −.278✶, p=.020✶) and
correct consonant perception (r(130)= −.286✶, p=.017✶), which is somehow dif-
ficult to be explained. Potentially, it could be due to different duration of their
staying abroad and actual communication, frequency, quality and quantity
with the native English-speakers abroad, which were not thoroughly measured
in this study.
3.5 Word length effect: role of task and phonemes
The findings of this study showed that the word length effect depends on the
task and the type of the sound. In the phoneme identification and discrimina-
tion tasks, two-syllable words elicited more target perception of the vowel and
consonant sounds. In the dictation task, one-syllable words elicited more target
perception of the vowel sounds, whereas two-syllable words elicited more tar-
get perception of the consonant sounds and overall.
According to a one-way ANOVA, there is no statistically significant difference
among the three tasks regarding vowel perception in words of one-syllable
length (F(2,128)= 1.525, p=.121), consonant perception in words with two sylla-
bles (F(2,128)= 1.389, p=.177), but there is a statistically significant difference
regarding vowel perception in words of two syllables (F(2,128)= 1.931, p=.035✶;

w2=0.097✶) with a medium effect size, and consonant perception in one-syllable
words (F(2,128)= 2.803, p=.002✶✶; w2=0.168✶✶) with a large effect size (see Table 9).
Table 9: Accuracy rates and word length.
Sounds/Word length One syllable Two syllables
Phoneme identification task
Vowels .% .%

Consonants .% .%
Overall .% .%
Minimal pair task
Vowels % .%

Consonants .% .%
Overall .% .%
Dictation task
Vowels .% .%

Consonants .% .%
Overall .% .%
According to paired samples t-test, there is a statistically significant difference:

1. between target perception of vowels in one-syllable words and two-syllable
words (t(129)= 34.027, p=.000✶✶; d=0.192✶✶), with a large effect size, as
well as consonants in one-syllable words and two-syllable words (t(129)=
31.467, p=.000✶✶; d=0.751✶✶), with a large effect size, in the phoneme iden-
tification task;
2. between target perception of vowels in one-syllable words and two-syllable
words (t(129)= 60.604, p=.000✶✶; d=1.944✶✶), with a large effect size, as well
as consonants in one-syllable words and two-syllable words (t(129)= 10.440,
p=.000✶✶; d=0.969✶✶), with a large effect size, in the discrimination task;
3. between target perception of consonant test items in one- and two-syllable
words (t(129)= −2.881, p=.005✶✶; d=0.344), with a small effect size, but not
in terms of vowel test items (t(129)= −1.393, p=.168) in the dictation task.
3.6 Word frequency effect in the three tasks:

Vowels vs. consonants
The results of the study indicated that word frequency effect on the perception
of vowels and consonants differs depending on the task and the type of the tar-
get sound (see Table 10). In the phoneme identification and dictation tasks,
high frequency words had more accurate perception (both vowels and conso-
nants) than low frequency words, whereas in the discrimination task, it was
low frequency words. Concerning vowels, more rates of target perception were
elicited by low frequency words in the phoneme identification and discrimina-
tion tasks, while in the dictation task, it was by high frequency words. As for
consonants, high frequency words elicited more target consonant sounds in the
phoneme identification and dictation tasks, while in the discrimination task, it
was low frequency words.
According to a one-way ANOVA, there is no statistically significant differ-
ence among the three tasks regarding vowel perception in high frequency
words (F(2,128)=.777, p=.638), in low frequency words (F(2,128)=.745, p=.715),
consonant perception in low frequency words F(2,128)=1.661, p=.087), but
there is a statistically significant difference regarding consonant perception in
high frequency words (F(2,128)=2.851, p=.002✶✶; w2=0.171✶✶), with a large ef-
fect size.
Table 10: Accuracy rates and/according to word frequency.
Sounds/Frequency High Low
Phoneme identification task
Vowels % %

Consonants .% .%
Overall % %
Minimal pair task
Vowels .% %

Consonants % %
Overall .% %
Dictation task
Vowels .% .%

Consonants .% %
Overall .% .%
According to the paired samples t-test, there is a statistically significant

difference:
1. between the target perception of high- and low-frequency vowel test items
(t(129) = −2.198, p = .030✶; d = 0.001), with a very small effect size, as well
as between the target perception of high- and low-frequency consonant test
items (t(129) = −10.104, p = .000✶✶; d = 0.882✶✶), with a large effect size, in
the phoneme identification task;
(t(129) = −70.218, p = .000✶✶; d = 1.305✶✶), with a large effect size, as well
as between the target perception of high- and low-frequency consonant test
items (t(129) = −95.822, p = .000✶✶; d = 0.896✶✶), with a large effect size, in
the discrimination task;
(t(129) = 7.938, p = .000✶✶; d = 0.948✶✶), with a large effect size, as well as
between the target perception of high- and low-frequency consonant test
items (t(129) = 5.676, p = .000✶✶; d = 0.678✶), with a moderate effect size,
in the dictation task.
4 Discussion and conclusion

The aim of this study was to investigate vowel and consonant sound perception
on the word level by L1 CG learners of L2 English. Through the findings, the
most challenging sounds for perception were revealed and a close insight was
gained into the underlying reasons and whether performance is due to cross-
linguistic influence, the individual differences of the learners, the task effect or
test item characteristics such as word length, and word frequency. The compar-
ison of the results based on the three tasks, in particular the phoneme identifi-
cation, discrimination task and dictation task, helped the authors to tap into
the depth of the perception process, investigate L2 phonology, and to examine
whether the findings of this study agree with the three most influential percep-
tual models: the Native Language Magnet Model, the Perceptual Assimilation
Model and the Speech Learning Model.
The Native Language Magnet model (NLM and NLM-e: Kuhl 1993; Kuhl
et al. 2008) suggests that perceptual representations are stored in memory and
L2 perception is constrained by the perceptual magnet effect. Therefore, accord-
ing to the NLM, there is L1 transfer as L1 sound categories are the ‘magnets’
that attract newly perceived tokens; consequently, the difficulty in L2 percep-
tion can be due to the similarity with the native category prototype. Based on
the Perceptual Assimilation Model (PAM and PAM-2: Best 1993, 1994, 1995; Best
and Tyler 2007), which follows an ecological approach to speech perception
(Best 1984), articulatory gestures that are used by L1 learners for perception are
also used by L2 learners for L2 discrimination; this suggests that L1 sounds are
assimilated to different/single native categories, depending on how similar the
sounds are. According to the Speech Learning Model (SLM: Flege 1995, 2002),
the process of L2 perception is constrained by L1 phonology since there is one
common phonological space for both L1 and L2 systems. L2 learners compare
new sounds and the L1 positional allophones; therefore, L2 perception is easier
if L1 and L2 sounds are different.
Our findings support the NLM model as L2 learners seem to have difficulty
with the sounds that are similar in both the L1 and the L2, at least in terms of some
of their acoustic cues. Most of the vowel sounds that received high perception
scores in L2 English are different from the L1 CG vowels (/æ/, /ɜ:/, /ʌ/, /u:/, /ɔ:/,
/ɑ:/, /i:/) in terms of vowel length, highness, frontness, and roundedness.
This partially supports the Native Language Magnet theory (NLM and NLM-e:
Kuhl 1993; Kuhl et al. 2008) as the students found it easier to perceive the sounds
that are different from their L1 sounds or those that are not present in their L1
sound system. In terms of consonants, the L1 CG consonant sound inventory is
richer than L2 English. The results showed that the students had higher percep-
tion scores for the consonant sounds that are similar in both languages: /z/, /d/,
/b/, /g/, /t/, /ð d/, /ð θ/, /p f/, /m n/, /θ s/, /l r/, /h/. This could support the Per-
ceptual Assimilation Model (PAM and PAM-2: Best 1993, 1994, 1995; Best and
Tyler 2007) as well as the Speech Learning Model (SLM: Flege 1995, 2002) in case
the participants were more advanced in the L2.
Regarding the first research question, there is an effect of the task on the
vowel and consonant perception. The task effect may be due to the peculiarities
of each task, as in the phoneme identification and discrimination tasks, the stu-
dents had to listen to almost identical words and choose the target sound that
was repeated; what differed between the two tasks was the number of words in-
volved, since in the phoneme identification task, minimal sets of five words were
involved, while the discrimination task included only two words. However, in
the dictation task, there was a link between oral and written form, as the learners
had to listen to the aural input, decode it and then encode it by writing a relevant
word that is in line with the orthographic rules of the English language. L2 learn-
ers seemed to have difficulty in perceiving non-native sounds, mapping the
speech signal to meaning, decoding and encoding. According to the NLM theory,
listeners seem to be better at discriminating between- as opposed to within-
category contrasts for both consonants and vowels (Kuhl 1993).
The dictation task was found to be more difficult for the students than the
phoneme identification and discrimination tasks. Vowel perception was better
in the phoneme identification task, while consonant perception rates seemed to
be higher in the minimal pair and the dictation tasks. It was found that age,
gender, years of studying L2 English, visits to English-speaking countries as
well as contact with English people are significant factors that affect vowel and
consonant perception. The importance of these extra-linguistic factors has been
the focus of previous studies; age (Hurford, 1991; Lenneberg, 1967; Long, 1990;
Patkowski, 1990; Scovel, 1969; Walsh and Diller, 1981), gender (Moyer 2016; Oh
2011), years of studying L2 English (Best and Tyler 2007), visits to English-
speaking countries (Schumann 1978), and reported use in the L2 (Johnson and
Krug 1980; Krashen et al. 1978; Schumann 1978). Based on these studies, when
language learners start learning the L2 early and are exposed to enough com-
prehensible input, they are more successful.
The non-target perception of L2 English by CG speakers can be explained
by the differences in the sound systems and grapheme-phoneme correspond-
ences between L1 and L2, which is in line with previous studies (Flege and Way-
land 2019; Karpava and Kkese 2020; Kkese and Karpava 2019; Wang and Chen
2019). There are acoustic and functional differences between vowels and conso-
nants (Bonatti et al. 2005), as vowels tend to cause more difficulties to L2 listen-
ers (Pereira 2014). This depends on the L1 vowel inventory, how rich it is, and
whether L2 learners have a cue to process L2 sounds in a non-native language
(Hacquard, Walter, and Marantz 2007; Kivistö-de Souza and Carlet 2014). The
vowel system of English is more complex than that of CG and having in mind
that sounds are perceived categorically, this can explain the fact that vowels
are more difficult for L2 perception.
As for the second research question, the results of the three tasks may sug-
gest that some one-syllable words are more difficult than two-syllable words for
L2 perception, as they seem to lack information on primary stress. On the other
hand, polysyllabic words can be affected by different parameters including
loudness, length, pitch, and quality (Goldsmith 1990; Roach 2009). Moreover,
polysyllabic words adhere to various patterns of stress placement; a disyllabic
adjective such as lovely [ˈlʌv.li] is stressed on the first syllable while its three-
syllable noun counterpart [ˈlʌv.li.nəs] is stressed on the second syllable.
Finally, with regard to the third research question, low-frequency words
were misperceived more in the dictation task. This is in line with Monsell,
Doyle, and Haggard (1989), who suggest that low-frequency words are misper-
ceived more in L2 English compared to high-frequency words due to the lack of
the familiarity effect. Based on the familiarity effect, words that are frequent
are acquired early and are more likely to be known compared to less frequent
words. This implies that high-frequency words need less time to be compared
to low-frequency words. However, this was not the case in the phoneme identi-
fication and the discrimination tasks, as high frequency words involving vowels
as the target sounds were misperceived more.
5 Pedagogical implications
The current study aimed to examine L2 perception in vowel and consonant
sounds by adult L2 listeners of L1 CG, who ranged from low intermediate to ad-
vanced L2 proficiency level. L2 learners tend to adjust the target L2 sounds to
the existing L1 cues as revealed through the results of the present study. How-
ever, the L2 perceptual ability may be influenced by further factors such as the
word length, lexical frequency, and type of task, as well as other individual dif-
ferences such as the age of acquisition, exposure to L2 learning, L2 proficiency,
and living in an L2-speaking country. Taken together, these factors may affect
perception, pointing to the need for pronunciation instruction experience and
specifically the need to incorporate bottom-up processing activities to support
L2 listening in the L2 classroom context. Bottom-up processing allows the L2
learners, especially at the early stages of L2 acquisition, to segment the speech
stream into meaningful units. Even though further studies are needed to test
the generalisability of the findings to different L2 learners (i.e., of different L2
proficiency and age profiles), pronunciation-focused teaching could consider-
ably help L2 learners of English in the investigated context and generally L2 set-
tings to master different acoustic-orthographic dimensions of L2 English at
both controlled and spontaneous speech levels (Saito 2015). Nonetheless, in the
educational system of Greek-speaking Cyprus, L2 learners of English do not re-
ceive any extensive pronunciation training.
The focus of the current study has been speech perception in L2 English, not
examining, therefore, speech production. Nonetheless, further studies could
focus on the impact of instruction to the perception and production abilities of L1
CG speakers of L2 English in an effort to help L2 English instructors understand
what makes the perception-production link difficult, thus helping the L2 listeners
overcome these difficulties (Kkese 2016). The current study also points to the
need for examining pronunciation instruction integrated into meaning-oriented
instruction contexts (Lee and Lyster 2016). Given that the study presented in this
chapter involved perception tasks conducted in an ‘isolated’ setting without any
sentential/communicative context, it would be very interesting to investigate how
pronunciation instruction integrated into meaning-oriented instruction contexts
could facilitate L2 perception and production (Bradlow et al. 1997; Thomson

2012). In this way, L2 learners would be able to improve their intelligibility and
accuracy of L2 pronunciation especially for beginning or low ability L2 learners
(Kkese 2016). In the present study, lexical frequency, word-length, and task type
were identified as significant factors when it comes to L2 perception. Further-
more, in terms of lexical frequency, exposing L2 learners to both high- and low-
frequency words could help them respond more accurately and quickly to the L2
since lexical frequency affects language processing and, as a result, spoken word
recognition. High-frequency words, unlike low-frequency words, do not seem to
require semantic access that could lead to longer reaction time (Norris 2013;
Wang et al. 2021; Zhang et al. 2009), allowing learners to focus more extensively
on pronunciation. Likewise, word length may also influence the amount of time
needed for processing, since shorter words may lead to faster and more accurate
responses. Further, the task type also contributes to L2 perception since the dic-
tation task was found to be the most demanding task compared to the phoneme
identification and minimal pair tasks. This happened because in the dictation
task, participants had to retain information for longer time since they had to lis-
ten, perceive, and record the word they could hear on a given handout. Lexical
frequency, word length, and task type indicate that explicit instruction and
ample pronunciation practice in different situations (Kkese 2016) could, there-
fore, help L2 learners rely more on the acoustic-phonetic information in the
speech signal.
Appendix 1 – Phoneme discrimination task

For each line (1–20), first listen to the whole line. Then circle the one word that is
said twice. Note that meaning is not important for this exercise.
 thigh thy die tie high

 bean Ben ban bun barn
 sue zoo shoe chew Jew
 fall furl fool full foll(ow)
 vow Wow! how thou thou(sand)
 pull Poll(y) Paul pool pearl
 fee V we ye he
 bet beat but bat bit
 goo Pooh! do two Boo!
 park peck pack Puck peak
(continued)
 pan Pam pang pal par(agraph)

 Patty party putty petty pity
 so zo(ne) show Joe cho(sen)
 can ken keen kin corn
 pill dill kill gill Bill
 word wooed would ward wad
 Tay day pay bay gay
 cooed could curd cod cord
 Tim tin ting till tyr(anny)
 bard bid bed bead bad
Appendix 2a – Word identification task

consonants
The following pairs of words form minimal pairs that differ in only one sound.
Listen and circle the word (a) or (b) that you hear being pronounced.
 (a) gypping (b) jibbing

 (a) repute (b) refute
 (a) candour (b) gander
 (a) thigh (b) thy
 (a) sue (b) zoo
 (a) plaid (b) pled
 (a) blame (b) blain
 (a) chucking (b) chugging
 (a) mat (b) mad
 (a) rope (b) robe
 (a) earned (b) end
 (a) wordy (b) worthy
 (a) wraith (b) race
 (a) teeth (b) teethe
 (a) haulier (b) hoarier
 (a) marrow (b) narrow
 (a) mighty (b) nightie
 (a) side (b) scythe
 (a) youthful (b) useful
 (a) bus (b) buzz
 (a) receipting (b) receding
 (a) curdling (b) girdling
 (a) palate (b) ballot
 (a) thistle (b) this’ll
(continued)
 (a) ball (b) boar

 (a) Elland (b) eland
 (a) vicar (b) vigour
 (a) ether (b) either
 (a) four (b) fir
 (a) cheap (b) chief
 (a) looser (b) loser
 (a) taunted (b) daunted
 (a) term (b) turn
 (a) enthuse (b) ensues
 (a) butting (b) budding
 (a) day (b) they
 (a) bowel (b) bower
 (a) cobble (b) corbel
 (a) were (b) wet
 (a) pick (b) pig
 (a) seamier (b) senior
 (a) wreath (b) wreathe
 (a) dose (b) doze
 (a) seed (b) seethe
 (a) passion (b) fashion
 (a) thanes (b) seines
 (a) punkah (b) bunker
 (a) lisle (b) rile
 (a) with (b) withe
 (a) hoar (b) her
 (a) tolling (b) doling
 (a) art (b) at
 (a) udder (b) other
 (a) sauce (b) saws
 (a) immobility (b) inability
 (a) sleuth (b) sluice
 (a) back (b) bag
 (a) clip (b) cliff
 (a) disperse (b) disburse
 (a) lavish (b) ravish
Appendix 2b – Word identification task vowels

The following pairs of words form minimal pairs that differ in only one sound.
Listen and circle the word (a) or (b) that you hear being pronounced.
 (a) bailee (b) bailey

 (a) leaving (b) living
 (a) Esther (b) Easter
 (a) arrant (b) errant
 (a) ob (b) orb
 (a) muscle (b) muzzle
 (a) two (b) to
 (a) gargle (b) gaggle
 (a) raider (b) radar
 (a) sir (b) set
 (a) past (b) fast
 (a) hoarse (b) hearse
 (a) forum (b) forearm
 (a) oompah (b) oompar
 (a) blood (b) blurred
 (a) us (b) errs
 (a) affluent (b) effluent
 (a) due (b) do
 (a) selling (b) ceiling
 (a) but (b) burr
 (a) met (b) meet
 (a) eat (b) it
 (a) oz (b) awes
 (a) arm (b) am
 (a) car (b) cat
 (a) dare (b) their
 (a) discerned (b) descend
 (a) fool (b) full
 (a) bat (b) bad
 (a) trustee (b) trusty
 (a) cos (b) cars
 (a) auburn (b) urban
 (a) exhort (b) exert
 (a) notch (b) nautch
 (a) suffer (b) surfer
 (a) a (b) ah
 (a) gnat (b) net
 (a) appealed (b) afield
 (a) cap (b) cab
 (a) bat (b) bet
(continued)
 (a) coronel (b) kennel

 (a) tarsal (b) tassel
 (a) net (b) neat
 (a) cot (b) core
 (a) oomph (b) umph
 (a) ‘em (b) arm
 (a) itch (b) each
 (a) up (b) Earp
 (a) steeple (b) stipple
 (a) molasses (b) morasses
 (a) erred (b) Ed
 (a) theorem (b) serum
 (a) trad (b) tread
 (a) far (b) fat
 (a) pool (b) pull
 (a) not (b) nor
 (a) to (b) ta
 (a) cut (b) cur
 (a) dissever (b) deceiver
 (a) awning (b) earning
Appendix 3a – Dictation task consonants
Consonants
I. Condition [ð]
there /ðeə(r)/ [ð] (high frequency, initial position,  syllable, male voice)
thy /ðaɪ/ [ð] (low frequency, initial position,  syllable, male voice)
southern /ˈsʌðən/ [ð] (high frequency, middle position,  syllables, female voice)
heather /ˈheðə(r)/ [ð] (low frequency, middle position,  syllables, male voice)
smooth /smu:ð/ [ð] (high frequency, final position,  syllable, female voice)
lathe /leɪð/ [ð] (low frequency, final position,  syllable, female voice)
II. Condition [z]
zap /zæp/ [z] (low frequency, initial position,  syllable, male voice)
zebra /ˈzebrə/ [z] (high frequency, initial position,  syllables, male voice)
muzzle /ˈmʌz(ə)l/ [z] (low frequency, middle position,  syllable, male voice)
puzzle /ˈpʌz(ə)l/ [z] (high frequency, middle position,  syllable, female voice)
demise /dɪˈmaɪz/ [z](low frequency, final position,  syllables, female voice)
confuse /kənˈfjuːz/ [z] (high frequency, final position,  syllables, female voice)
(continued)
Consonants
III. Condition [θ]
thick /θɪk/ [θ] (high frequency, initial position,  syllable, male voice)
Thursday /ˈθɜː(r)zdeɪ/ [θ] (high frequency, initial position,  syllables, male voice)
ether /ˈiːθə(r)/ [θ] (low frequency, middle position,  syllables, male voice)
anthem /ˈænθəm/ [θ] (low frequency, middle position,  syllables, female voice)
wreath /riːθ/ [θ] (low frequency, final position,  syllable, male voice)
depth /depθ/ [θ] (high frequency, final position,  syllable, female voice)
IV. Condition [v]
vent /vent/ [v] (low frequency, initial position,  syllable, female voice)
vault /vɔːlt/ [v] (low frequency, initial position,  syllable, male voice)
beaver /ˈbiːvə(r)/ [v] (low frequency, middle position,  syllables, male voice)
cover /ˈkʌvə(r)/ [v] (high frequency, middle position,  syllables, female voice)
behave /bɪˈheɪv/ [v] (high frequency, final position,  syllables, male voice)
give /ɡɪv/ [v] (high frequency, final position,  syllable, male voice)
V. Condition [d]
dough /dəʊ/ [d] (low frequency, initial position,  syllable, male voice)
doctor /ˈdɒktə(r)/ [d] (high frequency, initial position,  syllables, female voice)
udder /ˈʌdə(r)/ [d] (low frequency, middle position,  syllables, female voice)
fodder /ˈfɒdə(r)/ [d] (low frequency, middle position,  syllables, male voice)
sad /sæd/ [d] (high frequency, final position,  syllable, male voice)
red /red/ [d] (high frequency, final position,  syllable, female voice)
VI. Condition [ŋ]
dung /dʌŋ/ [ŋ] (low frequency, final position,  syllable, male voice)
sing /sɪŋ/ [ŋ] (high frequency, final position,  syllable, female voice)
cunning /ˈkʌnɪŋ/ [ŋ] (low frequency, final position,  syllables, female voice)
finger /ˈfɪŋɡə(r)/ /ŋ/(high frequency, middle position,  syllables, male voice)
juncture /ˈdʒʌŋktʃə(r)/ /ŋ/(low frequency, middle position,  syllables, female voice)
tongue /tʌŋ/ /ŋ/ (high frequency, middle position,  syllable, male voice)
VII. Condition [h]
hour /ˈaʊə(r)/ [h] (high frequency, initial position,  syllable, female voice)
heir /eə(r)/ [h] (low frequency, initial position,  syllable, male voice)
whelp /welp/ [h] (low frequency, middle position,  syllable, male voice)
vehicle /ˈviːəkl/ [h] (high frequency, middle position,  syllables, female voice)
downright /ˈdaʊnˌraɪt/ [h] (low frequency, final position,  syllables, male voice)
although /ɔːlˈðəʊ/ [h] (high frequency, final position,  syllables, female voice)
VIII. Condition [b]
blunder /ˈblʌndə(r)/ [b] (low frequency, initial position,  syllables, male voice)
(continued)
Consonants
blatant /ˈbleɪt(ə)nt/ [b] (low frequency, initial position,  syllables, male voice)
debunk /diːˈbʌŋk/ [b] (low frequency, middle position,  syllables, female voice)
table /ˈteɪb(ə)l/ [b] (high frequency, middle position,  syllable, male voice)
pub /pʌb/ [b] (high frequency, final position,  syllable, female voice)
club /klʌb/ [b] (high frequency, final position,  syllable, female voice)
IX. Condition [g]
gruff /ɡrʌf/ [g] (low frequency, initial position,  syllable, male voice)
gaunt /ɡɔːnt/ [g] (low frequency, initial position,  syllable, male voice)
cognate /ˈkɒɡneɪt/ [g] (low frequency, middle position,  syllables, female voice)
angry /ˈæŋɡri/ [g] (high frequency, middle position,  syllables, female voice)
frog/frɒɡ/ [g] (high frequency, final position,  syllable, male voice)
colleague /ˈkɒliːɡ/ [g] (high frequency, final position,  syllables, female voice)
X. Condition [ɹ]
rankle /ˈræŋk(ə)l/ [r] (low frequency, initial position,  syllable, female voice)
ribald /ˈrɪb(ə)ld/ [r] (low frequency, initial position,  syllables, male voice)
firm /fɜː(r)m/ [r] (high frequency, middle position,  syllable, male voice)
corn /kɔː(r)n/ [r] (high frequency, middle position,  syllable, female voice)
bicker /ˈbɪkə(r)/ [r] (low frequency, final position,  syllables, male voice)
colour /ˈkʌlə(r)/ [r] (low frequency, final position,  syllables, female voice)
Appendix 3b – Dictation task vowels
Vowels
I. Condition [æ]
ant /ænt/ [æ] (high frequency, initial position,  syllable, female voice)
amber /ˈæmbə(r)/ [æ] (low frequency, initial position,  syllables, female voice)
barren /ˈbærən/ [æ] (low frequency, middle position,  syllables, male voice)
clamour/ˈklæmə(r)/ [æ] (low frequency, middle position,  syllables, male voice)
add /æd/ [æ] (high frequency, initial position,  syllable, male voice)
ankle /ˈæŋk(ə)l/ [æ] (high frequency, initial position,  syllable, female voice)
II. Condition [ɜː]
urge /ɜː(r)dʒ/ [ɜː] (high frequency, initial position,  syllable, female voice)
earn /ɜː(r)n/ [ɜː] (high frequency, initial position,  syllable, male voice)
culvert /ˈkʌlvə(r)t/ [ɜː] (low frequency, middle position,  syllables, male voice)
immerse /ɪˈmɜː(r)s/ [ɜː] (low frequency, middle position,  syllables, female voice)
(continued)
Vowels
aver /əˈvɜː(r)/ [ɜː] (low frequency, final position,  syllables, female voice)
stir /stɜː(r)/ [ɜː] (high frequency, final position,  syllable, male voice)
III. Condition [ɔː]
oar /ɔː(r)/ [ɔː] (high frequency, initial position,  syllable, female voice)
almost /ˈɔːlməʊst/ [ɔː] (high frequency, initial position,  syllables, female voice)
adorn /əˈdɔː(r)n/ [ɔː] (low frequency, middle position,  syllables, male voice)
appal /əˈpɔːl/ [ɔː] (low frequency, middle position,  syllables, female voice)
roar /rɔː(r)/ [ɔː] (high frequency, final position,  syllable, male voice)
sore /sɔː(r)/ [ɔː] (high frequency, final position,  syllable, female voice)
IV. Condition [i:]
even/ˈiːv(ə)n/ [i:] (high frequency, initial position,  syllables, male voice)

eel /iːl/ [iː] (high frequency, initial position,  syllable, male voice)
treason /ˈtriːz(ə)n/ [iː] (low frequency, middle position,  syllables, male voice)
recede /rɪˈsiːd/ [iː] (low frequency, middle position,  syllables, female voice)
apogee /ˈæpədʒiː/ [iː] (low frequency, final position,  syllables, female voice)
ski /skiː/ [iː] (high frequency, initial position,  syllable, female voice)
V. Condition [u:]
use /juːz/ [u:] (high frequency, initial position,  syllable, female voice)
union /ˈjuːnjən/ [u:] (high frequency, initial position,  syllables, female voice)
traduce /trəˈdjuːs/ [u:] (low frequency, middle position,  syllables, male voice)
extrude /ɪkˈstruːd/ [u:] (low frequency, middle position,  syllables, male voice)
crew /kruː/ [u:] (high frequency, final position,  syllable, female voice)
lieu /luː/ [u:] (low frequency, final position,  syllable, male voice)
VI. Condition [ɑ:]
arm /ɑː(r)m/ [a:] (high frequency, initial position,  syllable, female voice)
arch /ɑː(r)tʃ/ [a:] (high frequency, initial position,  syllable, female voice)
alarm /əˈlɑː(r)m/ [a:] (high frequency, middle position,  syllables, male voice)
ghastly /ˈɡɑːs(t)li/ [a:] (low frequency, middle position,  syllables, male voice)
ajar /əˈdʒɑː(r)/ [a:] (low frequency, final position,  syllables, male voice)
spar /spɑː(r)/ [a:] (low frequency, final position,  syllable, female voice)
VII. Condition [e]
egg /eɡ/ [e] (high frequency, initial position,  syllable, female voice)
end /end/ [e] (high frequency, initial position,  syllable, male voice)
beget /bɪˈɡet/ [e] (low frequency, middle position,  syllables, male voice)
inept /ɪˈnept/ [e] (low frequency, middle position,  syllables, female voice)
stench /stentʃ/ [e] (low frequency, middle position,  syllable, male voice)
entry /ˈentri/ [e] (high frequency, initial position,  syllables, male voice)
(continued)
Vowels
VIII. Condition [ʌ]
utter /ˈʌtə(r)/ [ʌ] (low frequency, initial position,  syllables, female voice)
utmost /ˈʌtməʊst/ [ʌ] (low frequency, initial position,  syllables, female voice)
blood /blʌd/ [ʌ] (high frequency, middle position,  syllable, male voice)
flood /flʌd/ [ʌ] (high frequency, middle position,  syllable, male voice)
sunder /ˈsʌndə(r)/ [ʌ] (low frequency, middle position,  syllables, male voice)
other /ˈʌðə(r)/ [ʌ] (high frequency, initial position,  syllables, female voice)
IX. Condition [ə]
alone /əˈləʊn/ [ə] (high frequency, initial position,  syllables, female voice)
again /əˈɡen/ [ə] (high frequency, initial position,  syllables, male voice)
harangue /həˈræŋ/ [ə] (low frequency, middle position,  syllables, male voice)
raiment /ˈreɪmənt/ [ə] (low frequency, middle position,  syllables, female voice)
comma /ˈkɒmə/ [ə] (high frequency, final position,  syllables, female voice)
swagger /ˈswæɡə(r)/ [ə] (low frequency, final position,  syllables, male voice)
X. Condition [ʊ]
book /bʊk/ [u] (high frequency, middle position,  syllable, female voice)
truce /truːs/ [u] (low frequency, middle position,  syllable, male voice)
bullion /ˈbʊliən/ [u] (low frequency, middle position,  syllables, male voice)
gruel /ˈɡruːəl/ [u] (low frequency, middle position,  syllables, male voice)
should /ʃʊd/ [u] (high frequency, middle position,  syllable, female voice)
hood /hʊd/ [u] (high frequency, middle position,  syllable, female voice)
References
Arvaniti, Amalia. 1999. Greek voiced stops: Prosody, syllabification, underlying
representations or selection of the optimal? In Amalia Moser (ed.), Proceedings of the 3rd
International Linguistics Conference for the Greek Language, 1997, 383–390. Athens:
Ellinika Grammata.
Arvaniti, Amalia. 2007. Greek phonetics: The state of the art. Journal of Greek Linguistics 8(1).
97–208.
Arvaniti, Amalia. 2010. A (brief) overview of the phonetics and phonology of Cypriot Greek. In
A. Voskos, D. Goutsos & A. Mozer (eds.), The Greek Language in Cyprus: From Antiquity to
Today, 107–124. Athens: University of Athens.
Best, Catherine. 1984. Discovering messages in the medium. In Hiram Fitzgerald, Barry Lester
& Michael Yogman (eds.), Theory and Research in Behavioral Pediatrics, 97–145. Boston,
MA: Springer.
Best, Catherine. 1993. Emergence of language-specific constraints in perception of non-native
speech: A window on early phonological development. In Bénédicte de Boysson-Bardies,
Scania de Schonen, Peter Jusczyk, Peter McNeilage & John Morton (eds.), Developmental
Neurocognition: Speech and Face Processing in the First Year of Life, 289–304.
Dordrecht: Springer.
Best, Catherine. 1994. The emergence of native-language phonological influences in infants: A
perceptual assimilation model. In Judith C. Goodman & Howard C. Nusbaum (eds.), The
Development of Speech Perception: The Transition from Speech Sounds to Spoken Words,
233–277. Cambridge, MA: The MIT Press.
Best, Catherine. 1995. A direct realist view of cross-language speech perception. In Winifred
Strange (ed.), Speech Perception and Linguistic Experience: Issues in Cross-language
Research, 171–204. Timonium, MD: York Press.
Best, Catherine & Gerald McRoberts. 2003. Infant perception of non-native consonant
contrasts that adults assimilate in different ways. Language and Speech 46(2–3).
183–216.
Best, Catherine & Michael Tyler. 2007. Nonnative and second-language speech perception:
Commonalities and complementarities. In Ocke-Schwen Bohn & Murray Munro (eds.),
Language Experience in Second Language Speech Learning: In honor of James Emil Flege,
13–34. Amsterdam: John Benjamins.
Bonatti, Luca, Marcela Peña, Marina Nespor & Jacques Mehler. 2005. Linguistic constraints on
statistical computations: The role of consonants and vowels in continuous speech
processing. Psychological Science 16(6). 451–459.
Bradlow, Ann, David Pisoni, Reiko Akahane-Yamada & Yoh’ichi Tohkura. (1997). Training
Japanese listeners to identify English /ɹ/ and /l/. Journal of the Acoustical Society of
America 101(4). 2299–2310.
Carr, Phillip. 1999. English Phonetics and Phonology. An introduction. Oxford: Blackwell
Publishers.
Cruttenden, Alan. 2014. Gimson’s Pronunciation of English. Abingdon: Routledge.
Deterding, David. 2004. How many vowel sounds are there in English? STETS Language and
Communication Review 19(10). 19–21.
Flege, James Emil. 1995. Second language speech learning: Theory, findings, and problems. In
language Research, 233–277. Timonium, MD: York Press.
Flege, James Emil. 2002. Interactions between the native and second-language phonetic
systems. In Petra Burmeister, Thorsten Piske & Andreas Rohde (eds.), An Integrated View
of Language Development: Papers in Honor of Henning Wode, 217–244. Trier:
Wissenschaftlicher Verlag.
Flege, James Emil & Ocke-Schwen Bohn. 2021. The revised Speech Learning Model. In Ratree
Wayland (ed.), Second Language Speech Learning: Theoretical and Empirical Progress,
3–83. Cambridge: Cambridge University Press.
Flege, James Emil & Ratree Wayland. 2019. The role of input in native Spanish late learners’
production and perception of English phonetic segments. Journal of Second Language
Studies 2(1). 1–44.
Francis, Nelson & Henry Kucera. 1982. Frequency Analysis of English Usage: Lexicon and
Grammar. Boston: Houghton Mifflin.
Fry, Dennis, Arthur Abramson, Peter Eimas & Alvin Liberman. 1962. The identification and
discrimination of synthetic vowels. Language and Speech 5(4). 171–189.
Goldsmith, John. 1990. Autosegmental and Metrical Phonology. Oxford: Basil Blackwell.
Goldstein, Howard. 1983. Word recognition in a foreign language: A study of speech

perception. The Journal of Psycholinguistic Research 12(4). 417–427.
Goto, Hirumo. 1971. Auditory perception by normal Japanese adults of the sounds ‘L’ and ‘R’.
Neuropsychologia 9(3). 317–323.
Guion, Susan, James Flege, Reiko Akahane-Yamada & Jessica Pruitt. 2000. An investigation of
current models of second language speech perception: The case of Japanese adults’
perception of English consonants. The Journal of the Acoustical Society of America 107.
2711–2724.
Hacquard, Valentine, Mary Ann Walter & Alec Marantz. 2007. The effects of inventory on vowel
perception in French and Spanish: An MEG study. Brain and Language 100(3). 295–300.
Higgins, John. 2008. Minimal pairs for English RP: lists by John Higgins. http://minimal.mar
lodge.net/minimal.html (accessed 12 March 2020).
Hulme, Charles, Neath Ian, Stuart George, Shostak Lisa, Surprenant Aimee & Brown Gordon.
2006. The distinctiveness of the word-length effect. Journal of Experimental Psychology:
Learning, Memory and Cognition 32(3). 586–594.
Hurford, James. 1991. The evolution of the critical period for language acquisition. Cognition
40(3). 159–201.
Iverson, Paul, Patricia Kuhl, Reiko Akahane-Yamada, Eugen Diesch, Yoh’ich Tohkura, Andreas
Kettermann & Clausia Siebert. 2003. A perceptual interference account of acquisition
difficulties for non-native phonemes. Cognition 87(1). B47–B57.
Johnson, Thomas & Kathy Krug. 1980. Integrative and instrumental motivations: in search of a
measure. In John Oller & Kyle Perkins (eds.), Research in Language Testing, 241–249.
Rowley, MA: Newbury House.
Karpava, Sviatlana & Elena Kkese. 2020. Acoustic-orthographic interface in L2 phonology by
L1 Cypriot-Greek speakers. In Antonis Botinis (ed.), Proceedings ExLing 2020: 11th
International Conference of Experimental Linguistics, Athens, 2020, 105–109.
Kivistö-de Souza, Hanna & Angélica Carlet. 2014. Vowel inventory size and the use of
temporal cues in non-native vowel perception by Catalan and Danish EFL learners.
Concordia Working Papers in Applied Linguistics 5. 322–336.
Kkese, Elena. 2016. Identifying plosives in L2 English: The case of L1 Cypriot Greek Speakers.
Switzerland: Peter Lang.
Kkese, Elena. 2020a. Phonological awareness and literacy in L2: Sensitivity to phonological
awareness and phoneme-grapheme correspondences in L2 English. In Georgios
Neokleous, Anna Krulatz & Raichle Farrelly (eds.), Handbook of Research on Cultivating
Literacy in Diverse and Multilingual Classrooms, 62–81. Hershey, Pennsylvania: IGI
Global Press.
Kkese, Elena. 2020b. Categorisation of plosive consonants in L2 English: evidence from
bilingual Cypriot-Greek users. In Lydia Sciriha (ed.), Comparative Studies in Bilingualism
and Bilingual Education, 179–199. Newcastle upon Tyne: Cambridge Scholars Publishing.
Kkese, Elena & Sviatlana Karpava. 2019. Applying the Native Language Magnet Theory to an
L2 setting: Insights into the Cypriot Greek adult perception of L2 English. In Elena
Babatsouli (ed.), Proceedings of the International Symposium on Monolingual and
Bilingual Speech, Chania, Greece, 2019, 67–74. Chania, Greece: Institute of Monolingual
and Bilingual Speech.
Kkese, Elena & Kakia Petinou. 2017a. Perception abilities of L1 Cypriot Greek listeners – types
of errors involving plosive consonants in L2 English. The Journal of Psycholinguistic
Research 46(1). 1–25.
Kkese, Elena & Kakia Petinou. 2017b. Factors affecting the perception of plosives in second
language English by Cypriot-Greek listeners. In Elena Babatsouli (ed.), Proceedings of the
International Symposium on Monolingual and Bilingual Speech, Chania, Greece, 2017,
162–167. Chania, Greece: Institute of Monolingual and Bilingual Speech.
Krashen, Stephen, Zelinski Stanley, Jones Carl & Usprich Celia. 1978. How important is
instruction? English Language Teaching Journal 32(4). 257–261.
Kuhl, Patricia. 2000. A new view of language acquisition. Proceedings of the National
Academy of Sciences of the United States of America 97(22). 11850–11857.
Kuhl, Patricia. 1993. Innate predispositions and the effects of experience in speech
perception: The Native Language Magnet Theory. In Benedicte de Boysson-Bardies,
Scania de Schonen, Peter Jusczyk, Peter McNeilage & John Morton (eds.), Developmental
Neurocognition: Speech and Face Processing in the First Year of Life, 259–274. Dordrecht:
Springer.
Kuhl, Patricia, Barbara Conboy, Sharon Coffey-Corina, Denise Padden, Maritza Rivera-Gaxiola
& Tobey Nelson. 2008. Phonetic learning as a pathway to language: New data and Native
Language Magnet Theory Expanded (NLM-E). Philosophical Transactions of the Royal
Society B: Biological Sciences 363(1493). 979–1000.
Lee, Andrew & Roy Lyster. 2016. The effects of corrective feedback on instructed L2 speech
perception. Studies in Second Language Acquisition 38(1). 35–64.
Lengeris, Angelos. 2009. Perceptual assimilation and L2 learning: Evidence from the
perception of Southern British English vowels by native speakers of Greek and Japanese.
Phonetica 66(3). 169–187.
Lenneberg, Eric. 1967. Biological Foundations of Language. New York: Wiley.
Long, Michael. 1990. Maturational constraints on language development. Studies in Second
Language Acquisition 12(3). 251–285.
Lovatt, Peter, S. E. Avons & Jackie Masterson. 2000. The word-length effect and disyllabic
words. Quarterly Journal of Experimental Psychology 53A(1). 1–22.
Monsell, Stephen, Michael Doyle & Patrick Haggard. 1989. Effects of frequency on visual word
recognition tasks: Where are they? Journal of Experimental Psychology: General 118(1).
43–71.
Moyer, Alene. 2016. The puzzle of gender effects in L2 phonology. Journal of Second Language
Pronunciation 2(1). 8–28.
Norris, Dennis. 2013. Models of visual word recognition. Trends in Cognitive Sciences 17(10).
517–524.
Oh, Eunjin. 2011. Effects of speaker gender on voice onset time in Korean stops. Journal of
Phonetics 39(1). 59–67.
Patkowski, Mark. 1990. Age and accent in a second language: A reply to James Emil Flege.
Applied Linguistics 11(1). 73–89.
Pereira, Yasna. 2014. Perception and production of English vowels by Chilean learners of
English: Effect of auditory and visual modalities on phonetic training. London: University
College London dissertation.
Petinou, Kakia & Arhonto Terzi. 2002. Clitic misplacement in normally developing and
language impaired Cypriot-Greek children. Language Acquisition 10(1). 1–29.
Pierrehumbert, Janet. 2003. Phonetic diversity, statistical learning, and acquisition of
phonology. Language and Speech 46(Pt 2–3). 115–154.
Raphael, Lawrence, Gloria Borden, & Katherine Harris. 2007. Speech Science Primer:
Physiology, Acoustics, and Perception of Speech. Baltimore, Philadelphia: Lippincott
Williams & Wilkins.
Recasens, Daniel & Aina Espinosa. 2006. Dispersion and variability of Catalan vowels. Speech
Communication 48(6). 645–666.
Repp, Bruno H. 1981. Two strategies in fricative discrimination. Perception and Psychophysics
30(3). 217–227.
Repp, Bruno. 1984. Categorical perception: Issues, methods, findings. In Norman Lass (ed.),
Speech and Language: Advances in Basic Research and Practice, 244–335. Orlando, FL:
Academic Press.
Roach, Peter. 2004. British English: Received pronunciation. Journal of the International
Phonetic Association 34(2). 239–245.
Roach, Peter. 2009. English Phonetics and Phonology. Cambridge: Cambridge University
Press.
Saito, Kazuya. 2015. Communicative focus on L2 phonetic form: Teaching Japanese learners to
perceive and produce English /ɹ/ without explicit instruction. Applied Psycholinguistics
36(2). 377–409.
Schumann, John (1978). The acculturation model for second-language acquisition. In
R. Gingras (ed.), Second-language acquisition and foreign language teaching, 27–50.
Arlington, VA: Center for Applied Linguistics.
Scovel, Tom. 1969. Foreign accents, language acquisition and cerebral dominance. Language
Learning 19(3–4). 245–54.
Thomson, Ron. 2012. Improving L2 listeners’ perception of English vowels: A computer-
mediated approach. Language Learning 62(4). 1231–1258.
Walsh, Terence & Diller Karl. 1981. Neurolinguistic considerations on the optimal age
for second language learning. In K. Diller (ed.), Individual Differences and Universals in
Language Learning Aptitude, 510–524. Rowley, MA: Newbury House.
Wang, Xinchun & Jidong Chen. 2019. English speakers’ perception of Mandarin consonants:
The effect of phonetic distances and L2 experience. In Sasha Calhoun, Paola Escudero,
Marija Tabain & Paul Warren (eds.), Proceedings of the 19th International Congress of
Phonetic Sciences, Melbourne, Australia, 2019, 250–254. Canberra: Australasian Speech
Science and Technology Association Inc.
Wang, Yuling, Minghu Jiang, Yunlong Huang & Qiu Peijun. 2021. An ERP study on the role of
phonological processing in reading two-character compound Chinese words of high and
low frequency. Frontiers in Psychology 12. https://www.frontiersin.org/article/10.3389/
fpsyg.2021.637238.
Werker, Janet & Richard Tees. 1984. Phonemic and phonetic factors in adult cross‐language
speech perception. The Journal of the Acoustical Society of America 75(6). 1866–1878.
Zhang, Qin., John X. Zhang & Lingyue Kong. 2009. An ERP study on the time course of
phonological and semantic activation in Chinese word recognition. International Journal
of Psychophysiology 73(3). 235–245.
Pedro Luis Luchini, Cosme Daniel Paz, María Claudia Troglia
L2 accented speech measured
by Argentinian pre-service teachers
Abstract: Five international students from Argentina, Belgium, China, Japan
and Poland recorded a picture narrative in English that was later assessed for
measurements of comprehensibility and accentedness by a group of 22 Span-
ish-L1 Argentinian prospective English language teachers. After the listening
task, the raters completed a complementary activity where they identified the
linguistic factors that, in their view, had either eased or impaired the measure-
ment task. Results varied across the 5 speech samples due to the wide range of
phonetic-phonological/syntactic-semantic differences brought up by the speak-
ers’ L1 background transfer to L2 production. To determine the degree of associ-
ation between comprehensibility and accentedness, correlation analyses were
conducted. This analysis was significant for the Belgian and Japanese speakers,
but not for the rest of the speakers. Moderation among raters was highly varied
though not statistically significant. Data from the complementary task were
clustered into different linguistic factors: pronunciation, fluency, lexicogram-
mar and speech rate. Frequency analyses revealed that fluency and prosody
emerged as facilitating factors, while sounds and lexicogrammar appeared as
impeding factors. Upon these findings, some suggestions for L2 pronunciation
pedagogy and future research were made.
Keywords: L2 pronunciation teaching/learning, comprehensibility, accented-

ness, linguistic factors
1 Introduction
For a long time, the teaching of English pronunciation was neglected in the field
of Applied Linguistics (Lee, Jang, and Plonsky 2014). Today, however, pronuncia-
tion is present in numerous worldwide academic settings and well-known jour-
nals in which issues related to L2 pronunciation pedagogy, assessment and
research such as intelligibility, comprehensibility and degree of accentedness are
Pedro Luis Luchini, Cosme Daniel Paz, María Claudia Troglia, Universidad Nacional de Mar
del Plata
https://doi.org/10.1515/9783110736120-004
86 Pedro Luis Luchini, Cosme Daniel Paz, María Claudia Troglia
discussed (Bøhn and Hansen 2017; Derwing and Munro 2015; Derwing, Munro,
and Wiebe 1998; Munro and Derwing 1995).
Motivated by these investigations, in this study, we aim to explore the extent
to which the English produced by international students – with an intermediate
level of proficiency – affects comprehensibility and degree of accentedness as
measured by a group of 22 L1-Spanish prospective English language teachers. Lis-
teners assessed 5 recordings of picture narratives produced by 5 students from
Argentina, Belgium, China, Japan and Poland, respectively. Using a Likert-type
scale, they indicated degree of comprehensibility and accentedness. After per-
forming the measurement task, the listeners completed a complementary activity
whereby they wrote a brief report about each speaker’s productions, describing
the linguistic factors (pronunciation, fluency, lexicogrammar aspects, and speech
rate) that had facilitated or impaired the completion of the perceptual task.
The first part of the paper introduces the literature review, followed by the
method section, in which context, participants and materials are described.
The next section presents the results along with a general discussion. Finally,
some pedagogical implications for teaching L2 pronunciation are addressed,
and some avenues for future research are delineated.
2 Literature review
For the last twenty-five years or so, as a result of new technological advances and
the spread of worldwide globalization, English has become a lingua franca (Jen-
kins 2000; Seidlhofer 2011; Walker 2010). For non-native speakers, English has
thus become an additional language used for international communication. To
facilitate and safeguard global verbal interaction, L2 speakers and listeners need
to be both intelligible and comprehensible. L2 pronunciation plays an essential
role in communication as it constitutes the scaffolding of L2 speech; therefore, it
must be treated as a priority in language teaching (Levis 2005, 2006).
The prevailing requirement for L2 learners to strive for nativelikeness, which
still affects some pronunciation teaching practices, no longer seems to be a real-
istic goal to achieve successful communication. A more contemporary competing
ideology, however, recognizes that L2 learners’ speech needs to be easily under-
stood for communication to be successful, even if their foreign accents are salient
or very strong. In view of this new competing belief, and to meet this goal, teach-
ing practices need to be aligned with the principle of intelligibility (Levis 2005).
Instruction, then, should focus on those L2 speech aspects that have an effect on
understanding rather than on those that are comparably unproductive for that
matter. This presumption takes on particular relevance in the field of L2 pronun-

ciation teaching (Derwing, Munro, and Wiebe 1998).
In most current studies, researchers on L2 pronunciation teaching have sup-
ported the intelligibility principle. Comprehensibility is congruous with the instruc-
tional goal of providing students with the necessary tools and resources to become
intelligible speaker-listeners. Comprehensibility is thus pivotal for the achievement
of successful L2 real-world interactions (Derwing and Munro 2009). L2 learners can
preserve their L1 accents as long as they meet the minimum phonological require-
ment for intelligibility and comprehensibility in order to communicate efficiently
(Derwing and Munro 2005; Levis 2005, 2006; Saito 2013). Comprehensibility can be
defined as listeners’ judgement of how easy or difficult they perceive and under-
stand a given L2 speech sample (Munro and Derwing 1999). This dimension is a
judgment of level of difficulty and not of how much is understood. Comprehensi-
bility ratings thus indicate the amount of time, or the effort listeners need to make
to process L2 speech, even when what is being said is perfectly understood.
In most comprehensibility experiments, the evaluating listeners are gener-
ally native speakers who use a 9-point numerical scale, ranging from 1 (very easy
to understand) to 9 (impossible to understand) to measure the level of difficulty
in the L2-speaking samples. The construct of comprehensibility is aligned with
the principle of intelligibility in that the focus of instruction must be placed on
making students intelligible, and thus becomes a central pivot to achieve success
in communication (Derwing and Munro 2015; Issacs and Trofimovich 2012).
Another L2-speaking dimension that deserves attention is accentedness.
Derwing and Munro (2009) define foreign accent as the ways in which L2
speech differs from a (given) local variety of English, and the effects this differ-
ence may have on speakers and listeners. All speakers, whether they are native
or not, have an accent and neither is better or worse than the other. The fact
that foreign accents can be easily detectable, and even highly noticeable, does
not inevitably mean that they should obstruct communication, although on
some sporadic occasions this may happen. In most L2 pronunciation studies,
this L2-speech dimension is assessed by native-speaker judges using Likert-
type scales similar to those used to perform comprehensibility measurements.
Several studies have examined the impact that some linguistic aspects of L2
speech have on perceived comprehensibility and accentedness (Derwing and
Munro 2015). The construct of comprehensibility is consistent with the intelligibil-
ity principle in that comprehensibility contemplates the effort that listeners need
to make to understand L2 speech. Accentedness, however, is more in keeping
with the nativeness principle because it captures listeners’ perception of the ex-
tent to which L2 speech is affected by the speakers’ L1. Although these two L2
speech dimensions are strongly interconnected, they operate independently from
each other. That is how speakers can have a very accented L2 speech, and still be
fully comprehensible (Derwing and Munro 2015). The linguistic factors associated
with comprehensibility are more numerous and more varied than those tied to
accentedness. Accentedness correlates with the appropriate use of segments,
while comprehensibility shows a stronger association with suprasegmental fea-
tures (stress, rhythm and intonation), fluency and lexico-grammatical and discur-
sive aspects (Crowther et al. 2015a, 2015b, 2017; Isaacs and Trofimovich 2012;
Saito, Trofimovich, and Isaacs 2015; Saito et al. 2016a, 2016b).
To date, little research has delineated the qualities of perceived L2 compre-
hensibility and accentedness in ELF contexts (Pickering 2006) in a way that can
inform teaching practices. Additional empirical studies need to be conducted in
this context to measure these constructs and identify those linguistic factors
that can influence non-native listeners’ impressions of L2 accented speech.
More knowledge about these linguistic factors may help L2 teachers determine
which aspects of L2 speech deserve to be taught and which can be left out, en-
abling them to set appropriate learning goals. This information may also be
valid for teachers to help them gain better understanding of how to integrate
pronunciation skills with other linguistic areas such as grammar, lexis and dis-
course competence as well as to improve the way of assessing L2 speaking skills
(Celce-Murcia et al. 2010; Isaacs 2009; Kennedy and Trofimovich 2010; Saito
and Lyster 2011).
To address this research need, the current study sets out to explore the ex-
tent to which the English produced by international students affects compre-
hensibility and accentedness as measured by L1-Spanish prospective English
language teachers. The innovative nature of this classroom-based study lies in
the fact that comprehensibility and accentedness measurements will not rely
on expert native-speaker listener ratings (Piske, MacKay, and Flege 2001), but
on non-native speaker listeners. That is, 22 Spanish-L1 listeners judged a set of
5 picture narratives, recorded by different L2 learners from diverse L1 back-
grounds, using a 9-point numerical scale. Raters then wrote evaluative reports
whereby they identified and described the linguistic factors that had either en-
hanced or obstructed their understanding.
3 Research questions
This study presents a series of questions that constitute the main objective of
this research. In the first place, an answer will be given to the extent to which
speech in English, produced by international speakers with different accents,
influences the attribution of comprehensibility and accentedness. The second

and third questions inquire about the degree of association between these two
L2 speech dimensions and rater consistency, respectively. The last question
looks into the linguistic factors that, according to the listeners’ perceptions,
influenced the variables analyzed.
4 Method
4.1 Context and Argentinian participants
Data were collected from 22 Spanish-L1 students from a public university in Argen-
tina, studying the 2nd year of a TEFL Program. The group consisted of 4 men and
18 women, aged 19–25 (Sd=2.01). Upon their research consent, they completed the
listening task and the evaluative report as part of a classroom activity. At the time
of completing these tasks, none reported having had hearing problems.
4.2 International students: Speech samples
Students from Argentina, Belgium, China, Japan and Poland recorded a picture
narrative, sequenced in a series of 8 pictures (Derwing et al. 2009). To avoid
task repetition effects (Bygate 2001; Lambert 2017), each student received a dif-
ferent set of 2 sequenced pictures of the same story. The missing photos were
replaced by blank spaces for the students to recreate their own narratives. They
were allotted 2-minute planning time before recording.
At the time of recording, these students were participating in an English as a
Foreign Language study program in St. Albans, England. Prior to data collection,
consent was requested from the school authorities to conduct the experiment.
The students were selected considering their level of linguistic competence in En-
glish (B1, as stipulated by the Common Framework of Reference for Languages,
(henceforth, CEFR)). As a requisite, before entering this school, all students took
a placement test administered by the same institution.
4.3 Data collection procedures
Twenty-two student teachers, with a level of linguistic proficiency in English

equivalent to C1 (as stipulated by the CEFR) listened and scored the international
students’ speech samples. The trainees completed the listening task individually
in a university classroom. They were granted autonomy to listen to the speech
samples as many times as they needed. To determine comprehensibility meas-
urements, listeners judged each recording using a Likert-type scale with a pro-
gression of 1–9, in which 1 corresponded to L2 speech that was very difficult to
understand, while 9 was equivalent to L2 speech that was very easy to under-
stand. To establish degree of accentedness, listeners used the same scale in
which 1 represented very accented speech, while 9 indicated native-like accent.
The complementary task required listeners to write a brief evaluation report
in which they described the linguistic factors that had facilitated or obstructed
the realization of the measurement tasks (comprehensibility & accentedness). In
this complementary task, the students were asked to refer to segmental aspects
(pronunciation of individual vowels and consonants, deletion or addition of
sounds), prosody (word/sentence stress, rhythm and intonation), speech rate
(speakers’ overall pacing and speed of utterance delivery), lexical and grammati-
cal accuracy (speakers’ choice of words to accomplish the given task/grammati-
cal aspects in relation to word order, morphology tense inflections, plurals,
subject/verb agreement, among others) and fluency (flow, continuity, automatic-
ity, or smoothness of speech, often associated with frequency, length and distri-
bution of pauses).
4.4 Analysis
A descriptive analysis of data was carried out using frequency and spider graphs.
Simple linear Spearman correlation analyses were performed. An ANOVA analy-
sis of variance was also conducted with a significance level of P= 0.05 for each of
the variables measured, considering raters and speaker backgrounds as factors.
The effect sizes were estimated with the Cohen’s d coefficient.
5 Results
This section answers the questions raised in this research. We first inquired the
extent to which the variety of accents produced by L2 international speakers influ-
enced the measurements of comprehensibility and accentedness according to the
perception of a group of Argentine listeners. Figure 1 shows the relative frequency
of assessment for comprehensibility and accentedness according to speakers’
backgrounds. Tables 1 & 2 show the statistical analysis for these variables.
60 Argentina Belgium
50
40
30
20
10
0
60 China Japan
Relative frequency (%)
50
40
30
20
10
0
60 Poland 1 2 3 4 5 6 7 8 9
Accentedness
Comprehensibility Likert-type scale
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9
Likert-type scale
Figure 1: Bar graph of the relative frequency for comprehensibility (black bars) and
accentedness (gray bars) for the different speaker backgrounds.
The Argentine speaker’s comprehensibility values mainly ranged from 6 to 7,

suggesting that the assessors found the speech samples relatively easy to under-
stand. This is no striking fact if we consider that both speaker and listeners share
the same L1 (Spanish). Regarding the accentedness index, the highest value was
located between 4 and 6, revealing that this speaker’s foreign accent was per-
ceived as moderate. The fact that speaker and assessors shared the same L1
could have affected perception, therefore, resulting in a biased rating.
Concerning the Belgian speaker, the level of comprehensibility ranged be-

tween 3 to 8, with the highest frequency in the mean value of the scale. These
data show that the degree of difficulty in understanding was moderate. With
regard to the degree of accentedness, a similar value range was observed to
that assigned to the comprehensibility construct.
The Chinese speaker’s comprehensibility measurements varied between 5
and 8, indicating a higher rank than that obtained by the Argentine and Belgian
speakers. Eighty-six percent of listeners attributed a high rating (6–8) to com-
prehensibility, indicating that listeners showed a certain degree of ease in un-
derstanding. With respect to the foreign accent index for this speaker, the
assessment range was mainly between 4 and 7. Eighty-one per cent of the eval-
uators assigned a score between 5 and 7. These results suggest that listeners
did not detect a strong foreign accent in this speech sample.
Regarding the Japanese speaker’s comprehensibility index, the assessment
range fluctuated between 1 and 7, with 77% of the assessment between 2 and 4,
indicating a low perception of comprehensibility. The degree of accentedness
registered a score similar to that assigned to comprehensibility, with 72% of the
score between 1 and 3. These scores indicate that this speaker was perceived as
having highly accented L2 speech.
With regard to the Polish speaker, the raters’ comprehensibility level varied
between 4 to 8, with 86% of the assessment between 5 and 7. These data suggest
that the listeners did not need to make great cognitive effort to understand what
was being said. The accentedness index ranged from 2 to 7, with a frequency of
86% between grades 4 and 6. In general, this speaker was perceived as having a
moderate degree of foreign accent.
The speakers’ L1 background is a statistically significant factor in the two
variables analyzed, with a P <0.01 (Tables 1 & 2).
Table 1: Variance analysis for accentedness.
F.V. SS DF MS F p-value
Model ,  , , ,

L background ,  , , ,
Rater ,  , , ,
Error ,  ,
Total , 
Table 2: Variance analysis for comprehensibility.
F.V. SS DF MS F p-value
Model ,  , , <,

L background ,  , , <,
Rater ,  , , ,
Error ,  ,
Total , 
Tables 3 and 4 show mean comparison of accentedness and comprehensibility

for the different nationalities and significant differences among nationalities.
Table 3: Accentedness for the different nationalities. Values are the means ± EE.
The means were tested with two-way analysis of variance (ANOVA) for significant
effects. Different letters indicate significant differences (p > 0,05).
Nationality Means ± SD Effect Size
Japan , ± ,A ,

Argentina , ± ,AB ,
Belgium , ± ,B −,
Poland , ± ,BC −,
China , ± ,C −,
Table 4: Comprehensibility for the different nationalities. Values are the means ± EE.
The means were tested with two-way analysis of variance (ANOVA) for significant
effects. Different letters indicate significant differences (p > 0,05).
Nationality Means ± SD Effect Size
Japan , ± ,A ,

Belgium , ± ,B ,
Poland , ± ,BC −,
Argentina , ± ,C ,
China , ± ,C −,
5.1 Correlations between comprehensibility and accentedness
The second question was about the degree of association between compre-
hensibility and accentedness of English spoken by 5 international students and
evaluated by a group of Argentine raters. Results of this association are shown
in Figure 2 below.
10 Argentina Belgium
0
10 China Japan
Degree of accentedness
0
10 Poland 0 2 4 6 8 10
Comprehensibility
8
0
0 2 4 6 8 10
Comprehensibility
Figure 2: Dispersion graph between comprehensibility and accentedness for the different
speaker backgrounds. Only significant correlations (P <0.01) are shown as a solid line.
Correlation analyses between the two variables were significant (P <0.01)

for the Belgian and Japanese speakers, with a positive linear degree of adjust-
ment (r2) of 0.56 and 0.59, respectively. However, there was no degree of asso-
ciation or tendency between the variables studied for the 3 other speakers
(Argentina, Poland and China).
5.2 Rater consistency
In the third question, we delved into the variability among the raters’ assess-
ment results. Analyses of variance (Tables 1 & 2) do not show a statistically sig-
nificant rater effect for comprehensibility and accentedness, P = 0.2366 and
0.3229, respectively.
5.3 Linguistic factors
Our last question refers to the linguistic factors that, according to the raters’
opinions expressed in the complementary task, facilitated or hindered the reali-
zation of the measurement tasks. The most salient linguistic aspects, deter-
mined by frequency of occurrence, were counted, identified and clustered into
segments (pronunciation of individual vowels, consonants and deletion or ad-
dition of sounds), prosody (stress placement both at word and sentence levels),
rhythm (determined by the succession of stressed and unstressed syllables,
where stressed syllables tend to occur at roughly regular intervals of time), lex-
ico-grammar (the speaker’s choice of words to accomplish the task set, and as-
pects related to word order, sentence structure, morphology tense inflections,
plurals, agreement, among others), speech rate (speaker´s overall pacing and
speed of utterance delivery) and fluency (flow, continuity, automaticity, or
smoothness of speech, often associated with frequency, length and distribution
of pauses). The information from this analysis is summarized in the spider
graphs shown below.
With reference to the Argentine speaker, listeners recognized lexicogram-
mar aspects and sounds as compromising factors that affected the realization
of the listening task with rates of obstruction at 25% and 35%, respectively.
Contrastingly, it was estimated that the assistance rate of prosody and fluency
for successful task completion was about 28% and 56%, correspondingly.
Argentina Fluency
60
50
40
30
Lexico-grammar 20 Speech rate

10
Prosody Sounds
Obstructing factors
Facilitating factors
Figure 3: Spider graph showing facilitating and obstructing factors for comprehensibility for
the Argentine speaker.
Regarding the Belgian speaker, the rate for sounds was 46%, emerging as
the most salient factor that hampered the measurement task. Conversely, listen-
ers rated prosody at 50% as the most noticeable factor that facilitated the listen-
ing task. Fluency, lexicogrammar and speech rate were assigned the same
frequency as both facilitating and obstructing factors. This could be due to
raters’ perceptual differences when judging unfamiliar accented speech.
As for the Chinese speaker, prosody averaged 39% as the main obstructing
factor for the measurement task, followed by sounds with an average about
21%. By contrast, fluency and sounds were labeled as facilitators, averaging
33% and 29%, respectively. It is worth pointing out that lexicogrammar aspects
were rated as both hindering and enabling components for the completion of
the listening task. This final result could be partly attributed to the natural vari-
ability in raters’ perceptive skills.
Concerning the Japanese speaker, raters identified no linguistic factor that fa-
cilitated the completion of the assessment task. However, it should be noted that
listeners concurrently rated all factors as obstacles. Sounds were the most ob-
structing component for the task completion, averaging 30%. Fluency and pros-
ody rates followed sounds in order of importance, with an average of 22% each.
Belgium
Fluency
60
50
40
30
10
0
Prosody Sounds
Obstructing factors
Figure 4: Spider graph showing facilitating and obstructing factors for comprehensibility
for the Belgian speaker.
China
Fluency
60
50
40
30
10
0
Prosody Sounds
Obstructing factors
for the Chinese speaker.
These results illustrate the role that pronunciation (sounds & prosody) plays in
speech perception and production, and why it should be a concern in foreign lan-
guage classrooms.
Japan
Fluency
60
50
40
30

10
Prosody Sounds
Obstructing factors
for the Japanese speaker.
Finally, as to the Polish speaker, listeners rated fluency as the major factor that
obstructed the realization of the listening task with 39%. Lexicogrammar fol-
lowed fluency in order of importance with an average of 29%. In contrast,
speech rate, sounds and prosody were reported as enabling factors for the ac-
complishment of the listening task with averages ranging from 18%, 27% and
36%, respectively. These findings exhibit the contribution of prosodic charac-
teristics in the perception of foreign accented speech.
In the next section, the four questions initially posed will be critically ana-
lyzed, covering general aspects of the results obtained.
Poland
Fluency
60
50
40
30

10
Prosody Sounds
Obstructing factors
Figure 7: Spider graph showing facilitating and obstructing factors for comprehensibility for
the Polish speaker.
6 Discussion
For the first question, in general terms, the Argentine, Chinese, Polish and Bel-
gian speakers were perceived with a high degree of comprehensibility, while
the Japanese speaker received the lowest scores. Regarding accentedness, the
Argentine, Belgian and Japanese speakers were perceived with a high degree of
foreign accent. The Chinese, however, was perceived as having a low foreign
accent, while the Polish was rated at a medium level. In all, the speakers’ L1
background had a statistically significant effect on the two variables analyzed.
There is little research on speakers’ L1 effect on listener ratings of L2 compre-
hensibility and accentedness. Some of these studies have revealed mixed find-
ings. Anderson-Hsieh, Johnson, and Koehler (1992) showed that prosody was
highly correlated with speakers’ L2 assessment scores notwithstanding their L1
background, while sound deviations were dependent on speakers’ L1. Kang
(2010), on his part, demonstrated that Asian learners (China/Japan) had a
stronger foreign L2 accent than other speakers with different L1 backgrounds
(Arabic, Russian, etc.) due to recurrent misuse of emphatic stress. Crowther
et al. (2014) confirmed that the relative association between comprehensibility
and accentedness with linguistic factors varies according to the speakers’ L1. In
their study, they stated that Chinese speakers’ L2 perception was highly influ-
enced by segmental aspects, Hindi speakers by lexico-grammar variables, and
Farsi speakers showed no correlation with any of the linguistic factors exam-
ined. Derived from our findings, we can conclude that the speakers’ L1s played
a crucial role in determining listeners’ ratings of the L2 speech dimensions ex-
plored. The next research step to follow would be then to carry out an investiga-
tion that allows us to identify and explain what constitutes the nature of our
correlation.
The second question enquired about the correlation between the variables
of comprehensibility and accentedness. Comprehensibility and accentedness
seem to operate independently. Comprehensibility is associated with several
linguistic factors, including prosody, speech rate, lexis and grammatical as-
pects of speech, while accentedness is primarily tied to segmental accuracy and
word stress (Saito, Trofimovich, and Isaacs 2017; Trofimovich and Isaacs 2012).
In the present study, the listeners’ ratings for comprehensibility and accented-
ness for the Argentine, Chinese and Polish speakers showed no association. Al-
though these speakers’ L2 speech were perceived with a strong foreign accent,
raters considered them fairly comprehensible. However, for the Belgian and
Japanese speakers a positive linear correlation was observed. This means that
L2 speech perceived by raters with a strong foreign accent also required greater
cognitive effort on their part to be understood.
In the third question, rater consistency was evaluated. Although there was
variation among the raters’ scores, these differences were not statistically sig-
nificant, which means that assessors largely behaved similarly in how they
rated.
The fourth question examined the complementary task in which listeners
identified the linguistic aspects that, in their understanding, had facilitated or
hindered the realization of the measurement tasks. Generally, fluency emerged
as a facilitating factor for the Argentine and Chinese speakers, while it became
a hindrance for the Japanese and Polish speakers. Fluency had the same pro-
portion as both facilitating and impeding factor for the Belgian speaker. For
both the Argentine and Belgian speakers, prosody served as a promoting factor.
However, for the Japanese and Chinese speakers, prosody hindered under-
standing. Segment accuracy, on their part, constituted an adverse factor for the
Argentine, Belgian and Japanese speakers, while for the Chinese and Polish
speakers they facilitated task completion. Speech rate scored similar results
both as a facilitator and an impeding factor for the Argentine, Chinese and Bel-
gian speakers. This factor was facilitating for the Polish speaker, while for the
Japanese it became an obstacle. Finally, for the Argentine and Polish speakers
lexical-grammatical variables were facilitators, while for Belgian and Chinese

speakers they facilitated and hindered in the same proportion.
Common linguistic factors that either facilitated or hindered listeners’ L2
speech perception in the five international speakers were identified. This was
performed by the addition of the net contribution1 of each of the speakers’ L1
backgrounds to each linguistic factor. The result of this calculation indicates, in
general terms, that fluency and prosody may have helped listeners perform the
measurement task, while sounds and lexico-grammar may have acted as impedi-
ments. Lastly, speech rate seemed not to have influenced task accomplishment.
Teachers may take full advantage of understanding the influence that partic-
ular linguistic factors have on learners’ comprehensibility and accentedness.
They can, therefore, promote learners’ communicative success by trespassing
conventional teaching targets and going beyond the boundaries of sounds to in-
clude prosody, fluency and lexicogrammar components in their L2 pronunciation
classes. This paradigm shift, paired with an integrative approach to the teaching
of pronunciation, should enhance students’ communication skills. To this end,
integrating fluency with comprehensibility, making grammatical inaccuracies in
L2 speech noticeable to listeners and linking lexical knowledge to understanding
furthers communicative competence and spontaneous production abilities. On
balance, pursuing L2 comprehensibility as a learning goal demands a broad-
ranging, holistic approach receptive to variability in the language classroom.
7 Pedagogical implications and limitations

In this section, some limitations to the study will be addressed and some peda-
gogical implications will be analyzed. To begin with, measuring and analyzing 5
international speech samples involved working with a limited range of accents.
A greater number and variety of speech samples from different speakers’ L1 back-
grounds would have allowed the researchers to establish further comparisons.
In the complementary task, the analysis of the linguistic factors was done
taking into account the frequency of use in the listeners’ reports. For future
studies, the linguistic factors to be considered could be delimitated at the time
of the evaluation. This would allow for a more exhaustive and in-depth analysis
of their incidence on comprehensibility and accentedness.
 The net contribution results from the difference between facilitating and obstructing factors’
frequency.
Variability in listener judgement of L2 speech was observed. To improve

rater consistency, assessors could be given a listening placement test before
carrying out the assessment procedure to determine their L2 level of proficiency
in this skill. Such test would allow to reduce perceived differences in their rat-
ings. Listeners could also receive specific training to increase their knowledge
of rating scales, the consequences of inaccurate scoring, familiarity with and
attention to the grading criteria used, and learn about the incidence that certain
linguistic factors may have on the perception of L2 accented speech.
The current experiment was part of a classroom-based project in which lis-
teners were given the option to replay their recordings as many times as needed
to complete the task set. Although listeners may have the chance to ask for clar-
ification whenever they experience some difficulty in understanding their inter-
locutors in real life communication, we are aware that an option for replay
what has been said is not available. In future studies, we suggest setting a limit
on the number of times raters can listen to speech samples.
Our findings denote that fluency and prosody were relevant factors serv-
ing to enhance perceived comprehensibility (Derwing, Munro, and Thomson
2004; Derwing et al. 2009). Therefore, these factors should be given high pri-
ority in the L2 pronunciation class. On the other hand, inaccurate segments
and lexico-grammatical errors played an adverse role in the perception of L2
accented speech, thus calling for critical teacher intervention. Fundamental
attention must target the teaching of these aspects in order to help students
improve comprehensibility. Listener perception of L2 accented speech seems
to be the consequence of a composite amalgamation of many pronunciation,
fluency and lexico-grammatical factors (Saito, Trofimovich, and Isaacs 2015;
Crowther et al. 2017). Interestingly, raters seem to highlight prosody (Rossiter
et al. 2010) and fluency regardless of speaker background and also assign rel-
ative weights to sounds and lexico-grammatical variables. This is consistent
with the idea that the rating of accent primarily requires listeners to focus on
these linguistic factors which have different degrees of intensity before at-
tending to other aspects.
Although the L2 pronunciation research agenda consistently lays special
emphasis on comprehensibility as a more realistic goal for ensuring communi-
cative success, compared to accent reduction or nativelikeness (Derwing and
Munro 2009; Levis 2005; Saito, Trofimovich, and Isaacs 2015), teachers of En-
glish constitute a very special case. In the particular context of this experiment,
the participants were pre-service non-native English teachers; therefore, their
L2 speech needs to be undoubtedly understandable. However, they should also
have a clear and acceptable accent, particularly if they are dealing with young
learners, because they will be expected to serve as L2 speech models and a
source of input in English for them. Foreign accent may also lead them to social
or professional discrimination among other non-native and native teachers
alike (Derwing, Rossiter, and Munro 2002). To avoid this, non-native teachers
need to reduce their foreign accents.
The most common task used to elicit L2 speech for measurements of com-
prehensibility and accentedness has been a picture narrative. Nearly all data
that show the linguistic interdependence between these two L2 speech dimen-
sions emerge from this single task type. Few studies have examined this phe-
nomenon across different task types (Crowther et al. 2015a, 2015b, 2017), and
based on their results, task-type effects should not be ignored (Skehan 2009;
Skehan and Foster 1997). It would thus be interesting to conduct similar studies
to the present one using different task types and compare results.
8 Conclusion
This study allowed us, on the one hand, to inquire about the influence of L2
accented speech on the attribution of comprehensibility and accentedness. On
the other hand, we were also able to distinguish the linguistic factors that had
an impact on the L2 speech perception of Argentine listeners. The effect of the
influence from L1 to L2 of each international speaker clearly affected the assess-
ment results of comprehensibility and accentedness. A direct correlation be-
tween these variables was observed in the Belgian and Japanese speakers, but
not in the rest. Among the linguistic factors that influenced the Argentine lis-
teners’ L2 speech perception, fluency and prosody proved, in general, to have
helped them complete the measurement task successfully, while sounds and
lexicogrammar emerged as obstructing factors. These findings may shed light
on new pedagogical paradigms for teaching L2 pronunciation in diverse con-
texts, including ELF. It would certainly be valuable for other teachers and re-
searchers to replicate and expand on this study, incorporating speakers and
listeners from different L1 backgrounds. Cross comparisons of this kind may
contribute to elucidate relevant L2 pronunciation aspects and features which
should necessarily be a focus in pronunciation classrooms for teachers to help
learners become more efficient in their L2 production.
References
Anderson-Hsieh, Janet, Ruth Johnson & Kenneth Koehler. 1992. The relationship between
native speaker judgments of nonnative pronunciation and deviance in segmentals,
prosody, and syllable structure. Language Learning 42(1). 529–555. doi:10.1111/j.1467-
1770.1992.tb01043.x
Bøhn, Henrik & Thomas Hansen. 2017. Assessing pronunciation in an EFL context: Teachers’
orientations towards nativeness and intelligibility. Language Assessment Quarterly 14(1).
54–68. doi: 10.1080/15434303.2016.1256407
Bygate, Martin. 2001. Effects of task repetition on the structure and control of oral language.
In Martin Bygate, Peter Skehan & Merryl Swain (eds.), Researching Pedagogic
Tasks: Second Language Learning, Teaching and Testing, 23–48. London: Pearson
Education Limited.
Celce-Murcia, Marianne, Donna Brinton, Janet Goodwin & Barry Griner. 2010. Teaching
Press.
Crowther, Dusting, Pavel Trofimovich, Talia Isaacs & Kazuya Saito. 2015a. Does a speaking
task affect second language comprehensibility? The Modern Language Journal 9(1).
80–95. doi:10.1111/modl.12185
Crowther, Dustin, Pavel Trofimovich, Kazuya Saito & Talia Isaacs. 2014. Second language
comprehensibility revisited: Investigating the effects of learner background. TESOL
Quarterly 49(4). 814–837.
Crowther, Dustin, Pavel Trofimovich, Kazuya Saito & Talia Isaacs. 2015b. Second language
comprehensibility revisited: Investigating the effects of learner background. TESOL
Quarterly 49(4). 814–837.
Crowther, Dustin, Pavel Trofimovich, Kazuya Saito & Talia Isaacs. 2017. Linguistic dimensions
of L2 accentedness and comprehensibility vary across speaking tasks. Studies in Second
Language Acquisition 40(2). 443–457. doi:10.1017/S027226311700016X
Derwing, Tracey & Murray Munro. 2005. Second language accent and pronunciation teaching:
A research-based approach. TESOL Quarterly 39(3). 379–397. doi:10.2307/3588486
Derwing, Tracey & Murray Munro. 2009. Putting accent in its place: Rethinking obstacles to
communication. Language Teaching 42(4). 476–490.
Derwing, Tracey & Murray Munro. 2015. Pronunciation Fundamentals: Evidence-based
Derwing, Tracy, Murray Munro & Ron Thomson. 2004. Second language fluency: Judgments on
different tasks. Language Learning 54(4). 655–679.
Derwing, Tracey, Murray Munro, Ron Thomson & Marian Rossiter. 2009. The relationship
between L1 fluency and L2 fluency development. Studies in Second Language Acquisition
31(4). 533–557.
Derwing, Tracey, Murray Munro & Grace Wiebe. 1998. Evidence in favor of a broad framework
for pronunciation instruction. Language Learning 48(3). 393–410.
Derwing, Tracey, Marian Rossiter & Murray Munro. 2002. Teaching native speakers to listen to
foreign-accented speech. Journal of Multilingual and Multicultural Development 23(4).
245–259.
Isaacs, Talia. (2009). Integrating form and meaning in L2 pronunciation instruction. TESL
Canada Journal 27(1). 1–12.
Isaacs, Talia & Pavel Trofimovich. 2012. Deconstructing comprehensibility: Identifying the
linguistic influences on listeners’ L2 comprehensibility ratings. Studies in Second
Language Acquisition 34(3). 475–505. doi:10.1017/S0272263112000150
Jenkins, Jennifer. 2000. The Phonology of English as an International Language. Oxford:
Oxford University Press.
Kang, Okim. 2010. Relative salience of suprasegmental features on judgments of L2
comprehensibility and accentedness. System 38(2). 301–315. doi: 10.1016/j.system.
2010.01.005
Kennedy, Sara & Pavel Trofimovich. 2010. Language awareness and second language
pronunciation: A classroom study. Language Awareness 19(3). 171–185.
Lambert, Craig. 2017. Tasks, affect and second language performance. Language Teaching
Research 21(6). 657–664. doi:10.1177/1362168817736644
Lee, Junkyu, Juhyun Jang, & Luke Plonsky. 2014. The effectiveness of second language
pronunciation instruction: A meta-analysis. Applied Linguistics 36 (3).345–366. 10.1093/
applin/amu040.
Levis, John. 2005. Changing contexts and shifting paradigms in pronunciation teaching.
Levis, John. 2006. Pronunciation and the Assessment of Spoken Language. In Rebecca Hughes
(ed.), Spoken English, TESOL and Applied Linguistics, 245–270. London: Palgrave
Macmillan doi.org/10.1057/9780230584587_11
Munro, Murray & Tracey Derwing. 1995. Foreign Accent, Comprehensibility, and Intelligibility
in the speech of second language learners. Language Learning 45(1). 73–97. https://doi.
org/10.1111/j.1467-1770.1995.tb00963.x
Munro, Murray & Tracey Derwing. 1999. Foreign accent, comprehensibility, and intelligibility
in the speech of second language learners. Language Learning 49(1). 285–310. https://
doi.org/10.1111/0023-8333.49.s1.8
Pickering, Lucy. 2006. Current research on intelligibility in English as a Lingua Franca. Annual
Review of Applied Linguistics 26. 219–233. doi:10.1017/S0267190506000110
Piske, Thorsten, Ian MacKay & James E. Flege. 2001. Factors affecting degree of foreign accent
in an L2: A review. Journal of Phonetics 29. 191–215 doi:10.006/jpho.2001.0134
Rossiter, Marian, Tracey Derwing, Linda Manimtim & Ron Thomson. 2010. Oral fluency: The
neglected component in the communicative language classroom. The Canadian Modern
Language Review 66(4). 583–606.
Saito, Kazuya. 2013. Effects of instruction on L2 pronunciation development: A Synthesis of 15
Saito, Kazuya & Roy Lyster. 2011. Effects of form-focused instruction and corrective feedback
on L2 pronunciation development of /ɹ/ by Japanese learners of English. Language
Learning 62(2). 595–633. https://doi.org/10.1111/j.1467-9922.2011.00639.x
Saito, Kazuya, Pavel Trofimovich & Talia Isaacs. 2015. Second language speech production:
Investigating linguistic correlates of comprehensibility and accentedness for learners at
different ability levels. Applied Psycholinguistics 37(2). 217–240. doi:10.1017/
S0142716414000502
Saito, Kazuya, Pavel Trofimovich & Talia Isaacs. 2017. Using Listener Judgments to Investigate
Linguistic Influences on L2 Comprehensibility and Accentedness: A Validation and
Generalization Study. Applied Linguistics 38(4). 439–462. doi:10.1093/applin/amv047
Saito, Kazuya, Stuart Webb, Pavel Trofimovich & Talia Isaacs. 2016a. Lexical profiles of
comprehensible second language speech. Studies in Second Language Acquisition 38(4).
677–701. doi:10.1017/S0272263115000297
Saito, Kazuya, Stuart Webb, Pavel Trofimovich & Talia Isaacs. 2016b. Lexical correlates of
comprehensibility versus accentedness in second language speech. Bilingualism:
Language and Cognition 19(3). 597–609. doi:10.1017/S1366728915000255
Seidlhofer, Barbara. 2011. Understanding English as a Lingua Franca. Oxford: Oxford
University Press.
Skehan, Peter. 2009. Modelling second language performance: Integrating complexity,
accuracy, fluency and lexis. Applied Linguistics 30(4). 510–532.
Skehan, Peter & Pauline Foster. 1997. Task type and task processing conditions as influences
on foreign language performance. Language Teaching Research 1. 185–211.
Trofimovich, Pavel & Talia Isaacs. 2012. Disentangling accent from comprehensibility.
Bilingualism: Language and Cognition 15(4). 905–916. doi:10.1017/S1366728912000168
Walker, Robin. 2010. Teaching the Pronunciation of English as a Lingua Franca. Oxford: Oxford
University Press.
Jeniffer Imaregna Alcantara de Albuquerque,
Ubiratã Kickhöfel Alves
Dynamic paths of intelligibility
and comprehensibility: Implications
for pronunciation teaching from
a longitudinal study with Haitian learners
of Brazilian Portuguese
Abstract: An agenda of studies has shed some light on pronunciation phenom-
ena through the lens of intelligibility and comprehensibility studies (Derwing
and Munro 2015; Munro and Derwing 1995) as complex, dynamic and multimodal
constructs (Albuquerque 2019; Nagle, Trofimovich, and Bergeron 2019; Nagle
et al. 2021; Zielinski and Pryor 2020). This chapter presents the results of a 12-
point longitudinal data collection conducted with three Haitian speakers (all of
them with different lengths of residence in Brazil and showing different profi-
ciency levels in Portuguese) when listened by two Brazilian listeners (showing
different levels of experience in Second Languages and exhibiting different de-
grees of contact with foreigners) and discusses intelligibility and comprehensibil-
ity in the speaker-listener binomial relationship. The study included an oral
repetition task (aiming to obtain the listeners’ oral comprehension of the speak-
ers’ productions) and a comprehensibility task (with a 9-point Likert scale). Re-
sults indicate individual differences between listener-speaker relationships, as
variability may lead to learning (Lowie and Verspoor 2019). Intelligibility and
comprehensibility results reveal an influence of the participants’ personal profile,
i.e., contact with foreigners (for the listeners), formal versus informal language
learning process and amount of time in immersion context (for the speakers).
Both constructs varied in the binomial relationships, and they seemed connected
to both speakers’ improvement in lexical complexity and pronunciation and lis-
teners’ ability to accommodate new data from the speaker’s productions. Our
general findings suggest benefits of a binomial listener-speaker pairing design in
Acknowledgements: The longitudinal study from which our data is drawn was partly funded by
the Brazilian government (CAPES and CNPq funding agencies). We are deeply grateful to the
participants in our data collections.
Jeniffer Imaregna Alcantara de Albuquerque, Technology Federal University of Paraná

Ubiratã Kickhöfel Alves, Federal University of Rio Grande do Sul
https://doi.org/10.1515/9783110736120-005
108 Jeniffer Imaregna Alcantara de Albuquerque, Ubiratã Kickhöfel Alves
the analysis of intelligibility and comprehensibility, having important implica-

tions for the studies on language development and L2 pronunciation teaching.
Keywords: intelligibility, comprehensibility, Portuguese as a Second Language,

Complex Adaptative Systems
1 Introduction
Although there seems to be no one-size-fits-all view regarding pronunciation
teaching (Levis 2020), one of the most prominent research agendas since the
late 80’s has been implemented by Tracey Derwing and Murray Munro in their
discussions on intelligibility and comprehensibility. A wealth of previous re-
search under the authors and collaborators’ view investigated several variables
and contexts underlying the above-mentioned constructs: listener judgments and
its connection to pronunciation changes (Derwing, Munro, and Wiebe 1998); the
distinction between the constructs of intelligibility, comprehensibility and ac-
centedness (Derwing and Munro 1997); a closer analysis on comprehensibility
judgments and its specific features (Isaacs and Trofimovich 2012); the influence
of methodological features regarding speech assessment (O’Brien 2014); pedagog-
ical aspects towards intelligibility (Derwing and Munro 2015; Levis 2020). Among
these contributions, there are fewer works on longitudinal and dynamic analyses
of speech rating (Albuquerque and Alves 2020; Derwing and Munro 2013; Nagle,
Trofimovich, and Bergeron 2019; Nagle et al. 2021), which is our major focus in
this chapter.
To cope with this dynamic view of language, many studies see develop-
ment as a constant change scenario instead of a one point in time picture (De
Bot 2017; Larsen-Freeman 2015; Lowie 2017; Lowie and Verspoor 2019; Verspoor
et al. 2011; Verspoor, Lowie, and Van Dijk 2008). Yet, dynamic studies on in-
telligibility and comprehensibility do vary on the notion of “time” and time-
scales, whether operationalizing it as a real-time multiple click measurement
(Nagle, Trofimovich, and Bergeron 2019; Nagle et al. 2021) or as a change in an
L2 learner’s trajectory in months/years (Albuquerque 2019; Albuquerque and
Alves 2020; Zielinski and Pryor 2020). In addition, another important premise
of Complex Dynamic Systems Theory (CDST) is assuming variability as part of
the system’s changes and as a potential force towards learning (Lowie and
Verspoor 2019; Van Geert and Van Dijk 2002). As Larsen-Freeman (2020: 295)
argued, variability should not be set aside language teaching and learning
theories; instead, it should be considered an “indispensable source of infor-
mation”, since it may lead to new learning processes.
Dynamic paths of intelligibility and comprehensibility 109
It is taking both speaker and listener as part of a binomial relationship and

assuming intelligibility and comprehensibility as complex and dynamic concepts
that we present our longitudinal study. Different from most of the available
research data, we investigate both constructs in a longitudinal analysis of a
non-mainstream language (Brazilian Portuguese), being developed by Haitian
speakers. Thus, following a dynamic account, our main goal is to investigate pos-
sible intelligibility and comprehensibility variability patterns in the production of
Haitian learners of Brazilian Portuguese judged by Brazilian participants.
This chapter begins by presenting intelligibility and comprehensibility as
complex and dynamic concepts, aligned with the process of language develop-
ment. This is followed by some important principles and concepts of the Theory
of Complex Dynamic Systems, especially in what concerns variability and how
it can serve as the basis for a deeper discussion of a binomial speaker-listener
relationship. It then presents the methodological design and the research re-
sults. Finally, it presents some final considerations and some important impli-
cations for studies on language development and L2 pronunciation teaching.
2 Background literature
2.1 L2 oral intelligibility and comprehensibility:
Contingency in the migration processes
Brazil, especially the south of the country, has received a great number of refugees
who faced natural disasters and war incidents in their homelands. According to the
United Nations High Commissioner for Refugees (UNHCR) 2019 report, 79.5 million
people were forcibly displaced of their countries seeking international protection.
Brazil is the fifth country in the world to receive more asylum-seekers, providing
around 260 thousand people with temporary or long-term asylum (UNHCR 2020).
In this scenario, an unprecedented migration process from Haiti took place
after the 2010 earthquake that devastated the country. Therefore, learning Brazil-
ian Portuguese became the most urgent demand to those migrants. According to
Cadely (2012 apud Silva 2015), Haitians speak Haitian Creole and about 10% of
the population speak French (those who were able to receive formal instruction).
Also, some speak and understand a little of Spanish (due to geographical influ-
ence, since the country is surrounded by Spanish-speaking countries). As fasci-
nating as this language mixture may sound (especially when assuming a complex
dynamic system perspective), this new context of teaching brought up a lot of
doubts. However, not many studies have conducted a deeper investigation about
the oral production and comprehension aspects in these communities.
In the last five years, some studies have been trying to focus on the difficul-
ties in oral production and comprehension which emerge from the linguistic
contact between Haitians and Brazilians (Albuquerque and Alves 2017, 2020;
Machry Da Silva 2017; Silva 2015) and comprehension strategies which may
help them to sound more intelligible. In line with these works, the present
study aims to fill in an L2 diversity gap (since most studies concern English as
an L2 and few studies focus on contributions from other languages), as well as
provide future input for further studies on the interaction between Haitians and
Brazilians.
The data discussed in this chapter may help pronunciation studies to fill in
at least three important gaps. The first one is connected to the fact that most
works focus on developmental data of English as an L2, i.e., on the oral produc-
tion of non-native speakers of English, whether more basic or advanced learn-
ers, in perception or intelligibility and comprehensibility studies whose judges
are usually native listeners. By analyzing data from Brazilian Portuguese, we
may broaden the scope of more linguistic and sociolinguistic findings to lesser-
researched languages.
The second gap is related to the common assumption that learners can usu-
ally achieve higher levels of proficiency in an L2 if they are in an immersion
context, i.e., developing the language by living in a country where this lan-
guage is used as a native one. However, it is important to situate the Haitian
learners in the Brazilian context. According to Norton (2013), when learning a
language, it is important that learners have access to both symbolic and mate-
rial resources, the first connected to cultural aspects and the second with mate-
rial benefits that may emerge from learning the language, as getting a job, good
housing, etc. Just living in the country, unfortunately, does not guarantee a
complete immersion in the language or having access to these symbolic and
material resources. A report by the UNHCR (2020) indicates that even after liv-
ing for more than one year in Brazil, a great number of Haitians do not feel
fully immersed in the country or are able to achieve a solid basic level in Brazil-
ian Portuguese (Albuquerque 2019).
Last, but not least, the third gap, and maybe the key one in our investiga-
tion, is connected to the lack of studies that focus on Haitian learners’ personal
trajectories. By taking into account individual development over time, we also
assume that language, as well as its associated constructs as intelligibility and
comprehensibility, are complex, dynamic systems. Therefore, by adopting a
CDST approach, one can propose major implications to both language develop-
ment and L2 pronunciation teaching.
2.2 Intelligibility and comprehensibility: Findings and gaps
Throughout the last 35 years, intelligibility and comprehensibility have been

consistently discussed as key issues in pronunciation teaching and learning.
Although we can trace back one of the first definitions of the constructs to Aber-
crombie (1949), who had already brought up the need to stop aiming for accu-
racy in L2 learner’s production, to aim at what he called “a comfortable and
intelligible pronunciation”, the more solid and almost uninterrupted research
agenda comes from Tracey Derwing, Munray Munro and colleagues.
Munro and Derwing (2015: 14) define intelligibility as the “extent to which lis-
teners’ perceptions match speakers’ intentions (actual understanding)”, and com-
prehensibility as the “perceived degree of difficulty experienced by the listener in
understanding speech”. As the authors mention, intelligibility and comprehensi-
bility are not completely distinct constructs, but partially interconnected ones,
since they refer to different dimensions and components (more objective and/or
subjective strategies of oral comprehension) of information retrieval. Research
methods usually operationalize intelligibility through dictation tasks (orthographic
transcription), comprehension questions, true/false sentences, among other meth-
ods, being most frequently approached via transcription tasks (Munro and Derw-
ing 2015). Moreover, comprehensibility is generally measured by using a Likert
scale (with scales varying from 1 to 5 or 1 to 9).
Alongside with a wealth of contributions, Derwing and Munro and colleagues
have also pointed out some theoretical and methodological gaps which need fur-
ther investigation. In a special edition to the authors’ great contribution to the field
of Applied Linguistics, Munro and Derwing (2020) had the opportunity to discuss
some of the concepts’ foundation and methodological choices from the 1995 ori-
ginal work. Munro and Derwing (2020) explain the use of the term “perceived
comprehensibility” as an insertion demanded by the paper’s reviewer at that mo-
ment, something which they later regretted. In addition, in this chapter we also
raise another important concern connected to the term “understanding” and to
what it really stands for when considering at least two main aspects: (i) the lan-
guage conception underlying the construct or its form of measurement; (ii) advances
in cognitive research which relate linguistic findings to deeper meaning networks.
Some issues also remain unsettled when thinking about the empirical opera-
tionalization of these constructs. Transcription as a way of measuring intelligibil-
ity has received positive recognition, when considering the practical aspect of
compiling a great number of responses in a short time span (Derwing and Munro
2015), but it has also received some criticism (Alves, Albuquerque, and Bondaruk
2021; Kang, Thomson, and Moran 2018; Munro and Derwing 2020; Zielinski 2006).
Zielinski (2006) comments that when participants transcribe something wrongly,
this mistake may not be directly connected only to speech production issues,
but with greater cognitive functions related to memory and orthographic proc-
essing, for example. Munro and Derwing (2020) acknowledge this argument
and add some questioning about memory load, which Kang, Thomson, and
Moran (2018) had previously referred to, explaining that a transcription task
could increase the working memory overload and its potential impact in the
results. Aiming to reflect about the role of transcription in intelligibility stud-
ies, following Albuquerque (2019), Alves, Albuquerque, and Bondaruk (2021)
and De Weers (2020), we propose that the construct could be operationalized
in a way that promotes a more active reply from participants, by allowing
them to recover either fine detail or more general information (i.e., individual
sounds, group of sounds, semantic content) through an oral repetition task.
This method of data collection was employed in the present study.
Notwithstanding the exposed gaps (which are intrinsic aspects of any ob-
served phenomenon), there is a growing discussion concerning both intelligi-
bility and comprehensibility as dynamic processes. Despite the fact that the
term ‘dynamic’ was not in Derwing and Murray’s first works, in many of their
investigations (and related studies) that followed the seminal contribution of
1995, the authors shed light on the dynamic aspects of oral interaction, e.g., on
listeners’ variability in judgements and its relation towards both intelligibility and
comprehensibility ratings; the dependance of speakers’ intelligibility on their life
trajectories and multiple variables that have influenced their language develop-
ment; the interdependence of listener and speaker in an interaction. This moti-
vates us to pursue a dynamic account of intelligibility and comprehensibility.
2.3 Intelligibility and comprehensibility through

a dynamic lens
A dynamic view applied to oral production and perception studies may have dif-
ferent outcomes and be organized under slightly different premises. Recent works
as Nagle, Trofimovich, and Bergeron (2019) and Nagle et al. (2021) provide some
interesting results for comprehensibility by adopting a dynamic perspective.
As previously mentioned in this chapter, when going for a dynamic approach,
language development must be understood as change in time. The notion of time
adopted by the above-mentioned works is usually obtained through a longitudinal
study, where comprehensibility is assumed as an action-centered construct. Nagle,
Trofimovich, and Bergeron (2019) have assessed comprehensibility by capturing
how it changes in real-time clicks, i.e, instead of measuring the construct only
once, the authors managed to extract several measurements, in one session, by
using a timescale of 2–5 minute intervals (in the 2019 study) and of 2–3 minute
intervals using 100-millimeter scales, obtaining seven ratings per interlocutor in a
17-minute task interaction (in the 2020 study). The general results of the 2019
study pointed out to a great deal of individual variability and to the fact that clips
that received lower grades would frequently receive lower global ratings. In turn,
Nagle et al. (2021) not only showed a U-shaped function of comprehensibility rat-
ings throughout time (i.e., beginning with high levels and also finishing with high
ones), but also a ‘pairability’ among interlocutors’ ratings, in the sense that their
evaluation became quite similar over time.
The findings reported by Nagle, Trofimovich, and Bergeron (2019) and
Nagle et al. (2021) pave a more organic path towards comprehensibility. How-
ever, some questions concerning how individual variability plays a role still re-
main, since the overall results of these studies focus more on revealing group
tendencies and group alignment.
In addition, another study focusing on a dynamic view of comprehensibility
was presented in Zielinski and Pryor (2020), which is an exploratory investigation
of everyday English use in individual trajectories over time. The study was con-
ducted in a 10-month timescale with 14 L2 English learners (8 beginners and 6
intermediate), who were interviewed four times during this period. Besides taking
into account the comprehensibility of beginners, who are not commonly investi-
gated, the study highlights learners’ non-linear trajectories of English use, by
showcasing the importance of individual variability. Our study is aligned with this
investigation since we understand that in order to analyze ‘change’, whether in L2
comprehensibility or intelligibility, individual variability must be accounted for.
These issues considered, the current study sees both intelligibility and compre-
hensibility as imbricated in a comprehension gradient, in which there are stages of
more macro or micro tuning of different subsystems’ (e.g. phonic, lexical, syntac-
tic, semantic) association and a constant cognitive accommodation process (Alves,
Albuquerque, and Bondaruk 2021). It is important to state that this recognition or
tuning process does not follow a linear order. We stress the need for insights in the
process of intelligibility and comprehensibility development and how it changes
over time for the listener-speaker binomial relationship.
3 Study design and research question

Trying to contribute to recent studies that see both intelligibility and compre-
hensibility as complex and dynamic constructs, we present our study with three
Haitian speakers (all of them with different lengths of residence in Brazil and
showing different proficiency levels in Portuguese) and three Brazilian listeners

(showing different experiences with non-native languages and exhibiting differ-
ent degrees of contact with foreigners). We intend to show six different binomial
trajectories based on the combinations between speakers and listeners, so that
they indicate that developmental paths of intelligibility and comprehensibility
work in a binomial level. To do so, we explore possible interactions between
complex characteristics of both the speaker and the listener.
We are particularly interested in observing how the inter-variability in
speaker-listener personal trajectories develops over time, showing possible
patterns that may emerge from this binomial relationship. This questioning de-
rives from what we understand by the participants’ development process, which,
as conceived in this study, may only take place in time and in an individual bino-
mial (speaker-listener) level analysis. Although traditional inferential hypotheses
are not usually set in a dynamic approach, we can predict that the development
patterns of intelligibility and comprehensibility will be nonlinear, exhibiting
characteristics of both the speaker and the listener in interaction.
In order to organize these speaker-listener binomials, we selected three speak-
ers (1, 2 and 3) and three listeners (A, B and C) from a larger poll of participants.
The listeners were selected following Baba and Nitta (2014) and Yu and Lowie
(2020) correlation criteria, i.e., we used correlation tests between the week of data
collection and the mean length of comprehensibility of each week for each one of
the 13 listeners (from Albuquerque, 2019). The three participants who showed the
highest and the lowest correlations were selected and they were compared to ex-
amine whether the initial state influenced their speech intelligibility and compre-
hensibility. The highest correlation demonstrates that this participant’s oral
intelligibility and comprehensibility seem to have changed more than that of
the other listeners. The highest correlations were from participant’s A (r = 0.719,
p < 0.01) and participant C (r= 0.701, p < 0.02) and the lowest was participant B’s
(r=−0.137, p < 0.01) for intelligibility. As for comprehensibility, the highest corre-
lations were from participant A (r=0.634, p < 0.01) and participant C (r= 0.622,
p < 0.02), and the lowest was participant B’s (r =−0.042, p < 0.01). Therefore, the
three selected binomials were: (i) S1-LA; S1-LB; S1-LC; (ii) S2-LA; S2-LB; S2-LC;
(iii) S3 – LA; S3-LB; S3-LC. All participants had signed a consent form for Albu-
querque’s (2019) study, according to the norms of the ethics committees in
Brazil.
4 Methods
4.1 Participants
Due to space limitations and as our focus is on the binomial-relationships be-

tween speakers and listeners, a sample of six participants (three Haitians
speakers and three Brazilians listeners) were selected from Albuquerque (2019),
as mentioned in the previous section.
In a CDST approach, discussing the participants’ profile is essential and
works along with more descriptive and inferential results. Hence, Tables 1 and 2
present some of their personal profile.
Table 1: Haitian speakers’ profiles.
Speakers
S S S
Age   
Gender Female Male Female
L Haitian-creole Haitian-creole Haitian-creole
L French French French
L Portuguese Portuguese Portuguese
Formal training of h (Basic ) h (Basic ) h (Basic )

Portuguese in hours at the
beginning of the research
(November/)
Formal training of h (Basic ) h (Basic ) h (Pre-

Portuguese in hours at the Intermediate)
end of the research (April/
)
Time in Brazil at the  months  months  months

beginning of the research
(November/)
Time in Brazil at the end of  months  year and   year

the research (April/) months
Table 1 (continued)
Speakers
S S S
Contact with Portuguese – Only in – At the – At the

Portuguese Portuguese Portuguese
classes classes classes
– At work – Small everyday
– With some interactions
Brazilian friends (e.g. shopping
– social events for grocery,
with Haitian and going to the
Brazilian bank, etc)
friends.
Source: Adapted from Albuquerque (2019).
Table 2: Selected Brazilian listeners’ profiles.
Listeners
LA LB LC
Age   
Gender Female Male Male
L Brazilian Brazilian Brazilian

Portuguese Portuguese Portuguese
L Advanced English Advanced English Advanced English
L Basic French Basic German no other language

knowledge
Contact with foreigners No contact Yes (monthly) Yes (weekly)

(speakers of other languages)
Experience with teaching Yes ( years) Yes ( years) No experience

foreign languages
Source: Adapted from Albuquerque (2019).
We highlight that all of the speakers had different lengths of residence in

Brazil and presented different proficiency levels in Portuguese. As for the listen-
ers, an essential aspect to stress is that they had distinct contact opportunities
with foreigners speaking L2 Portuguese and different teaching experiences and
contact with other non-native languages except for English.
4.2 Instruments and measurements
Intelligibility was operationalized through an oral sentence repetition task, in

which listeners would listen to the speakers’ sentences and repeat what they
could understand of the production right after that. We chose this operationaliza-
tion due to implications from previous studies: (i) as a task, oral production does
not imply a big effort or a huge working memory load (Kang, Thomson, and
Moran 2018) when compared to the traditional transcription task; (ii) Zielinski
(2006) criticizes transcription tasks, since they involve a more direct access to or-
thographic knowledge, which may not reveal what listeners understood; (iii) the
oral production task may represent a more real task (taking into account com-
mon oral exchanges between speaker and listener); (iv) an oral production task
allows listeners to reply with the content they could actually recover from the
sentence heard, whether a sound, group of sounds, full words or the whole idea,
as it was semantically displayed. In addition, comprehensibility was measured
with a 9-point Likert scale, as in Derwing and Munro (2015), with ‘1’ indicating
“very difficult to understand” and ‘9’, “very easy to understand”.
The sentences were obtained from Whatsapp conversations between the
first author and the Haitian speakers. The conversation themes were freely gen-
erated by following two major topics: a) description of their daily activities as
well as personal information; b) oral prompts selected from the Celpe-Bras (offi-
cial Certificate of Portuguese Proficiency for Foreigners) archive.1 Full complete
sentences were taken from the conversations, with a maximum limit of 8 words
(respecting the attention span proposed by Sternberg and Sternberg 2012).
4.3 Data collection procedures
For both speakers and listeners, the study used a time window of 6 months,
within a time scale of each 15 days, which in total resulted in 12 data points in
time. The data collection points for listeners and speakers can be seen in Figures 1
and 2 and were based on Yu and Lowie’s (2020) layout.
As for recording and sentence edition, all speakers’ data were segmented
on Praat, version 6.0.53 (Boersma and Weenink 2019). Moreover, the audios
were edited on Audacity, version 2.3.2 (2019) and normalized at −5dB intensity.
Speakers would receive weekly general oral and written linguistic feedback
from the first author, so that they could keep training their Portuguese.
 https://www.ufrgs.br/acervocelpebras/acervo/
18
18
18
18
19
19
19
19
19
19
19
19
20
20
20
20
20
20
20
20
20
20
20
20
1/
1/
1/
2/
1/
1/
2/
2/
3/
3/
3/
4/
/1
/1
/1
/1
/0
/0
/0
/0
/0
/0
/0
/0
02
16
30
14
04
18
01
15
01
15
29
12
Figure 1: Speakers recording dates (dd/mm/yyyy).
18
18
18
18
19
19
19
19
19
19
19
19
20
20
20
20
20
20
20
20
20
20
20
20
1/
1/
2/
2/
1/
1/
2/
2/
3/
3/
4/
4/
/1
/1
/1
/1
/0
/0
/0
/0
/0
/0
/0
/0
05
19
03
17
07
21
04
18
04
18
01
15
Figure 2: Listeners’ receiving dates (dd/mm/yyyy).
As for the listeners’ group, on the first day of the data collection, they
received an email with all the necessary guidelines to perform the audio
evaluation. Each week, they would receive the data pack and download it to
their personal computer. Overall, listeners had the task to evaluate all the
audios, save them in a .zip file and send them back to the researcher. All data
were obtained on the AEPI2 app (Bondaruk, Albuquerque, and Alves 2018).
4.4 Data analysis
The intelligibility data were coded from the listener’s oral productions based on
what they were able to repeat or explain from the Haitian speakers’ produc-
tions. As this was an oral production task, listeners were told they could orally
reproduce a wide range of nuances: from some sounds, full words or the whole
sentence. In order to score the points, the researcher took into account content
words, i.e., if listeners were not able to retrieve the articles or prepositions in a
sentence as “In Brazil it is too hot”, this would not be considered a mistake
since they are function words and do not carry the main meaning of the sen-
tence. Each content word was scored as a point and the total amount of correct
words was converted in percentage values. In addition, the comprehensibility
data were coded using the raw Likert scale scores for each data point.
The data were analyzed using moving min and max graphs and Monte
Carlo simulations, according to the methodology proposed by Verspoor, De Bot,
 A step-by-step description of the data storage and the operationalization of intelligibility

and comprehensibility on AEPI can be found in Alves, Albuquerque, and Bondaruk (2021).
and Lowie (2011). In the moving min-max graphs, data can be analyzed by the
moving minima, maxima and values which can be depicted, and variability pat-
terns may be seen through different bandwiths. This way, we had access to the
binomial’s developmental trajectory and potential changes in both intelligibil-
ity and comprehensibility over time. The moving average of both intelligibility
and comprehensibility performances and the moving minima and maxima of
the two constructs were extracted by a predetermined moving window of 2 posi-
tions (as the total data point is composed of 12 points in time). Each point in
time presented a set of 14 sentences that were extracted from the conversations
with the Haitian speakers (being four sentences for Speaker 1, five sentences for
Speaker 2 and five sentences for Speaker 33). Monte Carlo simulations were run
to explore possible unexpected changes in the binomial developmental trajec-
tory. The simulations were calculated through resampling the original data and
reshuffling them 5000 times (Van Geert, Steenbeek, and Kunnen 2012).
5 Results
Albuquerque (2019) pointed out that through a product, inferential analysis of
the whole group, it could be generally observed that intelligibility decreased
from data point 1 (first data collection point) to 12 (last data collection point). Al-
though obtaining a descending curve seemed to be counterintuitive, the individ-
ual speaker-listener binomial analyses pointed out that none of the binomial
relationships between speakers and listeners presented such a high decrease
movement at data point 12. In contrast, non-linear developmental trajectories
among the binomial relationships were observed as the main occurrence. More-
over, the descriptive differences among the binomials led to individual differen-
ces that seemed to depend on the speaker-listener relationship, i.e., it could not
be stated that a speaker is intelligible by him/herself or that the listener is able to
linearly understand random speakers throughout time. Not only intelligibility,
but also comprehensibility seem to vary among the binomial relationships, and
they may be connected to both speakers’ improvement in lexical complexity and
 The uneven number of words produced by the participants is connected to the participants’
proficiency level, i.e., Speaker 1 was the least proficient among all participants and was not
able to produce full/complete sentences in the first data points. Therefore, it was decided to
maintain the speaker’s productions in the sample as she was the only more basic participant
in the study, and because a CDST approach reinforces the need for natural data.
pronunciation and listeners’ ability to accommodate new data and nuances from
the speakers’ productions.
5.1 Intelligibility results
In this section, we provide an exploratory analysis of the dynamicity of the

speaker-listener binomial relationship in oral interaction. In this attempt, the
graphs in Figures 3–8 explore not only the listeners’ relationships toward differ-
ent speakers, but also how speakers were rated by distinct listeners.
Figures 3–5 present the developmental trajectories of the three speakers’
performances when judged by different listeners. Over the time span of six
months, several fluctuation periods can be observed throughout the 12-point
longitudinal data collections. As the CDST predicts, these trajectories are non-
linear and they display moments of progression and regression in intelligibility
development (Van Geert and van Dijk 2002; van Dijk, Verspoor, and Lowie
2011; Yu and Lowie 2020).
Speaker 1
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11 12
A B C Average
Figure 3: Intelligibility binomials for Speaker 1 (S1-LA, S1-LB, S1-LC).
Generally, the listeners’ rating scores seemed to be speaker-dependent, i.e.,

scores were higher in some moments, between 55–100% to Speakers 2 and 3
(who had been learning Brazilian Portuguese formally and living in Brazil for a
longer period), when compared to Speaker 1 (who was a more basic learner),
who did not reach 100% intelligibility often and who showed the lowest peak
Speaker 2
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11 12
A B C Average
Speaker 3
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11 12
A B C Average
(30%). Also, the speakers seemed to present different learning stages. Standing
by the premise that none of the stages are linear, Speaker 1 seemed to present
three stages, one that went from data point 1 to 8 and another from data point 8
to 10, and a smaller one from 10 to 12. Data point 8 could be taken as a develop-
mental “jump” (in which intelligibility was at 36% for most listeners) and it
reached 56% for Listener A and 80% for Listener C, in data point 9, and 95% for
Listener B, in data point 10. However, Speakers 2 and 3 displayed more diffuse
developmental stages when taking account of different listeners, especially tak-
ing into account some peaks (e.g. between data points 3–5) and valleys (e.g. for
Speaker 3 at data point 7 and for Speaker 2 at data point 8).
Figures 6–8 present the intelligibility graphs in which we focus on the lis-
teners’ rating patterns.
Listener A
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 Average
Figure 6: Intelligibility binomials for Listener A (LA-S1, LA-S2, LA-S3).
Listener B
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 Average
Figure 7: Intelligibility binomials for Listener B (LB-S1, LB-S2, LB-S3).

Listener C
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 Average
Figure 8: Intelligibility binomials for Listener C (LC-S1, LC-S2, LC-S3).
Again, when focusing on the listeners and their rating trajectories, a great
deal of fluctuation can be observed. A converging point among the graphs seems
to rely on how different listeners rate speakers who present a basic knowledge of
Portuguese, since Speaker 1 received the lowest intelligibility ratings, being the
lowest one among all the speakers and for all listeners at data point 8, varying
from 33% to 38%. Nevertheless, Speaker 1 is also the one who received the high-
est intelligibility ratings from all listeners towards the last data point. When ac-
counting for developmental stages, one cannot either clearly draw this scenario
or point out that specific listeners are more dynamic raters than others. Notwith-
standing, it may be observed that until their third/fourth data point mark, listen-
ers did not seem to vary a lot in their ratings, since a more dynamic perception
started to take place from data point 7/8 onwards.
Interestingly, Figures 6–8 portray a very diverse scenario, in which it is diffi-
cult to point out specific developmental stages for all listeners and all speakers,
since results portray a potential influence of the speaker-listener relationship
over intelligibility ratings. As visual inspection may work as a resourceful tool to
analyze variability in longitudinal studies (Van Dijk, Verspoor, and Lowie 2011),
we present the moving min-max graphs for all listeners and speakers in their bi-
nomial settings.
We present below the results for the min-max graphs for the intelligibility
construct, in the selected binomial relationships: (i) S1-LA; S1-LB; S1-LC; (ii) S2-LA;
S2-LB; S2-LC; (iii) S3 – LA; S3-LB; S3-LC. Overall, it can be observed that re-
sults reached ceiling effects, which is going to be discussed at the end of this
chapter, concerning the methodological issues of the study. Thus, all min-
max graph analyses will take into account the min results.
In general, in Figure 9 some fluctuation for the binomial settings S1-LA and
S1-LB can be observed towards the first half of the data points, from data point 1 to
5, and for S1-LC, from data point 1 to 3. Yet, all binomial settings seem to present a
rather stable development in the mid part of the data points and an increase of
the min values from data points 9 to 10 (S1-LA, from 0% to 33% and S1-LB, from
0% to 66%) and from data points 8 to 9 (S1-LC, from 20% to 50%), which may
indicate a developmental change for some of the binomial relationships.
In the min-max graphs of S2-LA; S2-LB; S2-LC shown in Figure 10, one can
observe some growing fluctuations in the intelligibility scores in two moments
of the binomial setting S2-LA (from data points 1 to 2, from 33% to 75%, and
from data points 6 to 7, from 33% to 50%). Moreover, in the graphs of S2-LB, we
can also find moments of fluctuations which may indicate that intelligibility in-
creases (from data points 5 to 6, from 0% to 66%, and from data points 8 to 9,
from 0% to 25%) and a great descending moment (from data points 1 to 4, from
60% to 0%). Also, for S2-LC a possible, but minor developmental peak may be
observed in one moment (from data points 4 to 5, from 50% to 60%), as well as
a great descending moment (from data points 9 to 10, from 50% to 0%).
As for the binomial relationships S3 – LA; S3-LB; S3-LC, despite the fact
that we can also observe ceiling effects, different valley and peak patterns can
be seen. A rather wider bandwidth (the lowest and highest values of fluctuation
moments) can be observed in the graphs in Figure 11, which can be connected
to how listeners were accommodating the speaker’s productions. One can no-
tice a major growing fluctuation for S3 – LA in one moment (from data points 7
to 8, from 0% to 75%). In addition, S3-LB presented two slightly similar mo-
ments indicating an increase in intelligibility scores (from data points 7 to 8,
from 0% to 71% and from data points 10 to 11, from 0% to 37%). Both binomial
relationships presented a min line showing a larger increase in data points 7 to
8. Finally, S3-LC presented two major growing moments in their intelligibility
scores (from data points 3 to 4, from 25% to 62%, and from data points 8 to 10,
from 62% to 100%).
The variability ranges can point out to different accommodation processes
between each listener-speaker relationship. They might be the result of coinci-
dental fluctuations or significant “tuning” moments, which may be evoked by
an “oh, I got it” assumption, in which speakers decide to use new and risky
forms and listeners try to accommodate this content.
Taking into account the previously mentioned binomial relationships, we
can observe that variability is present in all binomial relationships and all of
them, in different data points, seem to present developmental jumps. In order
Speaker 1 - Listener A
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11
Intelligibility performance min max
Speaker 1 - Listener B
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11
Speaker 1 - Listener C
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11
Figure 9: Moving min-max intelligibility binomials for Speaker 1 (S1-LA, S1-LB, S1-LC).
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1 2 3 4 5 6 7 8 9 10 11
to take a closer look at the significant peaks in the binomial pairs, we ran
Monte Carlo Simulations (Verspoor, De Bot, and Lowie 2011). These simula-
tions (5000 interactions) revealed that, among the nine binomials, significant
intelligibility peaks (p ≤ 0.05) were found for Speaker 1-Listener A (p= 0,0248),
Speaker 3-Listener A (p= 0,000) and Speaker 2-Listener B (p=0,0438). In contrast,
the peaks were likely the result of coincidental fluctuations for the other binomial
relationships.
When tracing back characteristics from speakers and listeners’ profiles, we
may observe that Listener A’s ability as a more experienced language teacher
may have helped tuning in with both speakers’ 1 and 3 productions, i.e., the par-
ticipant may be very much adapted to a “class-like” speech production. Speakers
1 and 3, in turn, could be placed in the two extremes of a proficiency scale (in a
more traditional classification), i.e, during the data collection, Speaker 1 was
starting formal classes of Portuguese and had been living in Brazil for a little
time, and Speaker 3 had been having classes for over a year and living in Brazil
for longer than 06 months. More importantly, both of them received formal train-
ing in a “class-like” scenario. However, Listener B had some experience in teach-
ing but had a closer contact with foreigners speaking Portuguese as an L2, an
ability which might have helped to accommodate Speaker 2’s more informal
speech, the one who had less formal training in Portuguese (and did not engage
as much in formal lessons as Speaker 3, for example).
5.2 Comprehensibility results
Graphs in Figures 12–17 explore not only the listeners’ relationships toward dif-
ferent speakers, but also how speakers were rated by distinct listeners concern-
ing the comprehensibility dimension.
The graphs present the 12-point longitudinal data collection on the X-axis
and the degree of difficulty felt by the listeners when rating speakers’ produc-
tions on the Y-axis, in which 1 stands for “very difficult to understand” and 9 as
“very easy to understand”.
Similarly to the intelligibility analysis, one can also observe that listeners and
speakers’ trajectories are non-linear and thus show moments of progression and
regression in comprehensibility development. In addition, in tune with the intelli-
gibility results, it can be observed that Speaker 1 also presented the lowest com-
prehensibility scores in many data points and to different listeners (2.2 in data
point 6 for Listener B and 4 in data points 7 and 8 for Listeners A and C, respec-
tively), i.e., her oral productions were generally considered more difficult to under-
stand. Speakers 2 and 3’s graphs, on the other hand, display a different scenario
Speaker 1
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11 12
A B C Average
Figure 12: Comprehensibility binomials for Speaker 1 (S1-LA, S1-LB and S1-LC).
Speaker 2
9
1
1 2 3 4 5 6 7 8 9 10 11 12
A B C Average
in some aspects. These speakers’ variability seems to be distributed distinctly, in a

way that for Speaker 2 major fluctuations seem to appear from data point 6 on-
wards, and as for Speaker 3 two moments are seen, one that appears to start in
Speaker 3
9
1
1 2 3 4 5 6 7 8 9 10 11 12
A B C Average
data point 1 and goes until data point 5, and another one from data point 8 until
the last data point.
Figures 15–17 present the comprehensibility graphs in which we focus on
the listeners’ possible rating patterns.
Listener A
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 Average
Figure 15: Comprehensibility binomials for Listener A (S1-LA, S2-LA and S3-LA).
Listener B
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 Average
Figure 16: Comprehensibility binomials for Listener B (S1-LB, S2-LB and S3-LB).
Listener C
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 Average
Figure 17: Comprehensibility binomials for Listener C (S1-LC, S2-LC and S3-LC).
Once again, when focusing on the listeners and their rating trajectories, a great
deal of fluctuation can be observed. Notwithstanding, when taking a closer look at
each listener, it is clear that Listener A started rating the three different speakers
similarly. Listener A displayed an initial pattern which groups the distinct speakers
in a way that ratings from data point 1 to 3 range from 6 to 7 in the likert scale,
meaning that all speakers seemed to be relatively easy to understand in the first
data collection points. Yet, some event seemed to cause some major changes from
data point 4 forward. A similar movement may be observed for Listener C. In con-
trast, Listener B exhibited an interesting and different movement throughout time,
since initial ratings (in data points 1 and 2) converged to a similar rating range in
the last point (with likert scores varying from 4 to 6). In addition, Listeners A and C
generally displayed lower ratings for Speaker 1, i.e., they identified speaker 1’s pro-
ductions as being mostly more difficult to understand than Speakers 2 and 3.
Likewise Figures 3–8, it is challenging to indicate specific developmental
stages for all listeners and all speakers in our discussion for Figures 12–17,
since once again results portrait a potential influence of speaker-listener rela-
tionship in comprehensibility ratings. As visual inspection may work as a re-
sourceful tool to analyze variability in longitudinal studies (Van Dijk, Verspoor,
and Lowie 2011), we present the moving min-max graphs for all listeners and
speakers in a binomial setting.
Overall in Figures 18–20, it can be observed that not all results reach ceiling
effects as in the intelligibility results, perhaps due to the nature of the construct,
which is a more subjective measure of perceived comprehension difficulty. A vi-
sual inspection may suggest that the S1-LB binomial relationship presents more
variability than S1-LA, for example. In the min-max graphs of the S1-LA binomial,
we can observe some fluctuations in many moments, but variability may be tak-
ing place in data points 3 to 5, since sentences are now scored as more difficult
to understand (scores go from 5 points to 1 point in the Likert scale). Moreover, in
the S1-LB graph, we can observe three main developmental stages, which can be
analyzed through the bandwidth (the variation between min and max results in
time). The first one is a rather narrow bandwidth, which may indicate less variabil-
ity, around data points 2 and 3, oscillating from 1, being the min, and 5, the max.
This is followed by two slightly wider bandwidths, which may indicate more vari-
ability, one around data point 4, in which the scores move from a min value of 2
and reach a max value of 7 and the other from data point 6 onwards. In the last of
this series of pairs, S1-LC, one can visualize from two to three stages in which vari-
ability changes, starting with a fairly narrow bandwidth (from data points 1 to 4),
followed by a wide bandwidth (from data points 5 to 7) and, finally, reaching a
rather wide bandwidth (from data points 8 to 9, being the min value 1 and the
max, 9).
In the min-max graphs of the S2-LA binomial, as ceiling effects were observed,
the analyses were made based on the min results. One can notice two main stages
where we believe a significant change would occur: from data points 1 to 2 (whose
scores in the Likert scale go from 1 to 5 points, indicating that data are easier to be
understood) and a quite abrupt change in the comprehensibility scores from data
points 8 to 10. In addition, in the S2-LB graph, fewer ceiling effects as well as more
stable scores are observed. In the least stable moment, where increased variability
Listener A - Speaker 1
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11
comprehensibility performance min max
Listener B - Speaker 1
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11
Listener C - Speaker 1
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11
Figure 18: Moving min-max comprehensibility binomials for Speaker 1 (S1-LA, S1-LB, S1-LC).
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 11
may be observed, there is a rather narrow and descended moment (from data
points 8 to 10, in which Likert scale scores varied from max 7 or 8 and min 1). In
addition, for S2-LC, a great descending peak from data points 8 to 11 can be found
(in which Likert scale scores varied from max 7 to min 1, indicating that produc-
tions were assumed as difficult to understand.
In the min-max graphs of the S3-LA binomial, we can observe a major de-
scending moment (between data points 2 to 3, in which Likert scale moved from
5 to 2), a more stable comprehensibility rating process in data points 3 to 7, fol-
lowed by a major increase in comprehension levels in data point 7 (in which Lik-
ert scale scores varied from 2–7, meaning the data production was easier to
understand). In addition, the S3-LB graph presents a rather narrow and some-
what stable development from data points 1 to 5, a subtle slightly decrease in
data point 5 (Likert scale judgements go from 2 to 8 points) and a rather wide
increase in the Likert scale scores from data points 6 to 7 (from 2 to 9 points).
Finally, S3-LC presents a bandwidth with a possible significant variability score
between data points 2 and 3, whose Likert scale points go from 3 to 9.
To this point, we have showcased the descriptive analyses based on
moving min-max graphs from the comprehensibility results. When com-
pared to the intelligibility rates, we see that if participants have to deal with
a more subjective dimension of how difficult or easy it is to understand
someone, more fluctuations and variations of rather narrow and wide band-
widths may be found in the same pairings. Taking into account the binomial
relationships previously mentioned, it can be observed that the variability is
present in all binomial relationships and all of them, in different data points,
seem to present developmental jumps. In order to take a closer look at the signifi-
cant peaks in the binomial pairs, we ran Monte Carlo Simulations (Verspoor, De
Bot, and Lowie 2011). These simulations (5000 interactions) revealed that, among
the nine binomials, a significant intelligibility peak (p ≤ 0.05) was found only for
Speaker 3-Listener A (p= 0,0156), and a marginal significance was found for
Speaker 3- Listener B (p= 0,0534). In contrast, the peaks were likely the result of
coincidental fluctuations in the other binomial relationships.
When aligning the Monte Carlo results and speakers and listeners’ profiles,
we may observe that Listeners A and B shared some similarities in their profiles
which may become handy when dealing with so much fluctuation, supporting the
fact that the first was a more experienced teacher and the second had more experi-
ence with foreign speech. Speaker 3, as we have previously explored, had more
formal experience with Portuguese (which may be portraited as a more accurate
learner) and had been living in Brazil for a longer period than Speaker 1, yet less
time than Speaker 2. A possible explanation may be provided as we consider com-
plex accommodating and tuning in processes. Speaker 2, for example, presented a
very diverse context for Portuguese usage, since the participant speaks Portuguese
not only at work, but used Portuguese freely for general oral conversation mo-
ments. The amount and diversity of contact with Portuguese caused an impact in
this learners’ production and in the reception of their speech.
6 Discussion
In this chapter, we aimed to display some important features and nuances of
variability and potential individual development concerning intelligibility and
comprehensibility constructs by discussing binomial listener-speaker relation-
ships. The pairs of participants who took part in the analysis consisted of three
Haitian learners (referred as the ‘speakers’), all of them showing both different
lengths of residence in Brazil and proficiency levels in Brazilian Portuguese,
and three Brazilians (referred to as the ‘listeners’), who had distinct experiences
with other L2s (like English or German) and exhibited different degrees of con-
tact with foreigners. The findings provided interesting results on the impor-
tance of longitudinal studies in exploring how speaker and listener may work
as a binomial when one assumes ‘understanding’ as a non-linear process over
time. Also, the study raised necessary theoretical-methodological issues.
This study took into consideration some gaps concerning the intelligibil-
ity construct raised by some authors throughout the years: criticism of the
use of transcription in intelligibility tasks (Alves, Albuquerque, and Bon-
daruk 2021; Munro and Derwing 2020; Zielinski 2006) and concerns related
to working memory load (Kang, Thomson, and Moran 2018). Taking this crit-
icism into account, in this study we chose to adopt an oral repetition task. The
task was a holistic attempt to help listeners to recover oral information more
freely, i.e., participants could either retake small pieces of information (sounds,
syllables, isolated words) or bigger blocks of information (parts of the sentence
or its general idea). One of the major contributions of this study is printed on
not needing to recover words orthographically-like, but, instead, as idea chunks
or the whole idea, as it was semantically displayed. An example can be seen in a
sentence produced by Speaker 2.
Table 3 presents Listeners A and C comments on what they understood of
Speaker 2’s productions. It is important to state that these comments were all
collected in the AEPI app in a written form. The researchers asked all listeners
to provide all sorts of impressions on the productions: from more detailed notes
(related to sound comprehension) to more general ideas (semantic content). It
can be observed that the listener who had some previous teaching experience
Table 3: Speaker-Listener examples of the oral repetition task.
Speaker Listener
Aimed production by the speaker: “Curitiba is Listener A’s comprehension: “I think he said
too hot” (in BP “Curitiba é muito calor”). ‘Curitiba is too warm’, but I am not sure
because there was a problem with a sound).
Actual produced sentence by the speaker:
Listener C’ comprehension: “I understood he
“Curitiba is too ‘hor’ (in BP “Curitiba é muito
said ‘Curitiba is very expensive’, but the
caror).
pronunciation of the final sound caused me
problems, it may be another word”.
(Listener A) was able to recover the idea, as it was semantically displayed, by

using a different word from the original audio and maintaining the intended
meaning. In turn, Listener C was misled by a sound exchange made by the
speaker, and instead of a semantic strategy, the listener seemed to apply a
more phonetic cue. Despite the ceiling effects (which will be further discussed),
listener experience/contact with foreign speech proved to be an important
individual characteristic to promote intelligibility over time. As for the speakers,
the ones in the extreme of what could be called “a formal learning continuum”
presented some variability that could be well-accommodated by different listen-
ers. However, Speaker 2, who had undergone a later formal training process, pre-
sented a more “irregular production” pattern, since his speech involved some
variability which emerged from his background contact with other Brazilians
(e.g., social gatherings, work meetings).
As for comprehensibility, results also portraited this construct as developing
non-linearly over time for all binomial relationships. As a more subjective mea-
surement, the general descriptive analysis through the moving min-max graphs
showed fewer ceiling effects and a rather wide variability when compared to in-
telligibility findings. Although Monte Carlo results have pointed out significant
variability peaks for only two binomials, Speaker 3-Listener A and Speaker 3-
Listener B, they tell us something about a potential learning process, as these par-
ticipants learned how to accommodate S3’s patterns over time, since variability
may also work as a potential learning cue (Lowie and Verspoor 2019). Although
the other pairs did not display significant results, they may also provide further
explanation on what is underlying Speakers 1 and 2’s production patterns and Lis-
tener C’s comprehension strategies. Our results are aligned with Zielinski and
Pryor (2020), who provided interesting results on beginners’ variability over time
and reflected on the importance of the context they are inserted in (when learners
can have a real daily practice of the language). Although our binomial pairings
did not reveal any group tendencies, results are aligned with Nagle, Trofimovich,
and Bergeron (2019) and Nagle et al. (2021) in pointing out the role of variability
in the comprehensibility dimension.
Generally, both intelligibility and comprehensibility constructs have showed
a non-linear behavior over time, with each binomial relationship presenting dif-
ferent patterns and variability contours. Also, we side with Ranta and Meckel-
borg (2013) and Zielinski and Pryor (2020) by observing that learning a language
in an immersion environment does not guarantee a constant increase of profi-
ciency over time, as made clear in our results.
6.1 Intelligibility and comprehensibility ratings
The study faced some issues concerning the lengths of the sentences. Since speak-
ers had different proficiency scores, they initially presented distinct access to a
rather wide lexicon size which, in turn, led to smaller sentences being produced
by Speaker 1, for example, when compared to Speakers 2 and 3. Therefore, senten-
ces ranged from 3 to 8 words, being 5 or 6 words the most frequent pattern. This
may have influenced the ceiling effects of both intelligibility and comprehensibil-
ity results. Yet, intelligibility scores might have suffered a larger influence since it
is a more objective measure, and the chances of either getting all words wrong or
right is high in shorter sentences (which could explain the ceiling effects). This
sort of issue also occurred in other studies which work with more naturalistic
data. In the same fashion, Munro and Derwing (2020) report that the listeners’ per-
formance in the 1995 experiment was probably connected to sentence length
(which varied from 7 to 13 words).
When taking comprehensibility into account, many scale lengths have
been used by researchers, but there is no agreement on which would be a best
fit (Munro 2018). Zielinski and Pryor (2020) argue that it is of major importance
that scales assure raters with a comfortable range to evaluate comprehensibil-
ity. In their study, they used a 5-point scale, which may have influenced the
way participants rated beginners and intermediate learners, since variability
may vary differently depending on the perceived fluency. In our study, despite
the space for more subtle differences to identify, since we used a 1–9 scale, a
similar effect may have taken place, since we also had speakers from distinct
proficiency levels and the listeners might have used different criteria to score
their comprehension. Thus, Speaker 1, who was the least proficient one, may
have reached lower and higher levels more frequently than Speakers 2 and 3.
Also, Zielinski and Pryor (2020) also point to an effect mentioned by Munro and
Derwing (2015): unsupervised rating. According to the authors, participants
may not be able to constantly control the underlying conditions for rating, and
in a longitudinal study this effect can increase, since conditions have to be fre-
quently revisited.
6.2 Pronunciation teaching and learning implications
Tracey Derwing and Murray Munro have had a major impact on how both in-
telligibility and comprehensibility have been researched, and their findings
have served as a true pedagogical tool towards pronunciation teaching and
learning. This way, intelligibility and comprehensibility may be seen through a
lens on how learners’ “mistakes” can be overlooked by taking into account in-
dividual variability as a key element to analyze development.
Although not all studies attempt to conduct longitudinal investigations, this
kind of research can reveal that development does not generally follow a flat or
balanced line. In contrast, it is by analyzing variability that potential learning pro-
cesses may emerge. According to Van Dijk, Verspoor, and Lowie (2011), more tra-
ditional paradigms assume the contrast between competence vs performance, in
which the last one is usually connected to an intrinsic and intense variability pro-
cess and the first one is related to the stability of sounds and forms. Thus, accord-
ing to more traditional paradigms, learners’ mistakes are connected to their
system’s irregularities and should be discarded, diminished, eliminated. However,
we understand that in order to learn, individuals have to make mistakes.
Instead of being the one to be left out, as it was “noise” in a balanced system,
“variability is not something to be ignored, but rather offers an indispensable
source of information.” (Larsen-Freeman 2020: 295). This information has a huge
impact over both teaching and learning pronunciation processes, since language
teachers will have to analyze his/her students development over time and in a way
that individual variability is not set aside so that group tendencies may take place.
As we reflect more specifically upon the constructs of intelligibility and
comprehensibility, we conclude that variability can be seen as an important
strategy in listener-speaker pairing in class, i.e., instead of pairing students fre-
quently in the same groups, learners’ production and comprehension strategies
will probably increase if they learn how to accommodate new details, e.g., dif-
ferences in vowels and consonants. Also, when varying what sort of content
and how learners have to retrieve it, teachers may be helping students to de-
velop fine-grained detail and more holistic recovery processes.
Last, but not least, we would like to raise awareness once again to the interde-
pendence of speaker and listener in an oral communication moment, as a sort of
comprehension dance. As it is longer known that individuals are not intelligible or
comprehensible by themselves, but context or even person-dependent, it seems
that overseeing the constructs as not only partially interconnected (Derwing and
Munro 2015) but also as tunned in a speaker-listener binomial relationship may
have important implications for the studies on language development and L2 pro-
nunciation teaching. In this sense, it is important to mention the possibility that
talker familiarity may be taking place longitudinally, as the listeners slowly learn
how to deal with accented speech (Albuquerque and Alves 2017). We should high-
light the importance of “learning to listen”, which is made possible as both speak-
ers and listeners are exposed to variation in the language input, leading them to
familiarize with different varieties of the language (Leung 2012, 2014), including
L2-accented speech. These results, therefore, highlight the importance of teaching
not only how to pronounce, but also how to listen. This latter type of learning is of
paramount important regardless of whether we are dealing with native or non-
native speakers of a language.
7 Conclusion
This longitudinal study was set as an exploratory attempt to bring nonmain-
stream data, originated from Haitian learners of Brazilian Portuguese, to a well-
known field of research on intelligibility and comprehensibility. Although the
pairs of speakers and listeners selected for this study formed a small group to
be analyzed, the aim of the investigation was to highlight the binomial mem-
bers’ personal trajectories to observe individual differences over time instead of
regular group tendencies.
Intelligibility results have shown significant variability peaks in the binomial
relationships for Speaker 1-Listener A, Speaker 3-Listener A and Speaker 2-Listener
B. As for comprehensibility, binomial relationships which presented significant
results were Speaker 3-Listener A and Speaker 3-Listener B. Generally, only one
pair, Speaker 1-Listener A, displayed significant variability patterns that could
be connected to both intelligibility and comprehensibility tasks. Thus, we can ob-
serve, once again, how the influence of personal features such as having previous
contact with foreigner speech samples (for listeners) and receiving formal L2 tu-
ition (for speakers) seems to have an effect on intelligibility and comprehensibility.
We hope this chapter has contributed to paving the way for future studies
that take the individuals and their interactions as the locus of analysis in in-
telligibility and comprehensibility studies. Therefore, we state the importance
of longitudinal studies on intelligibility and comprehensibility as well as taking
variability as an important cue for learning, in order to improve not only L2
teaching methods, but also learners’ strategies towards L2 development.
References
Abercrombie, David. 1949. Teaching pronunciation. English Language Teaching 3(5). 113–122.
Albuquerque, Jeniffer Imaregna Alcantara. 2019. Caminhos dinâmicos em Inteligibilidade e
Compreensibilidade de Línguas Adicionais: um estudo longitudinal com dados de fala de
haitianos aprendizes de Português Brasileiro [Dynamic paths in intelligibility and
comprehenisbility in Additional Languages: a longitudinal study with data from Haitian
learners of Brazilian Portuguese]. Porto Alegre: Universidade Federal do Rio Grande do
Sul dissertation.
Albuquerque, Jeniffer Imaregna Alcantara & Ubiratã Kickhöfel Alves. 2017.
Compreensibilidade em L2: Uma discussão sobre o efeito da experiência do ouvinte e do
tipo de meio em excertos do Português Brasileiro produzidos por um Falante haitiano
[L2 Comprehensibility: a discussion on listerner’s experience effects and type of medium
in the Brazilian Portuguese data produced by a Haitian speaker]. Revista X 12(2). 43–64.
Albuquerque, Jeniffer Imaregna Alcantara & Ubiratã Kickhöfel Alves. 2020. Os construtos de
‘inteligibilidade’ e ‘compreensibilidade’ em dados do Português Brasileiro como língua
adicional: um olhar via Sistemas Dinâmicos Complexos [The constructs of ‘intelligibility’
and ‘comprehensibility’ in Brazilian Portuguese data as an Additional Language through
the lens of Complex Dynamic Systems]. Signótica 32. Retrieved from: https://www.revis
tas.ufg.br/sig/article/view/58214.
Alves, Ubiratã Kickhöfel, Jeniffer Imaregna Alcantara Albuquerque & Patrick D. Bondaruk.
2021. L2 intelligibility and comprehensibility: trying out new measurements with AEPI.
Anales de Lingüística 5. 21–39. Retrieved from: https://revistas.uncu.edu.ar/ojs3/index.
php/analeslinguistica/article/view/4587
Baba, Kyoko & Ryo Nitta. 2014. Phase transitions in development of writing fluency from a
complex dynamic systems perspective. Language Learning 64(1). 1–35.
Boersma, Paul & David Weenink. 2019. Praat: doing phonetics by computer [Computer
software]. Version 6.0.53. http://www.praat.org/.
Bondaruk, Patrick D., Jeniffer Imaregna Alcantara de Albuquerque & Ubiratã Kickhöfel Alves.
2018. AEPI – Aplicativo para Estudos de Percepção e Inteligibilidade [AEPI – An app for
perceptual and intelligibility studies]. Version 0.01. https://en:aepi.e-pi.co. (Accessed
21 February 2021).
Cadely, Jean-Robert. 2012. Haiti: The politics of language. Journal of Teaching and Education
1(3). 389–394.
De Bot, Kees. 2017. Complexity Theory and Dynamic Systems Theory: Same or different? In
Lourdes Ortega & ZhaoHong Han (eds.), Complexity Theory and Language Development:
In Celebration of Diane Larsen-Freeman, 51–58. Amsterdam: John Benjamins.
Derwing, Tracey & Murray Munro. 1997. Accent, comprehensibility and intelligibility: Evidence
from four L1s. Studies in Second Language Acquisition 19(1). 1–16. https://doi.org/
10.1017/S0272263197001010
Derwing, Tracey & Murray Munro. 2013. The development of L2 oral language skills in two L1
groups: A 7‐year study. Language Learning 63(2). 163–185.
Derwing, Tracey & Murray Munro. 2015. Pronunciation Fundamentals: Evidence-based
Derwing, Tracey, Murray Munro & Grace Wiebe. 1998. Evidence in favor of a broad framework
for pronunciation instruction. Language Learning 48(3). 393–410.
De Weers, Noortje. 2020. A critical (re)assessment of the effect of speaker ethnicity on speech
processing and evaluation. Burnaby: Simon Fraser University dissertation.
Isaacs, Talia & Pavel Trofimovich. 2012. Deconstructing comprehensibility: Identifying the
linguistic influences on listeners’ L2 comprehensibility ratings. Studies in Second
Kang, Okim, Ron I. Thomson & Meghan Moran. 2018. Empirical approaches to measuring the
intelligibility of different varieties of English in predicting listeners comprehension.
Larsen-Freeman, Diane. 2015. Ten “lessons” from complex dynamic systems theory: What is
on offer. In Zoltán Dörnyei, Peter D. MacIntyre & Alastair Henry (eds.), Motivational
Dynamics in Language Learning, 1–11. Bristol: Multilingual Matters.
Larsen-Freeman, Diane. 2020. Epilogue. In Wander Lowie, Marije Michel, Merel Keijzer &
Rasmus Steinkrauss (eds.), Usage-Based Dynamics in Second Language Development,
295–300. Bristol: Multilingual Matters.
Leung, Alex Ho-Cheong. 2012. Bad influence? – An investigation into the purported negative
influence of foreign domestic helpers on children’s second language English acquisition.
Journal of Multilingual and Multicultural Development 33(2). 133–148.
Leung, Alex Ho-Cheong. 2014. Input multiplicity and the robustness of phonological
categories in child L2 phonology acquisition. Concordia Working Papers in Applied
Linguistics 5. 401–415.
Levis, John. 2020. Revisiting the Intelligibility and Nativeness Principles. Journal of Second
Lowie, Wander. 2017. Lost in state space? Methodological considerations in Complex Dynamic
Theory approaches to second language development research. In Lourdes Ortega &
ZhaoHong Han (eds.), Complexity Theory and Language Development: In Celebration of
Diane Larsen-Freeman, 123–141. Amsterdam: John Benjamins.
Lowie, Wander & Marjolijn Verspoor. 2019. Individual differences and the ergodicity problem.
Language Learning 69(S1). 184–206. doi:10.1111/lang.12324.
Machry da Silva, Susiele. 2017. Aprendizagem do português por haitianos: percepção das
consoantes líquidas /l/ e /ɾ/. [The learning of Portuguese by Haitians: a perception study
of liquid consonants /l/ and /ɾ/]. Ilha do Desterro: A Journal of English Language,
Literatures in English & Cultural Studies 70(3). 47–62.
Munro, Murray. 2018. Dimensions of pronunciation. In Okim Kang, Ron I. Thomson &
John M. Murphy (eds.), The Routledge Handbook of Contemporary English Pronunciation,
413–431. New York: Routledge.
Munro, Murray & Tracey Derwing. 1995. Foreign accent, comprehensibility and intelligibility in
the speech of second language learners. Language Learning 45(1). 73–97.
Munro, Murray & Tracey Derwing. 2015. A prospectus for pronunciation research in the 21st
century: A point of view. Journal of Second Language Pronunciation 1(1). 11–42.
Munro, Murray & Tracey Derwing. 2020. Foreign accent, comprehensibility and intelligibility,
redux. Journal of Second Language Pronunciation 6(3). 283–309.
Nagle, Charles, Pavel Trofimovich & Annie Bergeron. 2019. Toward a dynamic view of second
language comprehensibility. Studies in Second Language Acquisition 41(4). 647–672.
https://doi.org/10.1017/S0272263119000044
Nagle, Charles, Pavel Trofimovich, Mary G. O’Brien, Mary & Sara Kennedy. 2021. Beyond
linguistic features: Exploring the behavioral and affective correlates of
comprehensible second language speech. Studies in Second Language Acquisition 44(1).

255–270. https://doi.org/10.1017/S0272263121000073
Norton, Bonny. 2013. Identity and Language Learning: Extending the Conversation. Bristol:
Multilingual Matters.
O’Brien, Mary. 2014. L2 learners’ assessments of accentedness, fluency, and
comprehensibility of native and nonnative German speech. Language Learning 64(4).
715–748.
Ranta, Leila & Amy Meckelborg. 2013. How much exposure to English do international
graduate students really get? Measuring language use in a naturalistic setting. Canadian
Modern Language Review 69(1). 1–33.
Silva, Adelaide. H. Pescatori. 2015. Uma ferramenta para o ensino do acento primário do PB
para falantes nativos do crioulo haitiano [A tool for teaching primary accent in Brazilian
Portuguese to native speakers of Haitian-creoule]. Organon 30(58). 175–191.
Sternberg, Robert & Karin Sternberg. 2012. Cognitive Psychology. Belmont: Cengage Learning.
United Nations High Commissioner for Refugees (UNHCR). 2020. Venezuelanos no Brasil:
Integração no Mercado de Trabalho e Acesso a Redes de Proteção Social [Venezuelans in
Brazil: integration to the job market and access to social protection networks].
https://www.acnur.org/portugues/wpcontent/uploads/2020/07/Estudo-sobre-
Integra%C3%A7%C3%A3o-de-Refugiados-eMigrantes-da-Venezuela-no-Brasil.pdf.
Access in Apr. 21. 2021.
Van Dijk, Marijn, Marjolijn Verspoor & Wander Lowie. 2011. Variability in second language
development from a dynamic systems perspective. In Marjolijn Verspoor, Kees de Bot &
Wander Lowie (eds.), A Dynamic Approach to Second Language Development: Methods
and Techniques, 55–84. Amsterdam: John Benjamins.
Van Geert, Paul, Henderien Steenbeek & Saskia Kunnen. 2012. Monte Carlo techniques:
statistical simulation for developmental data. In Saskia Elske Kunnen (ed.), A Dynamical
Systems Approach to Adolescent Development, 43–53. New York: Routledge.
Van Geert, Paul & Marijn Van Dijk. 2002. Focus on variability: new tools to study intra-
individual variability in developmental data. Infant Behavior and Development 25(4).
340–374.
Verspoor, Marjolijn, Kees De Bot & Wander Lowie. 2011. A Dynamic Approach to Second
Language Development: Methods and Techniques. Amsterdam: Benjamins.
Verspoor, Marjolijn, Wander Lowie, & Mairjn Van Dijk. 2008. Variability in second language
development from a dynamic systems perspective. Modern Language Journal 92(2).
214–231.
Yu, Hanjing & Wander Lowie. 2020. Dynamic Paths of Complexity and Accuracy in Second
Language Speech: A Longitudinal Case Study of Chinese Learners. Applied Linguistics
41(6). 855–877.
Zielinski, Beth. 2006. The intelligibility cocktail: An interaction between speaker and listener
ingredients. Prospect: An Australian Journal of TESOL 21(1). 22–45.
Zielinski, Beth & Elizabeth Pryor. 2020. Comprehensibility and everyday English use. In John
Levis, Tracey Derwing & Murray Munro (eds.), The Evolution of Pronunciation Teaching
and Research: 25 Years of Intelligibility, Comprehensibility and Accentedness, 75–101.
Amsterdam: John Benjamins.
Part II: L2 pronunciation teaching
Ronaldo Lima Jr
A dynamic account of the development
of English (L2) vowels by Brazilian learners
through communicative teaching and
through explicit instruction
Abstract: This study analyzed the longitudinal production of English vowels [i ɪ ɛ
æ u ʊ] by ten Brazilian undergraduate students of English Language Teaching
throughout the four first semesters of their college studies. In their first and second
semesters, they have integrated communicative language lessons in English. In
the third semester, they take a mandatory course in English phonetics and phonol-
ogy, in which they receive explicit instruction on the sounds of English, including
the vowels in focus. In their fourth semester of college studies, they resume having
integrated communicative language lessons in English. Therefore, it was possible
to compare the development in the production of such vowels after each semes-
ter, without explicit instruction and with explicit instruction on English vowel
sounds. Participants were recorded reading target words in a carrier sentence
every semester from semester 1 through 4, and the recordings were analyzed
acoustically. Euclidean distances between pairs of vowels were calculated, and
such distances were used to fit a Bayesian mixed-effects model to the data. The
analyses showed that the development of the target vowels is extremely dy-
namic in nature, with a great amount of variability in the data. The main results
are: vowel contrasts were present in every recording, with most of them appear-
ing right after the English Phonetics and Phonology course; most learners in-
creased their contrasts of the target vowels; learners developed their vowels at
a different pace and in different moments; not all learners were able to create
new categories for the target vowels.
Keywords: second language development, foreign language speech, vowels,

Complex Dynamic Systems
Acknowledgement: This project has been partially funded by the Brazilian National Council for
Scientific and Technological Development (CNPq), grant number 471868/2014-0.
Ronaldo Lima Jr, Federal University of Ceará
https://doi.org/10.1515/9783110736120-006
148 Ronaldo Lima Jr
1 Introduction
There are several aspects of the pronunciation of English that cause difficulties
to Brazilian learners, and vowels are among the most challenging ones. Since
Brazilian Portuguese has only seven vowels, /i e ɛ a ɔ o u/, it comes as no sur-
prise that learning a language with more vowels, as is the case of English, will be
especially challenging for Brazilian learners, who will need to create new vowel
categories in the vocalic space. Among the English vowels, the pairs /i ɪ/, /ɛ æ/
and /u ʊ/ are particularly challenging for Brazilian learners due to the expected
difficulty to perceive and produce L2 sounds which are very similar yet not con-
trasted in the learner’s L1 (Flege 1995; Flege and Bohn 2021).
When acquiring their native language, people learn how to accommodate
the variation of the acoustic signal into prototypical phonological categories so
that communication can take place, and the brain does so by taking statistics of
the input and assigning exemplars to the corresponding categories (Bybee 2003;
Cristófaro Silva 2003; Kuhl et al. 2008; Leather 2003; Pierrehumbert 1990). Hav-
ing learned the L1 so well is what makes it challenging to perceive and produce
L2 sounds that are very close, but not identical, to an L1 sound, especially when
there are two L2 sounds competing for some acoustic (and perceptual) space that is
occupied by a single vowel of the L1. This is the case of English vowels /i ɪ ɛ æ u ʊ/,
which tend to be perceived and produced by Brazilian learners within the pro-
totypical categories of Brazilian Portuguese /i ɛ u/, respectively (Bion et al.
2006; Lima Jr 2015; Nobre-Oliveira 2007; Rauber 2006). That is why (i) analy-
ses of the production of these six English vowels by Brazilian learners will be
presented in this paper.
The word development, instead of acquisition, in the title of this chapter was
intentionally chosen since it will be argued that language (whether L1 or L2) is a
complex dynamic system, and the development of L2 is a dynamic process (Beck-
ner et al. 2009; De Bot 2008; De Bot, Lowie, and Verspoor 2007; Larsen-Freeman
1997). Under such perspective of language development, the phonological cate-
gories created for communication in the L1 are seen as attractor states for the L2
(Lima Jr 2013). Attractors are states of temporary accommodation of a complex
dynamic system, where the system finds temporary stability. These states are
temporary due to the dynamic nature of such systems, which may move, or even
keep moving, from one attractor state to another. That is why development is bet-
ter suited than acquisition as the former captures the dynamic, never-ending
change in time as the system moves through different attractor states.
Some attractor states require more energy for the system to move away
from them than others. De Bot, Lowie, and Verspoor (2007) illustrate this fact
A dynamic account of the development of English (L2) vowels 149
with the image of a surface, like a table, with some holes on it, of different sizes
and depths, and a ball moving from one hole to another. As we tilt the surface,
depending on how we do it, the ball resting in one hole might get out of it and
stop in another one, and the bigger and deeper the hole, the more we must tilt
the surface to get the ball out of it. In other words, more energy will be needed
to take the ball from one attractor state to another depending on how strongly
that state is attracting the ball.
In this metaphor, the table/surface is the learner (and their L2 developing
system); the ball is their pronunciation of the L2, in this case, the English vowels
/i ɪ ɛ æ u ʊ/; and the holes are the prototypical phonological categories of the L1,
in this case, Brazilian Portuguese /i ɛ u/. The energy to tilt the surface is related
to the nature, strength, frequency, quantity, quality, etc. of perturbation intro-
duced to the systems; in this case, perturbations might be language lessons, ex-
posure to the L2, interaction with L2 speakers, experiences abroad, etc. That is
why (ii) this paper seeks to compare two types of perturbation: having communi-
cative language lessons and having explicit instruction on pronunciation.
Another typical characteristic of complex dynamic systems is the non-
linearity between cause and effect, between perturbation and movement of the
system. To illustrate this characteristic, Bak and Weismann (1997) use the image
of someone dropping sand on a surface. In the beginning, it is possible to drop
several grains of sand, one onto the other, with the sand forming a cone-shaped
pile. However, as more grains are added to the system, the pile becomes steeper
and steeper, with the system reaching a critical point at which one single grain
of sand may cause an avalanche, which, in turn, may also cause other ava-
lanches, not predictable in number or dimension. As Johnson (1997) puts it, a
linear relation is like the volume knob of a radio, with each and every nuance of
change on the knob causing the same change of volume. A non-linear relation,
on the other hand, is like the tuning knob of a radio, for at the same time that a
small change on the knob might cause a great effect (getting out of a station),
great changes might also have no result at all (as when navigating through static
radio frequencies).
This means that potential effects of communicative language lessons or ex-
plicit instruction will probably not be seen equally among all learners, and
some effects might not be seen immediately, as they might contribute to getting
a learner’s L2 developing system closer to a critical point, but may not necessar-
ily cause the aforementioned avalanche. Added to the dynamic nature of such
systems, this means that L2 development is better studied through longitudinal
studies (De Bot and Larsen-Freeman 2011; Lima Jr 2016a; Verspoor, De Bot, and
Lowie 2011;), and this is why (iii) the data presented in this paper comprise four
150 Ronaldo Lima Jr
semesters of language development of the same learners – Brazilian college

students of English Language Teaching.
The word complex in complex dynamic systems means that the overall behav-
ior of the system is more than the mere sum of the behaviors of its components,
as its behavior emerges from the iterative interaction of the many components of
the system within themselves and with the environment. This makes the L2 learn-
ing experience extremely idiosyncratic, for each learner and their L2 developing
system may behave differently at different moments of their development. This
means that L2 phonological development is better examined in a longitudinal
study that, besides looking into group tendencies, also takes individual routes of
development into account (De Bot and Larsen-Freeman 2011; Lima Jr 2016a; Ver-
spoor, De Bot, and Lowie 2011), and that is why (iv) the inferences drawn in this
study come from a mixed-effects regression model fit to the data, which allows
for a view of group tendency while also looking into individual deviations from
this tendency.
Putting together items i–iv embedded in the previous paragraphs, the main
goal of this paper is revealed: to investigate possible effects of communicative
language lessons and of explicit pronunciation instruction on the development
of English vowels /i ɪ ɛ æ u ʊ/ by Brazilian learners in the first four semesters of
their college studies in English Language Teaching. The data are of acoustic na-
ture, the analysis was done by means of a mixed-effects regression model, and
the discussion is conducted in light of a complex dynamic system approach to
language development.
2 Method
2.1 Participants
Ten Brazilian college students taking an English Language Teaching major in a

university in Brazil contributed with production data for this study. They were
all native speakers of Brazilian Portuguese, aged between 18–20 in the first re-
cording, with no experience in an English-speaking country, no contact with
English native speakers, no experience learning other foreign languages be-
sides English, and who had not taken any extracurricular English lessons. In
Brazil, English is mandatory in the seven years of middle and high school
(starting at 11 years of age), but with a focus on reading, so teenagers whose
parents want them to learn how to speak English (and who can afford it)
usually take extracurricular English courses at private language institutes,

which was not the case of any of the participants.
Despite not having had speaking or pronunciation English classes before
college studies, the participants had heterogeneous levels of general language
proficiency, with some learners at a more basic level, but others with fluent
conversational abilities by having studied on their own through podcasts, vid-
eos, music, games, and other kinds of media. They had all been admitted to a
renowned university in Brazil to study to become English teachers, so a special
interest in the language and minimum knowledge of it to be able to attend col-
lege-level classes in English are common among them.
The participants followed the regular structure of their major and took at
least five mandatory courses per semester, each one with a total of 64 hours/se-
mester. In the first semester, one of the courses is taught in English and the
others in Portuguese. From the second semester on, all courses are taught en-
tirely in English in a Communicative Language Teaching style, that is, instructors
usually use authentic texts, videos, audio segments or other media to engage stu-
dents in discussions and in groupwork activities to explore grammar, vocabulary
and other language aspects aroused by the material and the discussions. In the
third semester, they take a mandatory English Phonetics and Phonology course,
in which they gain basic technical knowledge of phonetics and phonology, and
practice pronunciation aspects of English that are particularly challenging for
Brazilian learners, including the vowels in focus in this study. Brazilian learners
tend to pursue an American-colored pronunciation, and most Brazilian English
teachers speak with an approximation of some North American dialect.
The initial idea was to follow the development of the pronunciation of Bra-
zilian college students majoring in English Language Teaching throughout
their entire college studies (9 semesters). A total of 50 students were invited to
participate in an oral production task once a semester, and 47 agreed to partici-
pate. However, participants either stopped fulfilling the criteria to be in the
study (i.e., take all mandatory courses without failing or dropping out of any of
them) or lost interest in taking part in the research along the way, and therefore
could not be contacted anymore or simply did not show up to be recorded, with
only four students recorded in the fifth semester, when data collection was in-
terrupted. This is a typical challenge of many longitudinal studies, which
makes longitudinal data of L2 speakers even more valuable.1
 Participants in this study are coded with letters, and this loss of participants along the way
is the reason some letters are skipped in this paper (see Table 1). The data of the missing par-
ticipants appeared in a preliminary analysis previously reported (Lima Jr 2016b), but they did
not do all four recordings reported in this paper.
152 Ronaldo Lima Jr
2.2 Data
The ten participants were recorded individually at the end of the first, second,
third and fourth semesters of their college studies in English Language Teach-
ing. The recordings were conducted in a silent room with a supercardioid Shure
150B lapel microphone connected to a Zoom 4Hn recorder. The audio was cap-
tured in mono, with a sampling rate of 44 kHz, and later saved in .wav format.
Students were recorded reading words inserted in the carrier sentence
“I said token this time”, which controls for the prosodic context of the target
word. The corpus was composed of three words for each target vowel. The
words, presented in Table 1, were all monosyllabic and with a CVC structure,
with most Cs being voiceless plosives, to prevent acoustic bias from neighbor-
ing segments and to help later identify, segment and label the vowels in PRAAT
(Boersma and Weenink 2019).
Table 1: Corpus for data collection.
/i/ /ɪ/ /ɛ/ /æ/ /u/ /ʊ/
peak pick peck pack boot book

Pete Pitt pet pat poop put
teak tick tech tack toot took
The sentences were shown in a slide presentation, with each slide containing the
carrier sentence with a different, randomly selected target word. Each word was
presented four times, generating 12 tokens per vowel per participant, which gen-
erated 72 tokens per participant, and a total of 720 vowels per semester. In the
end, 2,880 vowels were identified, segmented, and labeled in PRAAT.
A common method to extract formant values is through Linear Predictive
Coding (LPC), which is an algorithm that decomposes the acoustic signal and
estimates the resonances generated in the vocal tract. However, automatic
LPC analyses have been criticized (e.g., Vallabha and Tuller 2002; Wempe and
Boersma 2003) because they may introduce systematic errors in the formant
extraction depending on the parameters set beforehand by the researcher.
With the automatic LPC analysis, the researcher needs to define the order of
the LPC (i.e., the quantity of formants to be found) and the maximum (ceiling)
frequency in which to look, which is usually set as 5 kHz for men and 5.5 kHz for
women. However, different men and women might have different frequency ceil-
ings, which, if not set accordingly, might lead the LPC into identifying peaks that
do not exist and overlooking peaks that do.
A solution to this issue is to double-check the fit of the LPC estimation of

each vowel to the Fast Fourier Transform spectrum. Even though this method is
more time-consuming, it allows the researcher to adjust, when necessary, the
ceiling frequency or the order of the LPC for specific speakers. This is how the
F1 and F2 values of this study were extracted, with the assistance of two PRAAT
scripts (Arantes 2010, 2011).
Once extracted, F1 and F2 values were used to create vowel space plots for
individual speakers to compare their development over time. F1 and F2 values
were also used to calculate the Euclidean Distances between the vowels in each
target pair for each learner to compare those distances from semester to semes-
ter, and to compare them with the distances found for ten native speakers of
American English published previously (Lima Jr 2015).
The Euclidean Distance is a measure of dissimilarity that can be used to
measure the distance between two points in a cartesian (x-y) coordinate sys-
tem, which is the case of F1-F2 vowel space plots. It is basically a sum of the
differences in F1 and F2 values between two vowels.2 To prevent the bias of F2
values, which increase in much larger increments than F1, the values used for
Euclidean Distance calculations were normalized (z-score). Finally, the Euclid-
ean Distance values were used to fit a Bayesian mixed-effects model.
3 Results
The first step in the analysis was to visually inspect individual vowel spaces,
comparing the distributions of the speakers’ vowels in the four different record-
ings. For this comparison, the vowel spaces were plotted by recording (so four
plots for each speaker), and each plot contained every occurrence of the six En-
glish vowels, as well as the mean F1 and F2 values for each vowel. Figure 1 has
an example of such a plot, containing the vowels of speaker A in their first re-
cording. The actual vowels produced by the speaker are plotted as the smaller
and lighter phonetic symbols, and the larger and darker phonetic symbols are
at the mean values of F1 and F2 for each vowel. The ellipses represent one stan-
dard deviation from the mean.
In the vowel space of Figure 1, it is easy to see that speaker A already had
two separate categories for the /i ɪ/ pair. The occurrences of each of these vow-
els are very far from one another, generating clearly separated averages with
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 ðF1a − F1bÞ2 + ðF2a − F2bÞ2 .
154 Ronaldo Lima Jr
Figure 1: Vowel space of speaker A’s productions in recording 1. Smaller and lighter
phonetics symbols are the individual vowels produced by the speaker in their F1-F2
intersections; larger and darker phonetic symbols are located at the mean F1-F2 values for
each vowel, surrounded by a 1-standard-deviation ellipsis. Colors represent different vowels.
ellipses that do not touch each other. It is also easy to identify that, on the
other hand, speaker A’s /ɛ/ and /æ/ are completely overlapped, with nearly
identical F1-F2 means, and ellipses that overlap almost entirely. The vowels /u ʊ/,
despite being slightly more separated than /ɛ æ/, still occupy the same area of the
vowel space, with their ellipses overlapping almost completely.
To compare possible changes throughout the four recordings, images of the
four vowel spaces, like the ones in Figure 2, were inspected.
By visually inspecting the four vowel spaces, one can see that: (i) the /i ɪ/
pair is kept as separate categories throughout the four recordings; (ii) the /ɛ æ/
pair gets separated in the third recording (as a possible effect of the English
Phonetics and Phonology course) and is kept separated in the fourth recording;
and (iii) the /u ʊ/ pair seems to be separating in the third recording, still with
some overlap of the ellipses, but gets overlapped again in the fourth recording.
The four plots depict the dynamic nature of phonological development as
well as its gradient emergence. Sometimes it was not easy to decide whether
Figure 2: Vowel spaces of speaker A’s productions in the four recordings. Smaller and lighter
phonetics symbols are the individual vowels produced by the speaker in their F1-F2
intersections; larger and darker phonetic symbols are located at the mean F1-F2 values for
each vowel, surrounded by a 1-standard-deviation ellipsis. Vowel space on the top-left corner
is the first recording, top right is the second one, bottom left is the third one, and bottom
right is the last recording. Colors represent different vowels.
156 Ronaldo Lima Jr
two vowels are overlapping or if their distance would be enough for the produc-
tion to be heard as two different vowels. Also, the data collected are simply
four photographs of the vowel spaces at four different points in time within two
years of language development, and different configurations probably took
place at different moments within those two years in both directions, making
pairs of target vowels closer together and farther apart. Nonetheless, one re-
cording a semester is what was feasible at the moment, and, for research pur-
poses, there is usefulness in classifying the pairs of target vowels in each
recording for every participant as either overlapped or as distinct vowels, so
some criteria were needed.
When two vowels had at least half of their ellipses overlapping, they were
considered overlapping vowels right away; and when less than half of the ellip-
ses overlapped or when they did not overlap at all, they were marked as poten-
tial candidates of separate vowel categories. To confirm the status of those
potential candidates for separate vowels, the Euclidean Distance between the
vowels in each pair was used. As was explained in the Method section, the Eu-
clidean Distance is a measure of dissimilarity that can be used to calculate the
distance between two points in an x-y plane, like the F1-F2 vowel space. Since
F2 values change in greater increments than F1 values, the Euclidean Distances
need to be calculated with normalized/standardized values (z-scores in this
case). To give the reader an idea of the scale of distances resulting from this
calculation of Euclidean Distances with normalized F1-F2 values, Figure 3
presents the productions of speaker A in all four recordings with the distances
between /i/ and /ɪ/ (1.26, 1.14, 1.14 and 1.02 for recordings 1, 2, 3 and 4) and the
distances between /ɛ/ and /æ/ (0.09, 0.12, 0.97 and 0.68, respectively) marked
on the plot.
In a previous study with the same method of data collection and analysis
(Lima Jr 2015), the Euclidean Distances between the normalized mean formant
values of a group of ten native speakers of American English were reported as
0.46 for /i ɪ/, 0.38 for /ɛ æ/ and 0.33 for /u ʊ/. Therefore, in this study, those
potential separate vowels (based on the overlap of the ellipses) were in fact
considered separate vowel categories only if their Euclidean Distances were of
at least 0.3. It is based on these two criteria that Table 2 shows in which record-
ings there is a contrast between the target vowels for each speaker.
As can be seen, there are all types of developmental routes, from a learner
that did not develop separate vowel categories at all (speaker D); to those who
developed along the way, especially after taking the English Phonetics and Pho-
nology course (recording 3 – learners A and N, for instance); and those who cre-
ated new phonetic categories but then lost them (K and L). From the 10 learners, 7
already had separate vowel spaces for [i ɪ] in recording 1, and the other 3 learners
Figure 3: Euclidean Distances between /i/ and /ɪ/ and between /ɛ/ and /æ/ for Speaker
A in all four recordings.
Table 2: Pairs of target vowels consisting of two separate categories marked YES for each
recording of every participant.
Speaker Recording /i ɪ/ /ɛ æ/ /u ʊ/ Speaker Recording /i ɪ/ /ɛ æ/ /u ʊ/
A  YES no no G  YES no no
 YES no no  YES no YES
 YES YES YES  YES no YES
 YES YES YES  YES no YES
B  YES no no K  YES no no
 YES no no  YES no no
 YES no no  YES YES no
 YES no no  YES no no
D  no no no L  YES no YES
 no no no  YES no YES
 no no no  YES YES YES
 no no no  YES no no
158 Ronaldo Lima Jr
Table 2 (continued)
Speaker Recording /i ɪ/ /ɛ æ/ /u ʊ/ Speaker Recording /i ɪ/ /ɛ æ/ /u ʊ/
E  no no YES M  no YES no
 no no YES  no YES no
 no no YES  no YES no
 no no no  no YES no
F  YES no no N  YES no YES

 YES no no  YES no YES
 YES YES no  YES YES YES
 YES YES YES  YES YES YES
did not develop these categories in the other three recordings. For the /ɛ æ/ pair,
only one student already had separate categories for them in recording 1 (M),
three learners developed separate categories for them in recording 3 (right after
the Phonetics and Phonology course) and kept them in recording 4 (A, F and N),
and two participants also produced them as separate vowels in recording 3 but
not anymore in recording 4 (K and L). For the high back vowels, two learners pro-
duced them separately in recordings 1 through 3 but not in recording 4 (E and L),
three learners created separate vowel categories for them along the way (A, F and
G), and only one already had them separate from recording 1 onwards (N). Only
three learners got to recording 4 with separate phonetic categories for all three
pairs (A, F and N). The column with most YES’s is the one for the /i ɪ/ pair, and
the one with fewest is the /ɛ æ/ one, confirming previous findings that, from
those three pairs, /ɛ æ/ is the most challenging for Brazilians (Lima Jr 2015).
Lastly, as an attempt to look at a general vowel development index for each
learner and for the group as a whole, the sum of the Euclidean Distances of the
three target pairs of vowels was used to fit a Bayesian mixed-effects model. The
expectation was that learners would increase their distances as they advanced in
time in their studies. In the model, the fixed effects were the intercept and the
slope of the trend for the population of all 10 learners, and the random effects
were the deviations in intercept and in slope that each subject’s own trend had
from the population values.3 Regularizing priors were used,4 allowing for sums of
Euclidean Distances within a realistic range, and allowing for both positive and
 Model: sum.euclidean.dist ~ recording + (recording|subject).

 Priors were set as normal distributions, with mean 1.5 and standard deviation 1 for the inter-
cept, mean zero and standard deviation 1 for the slopes, and mean zero (truncated at zero)
and standard deviation 0.5 for sigma.
negative slopes. Figure 4 presents graphs containing the main results of the
model.
In Figure 4, each panel represents one participant. The four black dots in
each panel are the sums of the Euclidean Distances of the three pairs of L2 vow-
els in each recording, and the black dashed line, which is repeated in every in-
dividual plot, represents the tendency of the group as a whole, derived from
the fixed effects given by the model,5 which favors the hypothesis that learners
should increase their distances with time of study. The orange lines in each
graph are a sample of 100 probable lines predicted by the model for each
speaker considering the random effects. They show that not all speakers had a
positive correlation between the sums of Euclidean Distances and time. The ex-
pectation was that learners should increase the distances between contrasting
vowels as they advance in their study of English, but only six of them (A, B, F,
G, L, N) ended up with a clear positive correlation – some of which with lines
much higher and with a steeper slope than that of the group tendency. From
the other four, one had a clear negative slope (D), and the others had lines that
indicate either stagnation or extremely mild movements. This result highlights
the degree of variance found among speakers.
Lastly, the dotted blue line, with no slope and repeated in all individual
plots at 1.17, marks the sum of the Euclidean Distances from normalized mean
F1-F2 values of a group of ten native speakers of American English. This serves
as a reference, showing that three of the four learners that showed no satisfactory
progress (D, E and M) had sums of Euclidean Distances below that of the group
of native speakers; and that all learners with positively correlated lines had dis-
tances above that of the native speakers. Most learners produced their vowels
with Euclidean Distances greater than those of the group of native speakers
(above the fixed dotted line). This does not mean that they necessarily produced
vowels in separate phonetic categories because, in many cases, even though the
mean F1-F2 values were somewhat distant, the one-standard-deviation ellipses in
their vowel spaces were still overlapping due to variability, which did not happen
with the group of native speakers. This means that at some point in their devel-
opmental routes, the learners were able to produce some of the target words with
distinct vowel categories, but not all of them, or not all the time, resulting in
great variance, and thus large ellipses in their vowel spaces, whereas the native
speakers were able to maintain their vowel categories completely separate (with
 More specifically, the median of the posterior distributions for the intercept (1.16) and the
slope (0.14).
160 Ronaldo Lima Jr
Figure 4: Result from the Bayesian mixed-effects model fit to the sum of Euclidean Distances.
Each panel corresponds to one participant; the black dots are the sums of the Euclidean
Distances of the three pairs of L2 vowels in each recording, four for each participant; the black
ellipses far from each other) at smaller Euclidean Distances. It is as if consistency

may compensate for a smaller distance.
4 Discussion
There was a lot of variability in the observed development of the learners, which
was expected given that each learner has their own complex dynamic L2 system.
Each system is made up of so many elements, whose interaction among them-
selves and with the environment make the performance in the L2 emerge, that it
is impossible to expect all learners to exhibit the same developmental pattern.
Each lesson, be it a holistic communicative lesson or some explicit pronuncia-
tion instruction, is a perturbation of the system, but each system is at a different
stage, some closer to a critical point that might lead to an avalanche (returning
to the metaphor from the introduction) and others still in the beginning of the
sand-accumulation process. Since the cause-effect relation is non-linear in com-
plex dynamic systems, it is only natural that one observes different behaviors
from different learners’ L2 systems.
This confirms the need to analyze L2 developmental data individually,
even if also looking into group tendencies, for a lot of information is lost in a
more traditional design looking only at grouped data (Lima Jr 2016a; Verspoor,
Lowie, and Van Dijk 2008; Verspoor and Van Dijk 2012). A linear regression
looking only at group tendencies would lead one to ignore the fact that some
students really excelled in their developmental (increasing) trajectory, such as
participants A, F, G, L and N (see their [orange] trend lines in Figure 4); and to
also ignore participants who had decreasing trend lines, such as speaker D.
Another characteristic of complex dynamic systems is that they are sensi-
tive to initial/previous states. Among the productions of all learners, there was
a total of 11 vowel contrasts already present in recording 1 (7 for /i ɪ/, 1 for /ɛ æ/
and 3 for /u ʊ/). Even controlling for some individual variables (only learners
who had never been to an English-speaking country, did not have contact with
English native speakers, and had not taken extracurricular English lessons
Figure 4 (continued)
dashed line is the trend line for the group; the orange lines are probable trend lines from the
model for each participant; and the blue dotted line is the sum of Euclidean Distances for the
three target pairs of vowels for a control group of native speakers of English.
162 Ronaldo Lima Jr
could participate), learners still arrive at college with different experiences in

the L2, quantity and type of exposure to media in the L2, levels of motivation to
learn the L2, just to mention a few individual variables at play. Even if students
took a placement test before the beginning of this research, this would be just
one more variable to be considered, because being placed in the same profi-
ciency level by a standardized test, although useful for research (and I am not
arguing against it), does not imply being at the same (initial) state. Also, the
goal of this study was to investigate the effects of communicative lessons and
of pronunciation explicit instruction on the development of English vowels by
Brazilian learners, especially the emergence of some distance between the vow-
els in each challenging pair, but it is necessary to acknowledge the emergence
of distinct vowel categories through reading-based lessons in regular school
and/or through self-study with media in English before getting to college.
As already mentioned in the results section, since the development of the
L2 system is dynamic, with potential ongoing changes all the time, the data col-
lected and analyzed are just four photographs of the current yet momentary
state of the students’ L2 vowels at the end of each school term. Considering the
observation of emergence of new vowel contrasts in those four moments, there
was the creation of only one vowel contrast in the second recording (between
/u ʊ/ for speaker G), when students had had communicative lessons in English
(but just one 64-hour course). In the third recording, on the other hand, when
students had taken the English Phonetics and Phonology course, six new vowel
contrasts were observed (between /ɛ æ/ for speakers A, F, K, L and N, and be-
tween /u ʊ/ for speaker A). Besides the Phonetics and Phonology course, stu-
dents also took other four communicative-based 64-hour courses in English,
which might also have influenced their development. In the last recording,
when students had taken other five communicative-based 64-hour courses in
English, the emergence of one more vowel contrast (between /u ʊ/ for speaker F)
was observed, and the “loss” of four vowel contrasts (between /ɛ æ / for speakers
K and L, and between /u ʊ/ for speakers E and L) – notice that speaker L “lost”
two vowel contrasts in recording 4.
The contrasts apparently “unlearned” at some point also reveal the non-
linear nature of language development, showing that the system is constantly
moving. This “unlearning” could be a real forgetting of the contrasts, which
could be retriggered by some other perturbation of the system; but it could also
represent an attempt to adjust the entire system to new perturbations, which
could eventually take the system to an even better state in terms of L2 command.
The non-linearity between cause and effect also accounts for the fact that not all
students immediately created new vowel categories after taking the English Pho-
netics and Phonology course, and that one learner in particular (speaker D) did
not display any vowel contrast whatsoever in any of the recordings. It is possible
that, later on, and triggered by other perturbations of their systems, those learn-
ers that showed no (immediate) effect will move their systems away from the at-
tractor states of the prototypical L1 vowel categories.
In total, there were 11 vowel contrasts in the first recording, 12 in the second,
18 in the third (the semester they took the Phonetics and Phonology course), and
15 in the last one. No student had distinct vowels for all three pairs in the first
two recordings; three learners presented distinct vowels for all three pairs in the
third recording (speakers A, L and N); and, in the last recording, speaker L did
not present the contrasts in all three pairs anymore, but another learner
(speaker F) showed distinct vowels in all pairs. Added to all the discussion
conducted so far, this result highlights the positive role of explicit pronuncia-
tion instruction in the development of new vowel categories, but without less-
ening the also positive influence of communicative lessons in creating and/or
maintaining newly created vowel contrasts.
Finally, the results section attempted to categorize students’ productions
into “yes” and “no” concerning the presence of separate vowels in the three
pairs in focus. However, language development is not categorical, but gradient
in nature. It was not always easy to decide if two vowels should be considered
“with” or “without” a contrast. That is why some criteria needed to be defined
and followed for the categorization of the results. Nevertheless, the gradience
found in the data cannot be overlooked. There were cases of students classified
with “no contrast”, for instance, who were on the brink of creating new catego-
ries. The binary classification of participants may give the wrong impression
that all learners with a “no” in Table 2 produced the contrasts equally over-
lapped, which was not the case. Some students moved their vowels apart, just
not enough to fulfill the pre-established criteria. Likewise, not all speakers with
contrasting vowels in Table 2 produced them equally well. Some produced
them in the threshold of the criteria, whereas others produced truly separated
vowels, with the ellipses far from touching each other. There was variation
even within the same speaker. Speakers F, G and K, for instance, all marked
with separate categories for [i ɪ] in all recordings, produced contrasts much
more separate in the last two recordings, showing an influence of the explicit
instruction not depicted in the way data were treated categorically.
164 Ronaldo Lima Jr
5 Conclusion
The goal of this study was to investigate possible effects of communicative lan-
guage lessons and of explicit pronunciation instruction on the development of
English vowels /i ɪ ɛ æ u ʊ/ by Brazilian learners in the first four semesters of
their college studies in English Language Teaching. This was done by analyzing
the emergence of new vowel categories for the L2 vowels, and the developmen-
tal route of each learner through visual inspection of vowel spaces, calculation
of Euclidean Distances between contrasting vowels, and the results of a Bayes-
ian mixed-effects model with the sum of the Euclidean Distances, which helped
look both into group trend and individual variation.
The analyses showed a lot of variability in the development of the target
vowels by the learners, which is expected when L2 developing systems are seen
as complex dynamic systems. Many learners developed new vowel categories
throughout the first four semesters, and more contrasts are expected to develop
as they continue their studies. The main conclusion is that, even though com-
municative lessons play an important role in the development (and also in the
maintenance) of the L2 vowel system, explicit pronunciation instruction had a
greater impact on the emergence of new vowel contrasts.
Future investigations of this nature should include an analysis of the duration
of the vowels as well as the analysis of less monitored production (reading a text
or speaking spontaneously). Future research could also include perceptual studies
as an attempt to witness the emergence of both perceptual and productive vowel
categories. Lastly, as has been argued throughout this paper, investigations of L2
development are more informative if done with longitudinal data, so the collection
of data in more time points within those years and/or the collection of data for
more than two years would provide even more information to draw inferences of
the L2 developmental process.
References
Arantes, Pablo. 2010. Formants.Praat. [Computer software].
Arantes, Pablo. 2011. Collectformants.Praat. [Computer software].
Bak, Per & Michael Weissman. 1997. How nature works: The science of self-organized
criticality. American Journal of Physics 65(6). 579–80.
Beckner, Clay, Richard Blythe, Joan Bybee, Morten H Christiansen, William Croft, Nick C. Ellis,
John Holland, Jinyun Ke, Diane Larsen-Freeman & Tom Schoenemann. 2009. Language is
a complex adaptive system: position paper. Language Learning 59(s1). 1–26.
Bion, Ricardo Augusto Hoffmann, Paola Escudero, Andréia S. Rauber & Barbara O. Baptista.
2006. Category formation and the role of spectral quality in the perception and
production of English front vowels. In Richard M. Stern (ed.), Ninth International
Conference on Spoken Language Processing, Pittsburgh, USA, 2006, 1363–1366. Baixas,
France: International Speech Communication Association.
software]. Version 6. 1.03. http://www.praat.org/ (accessed 8 October 2019).
Cristófaro Silva, Thaïs. 2003. Descartando fonemas: a representação mental da fonologia de
uso [Discarding phonemes: the mental representation of use phonology]. In Dermeval da
Hora & Gisela Collischonn (eds.), Teoria Linguística: Fonologia e Outros Temas [Linguistic
Theory: Phonology and other topics], 200–251. João Pessoa: Editora Universitária.
De Bot, Kees. 2008. Introduction: second language development as a dynamic process. The
Modern Language Journal 92(2). 166–178.
De Bot, Kees & Diane Larsen-Freeman. 2011. Researching second language development from
a Dynamic Systems Theory perspective. In Marjolijn Verspoor, Kees De Bot &Wander
Lowie (eds.), A Dynamic Approach to Second Language Development: Methods and
Techniques, 5–24. Amsterdam: John Benjamins Publishing.
De Bot, Kees, Wander Lowie & Marjolijn Verspoor. 2007. A dynamic systems theory approach
to Second Language Acquisition. Bilingualism: Language and Cognition 10(1). 7–21.
Flege, James Emil. 1995. Second language speech learning: theory, findings, and problems. In
Winifred Strange (ed.), Speech perception and linguistic experience: Issues in cross-
language research, 233–277. York: York Press.
Flege, James Emil & Ocke-Schwen Bohn. 2021. The revised Speech Learning Model (SLM-R). In
Johnson, Keith. 1997. Acoustics and Auditory Phonetics. Malden: Blackwell Publishing.
Kuhl, Patricia, Barbara T. Conboy, Sharon Coffey-Corina, Denise Padden, Maritza Rivera-
Gaxiola & Tobey Nelson. 2008. Phonetic learning as a pathway to language: new data
and native language magnet theory expanded (NLM-E). Philosophical Transactions of the
Royal Society B: Biological Sciences 363(1493). 979–1000.
Larsen-Freeman, Diane. 1997. Chaos/Complexity science and Second Language Acquisition.
Leather, Jonathan. 2003. Phonological acquisition in multilingualism. In María del Pilar
García Mayo and María Luisa García Lecumberri (eds.), Age and the Acquisition of English
as a Foreign Language, 23–58. Clevedon: Multilingual Matters.
Lima Jr, Ronaldo Mangueira. 2013. Complexity in second language phonology acquisition.
Revista Brasileira de Lingüística Aplicada 13(2). 549–576.
Lima Jr, Ronaldo Mangueira. 2015. A influência da idade na aquisição de seis vogais do inglês
por alunos brasileiros [The influence of age on the acquisition of six English vowels by
Brazilian learners]. Organon 30(58). 15–31.
Lima Jr, Ronaldo Mangueira. 2016a. A necessidade de dados individuais e longitudinais para
análise do desenvolvimento fonológico de L2 como sistema complexo [The need of
individual and longitudinal data for the analysis of L2 phonological development as a
complex system]. ReVEL 14(27). 203–225.
Lima Jr, Ronaldo Mangueira. 2016b. Análise longitudinal de vogais do inglês-L2 de brasileiros:
dados preliminares [A longitudinal analysis of English-L2 vowels by Brazilians:
166 Ronaldo Lima Jr
preliminary data]. Gradus: Revista Brasileira de Fonologia de Laboratório [Brazilian

Journal of Laboratory Phonology] 1(1). 145–176.
Nobre-Oliveira, Denize. 2007. The effect of perceptual training on the learning of English
vowels by Brazilian Portuguese speakers. Florianópolis: Universidade Federal de Santa
Catarina dissertation.
Pierrehumbert, Janet. 1990. Phonological and phonetic representation. Journal of Phonetics
18(3). 375–394.
Rauber, Andréia Schurt. 2006. Perception and production of English vowels by Brazilian EFL
speakers. Florianópolis: Universidade Federal de Santa Catarina dissertation.
Vallabha, Gautam K. & Betty Tuller. 2002. Systematic errors in the formant analysis of steady-
state vowels. Speech Communication 38(1). 141–160.
Verspoor, Marjolijn, Kees De Bot & Wander Lowie. 2011. A Dynamic Approach to Second
Language Development: Methods and Techniques. Amsterdam: John Benjamins.
Verspoor, Marjolijn & Marijn Van Dijk. 2012. Variability in a Dynamic Systems Theory
approach to Second Language Acquisition. The Encyclopedia of Applied Linguistics,
6051–6059.
Verspoor, Marjolijn, Wander Lowie & Marijn Van Dijk. 2008. Variability in Second Language
Development from a Dynamic Systems Perspective. The Modern Language Journal 92(2).
214–231.
Wempe, Ton & Paul Boersma. 2003. The interactive design of an F0-related spectral analyser.
In Maria-Josep Solé, Daniel Recasens & Joaquín Romero (eds.), Proceedings of the 15th
International Congress of Phonetic Sciences, Barcelona, 2003, 1–4. Rundle Mall: Causal
Productions Pty Ltd.
Tim Kochem, Idée Edalatishams, Lily Compton, Elena Cotos
An extra layer of support: Developing
an English-speaking consultation program
Abstract: Pronunciation is an important aspect of effective communication in ac-
ademic settings, which includes graduate students and postdoctoral scholars
(Ranta and Meckelborg 2013; Yanagi and Baker 2016). However, the level of sup-
port these postgraduate students and scholars may require, or desire, can vary
greatly depending upon their aptitude and proficiency, as well as their future
plans for staying in the target language environment. A common belief about lan-
guage learning is that studying in a naturalistic setting presents learners with
ample opportunities for authentic exposure, which should be sufficient for growth
(Lightbown and Spada 2006). However, pronunciation has been shown to rarely
improve after the first year in a target language environment without explicit in-
struction (Derwing and Munro 2013). This leaves many learners finding themselves
in a precarious position, with their pronunciation being good enough as to not re-
quire additional coursework, but still problematic enough as to require additional
assistance to reduce intelligibility errors. At Iowa State University, such assistance
is provided through English-Speaking Consultations (ESCs), which offer students
one-on-one pronunciation practice focusing on specific segmental and supraseg-
mental features of English, and also focusing on more general needs related to
highly contextualized oral communication tasks (e.g., a conference presenta-
tion). Importantly, the ESCs draw upon the technological pedagogical content
knowledge framework (TPACK: Mishra and Koehler 2007), and the training of
consultants incorporates content knowledge (e.g., phonetics and phonology),
pedagogical knowledge (e.g., learner strategies and task-based learning), and
technological knowledge (e.g., tools for conferencing and for language prac-
tice), leveraging the intersections of these three areas. This chapter describes
the ESCs as well as the training of consultants, both of which can serve as
models for academic centers or communication and language programs at
other universities willing to develop or expand programs focused on pronun-
ciation support for their international students.
Keywords: international graduate students, English pronunciation instruction,

oral communication, language teacher training, technology-enhanced pronun-
ciation instruction
Tim Kochem, Idée Edalatishams, Lily Compton, Elena Cotos, Iowa State University
https://doi.org/10.1515/9783110736120-007
168 Tim Kochem et al.
1 Introduction
Effective oral communication is a staple in academic settings in the United
States (US), especially for postgraduate and postdoctoral students. These indi-
viduals are often required to deliver presentations (both formally at conferences
and informally in front of peers), act as teaching assistants, and conduct re-
search with peers and colleagues. While native speakers of English may require
some instruction on the specifics of these oral communication tasks, interna-
tional students who are nonnative speakers of English may require both task
knowledge and more general skills of English, such as grammar or pronuncia-
tion. Compounding this issue is that most English language instructors within
higher education regularly assess international students on general English
skills, but many do not possess a knowledge of English that would allow them
to deliver effective instruction to their students, who are typically placed in
their courses based on institutional test scores or entrance exam results (such
as TOEFL or IELTS). Students whose scores indicate a more advanced level of
proficiency also find themselves in a precarious situation: their English skills
may be good enough to pass a test and to not require additional coursework,
yet are still problematic enough to require additional assistance in order to
meet the oral communication demands of academia.
Of all the English skills, pronunciation is perceived to be one of the most
difficult skills to teach (Baker 2014; Couper 2017). In fact, studies have found
that amongst English as a second language (ESL) instructors, pronunciation is
often neglected due to instructors’ lack of confidence, skill, or knowledge
(Baker 2014; Derwing 2019). A common belief is that residing in a naturalistic
setting presents ample opportunities to engage with the target language, which
should be sufficient for language growth (Lightbown and Spada 2006). How-
ever, this belief does not hold quite as true for pronunciation, as some studies
have found that pronunciation rarely improves past the first year in a target
language environment without explicit instruction (e.g., Derwing and Munro
2013). The need for explicit instruction creates an unusual challenge, in that in-
ternational students studying in the US may not be able or want to spend time
taking a full course in oral communication, but still need fine-tuning if they are
to be successful in their academic career.
To meet this challenge, Iowa State University’s Center for Communication
Excellence (CCE) developed a model for assisting international students with
their English-speaking needs. Following an overview indicating the importance
of English-speaking ability in general, we introduce the English-Speaking
Consultation (ESC) model, and then describe the specifics of ESC consultant
training. Second, we cover how technology plays an essential role in the
An extra layer of support: Developing an English-speaking consultation program 169
ESCs, including how the consultants are trained to use technology and how
technology is leveraged for supplementary language instruction. To do this,
we use the technological pedagogical content knowledge framework (TPACK:
Mishra and Koehler 2007) as an underpinning to connect the consultants’
training in technology with their ability to deliver effective ESL pronunciation
tutoring. Finally, this chapter integrates personal insights and reflections
from two English-speaking consultants to further contextualize how ESCs are
conducted, as well as how the ESC training has influenced their knowledge of
pronunciation training.
2 Pronunciation and oral communication

2.1 Second language pronunciation development
Pronunciation development may be one of the most debated topics in second

language (L2) learning. We have seen pronunciation as a key construct during
its primacy in the Audiolingual Method (Saito and Lyster 2012) to its near non-
existence under Communicative Language Teaching (Purcell and Suter 1980).
Today, a renewed focus is being placed on the role of pronunciation in lan-
guage learning due to research showing that pronunciation instruction works
(e.g., Lee, Jang, and Plonsky 2015; Saito 2012; Saito and Plonsky 2019; Thomson
and Derwing 2015). Furthermore, current research suggests that a stronger em-
phasis should be placed on the importance of producing comprehensible and
intelligible speech, which is considered a rewarding goal that will be more
valuable to learners in terms of effective oral communication (Derwing 2019;
Levis 2005, 2018, 2020).
It should be noted, however, that regardless of the end goal, the develop-
ment of pronunciation takes time. Even within a naturalistic setting where the
target language is the primary language, we can reasonably expect that gains
in pronunciation will plateau after a year without explicit instruction (Derwing
and Munro 2013). Although there is no set framework or timeline for the acqui-
sition of pronunciation features, there are published lists of likely errors for
speakers of particular first languages (e.g., Swan and Smith 2001). There is no
guarantee, however, that individual learners will produce all those errors, as
aptitude, proficiency, motivational factors, and language experience all play a
role in the acquisition of pronunciation (Munro 2018). The type of instruction
that learners receive may also play a role in the acquisition process; some in-
structors, for instance, may rely too heavily upon more controlled or listening
discrimination activities rather than those activities which promote spontane-

ous speech (Baker 2014). While studies have shown that pronunciation instruc-
tion using controlled activities was effective, it remains unclear how well this
type of instruction leads to spontaneous (or automatized) speech (Saito and
Plonsky 2019).
Researchers and teachers would claim that pronunciation instruction is ef-
fective in helping learners to obtain metalinguistic awareness, noticing, and
understanding of L2 pronunciation features through controlled phonetic in-
struction (Derwing and Munro 2005). This is in line with the vast literature on
instructed second language acquisition (SLA), which has revealed the positive
effects that explicit instruction has on the initial stages of SLA – “noticing, pat-
tern identification, restructuring, [and] error avoidance” (Saito and Plonsky
2019: 691–692). However, scholars mostly agree that learners should be given
opportunities to automatize speech features through the use of free or sponta-
neous speech tasks (Saito and Plonsky 2019; Spada and Tomita 2010). The need
for these opportunities becomes more apparent when looking at international
graduate students studying in the US, as these learners enter into academia at
varying language proficiency levels but must learn to automatize their oral
communication to a sufficient degree in order to fulfill their responsibilities as
graduate students and teaching or research assistants and to complete their
program.
2.2 Oral communication for international graduate students
Successful oral communication is a must-have for all graduate students. This

includes the ability to converse with colleagues and professors, deliver formal
academic presentations, and network with other professionals within their
fields at conferences and other collegial venues. Mastering oral communication
is dependent on a number of factors, including willingness to communicate
(MacIntyre et al. 1998), dedication to developing oral communication skills
(Flege and Liu 2001; Ranta and Meckelborg 2013; Saito 2015), and effective pro-
nunciation (Derwing and Munro 2013). For native speakers of English, this is a
less-demanding task as they are only challenged with learning and using the
discourse conventions and features of a given context. International students,
in addition to that, must oftentimes learn to use effective pronunciation that is
both comprehensible and intelligible.
The role of pronunciation within successful oral communication is multi-
faceted (Levis 2018) and can possibly affect the degree of effort needed to com-
prehend or decode a speaker’s intended message (i.e., comprehensibility) and/or
a listener’s ability to actually understand the speaker’s message (i.e., intelligi-

bility). At the word level, issues with segmentals (e.g., consonant and vowel
sounds) can have a detrimental impact on the listener’s ability to decode the
word, thus leading to compromised comprehensibility or intelligibility (Bent,
Bradlow, and Smith 2007; Jenkins 2002). While incorrect vowel pronunciation
could be argued to be more detrimental (Bent, Bradlow, and Smith 2007), con-
sonants have also been shown to play a role in the listening process of decod-
ing information, especially word-initial and word-final consonant clusters
and high functional load errors (Im and Levis 2015; Munro and Derwing 2006;
Zielinski 2008). At the discourse-level, the prosodic features of language (i.e.,
suprasegmentals) play an integral role in a speaker’s ability to produce com-
prehensible and intelligible language. This includes having a firm command
of the rhythm, intonational patterns, use of prominence, word stress, and
thought groups that native English speakers employ. Many studies pointed
out a range of negative effects that misplaced or misused suprasegmentals
can have on comprehensibility or intelligibility (e.g., Gallego 1990; Hahn
2004; Kang 2010; Sereno, Lammers, and Jongman 2016). Some researchers
concluded that prosodic errors are often more serious than segmental errors
(e.g., Anderson-Hsieh, Johnson, and Koehler 1992). Without a sound founda-
tion in all aspects of effective pronunciation, students may find it difficult to
navigate through the oral communication exchanges that higher education re-
quires (Ranta and Meckelborg 2013; Yanagi and Baker 2016).
International students come to study at universities in the US for many rea-
sons, including the high quality of programs, more options and degrees to
choose from, and the appeal of English skills to potential employers. This last
point, the appeal of English skills, is a double-edged sword. On the one edge,
successful completion of a graduate program within the US has the potential to
show future employers that an international student has the skills to communi-
cate in the language. Yet on the other edge, the student must first complete the
program, which can be rather difficult if the English language proficiency is
low. That is, success is dependent on the student’s command of English. While
struggling students may have access to English courses once entering the uni-
versity, these are oftentimes restricted to instruction in writing and grammar,
even though challenges with oral communication have been highlighted for
decades (Ferris and Tagg 1996; Ferris 1998; Liu 2001; Morita 2004; Sawir 2005;
Kim 2006; Yanagi and Baker 2016).
The importance of international graduate students’ oral communication
abilities in general, and comprehensible pronunciation in particular, can be
viewed in light of their development as professionals in their field and growth
toward a career path. International graduate students perceive productive
language skills such as speaking as most challenging (Berman and Cheng

2001). Japanese speakers of English, for example, consider pronunciation as
the most difficult language skill (Yanagi and Baker 2016). In their study, Ya-
nagi and Baker (2016) found that in a classroom context, graduate students
have reported small group and large group discussions as the most common
oral communication tasks. Along with asking questions, these tasks represent
areas that graduate students are most concerned about and are also identified
by professors as most problematic (Kim 2006). Such perceived difficulties in
oral communication positively correlate with graduate students’ academic per-
formance (Berman and Cheng 2001), potentially impacting their prospects of
securing jobs that require a higher grade point average.
From the perspective of academic development, graduate students’ ability to
sustain effective oral communication with peers, colleagues, and professors
(both native and non-native speakers of English) can impact their chances of par-
ticipation in group projects and research collaborations, possibly affecting their
inclusion in future group research projects, getting published, and growing as
researchers. If a graduate student’s pronunciation is deemed as not comprehensi-
ble by members of their academic community, it might influence opportunities
for professional development and eventually finding a job and pursuing their de-
sired career. Many graduate students (especially at the doctoral level) are also
teaching assistants (TA), either teaching classes on their own or assisting a pro-
fessor in teaching a course. Depending on the type of teaching responsibility,
graduate TAs may have to deliver lectures, lead discussions, supervise lab ses-
sions, or hold office hours. In performing all of these tasks, comprehensible and
intelligible pronunciation is needed for effective communication between the
graduate TAs and their students. International TAs (ITAs) generally feel confident
in their knowledge of the course content and use of teaching strategies, but be-
lieve that their oral proficiency in general, and pronunciation in particular, create
problems for them in getting their messages across (Myles and Cheng 2003).
If undergraduate students taking classes with ITAs do not consider the pro-
nunciation of these instructors understandable, they might negatively evaluate
the ITA’s teaching performance (Hoekje and Williams 1992). Of course, students
from different language backgrounds and levels of experience with non-native
accents rate the comprehensibility of L2 instructors’ speech differently (Saito
and Plonsky 2019). While the role of listeners in this communication should not
be neglected (Rajadurai 2007; Subtirelu 2017), graduate students can benefit
from individual support on specific aspects of their pronunciation that might
interfere with the comprehensibility and intelligibility of their speech.
Research on graduate oral communication highlights the importance of
providing international graduate students with English for Academic Purposes
instruction that focuses on oral communication skills related to giving oral pre-
sentations, participating in small and large group discussions, and asking and
answering questions in class (Berman and Cheng 2001). Kim (2006) adds that
due to the differences in educational practices between the US and interna-
tional graduate students’ home countries, language instruction should also
help students gain meta-knowledge about the oral communication skills re-
quired of them in the context of US academia, and assist them in developing an
understanding the values and need for active participation through speaking.
Explicit instruction on prosodic features of speech can also activate non-native
speakers’ knowledge of the prosodic structures in their first languages, helping
them obtain better results in their pronunciation practice (Liu 2020). These
identified needs are targeted in the English-Speaking Consultations (ESCs),
which are in high demand at Iowa State University. The next section describes
how and by whom this support is provided.
3 English-Speaking Consultations (ESCs)

3.1 Context for ESCs
The ESC model was developed and implemented in the Center for Communica-
tion Excellence (CCE) housed within Iowa State University’s Graduate College.
The CCE aims to support the academic and professional communication needs of
all graduate students and postdoctoral scholars at the university. Since its found-
ing in 2015, the CCE has launched seven programs specialized in different aspects
of written and oral communication, all the programs being grounded in research
from communication genres and scholarship of teaching and learning. The En-
glish Language Development Program (ELDP) is one of the seven programs. It
was designed to provide opportunities for individualized language practice and
improvement by offering three types of service: English writing consultations
(EWCs), English-speaking consultations (ESCs), and Peer Speaking Practice Groups
(PSPGs). The EWCs are one-on-one tutoring sessions focusing on macro- and
micro-level aspects of writing, and the ESCs focus on oral communication profi-
ciencies including pronunciation. The PSPGs, in turn, engage small groups of par-
ticipants to practice speaking on various topics, and assigned facilitators generally
focus on pronunciation topics as they arise. This chapter focuses on elements of
the ESCs with a specific emphasis on pronunciation.
3.2 Types and procedural components of ESCs
The ESCs are 50-minute-long tutoring sessions. They can be of two types:
Type 1 – Assistance with specific English-speaking tasks, and Type 2 – Develop-
ment of general oral communication English skills. For Type 1 ESCs, consultants
focus on helping students prepare for speaking tasks such as conference presen-
tations, thesis/dissertation defenses, job talks, etc. Often, students and novice
scholars feel apprehensive or nervous about high-stakes performances (e.g., the-
sis/dissertation defenses or job interviews), or concerned that they might be
viewed as incompetent by peers or faculty during low-stake performances (e.g.,
group reports or class presentations). In Type 1 ESCs, consultants give concrete
recommendations with regards to the target task, with that helping to increase
speaker confidence. In Type 2 ESCs, consultants focus on specific language traits
to help students develop their speaking skills. Students seeking these consulta-
tions may be new to the US or at the beginning of their graduate studies, or they
may have experienced difficulties in expressing themselves clearly in academic
settings. Type 2 ESCs are often scheduled on a recurring basis so that consultants
can develop individualized plans after conducting a needs analysis. For each
type of consultation, specific recommended procedures were developed and are
put in place. Figure 1 shows the flowchart for recommended procedures for both
consultation types.
Overall, there are four key components: needs analysis, consensus build-
ing, formative tasks, and recommended next steps. Needs analysis is the first
step in the Type 1 ESCs and Phase 1 in the Type 2 ESCs. Since Type 1 ESCs are
based on a specific task, the needs analysis is completed in approximately five
minutes to get information about the task, context, and target audience. For
Type 2 ESCs, the needs analysis process may take one to two sessions. Consul-
tants use two carefully structured tools, the self-assessment interview and the
English language skills diagnostic tool, to (1) help establish rapport and credi-
bility, (2) discuss expectations about time and effort, and (3) analyze strengths
and weaknesses. After the needs analysis, both consultants and learners work
together to establish short and long term goals. Consensus building is a key
component of all ESCs to encourage learners’ sense of self-efficacy and whole-
hearted buy-in to the process. Thus, Step 2 in Type 1 ESCs and Step 2 in Type 2
ESCs (Phase 1) include a discussion about priorities based on the amount of
time available and the needs analysis. This step usually takes about 5–10
minutes.
The next step is the implementation of formative tasks. For Type 1 ESCs, role-
play is the most common formative task to allow the consultants to focus on the
specific language task. During the role-play, the consultants work on organization,
Figure 1: Recommended procedures for English-speaking consultations.
coherence, transitions, and lexicogrammar to ensure that the presentation is effec-

tive. Additionally, consultants also highlight specific pronunciation issues such as
word stress, enunciation, vowel or consonant sounds, etc. For Type 2 ESCs, forma-
tive tasks may include situational role-plays, fluency exercises, accuracy exer-
cises, and other activities depending on the insights from the needs analysis.
These formative tasks allow consultants to identify an area of improvement and
model how the task can be extended as self-practice. Since these tasks provide
opportunities for targeted instruction and continuous feedback, they require lon-
ger interaction, typically between 30–35 minutes.
The final key component of the ESCs is the recommendation of next steps.
In the Type 1 ESCs, the consultants make specific recommendations for how
students can independently practice specific aspects of oral communication
(e.g., fluency, prominence, signal words). Although these tend to be one-time
sessions, students wishing to improve and rehearse their presentation may
choose to schedule another Type 1 appointment. For the Type 2 ESCs, which
occur repeatedly, the recommendation of next steps is provided at the end of

each session. Commonly, this involves a recommended 10–15 minute home-
work activity following up on the formative tasks. For example, Figure 2 shows
a screenshot of a Successive Speaking (4/3/2) activity for self-practice on pre-
sentation tasks or job interviews. The 4/3/2 refers to the number of minutes the
student practices speaking. For the first round, the student sets the timer for
four minutes and speaks continuously until the timer runs out. The second and
third rounds are set for three and two minutes, respectively. The goal of this
activity is to focus on fluency, and by giving the same talk repeatedly and
being focused on the timer, the student can move away from fixating on accu-
racy first (over time, this activity can help the student to increase accuracy as
well).
Figure 2: Screenshot of Successive Speaking (4/3/2) task.
The final step shown in Figure 1 at the end of both consultation types is the “En-
glish-Speaking Consultation Evaluation”. This step is not directly related to the
learning tasks but rather a protocol for the CCE to collect feedback from students
about their experiences in an effort to improve the protocol and consultation
quality. The accumulated feedback is compiled and reviewed each semester.
It is worth mentioning that students often make appointments with more
than one consultant, which is why it is important to enact a procedure for infor-
mation transfer among consultants. This is particularly true for Type 2 ESCs,
where a consultant and the student mutually agree on an individualized plan

that needs to be implemented over time. To ensure the necessary continuity,
the consultant who works with a student for the first time creates a document,
takes consultation notes including the needs analysis, short and long term
goals, and learning goals, and stores the individualized document in a shared,
yet secure, digital folder. When the student attends ESCs with other consul-
tants, they are able to access the students’ consultation history entered by the
previous consultant. The students can also refer to their personal consultation
documents to review their own progress and access the homework tasks and
suggested resources.
4 English-speaking consultants
4.1 General ESC training
The English-speaking consultants are graduate students with ESL experience,

typically recruited from the Applied Linguistics and Technology Program in
the Department of English at Iowa State University. They are prepared and
certified through a self-paced training that includes both theoretical and prac-
tical aspects. Figure 3 shows the outline of the training. During the first seven
weeks, the consultant trainees complete six modules in the learning manage-
ment system Canvas. The first module is an introduction to the ESCs, and the
next five modules focus on pragmatics, listening, speaking, lexicogrammar,
and pronunciation. The trainees are encouraged to complete one module per
week; however, the pronunciation module consists of numerous concepts, so
they are allowed to take at least two weeks to complete this module.
Once trained, the consultants operate within three dimensions to assist the
students as summarized in Table 1.
4.2 Pronunciation training module
Of the five major aspects of oral communication covered in the ESC training,
pronunciation receives the most attention as it is arguably both the most com-
plex to teach and the most sought-after instruction from international students
in speaking consultations. Even for those consultations that focus on a particu-
lar discourse setting (e.g., interview, presentation), there is always an element
Figure 3: Training outline for English-speaking consultants.
of pronunciation instruction that is explicitly desired by students, who might

say “please let me know if my pronunciation needs any work.”
It is not unusual for students to know which features of pronunciation they
would like a consultant to focus on (for instance, intonation or word stress);
Table 1: Dimensions targeted by English-speaking consultants.
Language needs awareness – Diagnose oral English communication difficulties and

understand individual needs for language improvement
– Discuss oral communication issues in a culturally sensitive
environment
– Identify cross-cultural differences in oral communication
– Locate and effectively use resources for independent and/
or guided English-speaking practice and development
Speaking and pronunciation – Identify vowel and consonant differences between the first
language and the English language
– Use of correct vowel and consonants for effective oral
communication
– Use thought groups, intonation patterns, word stress,
focus, and/or volume appropriately for effective oral
communication
– Use of appropriate speech patterns in English to enhance
overall fluency
– Use different communication strategies to maintain and
repair oral discourse
Listening, lexicogrammar, – Use strategies to listen effectively to successfully respond

and pragmatics during oral communication
– Expand and use English lexicogrammar to improve overall
fluency during oral communication
– Differentiate implicit and explicit meanings of utterances
– Use appropriate utterances for various situational settings
however, additional features may need attention as well. Students may not be
aware of specific pronunciation errors, and they may seek help because a pro-
fessor or colleague mentioned they should work on their pronunciation, with-
out providing any concrete examples of their mispronunciations. This gives
further justification for the elongated consultant training in pronunciation fea-
tures. Consultations are first and foremost a service provided for international
students, so the needs and goals of the student come first. Still, the ability to
evaluate, assess, and prioritize pronunciation errors can sometimes reveal er-
rors that are more detrimental to a student’s ability to produce comprehensible
and intelligible speech.
Therefore, the module for pronunciation focuses on training the consultants
in intelligibility-based pronunciation instruction (Levis 2018). In other words, the
goal of pronunciation instruction is geared towards producing speech that is eas-
ily decoded and understood. This is accomplished by centering on six major
topics and respective subtopics, which are listed in Table 2.
Table 2: Major pronunciation topics and subtopics covered in the Pronunciation module of the
ESC training.
Topic Subtopic(s)
Thought groups – Common grammatical structures

– Pausing
– Speech rate
Word stress – ‘Clear’ vowel vs. schwa
– Consonant quality
– Length of vowels
– Pitch/volume movement
– Cognates
Focus (or – Default placement
prominence) – Contrastive stress
– Emphasis
Intonation – Pitch movement
– Grammatical meaning or attitude
– Listener interpretation
– Contextual factors
Volume – Continuum of ‘default volume’ for first languages
– Listener interpretation
– Self-awareness
– Contextual factors
Segmentals – Perception-production link
– Segmental pronunciation as physical activity
– Segmental pronunciation as habit
– Training priorities
– English consonants
– ‘Clear’ vowels
– Schwa
– Connected speech
For each topic, trainees are first presented with real-life scenarios where
students express their concerns. Here’s an example of a scenario for thought
groups:
An hour ago, you were meeting with a student who speaks so fast that you – and others,
according to the student – find following what she says very difficult. The student you are
meeting now is not nearly as fluent, tripping over almost every other word, her speech epito-
mizing the label ‘broken English.’ Using standard pronunciation training terminology, how
would you define each student’s problem? What during-appointment and homework activi-
ties would you assign for each student? Why?
The trainees begin with some of their own ideas based on their own language
teaching experiences and knowledge. They then go through readings grounded
in theory and research that were specifically written for this training. The mod-
ule also includes recommended techniques and tools for actual ESCs, which
were curated by an experienced practitioner and researcher of L2 pronunciation
instruction.
Before trainees can be taught to tutor in pronunciation, they must have a
firm understanding of the phonetic and phonological features of spoken En-
glish. After trainees have shared their understanding of the topic, they engage
with instructional materials that enhance their knowledge of these pronuncia-
tion features. For example, in the above scenario about thought groups, the lit-
erature review for the training program emphasizes the importance of thought
grouping followed by five features that are common in thought groups. Figure 4
shows excerpts from the training module related to the scenario.
Figure 4: Excerpt from the training material.
Figure 4 is but a small snapshot of the instructional materials that trainees en-
gage with throughout the pronunciation module. As Figure 4 shows, the train-
ees are given instruction not only on the phonetic and phonological aspects of
language, but also on their direct and indirect impact on producing comprehen-
sible and intelligible speech. The promotion of developing effective oral commu-
nication, rather than focusing solely on accent reduction, is a more reasonable
goal for learners, who are also immersed in their own studies, and it is often-
times a more achievable goal.
The trainees are also exposed to content that promotes their understanding
of pronunciation errors common to speakers of certain first languages, being
guided in terms of how to identify said errors. That is, in addition to an over-
view of common errors, there is an extended literature review that describes in
detail how to identify such errors, reasons why they occur, and how to ap-
proach them. This provides the trainees with ways to contextualize their in-
struction, which is often a more desired mode of teaching rather than teaching
pronunciation in a vacuum (Levis 2018).
Once the trainees have sufficient knowledge of the phonetic and phonologi-
cal features of English, as well as a firm understanding of their impact on produc-
ing comprehensible and intelligible speech, they move on to the final section of
each topic, which is the activities and techniques for effective teaching. Here,
they are provided with extensive pedagogical knowledge, as well as key take-
aways towards the end of the teaching section (see Figure 5). The takeaways pro-
vide a quick reference point for trainees when consulting, in case they need a
refresher or an activity on the go. This section is arguably the most important, as
teachers who report a lack of confidence in their ability to teach pronunciation
will sometimes provide instruction as written in a textbook or forego instruction
altogether (Baker 2014; Derwing 2019; Levis and Kochem in press).
Figure 5: Techniques and tools.
To end each topic, the trainees are asked to revisit the initial teaching scenario
and revise their answers as necessary. Even if a trainee was correct in their
identification of the pronunciation error at the beginning, at this point they
should elaborate more on not only the issue at hand, but how they would pro-
vide instruction that would be both meaningful and effective for the individual
students.
5 Technology for pronunciation training

and consulting
Aside from training in pronunciation features and effective means for language
instruction, consultants are also taught how to incorporate technology resources
into their consultations. The use of technology provides a two-fold advantage for
pronunciation instruction. First, it allows the consultants to find helpful resour-
ces that their students will benefit from for specific target features. For example,
if a student were having difficulty with vowels, it may benefit the consultant to
use a program such as English Accent Coach (https://www.englishaccentcoach.
com/) to help that student to differentiate between English vowel sounds. From
there, the consultant could move into technology resources that can help the stu-
dent hear the sounds in discourse, such as YouGlish (https://youglish.com/) or
the mobile application Languages with Music: Lyrics Training (LyricsTraining
2020). The second advantage is that the use of technology extends learning
beyond the consultation. Pronunciation development takes time and explicit
instruction, presumably more than the one-hour consultation per week can af-
ford. This additional practice with technology can lead to identifying specific
pronunciation errors, which the consultant and the student can work on in
future sessions.
To explain how the trainees are instructed to use technology for pronunciation
tutoring during ESCs, and to provide a clear connection between theory and prac-
tice, we use Mishra and Koehler’s (2007) technological pedagogical content knowl-
edge (TPACK) framework (see Figure 6). Earlier in this chapter, we addressed two
concepts of the framework: content knowledge (CK), i.e. what a consultant needs
to know about phonetic and phonological processes, and pedagogical knowledge
(PK), i.e. how a consultant provides effective instruction to teach L2 pronunciation.
The third form of knowledge is technological knowledge (TK), which refers to the
general understanding of informational technology and the use of technology for
daily operations, e.g. operating a computing device, browsing the internet, etc.
The addition of technology creates new and adapted knowledge bases when inter-
sected with CK and PK: technological content knowledge (TCK) and technological
pedagogical knowledge (TPK).
The training provides consultants with examples of TCK and TPK, such as
using TED Talks or YouTube to mine relevant authentic materials related to the
students’ discipline and needs that can result in purposeful learning within a spe-
cific context. For example, consultants may select a video on artificial intelligence
if working with students from computer science. Consultants may wish to narrow
down choices of videos based on the variety of English (e.g., American English),
Figure 6: TPACK http://tpack.org/.
difficulty of content (e.g., for a high school or undergraduate student versus a

graduate student), etc. Using these sources gives students access to not only au-
thentic materials but also quality presentations with suitable subject-matter as
opposed to a textbook approach with pre-made instructional recordings. Thus,
students see real-life and exemplary samples of discipline-related presentations
that they can aspire to work toward as their learning goals.
Furthermore, the trainees are provided with example teaching strategies
with technology throughout their training for individual skills. These are accessi-
ble through Wiki Pages (for an example, see Figure 7) that both trainees and con-
sultants can modify as they find additional helpful resources. Likewise, they can
add a descriptive area of interest (e.g., what pronunciation features the resource
covers) and any helpful tips or strategies for utilizing the technological resource
(e.g., what activities that can be used in tandem with the tool). A range of tech-
nologies is included, from YouTube videos with detailed descriptions and analy-
ses of pronunciation features to more interactive tools that can be used for
listening discrimination, controlled, or guided practice (Celce-Murcia et al. 2010).
Figure 7: Pronunciation Wiki Page.
The benefits of giving trainees and consultants the opportunity to co-construct

the technology Wiki Pages are two-fold. On the one hand, this creates a space for
them to find and learn about new technologies that can be adopted or adapted for
pronunciation tutoring. On the other hand, this requires them to critically evaluate
technological resources before posting to the Wiki Page, with an emphasis on:
1. Intended purpose (i.e., what pronunciation features the tool covers),
2. Potential effectiveness of the tool for achieving its intended purpose, and
3. Possibility to further leverage the tool through the use of additional or sup-
plementary instruction.
A YouTube video, for instance, might be a great tool to practice listening to tar-
get language sounds. However, it is important to select videos with potential
for language learning. In that regard, closed-captions and visuals in the video
can scaffold students’ attention to target sounds. Additionally, using the tran-
scripts and functions such as pause and rewind can assist foster their attention
to the target language sounds. These are technological content and technologi-
cal pedagogical skills.
However, simply listening to the target language sounds on YouTube and
using functions such as closed-captions and transcripts cannot improve stu-
dents’ pronunciation skills. Since the selected videos would likely be created
for purposes unrelated to language instruction, the trainees and consultants
need to identify relevant learning objectives and supplement the learning with
practice activities. For example, with the techniques listed in Figure 5, e.g. Ana-
lyze2Imitate, consultants can model how to use the transcript from a YouTube
video and mark all pauses from a segment with a slash, followed by capital-
izing words that have the emphasis as illustrated in the example below:
/let’s TALK about these FIVE / MAJOR / TYPES/ of CHEMICAL REACTIONS/
After that, they could model the technique of Analyzing your own practice talk to
imitate the pauses and emphasis using the marked transcript. Students then can
practice to demonstrate understanding of the tasks and later extend this practice
to their own recordings to facilitate comparison of their version with the version
from the video. These carefully selected instructional techniques, combined with
the understanding of students’ needs, would illustrate the trainee’s or consultant’s
intersectional grasp of technological, pedagogical, and content knowledge that is
technological pedagogical content knowledge (i.e., TPACK). An expert may give a
very clear and detailed description of a pronunciation feature, which may be fur-
ther complemented by visual aids. However, these videos rarely have additional
practice for their viewers, which is a common pitfall of such instruction. Under-
standing and acknowledging this pitfall, the trainee or consultant can provide
practice activities that complement or supplement the instructional video (for an
example, see Table 3).
Table 3: Example of TPACK Knowledge Building.
CK Knowledge about phonetics and phonology
PK Knowledge about using free videos as multimodal resources for instructional use
TK Knowledge about YouTube and how to search for videos, use functions like pause and
rewind buttons
TCK Knowledge about selecting specific and appropriate YouTube videos for listening to
target language sounds based on topic, vocabulary level, type of English, etc.
Table 3 (continued)
TPK Knowledge about using transcripts or closed captioning to facilitate the listening of
target language sounds or using the playback speed to adjust the pace of speech
according to learners’ level
TPACK Knowledge about using selected YouTube video with specific instructional techniques to
meet desired learning objective, e.g. reducing pauses or emphasizing important words
6 Narratives on ESC training and practices

To provide a deeper contextualization of the training experience, this section
features the narratives of two consultants who not only went through the train-
ing, but also helped shape it further through their feedback and revisions. Each
consultant was asked to answer a series of questions that focused on the con-
nection between their knowledge of pronunciation pedagogy and their consult-
ing practices. This is an important connection for consultants to be able to
make, as it helps them to reflect and self-critique – two crucial components of
tutor (or teacher) development. In short, they reflected on how their previous
experiences converged and diverged with the pronunciation materials in the
ESC training, and how they adapted what they learned to their consulting prac-
tices. It is our intention that the narratives below will help others to recognize
the importance and the realities of pronunciation training for language profes-
sionals. Pseudonyms were used for each consultant.
6.1 Billy
“L2 pronunciation is one of my main interests in pursuing a graduate degree in

Applied Linguistics, so I had an intimate understanding of the phonetic and
phonological features, as well as a working knowledge of effective teaching
strategies prior to ESC training. What I really learned in the training was how to
modify my teaching strategies to accommodate one-on-one tutoring sessions.
In a classroom setting, I can have the students working with each other while I
sit in the background and listen to their speech, identifying any errors or topics
for further instruction. In a one-on-one setting, there is nowhere for me to
‘hide’, so to speak – I act as the input, interlocutor, and instructor in this set-
ting. This presents some unique challenges, some of which I was not previously
prepared for.
However, the first component of the training that was quite helpful for one-
on-one consulting, which was not exclusive to the pronunciation module but
certainly helped regardless, was the identification of Type I and II consulta-
tions. Following the flowchart in Figure 1 helps me to stay on track and identify
the needs of the learner first, which to me is a more crucial step than in a class-
room setting. Oftentimes, classes are designed for a specific purpose (e.g., aca-
demic, business), but in a consultation, you really have no idea who your next
student will be, what background they’re coming from, and what they want to
work on. Starting by identifying the needs of the student, the flowchart pro-
vides a nice pathway to achieve their goals. Sometimes it’s a very specific goal,
such as practicing their thesis or dissertation presentation, and sometimes it’s
much more general, such as talking with labmates or other colleagues.
An additional benefit that I gained, and continue to gain from my consul-
tant peers and through professional development, which we engage in bi-
weekly, is how to leverage the use of technology for both face-to-face tutoring
and for continued instruction outside of the consultations. The need for contin-
ued instruction cannot be understated when it comes to pronunciation instruc-
tion – gains in pronunciation often require the breaking down of some speech
habits while simultaneously building new ones. Quite often, this is a matter of
first language influence (though it is certainly not limited to it), where students
are using the speech features of their native language in the L2. To break this
cycle, students require much more instructional time than the one-hour consul-
tation can give.
Therefore, by implementing technological resources (such as English Ac-
cent Coach, YouGlish, or dozens of other tools), we can assign ‘homework’ for
our students to help them continue their learning at least a little bit every day.
This continual learning strategy typically results in more automatized speech
though it does require additional effort on the part of the student. Likewise, this
approach with technology is both appealing for most students and it provides op-
portunities for them to encounter language (e.g., grammar, vocabulary) that they
may otherwise not be aware of. For example, using a web page for consonant clus-
ters (e.g., https://usefulenglish.ru/phonetics/practice-consonant-clusters) can ex-
pose students not only to common consonant clusters in English, but also to new
vocabulary items, which can help learners develop new ways to express them-
selves as well as concepts and ideas that are relevant to their field or discipline.”
6.2 Ali
“One of the qualities I looked for in a graduate assistantship was the opportu-
nity to support academic communication. Over the first few years of my gradu-
ate studies in Applied Linguistics, I had taught or tutored students on academic
writing skills. Regardless of their level, discipline, and first language back-
ground, I would always sense their need for oral communication support. With
the opportunity of involvement in the development of the ESC model and train-
ing, I felt empowered to examine and analyze oral communication needs of
graduate students in depth and to understand what types of support can best
address these needs. As I started offering consultations, Type I consultations
were easier for me. Since the student would come in with a specific task, needs
analysis seemed to be a much more straightforward process. Type II consulta-
tions, however, required me to greatly pull from my ELS educator knowledge to
identify the different aspects of English speaking and pronunciation that a stu-
dent needed to focus on. At times, choosing or designing activities that targeted
specific linguistic features such as a vowel/consonant, or a prosodic feature
such as thought grouping, could be challenging. Many times, students had con-
cerns about their “fluency,” without recognizing the numerous factors that play
a role in this broad concept. My ESC training and ESL educator knowledge in
general and in pronunciation teaching in particular would help me explain con-
cepts such as connected speech, shortened vowels in unstressed syllables,
filled/unfilled pauses, speech rate, etc. and walk the student through the steps
of setting priorities and practicing in these areas. As I gained experience work-
ing with students from the same language backgrounds, and as our repertoire
of ESC activities grew larger, I began to feel more confident in finding activities
targeting specific linguistic features.
As someone who enjoys working with individual students more than teach-
ing a class, offering ESCs has been a wonderfully rewarding experience, allow-
ing me the opportunity to connect with graduate students with non-English
language backgrounds (like myself) in ways that no other graduate assistant-
ship would allow me to. I was able to draw on my own experiences of learning
English as a second language and encountering linguistic challenges as a grad-
uate student to understand other students’ challenges, especially those very
new to living and studying in a dominantly English-speaking country. In meet-
ing with a student on a regular basis for Type II consultations, we would have
informal conversations about difficulties in interpersonal communication and
sometimes chat about awkward positions we had found ourselves in, because
of not understanding a joke or other reasons related to the cultural differences
between our country of origin and the US. Sharing such experiences and later
laughing about such memories together were valuable to me and motivating for
the student. Many of the students I worked with had not been in the country
long enough to establish relationships with anybody other than a few friends
who spoke the same native language as theirs. ESCs would allow them to de-
vote at least an hour every week to speak in English, both conversationally and
in practicing actual academic oral communication tasks.
Working as an English-speaking consultant also prepared me for a career
in supporting academic communication needs of students, with a holistic view
on academic communication that accounts for not only written communication,
but also oral communication and the skills needed in performing a variety of
tasks such as teaching as a graduate assistant, giving a presentation, holding a
conversation with a colleague, etc. While writing as a common academic com-
munication mode has received extensive attention in both practice and re-
search, speaking and pronunciation are areas that students can always use
more help with. Working as an English-speaking consultant shaped my career
as an ESL specialist to a great extent, but also motivated me to conduct re-
search on particular characteristics of non-native English speech in academic
settings.”
7 Implications and conclusion

The goal of this chapter was to showcase the ESC consultation and training
models developed by the CCE at Iowa State University. In particular, we sought
to highlight the training that consultants receive in L2 pronunciation, as well as
how technology training is provided and how the training has impacted the
knowledge and practices of two English-speaking consultants. As a result, we
are able to share a few implications for other higher education institutions who
seek to provide an additional layer of support for their own international popu-
lation. Namely, we hope to inform how other universities can develop and im-
plement their own ESC version, as well as what reasonable expectations and
outcomes they should anticipate.
While we do not recommend a copy-and-paste approach if an institution
should choose to develop their own version of ESCs, there are a few common
starting questions that we would recommend they answer before moving forward.
First, conduct an analysis of content-specific needs within the institution;
that is, what do faculty and students need in terms of oral communication
development? While answering this question, also be looking for potential
collaborators – which departments might be willing to ‘hire’ consultants for their
own department or program? By starting with these two questions, an institution

can get a good picture of who their target audience is, the scope of potential serv-
ices offered, and a rough estimate of how many consultants they may need.
Next, identify other stakeholders who may be interested, such as International
Student Services Organizations or the Center of Excellence in Learning and
Teaching. Their input may help to further shape potential ESCs. Also, by notify-
ing these stakeholders that such a form of support will soon exist, they can direct
potential clients to the consultants. After conducting a needs analysis with fac-
ulty, students, and other stakeholders, develop and pilot a small-scale ESC pro-
gram with one or two consultants. While models will differ based on resources
and needs, the models presented in this chapter all have a share-alike Creative
Commons license (i.e., CC-BY-NC-SA; excluding the TPACK framework). These
can serve as a great starting point instead of a blank sheet.
The next recommendation we would offer is in the actual training and devel-
opment of speaking consultants. This is not a quick process, nor should the train-
ees be left to their own devices to ‘learn on the go’. In order for trainees to
provide effective instruction for oral communication skills, especially pronuncia-
tion instruction, they require explicit training, field experiences, and supervised
consultations. This three-pronged approach – learning by reading, learning by
watching, and learning by doing – gives the trainees multiple opportunities to
hone their tutoring techniques and their ability to adapt their instruction for indi-
vidual students. Through this, trainees are able to deliver pronunciation in-
struction which is individualized and contextualized, as well as systematic
and integrative, that is, instruction that provides an equal focus on form and
meaning, which is ultimately the goal of effective pronunciation instruction
(Lord 2008; Saito 2013).
To help students continue this style of individualized and contextualized in-
struction outside of consultations, institutions may also provide technology train-
ing for their trainees. Making explicit connections between different knowledge
bases, for example the TPACK framework (Mishra and Koehler 2007), can help
the trainees see how and when to use technology to supplement or support their
future consulting. It also enhances the trainees’ ability to teach their own stu-
dents how to use technology, as learner training with technology is an often ne-
glected part of language instruction (Hubbard 2013).
As shown in the narratives, trainees who are offered this type of opportu-
nity have the potential to enhance their own understanding of what effective
oral communication is and how to provide explicit instruction beyond the one-
on-one tutoring environment. By interacting with language learners from across
a wide range of disciplines and language backgrounds, the consultants will
gain an understanding of how to better accommodate students who may vary
in terms of their aptitudes, proficiencies, and language learning goals (Munro

2018). However, a future area worth exploring is how improved students be-
come as a result of the consultations. This would provide invaluable informa-
tion about the current effectiveness of the ESC program and would likely lead
to future improvements within the ESC training program.
Through the development of an extensive ESC program, we were better
able to provide yet another layer of support for international students and
scholars at Iowa State University. While able to support this population in a
number of ways, perhaps the most common issue that speaking consultants en-
counter is improving L2 pronunciation. Therefore, other institutions which
have already developed, or would like to develop, their own ESC version should
take careful consideration when preparing trainees to be aware of phonetic and
phonological features of English, and how to deliver effective instruction in
these areas. Development of pronunciation is not an easy task, nor is it a short
one, but through careful individualized instruction, we can provide an equal
opportunity for all students who study within US higher education.
References
Anderson‐Hsieh, Janet, Ruth Johnson & Kenneth Koehler. 1992. The relationship between
native speaker judgments of nonnative pronunciation and deviance in segmentals,
prosody, and syllable structure. Language Learning 42(4). 529–555.
Baker, Amanda. 2014. Exploring teachers’ knowledge of second language pronunciation
techniques: Teacher cognitions, observed classroom practices, and student perceptions.
Bent, Tessa, Ann R. Bradlow & Bruce L. Smith. 2007. Intelligibility of non-native speech. In
Ocke-Schwen Bohn & Murray J. Munro (eds.), Language Experience in Second Language
Speech Learning: In Honor of James Emil Flege, 331–348. Philadelphia, PA: John
Benjamins.
Berman, Robert & Liying Cheng. 2001. English academic language skills: Perceived difficulties
by undergraduate and graduate students, and their academic achievement. Canadian
Journal of Applied Linguistics 4(1). 25–40.
Press.
Couper, Graeme. 2017. Teacher cognition of pronunciation teaching: Teachers’ concerns and
issues. TESOL Quarterly 51(4). 820–843.
Derwing, Tracey M. 2019. Utopian goals for pronunciation research revisited. In John Levis,
Charles Nagle & Erin Todey (eds.), Proceedings of the 10th Pronunciation in Second
Language Learning and Teaching Conference, Ames, USA, 2018, 27–35. Ames, IA: Iowa
State University.
Derwing, Tracey M. & Murray J. Munro. 2005. Second language accent and pronunciation
teaching: A research‐based approach. TESOL Quarterly 39(3). 379–397.
Derwing, Tracey M. & Murray J. Munro. 2013. The development of L2 oral language skills in two
L1 groups: A 7‐year study. Language Learning 63(2). 163–185.
Ferris, Dana. 1998. Students’ views of academic aural/oral skills: A comparative needs
analysis. TESOL Quarterly 32(2). 289–316.
Ferris, Dana & Tracy Tagg. 1996. Academic oral communication needs of EAP learners: What
subject‐matter instructors actually require. TESOL Quarterly 30(1). 31–58.
Flege, James Emil & Serena Liu. 2001. The effect of experience on adults’ acquisition of
a second language. Studies in Second Language Acquisition 23(4). 527–552.
Gallego, Juan Carlos. 1990. The intelligibility of three nonnative English-speaking teaching
assistants: An analysis of student-reported communication breakdowns. Issues in
Hahn, Laura D. 2004. Primary stress and intelligibility: Research to motivate the teaching of
suprasegmentals. TESOL Quarterly 38(2). 201–223.
Hoekje, Barbara & Jessica Williams. 1992. Communicative competence and the dilemma of
international teaching assistant education. TESOL Quarterly 26(2). 243–269.
Hubbard, Philip. 2013. Making a case for learner training in technology enhanced language
learning environments. CALICO Journal 30(2). 163–178.
Im, Jiyon & John Levis. 2015. Judgments of non-standard segmental sounds and international
teaching assistants’ spoken proficiency levels. In Greta Gorsuch (ed.), Talking Matters:
Research on Talk and Communication of International Teaching Assistants, 113–142.
Stillwater, OK: New Forums Press.
Jenkins, Jennifer. 2002. A sociolinguistically based, empirically researched pronunciation
syllabus for English as an international language. Applied Linguistics 23(1). 83–103.
Kang, Okim. 2010. ESL learners’ attitudes toward pronunciation instruction and varieties of
English. In John Levis & Kimberly LeVelle (eds.), Proceedings of the 1st Pronunciation
in Second Language Learning and Teaching Conference, Ames, USA, 2009, 105–118.
Ames, IA: Iowa State University.
Kim, Soonhyang. 2006. Academic oral communication needs of East Asian international
graduate students in non-science and non-engineering fields. English for Specific
Purposes 25(4). 479–489.
pronunciation instruction: A meta-analysis. Applied Linguistics 36(3). 345–366.
Levis, John M. 2020. Conversations with experts – In conversation with John Levis, Editor of
Journal of Second Language Pronunciation. RELC Journal 52(3). 1–14.
Levis, John M. & Tim Kochem. In press. Pronunciation tutoring as teacher preparation. In
Veronica G. Sardegna & Anna Jarosz (eds.), English pronunciation teaching: theory,
practice, and research findings. Bristol: Multilingual Matters.
Lightbown, Patsy & Nina Spada. 2006. How Languages are Learned, 3rd edn. New York, NY:
Oxford University Press.
Liu, Jiang. 2020. A combination of metalinguistic instruction and task repetition in teaching
Chinese prosody. In Okim Kang, Shelley Staples, Kate Yaw & Kevin Hirschi (eds.),
Proceedings of the 11th Pronunciation in Second Language Learning and Teaching

Conference, Flagstaff, Arizona (USA), 2019, 154–162. Ames, IA: Iowa State University.
Liu, Jun. 2001. Asian students’ classroom communication patterns in US universities: An emic
perspective. Greenwood Publishing Group.
Lord, Gillian. 2008. Podcasting communities and second language pronunciation. Foreign
Language Annals 41(2). 364–379.
LyricsTraining. 2020. Learn Languages with Music: Lyrics Training (Version 1.6.7).
https://lyricstraining.com/
MacIntyre, Peter D., Richard Clément, Zoltán Dörnyei & Kimberly A. Noels. 1998.
Conceptualizing willingness to communicate in a L2: A situational model of L2 confidence
and affiliation. The Modern Language Journal 82(4). 545–562.
Mishra, Punya & Matthew J. Koehler. 2007. Technological pedagogical content knowledge
(TPACK): Confronting the wicked problems of teaching with technology. In Society for
Information Technology & Teacher Education International Conference, San Antonio,
Texas (USA), 2007, 2214–2226. Waynesville: Association for the Advancement of
Computing in Education (AACE).
Morita, Naoko. 2004. Negotiating participation and identity in second language academic
communities. TESOL Quarterly 38(4). 573–603.
Munro, Murray J. 2018. How well can we predict second language learners’ pronunciation
difficulties? CATESOL Journal 30(1). 267–281.
Munro, Murray J. & Tracey M. Derwing. 2006. The functional load principle in ESL
pronunciation instruction: An exploratory study. System 34(4). 520–531.
Myles, Johanne & Liying Cheng. 2003. The social and cultural life of non-native English
speaking international graduate students at a Canadian university. Journal of English for
Academic Purposes 2(3). 247–263.
Purcell, Edward T. & Richard W. Suter. 1980. Predictors of pronunciation accuracy: A
reexamination. Language Learning 30(2). 271–287.
Rajadurai, Joanne. 2007. Intelligibility studies: A consideration of empirical and ideological
issues. World Englishes 26(1). 87–98.
Ranta, Leila & Amy Meckelborg. 2013. How much exposure to English do international
graduate students really get? Measuring language use in a naturalistic setting. Canadian
Modern Language Review 69(1). 1–33.
Saito, Kazuya. 2012. Effects of instruction on L2 pronunciation development: A synthesis of 15
Saito, Kazuya. 2013. Reexamining effects of form-focused instruction on L2 pronunciation
development. Studies in Second Language Acquisition 35(1). 1–29.
Saito, Kazuya. 2015. Variables affecting the effects of recasts on L2 pronunciation
development. Language Teaching Research 19(3). 276–300.
Saito, Kazuya & Roy Lyster. 2012. Effects of form‐focused instruction and corrective feedback
on L2 pronunciation development of /ɹ/ by Japanese learners of English. Language
Learning 62(2). 595–633.
Saito, Kazuya & Luke Plonsky. 2019. Effects of second language pronunciation teaching
revisited: A proposed measurement framework and meta‐analysis. Language Learning
69(3). 652–708.
Sawir, Erlenawati. 2005. Language difficulties of international students in Australia: The
effects of prior learning experience. International Education Journal 6(5). 567–580.
Sereno, Joan, Lynne Lammers & Allard Jongman. 2016. The relative contribution of segments
and intonation to the perception of foreign-accented speech. Applied Psycholinguistics
37(2). 303–322.
Spada, Nina & Yasuyo Tomita. 2010. Interactions between type of instruction and type of
language feature: A meta‐analysis. Language Learning 60(2). 263–308.
Subtirelu, Nicholas Close. 2017. Students’ orientations to communication across linguistic
differences with international teaching assistants at an internationalizing university in
the United States. Multilingua 36(3). 247–280.
Swan, Michael & Bernard Smith. 2001. Learner English: A teacher’s guide to interference and
other problems. Cambridge: Cambridge University Press.
Yanagi, Miho & Amanda A. Baker. 2016. Challenges experienced by Japanese students with
oral communication skills in Australian universities. TESOL Journal 7(3). 621–644.
Zielinski, Beth W. 2008. The listener: No longer the silent partner in reduced intelligibility.
System 36(1). 69–84.
Ilvi Blessenaar, Lizet van Ewijk
Putting participation first: The use
of the ICF-model in the assessment
and instruction of L2 pronunciation
Abstract: L2 pronunciation training should unequivocally be linked to complex
daily life experiences (Derwing 2017). Each client comes from a different back-
ground, participates in a different environmental context and engages in different
activities within those contexts (Threats 2008). This is a particularly challenging
aspect in the L2 practice (Derwing 2017). The International Classification of Func-
tioning, Disability and Health, also known as the ICF-Model (WHO 2001, 2013),
offers a conceptual framework that acknowledges the intricate dimensions of
human functioning and incorporates personal and contextual factors that can
influence participation in daily live (Heerkens and de Beer 2007; Ma, Threats,
and Worrall 2008). This paper provides an exploration of the application of
this model to pronunciation and intelligibility difficulties in L2 learning. We
apply the model to a specific L2 learner, Mahmout and demonstrate how its use
allows for consideration of factors much broader than the phonological or pho-
netic challenges Mahmout faces. Mahmout must be able to generalize that what
he has learned into functional communicative competences to improve his par-
ticipation. The ICF-model (WHO 2001, 2013) is used globally in a broad array of
healthcare professions, including Speech and Language Therapists (SLT’s). Yet,
it is not a customary tool, nor probably an obvious one, used by L2-professionals
(Blake and McLeod 2019). Of course, our goal is not to classify pronunciation
problems of L2 learners as disabilities. The model proves a useful tool to view the
individual L2 learner as a whole, and part of a larger system. It may allow L2
professionals to tailor their intervention to the individual’s needs and situation
and will consequently be able to establish priorities in instruction to enable
appropriate goal setting for each individual (Blake and McLeod 2019). It allows
identification of influencing barriers or facilitating factors within the stagna-
tion or improvement of pronunciation (Blake and McLeod 2019; Howe 2008).
Keywords: ICF, intelligibility, pronunciation, participation, second language
Ilvi Blessenaar, Lizet van Ewijk, University of Applied Sciences Utrecht
https://doi.org/10.1515/9783110736120-008
198 Ilvi Blessenaar, Lizet van Ewijk
1 Introduction
The International Classification of Functioning Disability and Health (ICF) was
introduced by the World Health Organisation (WHO) in 2001, with two main
goals: to offer a conceptual framework for health and health-related states, and
to create a common language for researchers, clinicians, educators, and policy
makers. The introduction of the ICF was a milestone towards more holistic and
person-centred care, as it offers a biopsychosocial perspective on health and in-
cludes the influence of personal and environmental factors. It offers a philoso-
phy, a way of acknowledging the complex dimensions of human functioning
and its interaction with its environment. It assists in ccomprehensively describ-
ing a person’s individual functioning profile that in turn helps to better under-
stand the person’s specific needs.
With the introduction of the ICF, WHO provided a new perspective on the
terms health and disability, acknowledging that all people can and will experience
some level of ‘disability’ in their life. The ICF aims to be universally applicable to
all people, without link to aetiology. Despite its obvious roots in healthcare, this
approach to health and life opens up possibilities for application to (groups of)
people who may not have poor health status in the biological sense but are
hindered in their ability to function and participate fully in life due to other
(external) challenges. These challenges can span the total breadth of human
experiences and can be related to the person as much as to the system around
the person. ICF is a social model that attributes limitations in functioning as a
socially created problem and not an attribute of an individual (Cerniauskaite
et al. 2011; Jelsma 2009; Üstün et al. 2003). In other words, if society, the envi-
ronment, would be maximally adjusted, the person would not experience limita-
tions. If an L2 learner lived in a community that would be completely accepting
towards linguistic and cultural differences, the L2 learner would experience far
fewer constraints on his/her functioning in a new language context.
The ICF model (ICF) is used globally in a broad array of healthcare profes-
sions, yet it is neither a customary tool, nor probably an obvious one, in the field
of L2 learning and teaching around the world. “Pathologizing” L2 speech is a
harmful and unwanted practice in the broad field of L2. The authors strongly
condemn this phenomenon and the discrimination associated with it and do not
wish to contribute to it in any way. In fact, the aim and philosophy of the model
and the WHO’s core value is quite the contrary of pathologizing: “equity, inclu-
sion and the aim of all to achieve a life where each person can exploit his or her
opportunities to the fullest possible degree” (WHO 2002). The application of the
model in the L2 context contributes towards placing pronunciation at the heart
Putting participation first 199
of L2 learning (as opposed to accent) by considering the participation of the L2

learner as a leading frame of reference.
The goal of this chapter is to provide an exploration of the application of
this model to pronunciation and intelligibility difficulties in L2 learning and to
provide a comprehensive overview of the ICF framework. This chapter illus-
trates the possible usefulness of the ICF for a more holistic approach to L2 pro-
nunciation and intelligibility. We will apply the model to a specific L2 learner,
Mahmout, and demonstrate how its use allows for consideration of factors
much broader than the phonological or phonetic differences Mahmout faces.
Mahmout wishes to improve his pronunciation and must be able to generalize
what he has learned into functional communicative competences to improve
his participation. The ICF provides tools to tailor pronunciation instruction to
the individual’s needs and unique circumstances, and it can possibly support
educators to help their students aspiring to be L2 language instructors, in their
learning process.
2 What is ICF?
2.1 A little bit of history
The first attempt to classify and describe consequences of health and health-
related experiences was the development of the International Classification of
Impairments, Disabilities, and Handicaps (ICIDH) in the 1980s. This classifica-
tion system already aimed to advance the idea that health is much more than
the absence of illness. This system, however, did not reflect the (influence of
the) complex interrelations and interactions between various factors in people’s
lives (Ma, Threats, and Worrall 2008). In 1993 the WHO started developing the
ICF after numerous field trials and consultations. All 191 WHO Member States
in the Fifty-fourth World Health Assembly officially endorsed it on 22 May 2001
to be used in their policymaking, and scientific standardisation in research,
planning, and care. By doing so, the WHO shifted the perspective on health
from cause (illness, disability, handicap) to impact (WHO 2002). It also reso-
nates with the current WHO definition of health (1948), which describes health
as “a state of complete physical, mental and social well-being and not merely
the absence of disease or infirmity” (WHO 1948).
The recently suggested update of this definition (Huber et al. 2011) concep-
tualizes health as “the ability to adapt and self-manage in the face of social,
physical, and emotional challenges” takes the concept of health even further
away from the biomedical approach to health. With its six dimensions ranging
from bodily functions to the spiritual/existential dimension, health is clearly
seen as something much broader than the absence of disease.
We can define the ICF as a universal, neutral and social model. It describes
domains of functioning applicable to every human being. In line with WHO’s
core value, it applies to all people irrespective of their culture, health condition,
gender or age. It espouses a neutral perspective. Ultimately, the ICF is about
people and its premise is that of focussing on the positive abilities of the indi-
vidual (Cerniauskaite et al. 2011; Üstün et al. 2003; WHO 2002).
2.2 The ICF model
What exactly does the ICF-model entail? The model is composed of three do-
mains (Figure 1): ‘Body Structures and Functions’, ‘Activities and Participation’
and ‘Contextual Factors’. The definition of ICF categories were defined using neu-
tral language without negative connotation so it can indicate neutral aspects of
health and health related states under the umbrella term of functioning.
Formulated wants, needs and

goals
Body Functions and Activities Participation

Structures
Environmental factors Personal Factors
CONTEXTUAL FACTORS
Figure 1: (Adapted) Visual representation of ICF-framework (WHO 2001).
The starting point within the ICF philosophy is always the perspective of the per-
son him- or herself. This means that the ‘formulated wants, needs and goals’ are
the starting point of any conversation and possible intervention. The diagram
identifies the three levels of human functioning classified by ICF: functioning at
the level of the physical body, the whole person, and the whole person in a social
context. The five constructs of body structures/functions, activities, participation,
environmental factors, and personal factors are identified, and bidirectional ar-
rows represent the interactions among the different components, reflecting the
ongoing influence of environmental factors on body functions, activities, and
participation, and vice versa (WHO 2001; WHO 2002). To illustrate the model on
a basic level, we use the examples of a broken leg and stuttering in Table 1:
Table 1: Basic examples ICF.
Level Construct Example A: General Example B: Communication
Perspective of Formulated “I want to be able to regain “I want to address my stutter

the person wants, needs full function of my leg and for the first time in my life
and goals return to my former hobbies because I want to start my
and work.” own business.”
Contextual Personal Woman (A),  years old, Man (B),  years old,
factors factors store manager, good overall African descent, works in
health, likes sports.Motivated construction, introvert.
for physical therapy.
Environmental Lives in second floor- In his culture, stuttering is

factors apartment without elevator, seen as a taboo, so he never
receives a lot of support from received treatment.
siblings and parents.
Level of the Body The bone in the right upper Mild to severe stutter,
Physical Body Functions leg is broken in  places. severity increases with
and stress
structures
Level of the Activities This means she cannot walk, This means he experiences
whole person run, drive, ride her bike, play trouble with speaking,
sports, jump, etc. especially with strangers or
on the telephone.
Level of Participation Because she cannot walk, He wants to start his own
person in drive, ride her bike, she is not contracting business, but his
social context able to go to work. Because stutter makes him insecure.
she cannot run and jump, she He will have to talk to a lot
cannot play tennis or go hiking. more people. He does not
She experiences participation know if he will be able to
problems because of her secure clients.
broken leg.
These examples show the bidirectional influence between the various lev-
els of the ICF. The fact that example A has a broken bone hinders work and
hobbies for this particular client and, therefore, participation in society. A sup-
port network, on the other hand, might facilitate recovery. In example B, stut-
tering severity increases for person B in stressful situations. This influences his
choice to start a new business, a life event associated with many stressors. Fur-
thermore, cultural attitudes might negatively affect progress.
2.3 ICF and its application in speech, language

and communication difficulties in L1 and L2 learning
Speech and Language Therapists1 (SLT’s) around the world use the ICF to classify
and clarify communication problems clients experience in daily functioning and
how this affects their participation in everyday life. The use of the ICF has been
endorsed since the early 2000’s by many SLT-organizations around the world, to
inform standards of practice, including the American Speech and Hearing Associ-
ation (ASHA), the Dutch NVLF (Dutch Association for SLT’s “Nederlandse Verenig-
ing voor Logopedie en Foniatrie”) and many others (Heerkens and de Beer 2007;
Ma, Threats, and Worrall 2008). As the framework incorporates personal and con-
textual factors that can positively or negatively influence participation in daily
life, it allows the SLT to systematically capture the communicative elements a cli-
ent has difficulties with, and take into consideration what a client is able to do
(Heerkens and de Beer 2007; Threats 2008). The ASHA formulates the use of the
ICF as follows:
American Speech and Hearing Association on ICF:
(. . .) capitalize on strengths and address weaknesses related to underlying structures and

functions, facilitate the individual’s activities and participation by assisting the person to
acquire new skills and strategies; modify contextual factors to reduce barriers; enhance fa-
cilitators of successful communication and participation and to provide appropriate accom-
modations and other supports, as well as training in how to use them.
(American Speech-Language-Hearing Association 2004)
We appreciate that a framework with its roots in health-related communication

difficulties might raise some eyebrows in relation to L2 pronunciation learning.
 Around the world, different titles are used for comparable professions: In Europe, the title
‘Speech and Language Therapist’ is more common, while in North-America ‘Speech-Language
Pathologist’ is the most common term. We wish to include all variations here.
However, when relating the definition above to L2 learning, one can only conclude
that this definition is very applicable to L2 learners. The aetiology of possible bar-
riers in activities and participation is of little importance, if we focus on intelligi-
bility, functioning and participation.
2.4 The five ICF components and intelligibility
In the following sections, we address the aspects of the model that are most
relevant to L2 pronunciation. We will discuss the various aspects of the ICF and
describe its constructs and their use more closely. In Figure 2 different domains
and corresponding paragraphs are referenced.
Formulated wants, needs and

goals
Body Functions and Activities Participation

Structures
Environmental factors Personal Factors
CONTEXTUAL FACTORS
Figure 2: (Adapted) Visual representation of ICF-framework (WHO 2001) with reference

to explanatory paragraphs.
2.4.1 Body functions and structures
Body Structures and Functions are at the base of spoken communication. This
is true for every individual. Body structures are defined as “anatomical parts of
the body, such as organs, limbs and their components” (WHO 2002) and are
less relevant in the context of L2 learning, as they are generally considered to
be intact. Body Functions, on the other hand, are defined as “the physiological
functions of body systems, including psychological functions” (WHO 2002).
The Practical Manual for using the ICF (WHO 2013) states that the production
of speech sounds is considered a body function. Difficulties at this level entail

input, organisation, and production of speech, at both the segmental and su-
prasegmental level (Baker et al. 2001; McLeod and McCormack 2007). The
concept of Body Functions is not linked to cause: problems with both L1 and
L2 intelligibility can have a wide variety of causes, such as brain injury, pho-
nological problems, weak general articulation skills, etc. (Cormack and Wor-
rall 2008). For L2, this construct is typically described in terms of linguistic
assessment results, such as the possible presence of segmental errors, problems
in the identification and discrimination of phonemes, or errors in word stress,
to name a few. These linguistic factors could become (part of) the target of instruc-
tion. Deciding which linguistic factors should be addressed in instruction is a col-
laborative process between the professional and client, taking into consideration a
multitude of factors (which are mapped in the other components of the ICF).
2.4.2 Activities and participation
Learners usually expect that training of their L2 pronunciation will automati-

cally result in a positive effect on their daily lives (O’Halloran and Larkins
2008). However, for L2 learners, integrating what they have practiced in the
classroom into their everyday communication settings can be a challenge. One
of the goals of the ICF is to identify limitations in functioning in daily life and
determine an individual’s functioning as a whole (WHO 2013). The activities
and participation levels allow for unravelling of proficiency on these varied lev-
els, and the possibility to target training accordingly. Although presented as
two distinct modules in the framework, the two levels are often mapped as a
continuum, rather than two clearly marked categories.
Activities are defined as ‘the execution of a task or action by an individual’
or ‘what people can do inherently without assistance or barriers’ (WHO 2002).
One of the most important aspects of this part of the ICF is to make the distinc-
tion between what a person can do in a standard environment (i.e., their level
of capacity), and what they actually do in their usual environment (i.e., their
level of performance) (WHO 2002). The mismatch between capacity and perfor-
mance is important to investigate, particularly in relation to the (lack of) trans-
fer of skills from classroom to real life. For example, L2 learners can perform
well on a single word production task, but in the complex communicative de-
mands of spontaneous speech, they might be far less proficient. This is sup-
ported by recent research that shows task-based language instruction can be
very beneficial in L2 pronunciation learning (Gordon 2021; Gurzynski-Weiss,
Long, and Solon 2017). In addition, the relationship between task specificity and
comprehensibility2 has been shown repeatedly (Crowther et al. 2015a, 2018). The
discrepancy between abilities in the classroom compared to their experiences in
daily life can be significant (O’Halloran and Larkins 2008). For example, being
proficient in a small classroom with well-known peers in a controlled exercise
that focuses on the production or perception of one sound with high frequent
words (capacity) has little predictive value on the proficiency using that speech
sound in low frequent words in a conversation with a stranger, or an authority
figure at work (performance). Activities range from basic to complex along a con-
tinuum, with controlled and targeted tasks on one end, moving toward multidi-
mensional complex activities on the other end.
Participation constitutes the “involvement in a life situation”, which implies
a role in society and entails choice and judgement (O’Halloran and Larkins
2008). It therefore per definition deals with ‘performance’ of the learner and al-
ways incorporates all elements that are essential for successful communication.
Potential problems in speech sound production or perception that were identified
in ‘body functions’, which were possibly also present in the capacity and/or per-
formance of ‘activities,’ are now only a small fragment of the whole picture.
This also explains discrepancies between client and professionals’ perspec-
tives, in terms of assessment of proficiency. The level of capacity of ‘activities’ is
generally judged by the professional (and client), whereas the level of perfor-
mance can only be assessed by the person themselves (or a proxy). For example,
the professional can judge an individual’s intelligibility to be moderate to good
on a Likert-scale or score a speech sound as ‘correct’. The L2 professional as-
sesses speech in an ‘ideal’ situation with all the knowledge, skills, expertise and
experience they have. The L2 learner, on the other hand, may qualify his/her
own intelligibility as insufficient. When queried, the L2 learner will likely talk
about a communicative situation in which he/she was not understood or in
which a misunderstanding occurred because of his/her speech (intelligibility in
context). In summary, the distinction between capacity and performance pro-
vides the L2 professional with complementary information. Even when the client
is able to produce speech sounds correctly under certain circumstances (capac-
ity), this does not necessarily translate to the ability to use these sounds in real
life situations (performance). Furthermore, addressing ‘performance’ contributes
to the fact that it is overall intelligibility that is critical for communication, which
is not necessarily directly or unequivocally the result of the identified unacquired
patterns in perception or production, as identified in ‘body functions’ (Derwing
 Comprehensibility is defined by Derwing and Munro (2009) as: the listeners perception of
how easy or difficult it is to understand a given speech sample (Derwing and Munro 2009: 4).
and Munro 2009; Munro and Derwing 2009). In this example, even if the learner
has difficulties with a particular segmental contrast both in class (capacity) and
real life (performance), this contrast may have a very limited functional load and
may well be of limited influence on the learners’ overall intelligibility (Munro
and Derwing 2006; Suzukida and Saito 2021). This contrast is therefore much less
useful to work on if we focus on participation.
2.4.3 Contextual factors
WHO defines contextual factors as “the complete background of an individual’s

life and living” (WHO 2001). It is comprised of two categories: personal and en-
vironmental factors. Environmental factors refer to all aspects of the external
world of an individual’s life that may have an impact on his or her functioning.
Personal factors are described as features of the individual in general and
entail demographic information and personality traits. Also included are per-
sonal factors that are inherently part of the L2 learner, such as gender, age, eth-
nic background, and past experiences (Howe 2008).
Environmental factors are specified with respect to the perspective of the
person: something that may be a facilitator to one individual can be a barrier to
another. These factors are subdivided into individual and societal levels. WHO
describes them across five domains (Garcia, Laroche, and Barrette 2002; Howe
2008; Macintyre 2007; McLeod and McCormack 2007; WHO 2002). For each do-
main, we will list concrete examples applicable to L2 learners:
1. Products and technologies: e.g., telephone with access to a translating app
can yield facilitators for some L2 learners, whereas automatic voice recogni-
tion can hinder L2 learners’ access to certain services (Blake and McLeod
2019).
2. Natural and human made changes to the environment: e.g., background
noise can be a barrier in comprehensibility or intelligibility in the work-
place (Munro 1998).
3. Support and relationships: the presence of a loved one can help the L2 learner
overcome the fear of public speaking, whereas the presence of strangers can
hamper attempts. Specific communication behaviours of other people, such
as speech rate (Anderson-Hsieh and Koehler 1988), language complexity or
not providing enough time for the L2 learner to react, can hinder communica-
tive participation (Blake and McLeod 2019).
4. Attitudes: other people’s negative attitudes can be a hindrance for L2 learn-
ers to function in the workplace and can cause a risk of discrimination or
bullying. On the other hand, cultural sensitivity and a respectful attitude
can support L2 learners in their process. In addition, societal awareness

can be a great benefit, whilst the lack thereof, a major drawback (Blake and
McLeod 2019; Derwing 2003; Dragojevic and Giles 2016).
5. Services, systems, and policies: programmes such as languages buddies or
mentor programmes can provide support and act as facilitators. Immigra-
tion policies and legislation regarding residence permit can influence the
L2 learning process positively or negatively (Derwing et al. 2014a; Levis and
Wu 2018).
For L2 professionals, the benefit of considering these factors is twofold. First, it

allows the professional to identify negative factors or barriers as well as posi-
tive factors or facilitators in the life, and, therefore, in the learning process of
an L2 learner. Positive factors could possibly be capitalized on, and barriers
considered. For example, a barrier to L2 learning could be great uncertainty
about residence permits. Experiencing stress may well hinder involvement in
training and the motivation to learn the L2. A facilitator could be a social envi-
ronment where the L2 learner has regular contact with L1 speakers at work.
This could be used by focussing on transfer of the elements worked on in L2
training, in interaction with the L1 speakers.
The second benefit of mapping these factors is that it allows the L2 profes-
sional to make a distinction between factors that can be influenced, such as –
but not limited to – knowledge, opinions, behaviour, individual psychological
assets or coping skills (Howe 2008), and those that cannot be influenced. For
example, L2 learners may be less intrinsically motivated when they are referred
to training by their boss, rather than seeking training based on a personal de-
sire to ameliorate their intelligibility. This is likely to affect the way they ap-
proach practice and application of newly learned skills. However, when they
experience the benefits of improved intelligibility during instruction or through
conversations with peers and instruction, this motivation might quickly change
for the better. Motivation is therefore an element that can be influenced by a
professional. On the other hand, an L2 professional, for example, has no influ-
ence on the fact that an L2 learner has had no access to formal education after
elementary school, or even the fact that they experienced discrimination or rac-
ism because of their speech.
In addition, with an emphasis on overall intelligibility, one could argue
that the system, rather than the person itself, deserves influencing. After all,
one of ICF’s main principles constitutes the fact that if the environment is opti-
mally adapted, limitations no longer exist (Cerniauskaite et al. 2011). If we as a
society were to succeed at preparing interlocutors for a wide variety of linguis-
tic diversity, this would immediately effect the participation of L2 learners. For
example, increasing understanding and empathy for L2 learners and providing

awareness training for employers, school and university staff, as well as co-
workers, could be a valuable contribution to the level of participation of the
L2 learner, as research confirms (Derwing, Rossiter, and Munro 2002).
By considering all domains and their interactions, an L2 professional can
determine their impact on a persons’ functioning, set achievable goals, create
realistic expectations, and formulate specific recommendations. The starting
point is a close examination of the challenges an L2 learner has in their partici-
pation in society (2.4.2 participation) due to issues in intelligibility. These can
then be related to the possible issues in segmentals, suprasegmentals, percep-
tion, production (2.4.1, body structures and functions) and determine which of
these issues are most detrimental to intelligibility. An example of an aspect that
can be considered in prioritising particular linguistic structures for instruction
is functional load (considered part of the external factors in the model). An-
other example of taking an external factor into consideration when choosing
linguistic targets that are most likely to actually impact on communicative suc-
cess for L2 learners is targeting segmentals that hinder native listeners’ under-
standing more than others (Munro and Derwing 2006; Suzukida and Saito
2021). Furthermore, there is of course ample evidence for addressing prosody in
training, in addition to segmental features (Caspers 2010; Caspers and Horłoza
2012; Derwing et al. 2014b; Derwing, Munro, and Wiebe 1998; Hahn 2004; Kis-
sling 2013; Lee, Jang, and Plonsky 2015; Lee, Plonsky, and Saito 2020; Zhang
and Yuan 2020; Trofimovich and Baker 2006). Ultimately, the impact of certain
linguistic challenges on the participation (2.4.3) of that specific person is deci-
sive. The ICF model therefore provides the starting point of instruction. A care-
ful investigation of possible barriers and facilitators in the context of the L2
learner (2.4.3 Contextual factors) ensures that the L2 professional can tune in to
what an L2 learner needs to master to function adequately in their lives and set
realistic and obtainable goals in the light of the individual learner’s unique cir-
cumstances. Using the notions of capacity and performance, instruction and
training can be increased in difficulty by starting with relatively simple activity
(2.4.2) tasks, building up to intelligibility in context, which is necessary for par-
ticipation (2.4.2) in real life.
3 ICF in relation to existing definitions and roles

3.1 Accent, intelligibility and ICF
An important issue in the field of L2 pronunciation research is the distinction

between accent, comprehensibility, and intelligibility. Accent is the way in
which the speech of a speaker differs from the local variety of the language
spoken and the effect it has on both the speaker and the listener (Munro and
Derwing 2009). In light of the ICF, ‘accent’ refers for the most part to the level
of Body functions (3.2.1): the way consonants, vowels, diphtongues and prosody
are produced differently from how (local) L1 speakers produce them. As dis-
cussed, these features are identified in body functions, but do not automatically
or unequivocally correlate to a person’s functioning in daily life (Cormack and
Worrall 2008; McLeod and McCormack 2007; Wambaugh and Mauszychi 2010).
Comprehensibility is defined by Derwing (2018: 321) as “degree of effort required
to understand speech”. Crowther et al. (2015b) concluded that the level of com-
prehensibility is determined by much more than accurate pronunciation alone. It
beholds a complex interaction between the broad linguistic dimensions of L1 and
L2, proficiency level and speaking task. Common ground and knowledge of con-
text of the speaker and listener also influence comprehensibility. Intelligibility
refers to whether the speaker is understandable to the communication partner. It
refers to the end result of the speech act of the L2 speaker, rather than to whether
speech sounds are produced similar to L1 speakers (accent), or how much effort
it takes for the listener to understand it (comprehensibility) (Levis 2005; Munro
2008; Munro and Derwing 2009). Indeed, intelligibility can be high, and commu-
nication very successful with the presence of an accent (Levis and Wu 2018), al-
though heavily accented speech can of course be quite unintelligible (Munro
2008; Munro and Derwing 1995, 2011).
The field of L2 pronunciation has long been influenced by the ‘nativeness
principle’, which holds that the ultimate goal for pronunciation instruction
should be to eliminate the accent (Levis 2005). As research has already pointed
out multiple times, L2 instruction should rather focus on primarily leading to
an actual change in intelligibility (Derwing 2017, 2018; Marx 2002; Munro 2008;
Munro and Derwing 2009). Considering the ICF, intelligibility directly relates to
the level of participation (2.4.2): a lack of intelligibility (and/or comprehensibility)
in real context is what hampers people in their daily communication. ‘The
accent issue’ is resolved when focussing on whether aspects in someone’s
speech affect their opportunities for participation in society. Because the ICF
considers functional implications of intelligibility, it can contribute to an im-
provement of L2 learners’ communication in their everyday environment. Using
this approach, we automatically move away from focussing on accent and

from modifying speech sounds, simply because they are produced differently
to the norm. In summary, we argue that the ICF could make a useful contribution
to adhering to the core value of pronunciation instruction: improving overall
communication.
3.2 ICF and linguistic theory: Dynamic systems theory
The ICF is not the only model that aims to reveal interacting domains or fac-
tors. Parallels between the core principles of ICF and Dynamic System Theory
(DST), the science of complex systems, have been made for numerous areas in
healthcare (Andrews 1996; Beckman, Fernandez, and Coulter 1996; Fannin
2016; McDougall, Wright, and Rosenbaum 2010). In fact, George Engel (1977),
who is considered the founder of the principles of the ICF, has from the outset
established the relationship between the holistic biopsychosocial model and
systems theory. De Bot, Lowie, and Verspoor (2007) applied the principles of
DST to L2 and argue that a DST approach may help us to develop a more realis-
tic representation of L2 development than other linguistic theories. DST de-
scribes that cognitive, social, and environmental factors continuously interact,
resulting in the emergence of creative communicative behaviours. DST de-
scribes language development as a process that takes place through interac-
tion between the individual and its environment. After all, the main purpose of
language is participating in social experiences. DST implies that focussing on
one aspect of this process only, cannot but provide an oversimplification of re-
ality. Only when we consider the dynamic interaction of all factors, are we
able to appreciate the actual complexity of the process (de Bot, Lowie, and Ver-
spoor 2007; Verspoor 2013). DST and the ICF have in common that they reflect
multidimensionality, a holistic approach, non-linearity and perpetually chang-
ing human circumstances. ICF attempts to tease apart the multidimensionality
by the use of interacting domains, recognising that the whole is greater than
the sum of its parts (McDougall, Wright, and Rosenbaum 2010). Thus, we sug-
gest that ICF could be a way of translating DST principles into daily practice
and create a realistic representation of the life of an L2 learner. By doing so,
the gap between the theoretical and highly conceptual DST and the way this
translates into daily practice for L2 professionals could be addressed. Further
exploration of this application and empirical research is of course necessary.
3.3 On roles and responsibilities: The need for collaboration
The role of SLTs in the L2 field is controversial (Grant 2014; Muller, Ball, and
Guendouzi 2000; Schmidt and Sullivan 2003). The problematic role of com-
merce in ‘accent modification’, ‘accent reduction’ and medicalisation of the
normal language learning processes (Derwing and Munro 2009) have unfortu-
nately muddled the waters of L2 pronunciation instruction for SLTs. Several re-
searchers have already shared some valid ethical considerations on this matter
(Derwing et al. 2014a; Thomson and Foote 2019). It has overshadowed the po-
tential of collaboration between SLTs and L2 trainers and the possibilities of
capitalising on each other’s strengths. The knowledge of SLTs on the impact of
communication difficulties on daily functioning, knowledge on phonetics and
phonology, therapeutic techniques and skills could be a very useful addition to
the research and daily practice concerning L2 learners. Applied linguists and L2
instructors, -teachers and -educators on the other hand, have tremendous ex-
pertise in and experience on L2 acquisition, L2 learning processes, L2 classroom
dynamics and didactics, theoretical frameworks, etc. A combination of these
strengths could benefit the assessment and training of pronunciation and in-
telligibility of L2 learners greatly. In essence, the cause of intelligibility prob-
lems is of secondary importance, if we focus on intelligibility, functioning and
participation of the L2 learner and look at expertise to achieve this.
By referring to the term ‘L2 professional’ throughout this chapter, we wish
to focus on collaboration and expertise and move away from (self-inflicted) pro-
fessional boundaries. From the perspective of the L2 learner, it does not matter
who can help them improve their intelligibility, as long as it helps them live a
fulfilling life in a society and it is done in an ethical way.
Furthermore, as the issue of immigration is a permanent resident in current
affairs and the spotlight globally remains on integration and participation of im-
migrants in society, there continues to be a high demand for L2 instruction in
general and more specifically on pronunciation and intelligibility (Blake, Knee-
bone, and McLeod, 2017; European Commission 2021; Verbakel, van den Brink,
and Groot 2020). We simply cannot afford to shy away from interprofessional col-
laboration. As the ICF allows for the use of a standard set of internationally rec-
ognized terms, it is ideally suited for broad use. This facilitates cooperation with
other professionals greatly and provides a much broader view on functioning, in-
tegration, and participation in society. It has also been translated into numerous
languages and as a result offers a solid base for international collaboration, both
in research and in practice.
3.4 ICF as a framework in L2 assessment and instruction
We propose using the ICF as a mind-set in the assessment and instruction of

pronunciation aimed at the improvement of intelligibility. There are three rea-
sons why we feel using the ICF would be of benefit.
Firstly, variables that influence L2 language and language learning can
only be factored in if you are aware that they exist. Systematic gathering of in-
formation within the framework of the ICF will ensure that an L2 professional
examines the intelligibility problems of the L2 learner in relation to many other
(personal and environmental) factors, which in turn supports appropriate and
meaningful goalsetting (Threats 2008). The LONT assessment (The SLT assess-
ment protocol of Dutch as L2 “Logopedisch Onderzoeksprotocol NT2”, Blesse-
naar et al. 2018) that has been developed at the HU University of Applied
Sciences Utrecht in the Netherlands is an example of how the ICF has been
used as a framework for L2 pronunciation and intelligibility assessment.3 The
LONT assessment is a protocol aimed at obtaining a qualitative analysis of pro-
nunciation and intelligibility in adolescent and adult L2 learners. It consists of
3 parts. The first part is a short screening in which the general intelligibility in
conversation is observed and factors that are most hindering are identified, for
example prosody, pronunciation of vowels and diphthongs, but also voice qual-
ity and general articulation skills. This part focuses on ‘body structures and
functions’ and ‘activities’. Based on the screening, the professional can decide
which components of part three need to be looked at further in more detail.
The second part is an extensive semi-structured interview in which the other
aspects of the ICF are mapped in collaboration with the client. In Table 2, exam-
ples of questions are included for every ICF domain.
Table 2: Example questions LONT interview organised by ICF domain (Blessenaar et al. 2018).
Domain Example
Body functions Can you describe what bothers you in your pronunciation of Dutch?
Activities How well can you understand people who speak Dutch?
Participation When do you experience the most difficulties speaking Dutch?
Personal Factors How important is being intelligible in Dutch for you?
Environmental factors Are you always well understood when you speak Dutch?
 For information on the LONT assessment (The SLT assessment protocol of Dutch as L2 “Log-
opedisch Onderzoeksprotocol NT2”, Blessenaar et al. 2018), please contact the authors: ilvi.
blessenaar@hu.nl, lizet.vanewijk@hu.nl.
In the third part, phonemes at the word level, prosody at the lexical level
and at the discourse level can be assessed, based on the results of the screening.
The LONT assessment can be conducted repeatedly to monitor progress (Blesse-
naar et al. 2018). Furthermore, it allows for dynamic assessment, using cues to
probe the clients’ response to instruction. There are no normative data, as the
assessment does not strive towards comparison between clients.
Secondly, by determining which relationships exist and how they interact,
the L2 professional and the L2 learner can identify the aspects of speech that are
most detrimental to intelligibility and thus set priorities in training for that spe-
cific L2 learner. Achievable goals are set together, to work towards intelligible
conversational speech within a person’s environment, with the possibility to
evaluate these goals in detail over time. This way the L2 professional will be able
to formulate a realistic prognosis that includes relevant barriers and facilitators
and give necessary recommendations. The ICF and, more specifically, the rela-
tionships between the domains are reflected in the goals that are formulated, the
priorities that must be made, and the training means that need to be chosen.
This means that, in short, training is designed to capitalize on strengths and ad-
dress weaknesses, to facilitate the individual’s activities and participation by as-
sisting the person to acquire new skills and strategies; and to modify contextual
factors to reduce barriers and enhance facilitators of successful communication
and participation (American Speech-Language-Hearing Association 2004).
Thirdly, L2 professionals are of course aware of the individuality of learners
without the use of the ICF. They understand that each L2 learner comes from a
different background, has had different experiences, and encounters different
communicative activities. The diversity in L2 learners is huge; ranging from ex-
change students who stay in a country for a semester and spend their time with
other international students at University, someone who moves across the
world for love for the rest of their lives and learns to integrate into a new family,
to refugees hoping to return to their home when wars are over. All these cir-
cumstances bring about very different starting points, influences, and motiva-
tions in the process of learning a language. However, research shows that this
knowledge does not automatically translate into the way L2 professionals adapt
their choices or actions to best suit the learners’ needs (Cormack and Worrall
2008). The ICF could provide a framework to translate this knowledge and its
implications on instruction for L2 professionals and force them to ensure L2 in-
struction is directly related to real life.
To summarize, because the ICF considers functional implications of intelligi-
bility, it can contribute to an improvement of L2 learners’ communication in their
everyday environment. When attention is paid to the multifaceted character of
functioning, a more tailored and therefore effective training can be designed
(WHO 2001). The ICF provides additional areas of consideration to enable appro-
priate goal setting for each individual: it considers limitations and social factors,
ultimately to bring about change in the lives of people. After all, each L2 learner
comes from a different background, participates in a different environmental
context, and engages in different activities within those contexts (Threats 2008).
4 Case example of application of ICF: The case

of Mahmout
In the following section, we will apply the ICF to a case. In Figure 3 an elabo-
rate ICF-scheme is included with all the information listed below mapped onto
the ICF domains. We will refer to the different ICF domains as followed:
– BS/F: Body Structures/Functions (see 2.4.1).
– A: Activities (see 2.4.2).
– P: Participation (see 2.4.2).
– PF: Personal Factors (see 2.4.3).
– EF: Environmental factors (see 2.4.3).
As for the L2 learner, Mahmout L. is a 28-year-old Syrian man looking for help
to improve his intelligibility in Dutch. He formulated his wishes for pronuncia-
tion improvement in the following way:
Dutch people often don’t understand me. I want to improve my speech, because I want to be
a teacher again.
4.1 Assessment
During the initial session, Mahmout was assessed using LONT (Blessenaar et al
2018). The first part of this protocol consists of a screening of spontaneous speech
to determine which aspects of speech influence intelligibility the most (A). It also
contains extensive topic list for an interview to map out environmental factors (EF),
personal factors (PF) and the possible challenges in relation to activities and par-
ticipation (A&P). The second part consists of an assessment of vowels, diphthongs,
consonants, clusters and of suprasegmentals, such as word stress, intonation, and
rhythm (BF).
Putting participation first
215
Figure 3: ICF case – Mahmout L.

LONT results
Mahmout is a 28-year-old former high school biology teacher (PF) born and
raised in Aleppo in Syria (PF). He arrived in the Netherlands in 2016, after he
fled Aleppo on his own, because he feared for his life and his family’s (EF). He
left his wife behind with the intention of applying for family reunification when
he arrived in Europe (EF). During this journey, he experienced several traumatic
incidences: near drowning, violence, fear of border patrol and starvation (PF). He
stayed at an immigration center for one year, but now has permanent resi-
dency and his wife joined him in the Netherlands in 2018 (EF). Now, he works
part-time as a computer consultant, while he tries to improve his Dutch to be
able to get into a teacher-training program (EF). Their financial and housing situ-
ation is unstable and precarious (EF). There are no relevant medical issues (PF).
Mahmout speaks Syrian Arabic at home and a lot with friends and fam-
ily (over the phone). He watches Syrian television and CNN and reads a lot of
English (EF). English was his second language and is better than his Dutch (PF). His
proficiency in Dutch is at a B1 level4 (Council of Europe 2001, 2020) (PF).
When asked ‘can you explain what made you decide to seek help?’ Mahm-
out elaborated candidly and comprehensively and was perceived as an outgo-
ing person that was not afraid to speak Dutch (P). He mentioned several examples
of activities in his daily life during which he experienced limitations: recurring
miscommunications with strangers, acquaintances, and friends, as well as fre-
quent problems during phone calls and difficulties in group conversations. He
also describes that he finds some speech sounds in Dutch very difficult, because
they do not exist in Syrian Arabic (A). He is highly motivated and decided to seek
help himself (PF).
Mahmout also described (possible future) problems in his participation in
society. He missed a promotion at work because of his limited intelligibility and
fears he will not be accepted into teacher training next year. Mahmout eventu-
ally wants to function as a Dutch-speaking professional. Mahmout describes
 The CEFR is an international standard for describing language ability. It describes language
ability on a 6-point scale, starting at A1 (Beginners) going up to C2 (Mastery). It defines 5 skills
on every level: Listening, Reading, Spoken Interaction, Spoken production, Writing. B1 is de-
fined as intermediate, independent user: Can understand the main points of clear standard
input on familiar matters regularly encountered. Can deal with most situations likely to arise
where the language is spoken. Can produce simple connected text on topics which are familiar
or of personal interest. Can describe experiences and events, dreams, hopes and ambitions
and briefly give reasons and explanations for opinions and plans. (Council of Europe. Council
for Cultural Co-operation. Education Committee. Modern Languages Division 2001; Council of
Europe 2020).
feeling limited in his social abilities because of his intelligibility. He would like
to make more meaningful connections to Dutch people (P).
In his spontaneous speech, a lot of influence of English and Arabic oc-
curred (BF). In addition, he persistently put the word stress on the first sylla-
ble, and he showed inconsistent rhythm and intonation patterns. His intelligibility
was impacted by little articulatory movement, his speech rate was fast, and his
general articulation skills weak (BF). He scored a 3 on a 5-point scale for overall
intelligibility, which correlates to ‘moderately intelligible’ (A).
The formal assessment of segments showed significant segmental mistakes,
mostly on vowels and diphthongs, both on the word level and on the level of
spontaneous speech (BF). For example, the Dutch diphthongs /œy/ and /ø/,
which do not exist in Arabic, were substituted by vowels and diphthongs that
do exist in his mother tongue.
For example:
– Dutch Huis /hœys/ ‘house’ was pronounced [haʊs]
– Dutch Deur /døːr/ ‘door’ was pronounced [dɔr]
4.2 Training plan
Based on the ICF (Figure 3) and through co-creation, a training plan was made
together with Mahmout. This plan formulates goals, means, priorities and rec-
ommendations as a direct result of the relationships between the aspects within
the ICF and existing evidence in research. The goal below was formulated at
the level of participation:
In 4 months’ time, Mahmout is able to clearly convey a complex message in Dutch on the
phone and in conversations, without using English and he feels confident doing so.
This goal was formulated based on all the relevant information collected within
the ICF framework. The training plan lists the following priorities:
– contrast /œy/ - /ɑu/ and production of /œy/
– contrast /o/, /ø/- /ɔ/ and production of /o/, /ø/
– contrast /ɛɪ/ - /aːi/ and production of /ɛɪ/
– word stress
– improvement of general articulation skills
The choice of these priorities is research-based (Derwing and Munro 2005; Grant
2014; Levis 2016) and based on five principles: First of all, we focus on both segmen-
tal and suprasegmental aspects of Mahmout’s speech. Segmental and suprasegmen-
tal errors contribute at least equally to intelligibility; moreover, the existence of both
types of errors can exacerbate intelligibility difficulties (Caspers 2009, 2010; Gordon
and Darcy 2016). There was attention to both global and segmental approach as
the general articulation skills of Mahmout were weak (Derwing, Munro, and
Wiebe 1998). Secondly, in order to improve production skills, perception exer-
cises were included for the selected contrasts (Derwing and Munro 2005; Lee,
Jang, and Plonsky 2015; Sakai and Moorman 2018). Thirdly, there was a focus on
form in the initial stages of addressing a contrast (Gordon and Darcy 2016;
Thomson and Derwing 2015), but is quickly integrated with meaning and con-
text relevant to Mahmout. We provide authentic practice material to attain the
ultimate goal of intelligible spontaneous speech (Darcy 2018; Levis 2005). We
chose the above-stated contrast as the analysis of the LONT assessment provided
us with a clear image of which features were most detrimental to Mahmout’s in-
telligibility. Additionally, research on the sound frequency of Dutch segmentals
and the segmentals most important for intelligibility and comprehensibility in-
formed these choices (Luyckx et al. 2007; Neri, Cucchiarini, and Strik 2006). For
example, the improvement of vowels and diphthongs have a higher priority
than consonants to improve intelligibility in spontaneous speech in Dutch (Neri,
Cucchiarini, and Strik 2006). A fourth and fifth important factor was the role
of explicit corrective feedback (Kissling 2013; Lee and Lyster 2017; Saito and Ly-
ster 2012) and self-monitoring (Pawlak and Szyszka 2018).
The fact that Mahmout is self-aware (PF) and shows a high level of intrinsic
motivation (PF) can be considered a facilitating factor in the prognosis. He is
also very invested in Dutch society (EF). The unstable economic factors (EF),
the fact that his exposure to Dutch is quite limited at the moment (EF) (Gurer
2019), and that he spends a considerable amount of time with a third language
(English) (EF) can be considered barriers in the prognosis. Additionally, the
presence of trauma’s (PF) can potentially create barriers (Schick et al. 2016).
The following recommendation was discussed with Mahmout: his contact
with (conversational) Dutch should be increased (Gurer 2019). This could be
achieved by signing up for a mentor-program that matches volunteers to L2
learners to enhance their opportunities to practice conversational Dutch in a
daily setting. We also recommended to (temporarily) limit his exposure to En-
glish and watch Dutch television, for example (Derwing 2018; Piske, MacKay,
and Flege 2001).
The training consisted of authentic exercises (task based, cf. Gordon 2021)
with a focus on intelligibility and on applying what was learned during instruc-
tion in daily life. He was urged to practice on a regular basis. During the course
of training, his expectations of qualifying to enter the teacher-program should
be discussed and possibly be adjusted if he does not meet the criteria for C1
level.5
4.3 Outcome and conclusion
The extensive assessment of Mahmout’s intelligibility led to a good understand-

ing of the complexity of the process of improvement of his intelligibility. To-
gether, we recognized that his home and financial situation were the culprit
in the decrease of time spent on homework (EF) he received at his Dutch
course. Together with Mahmout, we adjusted the plan and decided to postpone
pronunciation training, to allow him to address his other circumstances. In the
meantime, we did look into ways of increasing his opportunities to practice
Dutch (EF).
Once Mahmout felt ready, instruction was started. In the course of the next
16 weeks, Mahmout received fifteen 45-minute sessions of instruction training.
The sessions roughly had two sections: a section focussing on segmentals, and a
section focussing on suprasegmentals and articulation skills, both initially at
word level (BF). We then moved forward with task-based exercises and assign-
ments directly related to his social and employment situations: practising differ-
ent sentences highly frequent in his social situations (A) and scripted dialogues
in job related phone calls or meetings (A). The next phase consisted of simulating
helpdesk computer calls, preparing for teacher assessments and sessions in
which he gave a mini lecture on high school Biology (A). These specific instruc-
tions were possible because we had mapped out what Mahmout would actually
need to do in his social, academical and working life. Drawing the parallel with
DST, we mapped Mahmout’s system in a holistic non-linear approach and looked
at cognitive, social, and environmental factors that reflect perpetually changing
human circumstances.
 The CEFR is an international standard for describing language ability. It describes language
ability on a 6-point scale, starting at A1 (Beginners) going up to C2 (Mastery). It defines 5 skills
on every level: Listening, Reading, Spoken Interaction, Spoken production, Writing. C1 is de-
fined as advanced user: Can understand majority of sponken language even when less struc-
tured. Can express him/herself fluently including metaphorical language. Can produce detailed
discriptions about complex subjects, formulate specific points of view and round off with an ap-
propriate conclusion. Can use the language flexibly and effectively socially and professionally.
Can accurately articulate ideas and opinions and skillfully contribute to a conversation (Council
of Europe 2001, 2020).
By doing so, we worked towards the goal we set together with Mahm-
out. Mahmout’s motivation only grew during the course of this intervention, and
he reported feeling increasingly more confident in talking Dutch (PF). Overall, he
indicated he was much more aware of when and where he had to pay extra at-
tention to his intelligibility and how to actively influence it. He reported using
self-correction a lot more frequently in daily situations (A) and a second LONT
assessment after 16 weeks indicated that the number of segmental and supra-
segmental mistakes decreased significantly, on word level (BF) as well as in
spontaneous speech (A). At work, his superiors also noticed a clear improve-
ment in meetings and calls with clients: his intelligibility and comprehensibil-
ity increased (P). The fact that Mahmout had asked for feedback from his
colleagues turned out to provide a great new social opportunity to connect with
his co-workers (P). In addition, he gained new Dutch contacts through the men-
tor program (EF) and he indicated his time spent speaking Dutch drastically in-
creased (EF). He started watching Dutch singing competitions instead of English
ones and grew to be a huge fan of a famous Dutch soap (EF).
Looking back on the goal we set together:
In 4 months, Mahmout is able to clearly convey a complex message in Dutch on the phone
and in conversations without using English and he feels confident doing so.
Mahmout himself stated that he reached this goal and quickly formulated a
new one for himself:
“I want to apply the learned techniques in my daily life and further improve my intelligibil-
ity to become a biology teacher.”
5 Conclusion
This case demonstrates how the ICF was used as a tool to determine functional
goals for L2 intervention (Threats 2006). Using the model, we demonstrated
how goals could be set that are relevant and obtainable to the individual,
Mahmout (Blake and McLeod 2019). He was able to set a realistic goal and sig-
nificantly improve his intelligibility, in all aspects of his life. The concepts of
capacity (intelligibility in a simple context such as in structured classroom ex-
ercises), and performance (intelligibility in real life situations) provide insight
into the well-known conundrum in L2 instruction: transfer of knowledge and
skills to daily life.
We have touched upon the overlap between ICF and Dynamic Systems The-
ory and argued that the ICF model could be a way of manifesting DST principles
and translating theory into daily practice. In summary, the ICF could make a
useful addition to the tools L2 professionals have to consider the complex rela-
tionships between learner characteristics, circumstances, goals, attitudes, and
context. We have illustrated its use with a single case, supported by an exten-
sive and growing body of literature (cf. Blake and McLeod 2019). Of course, it
would be beneficial to increase the empirical knowledge on this application by
exploring multiple cases and expanding the research in L2 contexts. Addition-
ally, the ICF could be a catalyst for improved collaboration between different L2
professionals with complementary expertise. That may be one of the missing
pieces of this polymorphous puzzle that is called L2 pronunciation that we all
need to complete.
References
American Speech-Language-Hearing Association. 2004. Preferred Practice Patterns for the
Profession of Speech-Language Pathology. doi:10.1044/policy.PP2004-00191.
Anderson-Hsieh, Janet & Kenneth Koehler. 1988. The effect of foreign accent and speaking
rate on native speaker comprehension. Language Learning 38(4). 561–613.
Andrews, James. 1996. Theory and practice in speech-language pathology: A review of
systemic principles. Seminars in Speech and Language 17(2). 97–106. doi:10.1055/
s-2008-1064090.
Baker, Elise, Karen Croot, Sharynne Mcleod & Rhea Paul. 2001. Psycholinguistic models of
speech development and their application to clinical practice. Journal of Speech,
Language, and Hearing Research 44(3). 685–702.
Beckman John. F., Charles E. Fernandez & Ian D. Coulter. 1996. A systems model of health
care: A proposal. Manipulative & Physiological Therapeutics 19(3). 208–215.
Blake, Helen L., Laura Bennetts Kneebone & Sharynne McLeod. 2017. The impact of oral
English proficiency on humanitarian migrants’ experiences of settling in Australia.
International Journal of Bilingual Education and Bilingualism 22(6). 1–17. doi:10.1080/
13670050.2017.1294557.
Blake, Helen L. & Sharynne McLeod. 2019. Speech-language pathologists’ support for
multilingual speakers’ English intelligibility and participation informed by the ICF. Journal
of Communication Disorders 77. 56–70. doi:10.1016/j.jcomdis.2018.12.003.
Blessenaar, Ilvi, Emmy van Bommel, Marietta Aprea, Leonoor Oonk & Lizet van Ewijk. 2018.
Logopedisch onderzoeksprotocol NT2 [The SLT assessment protocol of Dutch as L2].
Utrecht: Hogeschool Utrecht.
Caspers, Johanneke. 2009. The perception of word stress in existing and non-existing Dutch
words by native speakers and second language learners. Linguistics in the Netherlands
26(1). 25–38. doi:10.1075/avt.26.04cas.
Caspers, Johanneke. 2010. The influence of erroneous stress position and segmental errors on
intelligibility, comprehensibility and foreign accent in Dutch as a second language.
Linguistics in the Netherlands 27. 17–29. doi:10.1075/avt.27.03cas.
Caspers, Johanneke & Katarzyna Horłoza. 2012. Intelligibility of non-natively produced Dutch
words: Interaction between segmental and suprasegmental errors. Phonetica 69(1–2).
94–107. doi:10.1159/000342622.
Cerniauskaite, Milda, Rui Quintas, Christine Boldt, Alberto Raggi, Alarcos Cieza, Jerome
Edmond Bickenbach & Matilde Leonardi. 2011. Systematic literature review on ICF from
2001 to 2009: Its use, implementation and operationalisation. Disability and
Rehabilitation 33(4). 281–309. doi:10.3109/09638288.2010.529235
Cormack, Jane M. C. & Linda E. Worrall. 2008. The ICF body functions and structures related to
speech-language pathology. International Journal of Speech-Language Pathology
10 (1–2).9–17. doi:10.1080/14417040701759742.
Council of Europe. 2001. The Common European Framework in its political and educational
context 1.1 What is the Common European Framework? Strasbourg: Council of Europe
Publishing.
Council of Europe. 2020. Common European Framework of Reference for Languages: Learning,
Teaching, Assessment: Companion Volume. Strasbourg: Council of Europe Publishing.
Crowther, Dustin, Pavel Trofimovich, Talia Isaacs & Kazuya Saito. 2015a. Does a speaking task
affect second language comprehensibility? Modern Language Journal 99 (1). 80–95.
doi:10.1111/modl.12185.
Crowther, Dustin, Pavel Trofimovich, Talia Isaacs & Kazuya Saito. 2018. Linguistic dimensions
of second language accentedness and comprehensibility vary across speaking
tasks. Second Language Acquisition 40(2). 443–457.
Crowther, Dustin, Pavel Trofimovich, Kazuya Saito & Talia Isaacs. 2015b. Second language
comprehensibility revisited: investigating the effects of learner background. TESOL
Quarterly 49(4). 814–837. doi:10.1002/tesq.203.
Darcy, Isabelle. 2018. Powerful and effective pronunciation instruction: how can we achieve
it? The CATESOL Journal. 30(1). 13–45.
de Bot, Kees, Wander Lowie & Marjolijn Verspoor. 2007. A Dynamic Systems Theory approach
to second language acquisition. Bilingualism: Language and Cognition 10(1). 7–21.
doi:10.1017/S1366728906002732.
Derwing, Tracey M. 2003. What do ESL student say about their accents? Canadian Modern
Derwing, Tracey M. 2017. The role of phonological awareness. In Peter Garrett & Josep M. Cots
(eds.), The Routledge Handbook of Language Awareness, 339–354. New York: Routledge.
https://doi.org/10.4324/9781315676494
Derwing, Tracey M. 2018. The efficacy of pronunciation instruction. In Okim Kang, Ron
I. Thomson & John M. Murphy (eds.), The Routledge Handbook of Contemporary English
Pronunciation, 320–334. New York: Routledge.
Derwing, Tracey M., Helen Fraser, Okim Kang & Ron I. Thomson. 2014a. L2 Accent and ethics:
issues that merit attention. In Ahmar Mahboob & Leslie Barrat (eds.), Englishes in
Multilingual Contexts, 63–80. New York: Springer. doi:10.1007/978-94-017-8869-4_5.
Derwing, Tracey M. & Murray J. Munro. 2005. Second language accent and pronunciation
teaching: A research-based approach. TESOL Quarterly 39(3). 379–397.
Derwing, Tracey M. & Murray J. Munro. 2009. Putting accent in its place: Rethinking obstacles
to communication. Language Teaching 42(4). 476–490. doi:10.1017/
S026144480800551X.
Derwing, Tracey M., Murray J. Munro, Jennifer A. Foote, Erin Waugh & Jason Fleming. 2014b.
Opening the window on comprehensible pronunciation after 19 years: A workplace
training study. Language Learning 64(3). 526–548. doi:10.1111/lang.12053.
Derwing, Tracey M., Murray J. Munro & Grace Wiebe. 1998. Evidence in favor of a broad
framework for pronunciation instruction. Language Learning 48(3). 393–410.
Derwing, Tracey M., Marian J. Rossiter & Murray J. Munro. 2002. Teaching native speakers to
listen to foreign-accented speech. Journal of Multilingual and Multicultural Development
23(4). 245–259. doi:10.1080/01434630208666468.
Dragojevic, Marko & Howard Giles. 2016. I don’t like you because you’re hard to understand:
The role of processing fluency in the language attitudes process. Human Communication
Research 42(3). 396–420. doi:10.1111/hcre.12079.
Engel, George Lucas. 1977. The need for a new medical model: A challenge for biomedicine.
Science 196(4286). 129–136. doi: 10.1126/science.847460.
European Commission. 2021. “Statistics on migration in Europe”. European Commission.
https://ec.europa.eu/info/strategy/priorities-2019-2024/promoting-our-european-way-
life/statistics-migration-europe_en. (accessed 25 May 2021).
Fannin, Danai Kasambira. 2016. The intersection of culture and ICF-CY personal and
environmental factors for alternative and augmentative communication. Perspectives of
the ASHA Special Interest Groups 12(1). 63–82.
Garcia, Linda J., Chantal Laroche & Jacques Barrette. 2002. Work integration issues go beyond
the nature of the communication disorder. Journal of Communication Disorders 35(2).
187–211.
Gordon, Joshua. 2021: Pronunciation and task-based instruction: Effects of a classroom
intervention. RELC Journal 52(1). 94–109. doi:10.1177/0033688220986919.
Gordon, Joshua & Isabelle Darcy. 2016. The development of comprehensible speech in L2
learners. Journal of Second Language Pronunciation 2(1). 56–92. doi:10.1075/
jslp.2.1.03gor.
Grant, Linda. 2014. Pronunciation myths: Applying Second Language Research to Classroom
Teaching. Ann Arbor: University of Michigan Press.
Gurer, Cuneyt. 2019. Refugee perspectives on integration in Germany. American Journal of
Qualitative Research 3(2). 52–70. doi:10.29333/ajqr/6433.
Gurzynski-Weiss, Laura, Avizia Yim Long & Megan Solon. 2017. TBLT and L2 pronunciation.
Studies in Second Language Acquisition 39(2). 213–224. doi:10.1017/
S0272263117000080.
Hahn, Laura D. 2004. Primary stress and intelligibility: Research to motivate the teaching of
suprasegmentals. TESOL Quarterly 38(2). 201–223. doi:10.2307/3588378.
Heerkens, Yvonne F. & Joost de Beer. 2007. International classification of functioning
disability and health: Gebruik van de ICF in de logopedie. Logopedie en Foniatrie 4.
112–119.
Howe, Tami J. 2008. The ICF Contextual Factors related to speech-language pathology.
International Journal of Speech-Language Pathology 10(1–2). 27–37. doi:10.1080/
14417040701774824.
Huber, Machteld, J. André Knottnerus, Lawrence Green, Henriëtte van der Horst, Alejandro
R. Jadad, Daan Kromhout, Brian Leonard, Kate Lorig, Maria Isabel Loureiro, Jos W. M. van
der Meer, Paul Schnabel, Richard Smith, Chris van Weel & Henk Smid. 2011. How should
we define health? BMJ (Online) 343(7817). 1–3. doi:10.1136/bmj.d4163.
Jelsma, Jennifer. 2009. Use of the International Classification of Functioning, Disability

and Health: A literature survey. Journal of Rehabilitation Medicine 41(1). 1–12.
doi:10.2340/16501977-0300.
Kissling, Elizabeth M. 2013. Teaching pronunciation: Is explicit phonetics instruction
beneficial for FL learners? Modern Language Journal 97(3). 720–744. doi:10.1111/
j.1540-4781.2013.12029.x.
Lee, Andrew H. & Roy Lyster. 2017. Can corrective feedback on second language speech
perception errors affect production accuracy? Applied Psycholinguistics 38(2). 371–393.
doi:10.1017/S0142716416000254.
Lee, Bradford, Luke Plonsky & Kazuya Saito. 2020. The effects of perception- vs. production-
based pronunciation instruction. System 88. 1–13. doi:10.1016/j.system.2019.102185.
doi:10.1093/applin/amu040.
TESOL Quarterly 39(3). 369–377. doi:10.2307/3588485.
Levis, John M. 2016. Research into practice: How research appears in pronunciation teaching
materials. Language Teaching 49(3). 423–437. doi:10.1017/S0261444816000045.
Levis, John M. & Anna Wu. 2018. Pronunciation: research into practice, and practice into
research. The CATESOL Journal 30(1). 1–12.
Luyckx, Kim, Hanne Kloots, Evie Coussé & Steven Gillis. (2007). Klankfrequenties in het
Nederlands. [Soundfrequencies in Dutch]. In Dominiek Sandra, Rita Rymenans, Pol
Cuvelier & Peter Van Petegem (eds.), Tussen taal, spelling en onderwijs: Essays bij het
emeritaat van Frans Daems, 141–154. Gent: Academia Press.
Ma, Estella P., Travis T. Threats & Linda E. Worrall. 2008. An introduction to the International
Classification of Functioning, Disability and Health (ICF) for speech-language pathology:
Its past, present and future. International Journal of Speech-Language Pathology 10(1–2).
2–8. doi:10.1080/14417040701772612.
Macintyre, Peter D. 2007. Willingness to communicate in the second language: Understanding
the decision to speak as a volitional process. Modern Language 91(4). 564–576.
Marx, Nicole. 2002. Never quite a ‘native speaker’: accent and identity in the L2 – and the L1.
Canadian Modern Language Review 59(2). 264–281.
McDougall, Janette, Virginia Wright & Peter Rosenbaum. 2010. The ICF model of functioning
and disability: Incorporating quality of life and human development. Developmental
Neurorehabilitation 13(3). 204–211. doi:10.3109/17518421003620525.
McLeod, Sharynne & Jane McCormack. 2007. Application of the ICF and ICF-children and youth
in children with speech impairment. Seminars in Speech and Language 28(4). 254–264.
doi:10.1055/s-2007-986522.
Muller, Nicole, Martin J. Ball & Jacqueline Guendouzi. 2000. Accent reduction programmes:
Not a role for speech-language pathologists? Advances in Speech Language Pathology
2(2). 119–129. doi:10.3109/14417040008996796.
Munro, Murray J. 1998. The effects of noise on the intelligibility of foreign-accented speech.
Studies in Second Language Acquisition 20(2). 139–154.
Munro, Murray J. 2008. Foreign accent and speech intelligibility. In Jette G. Hansen Edwards &
Mary L. Zampini (eds.), Phonology and Second Language Acquisition, 193–218.
Amsterdam: John Benjamins.
Munro, Murray J. & Tracey M. Derwing. 1995. Foreign accent, comprehensibility, and
intelligibility in the speech of second language learners. Language Learning 45(1).
73–97.
Munro, Murray J. & Tracey M. Derwing. 2006. The functional load principle in
ESL pronunciation instruction: An exploratory study. System 34(4). 520–531.
doi:10.1016/j.system.2006.09.004.
Munro, Murray J. & Tracey M. Derwing. 2009. Putting accent in its place: rethinking obstacles
to communication. Language Teaching 42(4). 476–490. doi:10.1017/S0261444811000103.
Munro, Murray J. & Tracey M. Derwing. 2011. The foundations of accent and intelligibility in
pronunciation research. Language Teaching 44(3). 316–327. doi:10.1017/
S0261444811000103.
Neri, Ambra, Catia Cucchiarini & Helmer Strik. 2006. Selecting segmental errors in non-native
Dutch for optimal pronunciation training. IRAL – International Review of Applied
Linguistics in Language Teaching 44(4). 357–404. doi:10.1515/IRAL.2006.016.
O’Halloran, Robyn O. & Brigette Larkins. 2008. The ICF activities and participation related to
speech-language pathology. International Journal of Speech-Language Pathology
10(1–2). 18–26. doi:10.1080/14417040701772620.
Pawlak, Mirosław & Magdalena Szyszka. 2018. Researching pronunciation learning strategies:
An overview and a critical look. Studies in Second Language Learning and Teaching 8(2).
293–323. doi:10.14746/ssllt.2018.8.2.6.
Piske, Thorsten, Ian R. A. MacKay & James. E. Flege. 2001. Factors affecting degree of foreign
accent in an L2: A review. Journal of Phonetics 29(2). 191–215. doi:doi:10.006/
jpho.2001.0134.
Saito, Kazuya & Roy Lyster. 2012. Investigating the pedagogical potential of recasts for L2
vowel acquisition. TESOL Quarterly 46(2). 387–398. doi:10.1002/tesq.25.
Sakai, Mari & Colleen Moorman. 2018. Can perception training improve the production
of second language phonemes? A meta-analytic review of 25 years of perception training
research. Applied Psycholinguistics 39(1) 187–224. doi:10.1017/S0142716417000418.
Schick, Matthis, Andre Zumwald, Bina Knöpfli, Angela Nickerson, Richard A Bryant, Ulrich
Schnyder, Julia Müller & Naser Morina. 2016. Challenging future, challenging past: the
relationship of social integration and psychological impairment in traumatized refugees.
European Journal of Psychotraumatology 7(1). 28057. doi:10.3402/ejpt.v7.28057.
Schmidt, Anna Marie & Shannon Sullivan. 2003. Clinical training in foreign accent
modification: A national survey. Contemporary Issues in Communication Science and
Disorders 30(Fall). 125–135.
Suzukida, Yui & Kazuya Saito. 2021. Which segmental features matter for successful L2
comprehensibility? Revisiting and generalizing the pedagogical value of the functional
load principle. Language Teaching Research 25(3). 431–450. doi:10.1177/
1362168819858246.
Thomson, Ron. I. & Tracey M. Derwing. 2015. The effectiveness of L2 pronunciation
instruction: A narrative review. Applied Linguistics 36(3). 326–344. doi:10.1093/applin/
amu076.
Thomson, Ron I. & Jennifer A. Foote. 2019. Pronunciation teaching: Whose ethical domain is it
anyways? In John Levis, Charles Nagle & Erin Todey (eds.), Proceedings of the 10th
Pronunciation in Second Language Learning and Teaching Conference, vol. 2018,
213–235. Ames, IA: Iowa State University.
Threats, Travis T. 2006. Towards an international framework for communication disorders:

Use of the ICF. Journal of Communication Disorders 39(4). 251–265.
doi:10.1016/j.jcomdis.2006.02.002.
Threats, Travis T. 2008. Use of the ICF for clinical practice in speech-language pathology.
International Journal of Speech-Language Pathology 10(1–2). 50–60.
doi:10.1080/14417040701768693.
Trofimovich, Pavel & Wendy Baker. 2006. Learning second language suprasegmentals: Effect
of L2 experience on prosody and fluency characteristics of L2 speech. Studies in Second
Language Acquisition 28(1). 1–30. doi:10.1017/S0272263106060013.
Üstün, T. Berdihan, Somnath Chatterji, Jerome Bickenbach, Nenad Kostansjek & Marguerite
Schneider. 2003. The International Classification of Functioning, Disability and Health: a
new tool for understanding disability and health. Disability and Rehabilitation 25(11–12).
565–571. doi:10.1080/0963828031000137063.
Verbakel, Doreen, Manouk van den Brink & Annemarie Groot. 2020. Anderstaligen met een
behoefte aan taalondersteuning [L2 speakers with a need for language support].
’s-Hertogenbosch: ECBO.
Verspoor, Marjolein. 2013. Dynamic systems theory as a comprehensive theory of second
language development. In María del Pilar García Mayo, María Junkal Gutierrez Mangado &
Maria Martinez Adrian (eds.), Contemporary Approaches to Second Language
Acquisition, 199–220. Philadelphia: John Benjamins.
Wambaugh, Julie L. & Shannon C. Mauszychi. 2010. Application of the WHO ICF to apraxia of
speech. Journal of Medical Speech and Language Pathology 18(4). 133–140.
WHO. 1948. Constitutions of the World Health Organization. New York: World Health
Organization.
WHO. 2001. International Classification of Functioning, Health and Disability. Geneva: World
Health Organization.
WHO. 2002. Towards a Common Language for Functioning, Disability and Health ICF. Geneva:
World Health Organization.
WHO. 2013. How to use the ICF: A practical manual for using the International Classification of
Functioning, Disability and Health (ICF). Geneva: World Health Organization.
Zhang, Runhan & Zhou Min Yuan. 2020. Examining the effects of explicit pronunciation
instruction on the development of L2 pronunciation. Studies in Second Language
Acquisition 42(4). 905–918. 1–14. doi:10.1017/S0272263120000121.
Part III: L2 pronunciation training: Implications
for the classroom
Susan Jackson, Walcir Cardoso
Orthographic interference
in the acquisition of English /h/
by Francophones
Abstract: A number of studies have demonstrated the effect of orthographic
input on L2 perception and production. While some evidence points to its abil-
ity to facilitate the acquisition of L2 phonological contrasts (Escudero, Hayes-
Harb, and Mitterer 2008; Weber and Cutler 2004), other evidence suggests L1
orthographic transfer can have a negative effect depending on the congruency
of the grapheme-to-phoneme correspondence (GPC) between the L1 and the L2
(Escudero 2015; Hayes-Harb, Nicol, and Barker 2010).
This pilot study looks at Francophone L2 learners of English who are known
to have difficulty acquiring the English phoneme /h/, often either deleting it
from the beginning of a word or stressed syllable (e.g. ‘owever) or inserting it
where it does not belong (e.g. [h]after). While the grapheme <h> in French is con-
sistently silent, its pronunciation in English is inconsistent: it is articulated only
in onset position and with certain lexical and rule-governed exceptions.
Participants were taught English pseudo-words by associating auditorily
presented stimuli with non-objects and were placed into one of three learning
conditions: auditory + congruent spelling, auditory + congruent/incongruent
(inconsistent) spelling, and auditory only. Accuracy rates in a subsequent
word-picture matching task suggest that the acquisition of a novel phoneme is
more difficult when the GPC of the target language is inconsistent. This may
inform the approach to teaching this difficult phoneme.
Keywords: orthography, English /h/, L1 french, word learning, pronunciation
1 Introduction
A long-time concern of second language pronunciation research is learners’
mixed success acquiring certain novel segments of the target language. While
some segments are acquired relatively easily and early, others are acquired
later, or in some cases, not at all (Archibald 2021; O’Brien 2021). One such case
is that of Francophone learners of English /h/, a segment that is frequently
Susan Jackson, Walcir Cardoso, Concordia University
https://doi.org/10.1515/9783110736120-009
230 Susan Jackson, Walcir Cardoso
deleted at all levels of proficiency, even when the other phonemes of English
have been mastered (see e.g., Janda and Auger 1992). As such, h-deletion (indi-
cated by a single quotation mark ‘, as in ‘owever, ‘istory instead of /h/owever
and /h/istory, respectively) is a recognizable feature of French-accented En-
glish. Learners’ difficulty with /h/ is a somewhat unique case in that it is nei-
ther a problem of articulation (/h/-insertion is also common) nor necessarily
one of perception, although their discrimination of [h]/Ø pairs (e.g., eat and
heat) has been shown to be weaker than other contrasts such as [i]/[I] or [t]/[θ],
although well above chance (e.g., LaCharité and Prévost 1999; Mielke 2008).
Yet, this phenomenon has not been well studied, with only a handful of excep-
tions (see e.g., Janda and Auger 1992; John 2006; LaCharité and Prévost 1999;
Mah 2011).
French has <h> orthographically (<h> represents the letter h as in hour and
hot), but it does not correspond to any phoneme that is overtly realized in the
language. In certain cases, it does have a phonological status as a phantom con-
sonant (Walker 2001) triggering liaison-blocking (i.e., h-aspiré), a phenomenon
that blocks across-word resyllabification of coda-onset sequences such as les ha-
macs ‘the hammocks’, pronounced [le.a.mak], not ✶[le.za.mak]. More commonly,
<h> is purely orthographic with no influence on neighboring sounds, so that a
phrase such as les habits ‘the clothes’ undergoes resyllabification and thus is pro-
nounced [le.za.bi], not ✶[le.a.bi]. Regardless, <h> is uniformly silent in French
and learners may transfer this knowledge over to their productions of English.
In English, on the other hand, the pronunciation of <h> varies: it is usually
pronounced at the beginning of words with the exception of a handful of loan-
words from French where it is silent, such as in hour or honour, as well as some
dialect-dependent deletions in words such as herb in American English. It is
also pronounced at the head of non-word-initial syllables with primary or sec-
ondary stress (e.g., inherent and alcohol), with certain exceptions in some dia-
lects, for example in the word Nottingham in many British varieties. However, it
is subject to categorical deletion at the head of weak syllables (e.g., vehicle)
and variable deletion in function words (e.g., hers, him, have) when not phrase
initial or subject to focus. In all other positions, <h> is silent, including when
part of consonant cluster, e.g., ghost and though. Considering the numerous in-
stances of <h> being silent or deleted (categorically or variably), a learner may
encounter it far more frequently in writing than they will hear it in speech.
Moreover, the environment in which it is deleted depends on syllable stress,
which is a particularly challenging feature of English phonology for Franco-
phone learners (Dupoux et al. 1997; Peperkamp, Vendelin, and Dupoux 2010).
This means that when it should and should not be pronounced may be experi-
enced as unpredictable to the learner. Therefore, there is not only an incongruent
Orthographic interference in the acquisition of English /h/ by Francophones 231
mapping between the grapheme to phoneme between French and English, but an
inconsistent grapheme-to-phoneme correspondence (GPC) in English, which we
propose is a contributing factor to the difficulties Francophones have with En-
glish /h/.
The role that orthography plays when learning new words is one way in
which second language (L2) acquisition can be set apart from first language (L1)
acquisition. Unlike L1 learners who are exposed to auditory input well before
they learn to read, L2 learners typically encounter the spoken and written forms
of words together, often in formal instruction through reading and writing. Even
before this, in a bilingual country such as Canada, children may become aware
of the written forms of words in the second language, widely available on prod-
uct packaging, for example, before they learn their pronunciation.
While there is considerable evidence that orthography is encoded as part of
a lexical entry and has an effect on speech processing in the L1 (e.g., Castles,
Wilson, and Coltheart 2011; Frost and Zigler 2007; Saletta, Goffman, and Hogan
2016), research has also demonstrated its effect on L2 speech processing and
production (e.g., Bürki et al. 2019; Escudero 2015; Hayes-Harb, Nicol, and
Barker 2010; Shea 2017; Showalter and Hayes-Harb 2015; Rafat 2016). Therefore,
it is worthwhile looking at the effect of written input when investigating L2 pho-
nology, especially sounds in the second language that may pose particular
problems for learners. While Francophone learners of English are likely affected
by the mismatch between the grapheme-to-phoneme correspondence for <h> in
their L1 and in the target language, they may find this segment particularly
challenging due to the complexity around when it is pronounced and when it is
silent. The question we explore in this pilot study is whether Francophone
learners exploit English orthography during word learning and, if so, whether
the observed variability in the pronunciation of <h> is a contributing factor in
their difficulty encoding /h/ as part of a lexical representation.
2 Background
2.1 The effect of orthography on L2 speech processing
and lexical representations
Evidence from several studies looking at the role of orthography in L2 phono-

logical acquisition points to both positive and negative effects. For example, it
has been shown that learners are able to exploit their knowledge of the ortho-
graphic form of minimally contrastive words in a second language to establish
separate lexical entries even without an ability to discriminate between them

auditorily (Cutler, Weber, and Otake 2006; Escudero, Hayes-Harb, and Mitterer
2008; Weber and Cutler 2004). In a novel word learning experiment, Escudero,
Hayes-Harb, and Mitterer (2008) found that highly proficient Dutch-English bi-
linguals who were presented with the spelling of English pseudo-words con-
taining the perceptually challenging contrast /ɛ/–/æ/ performed much better in
a forced choice task using eye-tracking than those who did not see the spelling
during word learning; those who were exposed to the spellings of the words
showed an asymmetric pattern of confusion, in that they tended to look more
often at the /ɛ/ words, regardless of which word was presented. Having en-
coded the spelling during word learning, the participants rejected words they
knew were spelled with <a> and had therefore encoded a representational dis-
tinction between the words, even without perceiving a phonetic distinction.
2.2 Transparency of the L1 orthographic system
Certain factors have been shown to influence the degree to which learners attend
to spelling during word learning. One is the transparency of the L1 orthographic
system, or orthographic depth, which can lead to either an over- or under-reliance
on orthography. Transparency is defined as the number of one-to-one or one-to-
many relationships between phonemes and graphemes. A language with a trans-
parent orthography has a larger number of one-to-one correspondences and is,
therefore, a reliable representation of a word’s phonological form, as is the case
with Spanish and German. An orthography with an abundance of one-to-many or
many-to-one relationships, as is the case with English, is considered opaque.
Learners whose L1 has an opaque orthography may experience less interference
from the L2 orthography during word learning simply because they are accus-
tomed to not relying on it. The inverse may also be true: L1 speakers of phonologi-
cally transparent orthographies may over-rely on the orthographic forms. In a
study by Erdener and Burnham (2005), Turkish (transparent) participants outper-
formed their Australian English (opaque) participants in their productions of L1
Spanish (transparent) words after trainingbut performed less well on their produc-
tions of Irish words (opaque). The rationale being that their reliance on orthogra-
phy lead to more confusion. French is considered to have an opaque orthography,
but unlike English, the opacity is not bidirectional: the pronunciation of a written
word is relatively predictable, while the spelling of an unknown word upon hear-
ing it is not predictable due the frequent use of “silent” letters (Marjou 2019).
2.3 Congruency of the grapheme-phoneme relationship
Another factor which has been shown to have an influence is congruency of the
grapheme-phoneme relationship of a particular contrast between the L1 and
the L2. In a novel word learning task with L1 English speakers, Hayes-Harb,
Nicol, and Barker (2010) found that incongruencies in the GPC for which there
was no counterpart in English – for example, the spelling <faza> paired with
the auditory [fɑʃə]) – lead L1 participants to perform more poorly on an auditory
word-picture matching task, demonstrating interference from their L1 ortho-
graphic conventions. However, if a particular contrastive pair shows a similar
correspondence across the L1 and L2, regardless of phonological similarity,
learners are able to make use of the spelling during word learning. For exam-
ple, Escudero, Simon, and Mulak (2014) found that their L1 Spanish partici-
pants who were exposed to both auditory and orthographic forms of Dutch
pseudo-words during training performed better on contrasts that were phono-
logically different but congruent in both languages (one-to-one match between
both), and worse on vowel pairs in which the GPCs in Dutch were incongruent
with those of in their native language, Spanish. Escudero (2015), however,
found that there was no effect of orthographic transparency, and orthography
only helped learners as a redundant cue on perceptually easy contrasts.
A one-to-many correspondence may be the result of an allophonic alterna-
tion, and here too, L1 orthography has been found to cause interference in
word processing in the L2. Shea (2017) tested L1 English learners of Spanish on
their processing of intervocalic stop-approximant alternation (e.g., <nada>
‘nothing’, [naða]). The shared stop graphemes (<b, d, g>) correspond to one
phone in the L1 but two variants in the L2. In a lexical decision task with cross-
modal and within-modal priming, participants activated the L1 stop variant
more strongly than the L2 approximant allophone when primed by the written
form of a word such as cabello [kaβeʝo], but not when the prime was auditory.
In another study which examined allophonic alternations, Hayes-Harb,
Brown, and Smith (2018) found a similar effect in a production task where
written input interfered in L1 English speakers’ acquisition of German coda
devoicing in a novel word learning task that included minimal pairs such as
<trop>/<trob>. Learners in the with-spelling condition failed to neutralize the
coda voicing in their productions of words spelling with <b,d,g> word-finally,
while those not exposed to the spelling performed similarly to native speaker
controls. This effect persisted even after participants received explicit instruction
as to the allophonic contrast. Both these studies point to the persistent, strong
influence of L1 grapheme-to-phoneme relationships in L2 lexical representations.
In summary, evidence of orthography’s role in the establishment of accurate

lexical representations is mixed. If the L2 learner relies on it, it may facilitate
word learning and help to create a distinction between difficult phonemic con-
trasts. However, if there is an incongruent relationship between the GPCs in the
L1 and L2, the effect may be inhibitory. More detrimental may be an inconsis-
tency in that relationship within the target language itself. For French learners of
English /h/, if they do attend to the orthography when learning novel words, the
variable pronunciation of /h/ may inhibit them from establishing accurate repre-
sentations of /h/ words, which might explain their errors in production.
2.4 The study
In the present study, we further explore the role of orthography on L2 phonologi-

cal encoding during word learning by looking at its role in the difficulties Fran-
cophones have with English /h/. We do this by simulating a word learning
experience, training Francophone learners on English pseudo-words paired with
novel objects and exposing them either to consistently spelled labels (an <h> al-
ways corresponding with [h] in the audio), inconsistently spelled labels (an <h>
sometimes corresponding with [h], and sometimes silent, or no spelling at all –
auditory only). After word-learning, participants were tested on their recall by
way of an auditory word-picture matching task.
2.5 Research questions and hypotheses
The first research question we asked was whether Francophone learners of En-
glish attend to the spelling of a word during word learning. If they do not, there
should be no difference in their ability to encode /h/ as part of a newly learned
word whether presented with the spelling during word learning or the pronun-
ciation alone.
The second research question we asked was whether an inconsistency in
the GPC during word learning would affect the Francophone participants’ abil-
ity to encode /h/ as part of a newly learned word. If it does not, then results
should be similar between the consistent and inconsistent spelling conditions.
However, if it does, then we would expect lower accuracy rates for participants
who were exposed to inconsistent spelling during learning.
3 Method
3.1 Participants
The participants were 16 French native speakers, aged 21 to 53 from a primarily

Francophone region of Quebec: Rouyn-Noranda. As of 2016, 96.8% of the popu-
lation of Rouyn-Noranda were French native speakers with a 33.4% rate of bi-
lingualism (L2 English). The percentage reporting only French spoken at home
is 92% (Statistics Canada 2017). They self-rated their level of English which
ranged from elementary to upper intermediate and reported on their daily use
of English, which was on average less than 15% (see Appendix). All participants
reported that either no English or less than 10% of English was spoken at home
during their childhood, and no one reported ever having lived in a primarily
Anglophone environment. Participants were randomly assigned to the three
learning conditions described below, five each to the two spelling conditions
and six to the auditory-only condition:
Consistent Both auditory and orthography presentation with a consistent grapheme-to-

spelling phoneme correspondence. Participants heard the words such as [hul] and
saw the word spelled <houl>. Likewise, they were presented with its
minimal pair [ul] and saw the word spelled as <oul>. In all cases, the
pronunciation matched with spelling conventions, so all h-words were
pronounced with /h/.
Auditory only Only auditory presentation of words matched with images.
Inconsistent Both auditory and orthography presentation with an inconsistent

spelling grapheme-to-phoneme correspondence with regards to the target <h>, in
that it was pronounced in some words and silent in others. For example,
participants may have heard [in] with a corresponding spelling of <hean>.
For filler words, the spelling followed conventions and matched with the
pronunciation.
3.2 Materials
The stimuli were 20 monosyllabic pseudo-words conforming to both English

and French phonotactics. Targets were /h/-initial words and their vowel-initial
minimal pair counterparts (e.g., hez – ez). Fillers were also minimal pairs that
contrasted by using a variety of consonants and vowels (e.g., mep – tep) that
are contrastive in both languages despite some minor differences at the pho-
netic level (see Table 1 for the full list).
All words were recorded by a female native speaker of English using a

Zoom H4n recorder and then normalized for volume. Each word was paired
with a photographic image of a non-object that was sourced from the NOUN da-
tabase (Horst and Hout 2015). The words were arranged in ten counterbalanced
blocks of four words. The first five blocks together used all 20 pseudo-words
with no minimal pairs presented in the same block (e.g., mep, keft, houl, ud).
The second five blocks used the same 20 words but were rearranged, so as to
contain minimal pairs (e.g., ud, hud, zalk, malk) and, as such, require learners
to phonologically discriminate between them while associating each to an
object.
To create the inconsistent learning condition, two of the five <h>-words
were paired with their vowel-initial audio counterpart (i.e., the <h> was silent).
However, because vowel-initial words are never pronounced with an /h/ onset
in English, no sets of this kind were created (i.e., <hean> + [in] but not <in> +
[hean]). This process eliminated two words, so in order to create balanced
blocks, these two <h>-words were repeated in separate blocks in the learning
phase, meaning they were encountered twice as often.
Table 1: Training words with orthographic and phonetic forms.
Learning condition h- and vowel initial Fillers
Consistent spelling houl [hul] oul [ul] mep [mɛp] tep [tɛp]
hud [hʌd] ud [ʌd] keft [kɛft] eft [ɛft]
hez [hɛz] ez [ɛz] foap [fop] oap [op]
hobe [hob] obe [ob] zalk [zɑlk] malk [mɑlk]
hean [hin] ean [in] tood [tud] zood [zud]
Inconsistent spelling houl [hul] oul [ul] mep [mɛp] tep [tɛp]
hud [hʌd] ud [ʌd] keft [kɛft] eft [ɛft]
hez [hɛz] ez [ɛz] foap [fop] oap [op]
hobe [ob] – zalk [zɑlk] malk [mɑlk]
hean [in] – tood [tud] zood [zud]

3.3 Procedure
The experiment was conducted online using gorilla.sc (Anwyl-Irvine et al. 2020),
a web-based experimental software. Participants were given a URL and a unique
access code to begin the experiment using their own computer and headphones.
They were offered a $5 Amazon electronic gift certificate for their time.
In the word learning phase, participants were presented with an image of a
novel object on-screen and simultaneously heard the audio of the label for that
object, a non-word conforming to English phonotactics. Depending on the ex-
perimental group in which they were placed, they either only heard the audio,
or they were also presented with the spelling of the label. In each block, four
words were presented in a random sequence with a three-second delay between
each. Participants were then told they would be tested on their memory of
these four words. All instructions were given in French to ensure comprehen-
sion, due to the variability in English proficiency, as indicated earlier.
On the next screen, participants saw the four objects randomly displayed in a
grid. One of the words was presented auditorily, with or without its spelling dis-
played depending on the experimental group, and participants were instructed to
click on the corresponding object (see Figure 1). If they answered correctly, they
were given feedback in the form of a green checkmark, and the next test word
from the set was presented with the same four images. If they selected the incor-
rect image, a red X appeared briefly, the image was removed from the grid, and
they could try again with the remaining three images. Images were removed until
the response was correct. Any incorrect responses resulted in the whole task being
repeated. Before moving on to the next block of four words, participants needed to
get all four correct on the first try. Ten blocks of four words were presented in this
manner. The training allowed for up to five attempts to score 100% for each block,
but only one participant required more than two rounds. Training took on average
20 minutes to complete.
Figure 1: Screens with correct response vs. incorrect response.

To maintain engagement throughout the word learning phase (see Bell 2018
for the rationale), participants were congratulated for completing each block and
they collected tokens: pieces of pie to complete a full pie in the first five blocks
(Figure 2), and penguins to collect a family of penguins for the second five blocks.
[Congratulations! 100%! You have obtained a piece [Wonderful! You have the complete pie! We will
of pie. Let’s try 4 new words. When you are ready, proceed to the second part. When you are ready, click
click on the button below.] on the button below.]
Figure 2: Screens depicting gamified elements.
Once the learning phase was completed, participants were asked to take a break
of 30 minutes during which a countdown timer appeared on the screen, and the
experiment was locked. In the main test that followed, images of objects were
presented one by one in sets of ten with either the correct audio or the minimal
pair counterpart. No spelling appeared on the screen during the test. Participants
were instructed to click on a green ‘thumbs up’ icon if they thought they heard
the correct label, or a red ‘thumbs down’ icon if they thought the label they
heard was incorrect. They completed four sets in all, totaling 40 trials. No feed-
back was given during the test, but they received a final score at the end. The
experiment took on average 60 minutes to complete, including the break.
3.4 Analysis
A Kruskal-Wallis H was used to investigate differences in correct responses on

targets between the three learning condition groups, first for all targets and
then matched and mismatched targets separately. This was followed by a pair-
wise comparison of correct responses between each condition using multiple
Mann-Whitney U tests and again for each condition on the matched and mis-
matched scores separately between each condition. Finally, scores from the
Inconsistent Spelling group were compared between words learned with silent
<h> versus those in which <h> was pronounced.
4 Results
The percent of correct responses on the word–picture matching test was calcu-
lated for each participant in each experimental group for matched and mis-
matched word–picture pairs separately. Group means and standard deviations
(SD, in parentheses) are presented in Table 2.
Table 2: Mean percent correct for the matched and mismatched word–picture pairs, by word
learning group (Learning Condition).
Learning Condition Mean Percent Correct (SD)
Targets Fillers
Match Mismatch Match Mismatch
Consistent Spelling (n=) . (.) . (.) . (.) . (.)
Auditory (n=) . (.) . (.) . (.) . (.)
Inconsistent Spelling (n=) . (.) . (.) . (.) . (.)
As Table 2 illustrates, performance on the matched pairs was high for both the
targets and fillers across all three conditions, but poorer on the mismatched
items in each case. For mismatched targets (e.g., when they saw an image of a
houl and heard [ul] or vice versa), correct scores were near chance for the Audi-
tory group and well below chance for the Inconsistent Spelling group.
A Kruskal-Wallis H test was run to investigate the overall impact of learning
condition on the percent of correct responses. Distributions of test scores for target
pairs were not similar between groups, as assessed by visual inspection of a box-
plot, nor were they statistically significant. Looking at matched and mismatched
targets separately, the same test revealed a statistically significant difference be-
tween learning condition groups for matched targets alone, H(2) = 9.319, p = .009,
but none for mismatched targets.
Multiple Mann-Whitney tests were then run to determine if there were signifi-
cant differences in test scores on targets between pairs of learning condition
groups. While scores were not significantly different between the Consistent Spell-
ing and Auditory groups nor the Inconsistent Spelling and Auditory Groups, scores
for the Consistent Spelling group (mean rank = 7.5) were statistically significantly
higher than those for the Inconsistent Spelling group (mean rank = 3.5),
U = 2.5, z = −2.128, p = .032. Analysing matched and mismatched targets separately
revealed only the matched target pairs between the Consistent Spelling and Incon-
sistent Spelling groups were significantly different, U =.000, z = −2.730, p = .008.
Nonetheless, a visual pattern in the data can be seen in Figure 3.
100
90
80
70
60
50 Target Match
40 Target Mismatch
30
20
10
0
Consistent Spelling Auditory Inconsistent Spelling
Figure 3: Mean percent correct condition for target matched and mismatched word–picture
pairs by word learning condition. Bars indicate standard error.
In the Inconsistent Spelling condition, participants were presented with two

words spelled with <h> but not pronounced in the audio: <hean> [in] and <hobe>
[ob]. One question was whether participants performed more poorly on those
words and better on those where the <h> was pronounced: <houl>, <hud>, and
<hez>. A Mann-Whitney U test revealed that scores were statistically significantly
higher for those words learned with /h/ (mean rank = 29.5) than those learned
with silent <h> (mean rank = 19.5), U = 420, z = 2.75, p = .006. A second question
was whether /h/- or vowel-initial words were more difficult in terms of encoding.
For the Inconsistent Spelling group only, there were twice as many correct re-
sponses when participants heard an /h/-initial word, regardless of the pair type
(matched or mismatched). For the matched /h/-initial pairs, they responded with
a 92% accuracy, but 52% for the vowel-initial matches. For the mismatched
pairs, they responded with a 48% accuracy for the /h/-initial mismatches, but
only 24% for the vowel initial mismatches. See Table 3 for mean raw scores.
Table 3: Mean raw scores by word-picture pair type.
Match Mismatch
Audio Picture Mean Audio Picture Mean
houl houl  houl oul 

hud hud  hud ud 
hez hez  hez ez 
hobe hobe  hobe obe 
hean hean  hean ean 
oul oul  oul houl 

ud ud  ud hud 
ez ez  ez hez 
obe obe  obe hobe 
ean ean  ean hean 
5 Discussion
This study was a preliminary investigation into whether the presence or ab-
sence of a written form would affect Francophone’s encoding of English /h/
during word learning, and whether inconsistency in the grapheme-phoneme
correspondence had the effect of making accurate encoding more difficult.
The first research question we addressed was whether Francophone learn-
ers of English rely on the orthography during word learning. The difference in
response accuracy rates on targets between the Consistent and Inconsistent
conditions provides evidence that they do, as these scores should have been
similar if the presence or absence of <h> in the spelling was inconsequential.
This result is inconsistent with studies that demonstrate that learners whose
L1 orthographic system is opaque, such as French, rely less on the spelling (e.g.,
Erdener and Burnham 2005), but it does fit with the bidirectional nature of opac-
ity in languages such as French (i.e., spelling is more predictive of pronunciation
than pronunciation is of spelling). This suggests that Francophones do in fact
rely on the orthography when learning the pronunciation of a word if given the
opportunity. As anecdotal evidence, observe the statement by one participant
after the experiment: “I gave myself reference points with the image and the
word, but associating the words I heard and the image was much more difficult.”
It was possible that relying on the orthography to help establish a distinc-
tion would have led to higher scores in the Consistent Spelling condition over
the Auditory condition, but no significant difference was found. There was,
however, a difference between the matched word-picture pairs alone and we

can see a visual trend in the graph in Figure 3. It is possible, then, that with a
larger sample size, this difference would be more pronounced, suggesting a fa-
cilitative effect of orthography, at least with a consistent GPC.
The second research question we asked was whether an inconsistency in
the grapheme-to-phoneme correspondence during word learning – a reality of
English <h> – would affect their ability to encode /h/ during the experiment.
Presuming learners did attend to the spelling during the learning task, this in-
consistency would have made the task more difficult, and our results indicate
this to be the case. The significantly lower accuracy rates in the Inconsistent
Spelling condition, as compared to the Consistent Spelling condition, point to
the variability in <h> pronunciation having an inhibitory effect on participants’
ability to encode /h/ as part of the word. While we expected a weaker perfor-
mance on words whose GPC was incongruent with the L1 (e.g., <houl>, <hud>,
and <hez>), consistent with findings by Hayes-Harb, Nicol, and Barker (2010),
participants in this condition fared poorly across all targets. This suggests that
the unreliability between the grapheme-to-phoneme correspondence was detri-
mental to accurate encoding, which to our knowledge, is a novel finding.
High scores in the Consistent Spelling condition also demonstrates that ex-
posure to the spelled form did not result in interference from their L1, as has
been seen in other studies (e.g., Escudero and Wanrooij 2010; Hayes-Harb,
Brown and Smith 2018; Hayes-Harb, Nicol, and Barker 2010), but perhaps the
one-to-one match in the GPC may have been helpful regardless of whether that
match is identical in the L1 and L2 (Escudero, Simon, and Mulak 2014).
We would like to acknowledge some limitations that may have affected the
generalizability of our findings. Aside from the small number of participants,
there was a range of proficiency levels, although no clear pattern emerged con-
necting these levels with rates of accuracy individually. Moreover, they were
asked to self-rate their proficiency as well as their amount of daily use of En-
glish. While the latter is logistically difficult to verify, proficiency can be con-
trolled for by independent means.
Concerning the task itself, the question arises as to whether word learning in
such a short period of time, over the course of a single experiment, is really a
memorization task and not truly representative of the process of lexicalization.
Escudero, Hayes-Harb, and Mitterer (2008) suggest that some aspects of lexicali-
zation can occur in even less than one hour by referring to results from Shatzman
and McQueen (2006), which show that learners are able to transfer prosodic pat-
terns to newly learned words within a single experiment session. One option
would be to wait overnight or to conduct the study over several days, as there is
good evidence that while learners can acquire a spoken form in a short period of
time, an incubation period of at least 12 hours during which the learner has slept
is needed for it to enter into lexical competition (Dumay and Gaskell 2007).
5.1 Pedagogical implications
While the results in this study did not show an overall significant difference be-
tween the Spelling and Auditory conditions in word learning, low scores in the
inconsistent spelling condition highlighted a possible inhibitory effect on learn-
er’s ability to encode /h/ as part of a word, especially given that it mirrored the
real-world variability of /h/ pronunciation. If this pattern is replicated in a
larger study, the question to ask is how such findings may be used to inform a
pedagogical approach to teaching this difficult segment: how can /h/ be pre-
sented to learners in a way that might help them establish more target-like rep-
resentations in their mental lexicon and be able to produce /h/ accurately?
One possibility is to set aside the spelling when teaching the pronunciation
of h-words and use pictures instead. The purpose would be to develop and rein-
force an association between the phonological form of a word and its meaning
without the interference of orthography. Learners could play word-picture
matching games such as pronunciation bingo or be asked to listen to a story or
song containing minimal pairs and choose the correct image from a worksheet.
Participating in picture identification exercises would not only highlight the
difference between a minimal pair, but also strengthen the association of /h/
with individual words. This would be especially important at the lower profi-
ciency levels, before the spelling has become part of the learner’s representa-
tion of a word through practice with reading and writing.
One issue pointed out by Trofimovich and John (2009) is that the number
of pairs that can be created by /h/- and a vowel- initial counterpart is minimal,
and many of them do not lend themselves easily to illustration (e.g., had-add).
Nonetheless, strengthening the representations of some words with /h/ onsets
may help learners to both notice when /h/ should be pronounced and to gener-
alize to other /h/-initial words. Pairing pronunciation with other channels of
sensory perception is also possible if pictures are not feasible, such as the use
of tactile or kinesthetic reinforcement (Celce-Murcia et al. 2010; Chan 2018).
Learners could use touch or gestures with /h/ words when learning new vocab-
ulary or when reciting rhymes and songs.
Finally, increasing the frequency of /h/ in oral input in the instructional set-
ting is another potentially helpful strategy. Aside from the handful of words
where initial <h> is silent, a great number of other potential /h/ tokens are de-
leted in natural speech, or they occur in environments that hinder its perceptual
salience (Jackson and Cardoso 2017). However, /h/ is deleted less often in careful
speech, such as that used in the classroom, and this could be reinforced through
the addition of activities such as reading aloud to students (for the rationale, see
Collins et al. 2009).
Together, these approaches – using pictures, kinesthetic reinforcement, and
increasing its frequency in oral input – may well help learners distinguish these
words and establish accurate representations. At later stages, it would then be
possible to explicitly teach words where /h/ is silent. In pronunciation materials
used in the ESL classroom, there is some focus on the instances where /h/ is silent
but much less is typically given to the wider variability in /h/ production. There-
fore, learners may also be taught the phonological contexts in which it is deleted.
6 Conclusion
The question addressed in this pilot study was whether the difficulty Franco-
phone learners have with English /h/ may be partially due to orthographic inter-
ference and its inconsistent grapheme-to-phoneme correspondences in English.
Although the sample size was small, the results do point to this being a contribut-
ing factor. This issue may be compounded with the unpredictability of when /h/
should be pronounced and when it should not: it is either uniformly silent for
some words, or subject to rule-governed deletion in contexts that may not be re-
coverable for a Francophone learner (e.g., when at the head of a weak syllable).
The range of scores in the current study indicate that it would be worthwhile
investigating other variables to see which most strongly correlate with accurate ver-
sus inaccurate encoding of /h/. One of the more obvious is level of English profi-
ciency. Future research comparing learners of different levels of proficiency might
uncover an effect of experience with English as more advanced learners may have
trouble overcoming an entrenched pattern. Notably, the Inconsistent Spelling con-
dition contained all upper intermediate level speakers (a result of random group
assignment), and accuracy rates were lowest in this condition. Also interesting may
be individual ability to discriminate between /h/-initial words and their vowel-
initial counterparts. A typical reason given for why Francophones have such diffi-
culty with /h/ is its weak perceptual salience (e.g., Collins et al. 2009). However,
while they do not discriminate between h- and vowel initial pairs as well as Anglo-
phones, they have been shown to perform above chance on discrimination tasks
(e.g., Mah 2011; Mielke 2008). In addition, adding a production task could deter-
mine whether the scores on word learning correlate with accurate productions of
/h/-initial words. It may be the case, that the inconsistencies in the grapheme-to-
phoneme correspondences also account for erroneous insertion of /h/ on vowel-

initial words. Finally, research with real words rather than novel words and
using word familiarity as a variable may reveal an effect of entrenched forms of
familiar words due the influence of orthography, and further evidence that learn-
ing the correct form auditorily at earlier stages is key.
Together, looking at the influence of orthography in perception, lexical en-
coding, and production, along with the effect of individual differences, will
contribute to a more complete picture of the problem Francophone learners
have with English /h/.
References
Anwyl-Irvine, Alexander L., Jessica Massonnié, Adam Flitton, Natasha Kirkham & Jo
K. Evershed. 2020. Gorilla in our midst: An online behavioural experiment builder.
Behaviour Research Methods 52(1). 388–407.
Archibald, John. 2021. Ease and Difficulty in L2 Phonology: A Mini-Review. Frontiers in
Communication 6. https://doi.org/10.3389/fcomm.2021.626529
Bell, Kevin. 2018. Game on!: Gamification, Gameful Design, and the Rise of the Gamer
Educator. Baltimore: Johns Hopkins University Press.
Bürki, Audrey, Pauline Welby, Mélanie Clément & Elsa Spinelli. 2019. Orthography and second
language word learning: Moving beyond “friend or foe?” The Journal of the Acoustical
Society of America 145(4). EL265–EL271.
Castles, Anne, Katherine Wilson & Max Coltheart. 2011. Early orthographic influences on
phonemic awareness tasks: Evidence from a preschool training study. Journal of
Experimental Child Psychology 108(1). 203–210.
Pronunciation: A Reference for Teachers of English to Speakers of Other Languages. 2nd
edn. Cambridge: Cambridge University Press.
Chan, M. J. 2018. Embodied Pronunciation Learning: Research and Practice. CATESOL Journal
30(1). 47–68.
Collins, Laura, Pavel Trofimovich, Joanna White, Walcir Cardoso & Marlise Horst. 2009. Some
input on the easy/difficult grammar question: An empirical study. The Modern Language
Journal 93(3). 336–353.
Cutler, Anne, Andrea Weber & Takashi Otake. 2006. Asymmetric mapping from phonetic to
lexical representations in second-language listening. Journal of Phonetics 34(2). 269–284.
Dumay, Nicolas & M. Gareth Gaskell. 2007. Sleep-associated changes in the mental
representation of spoken words. Psychological Science 18(1). 35–39.
Dupoux, Emmanuel, Christophe Pallier, Nuria Sebastian & Jacques Mehler. 1997. A
destressing “deafness” in French? Journal of Memory and Language 36(3). 406–421.
Erdener, V. Doǧu & Denis K. Burnham. 2005. The role of audiovisual speech and orthographic
information in nonnative speech production. Language Learning 55(2). 191–228.
Escudero, Paola. 2015. Orthography plays a limited role when learning the phonological forms
of new words: The case of Spanish and English learners of novel Dutch words. Applied
Psycholinguistics 36(1). 7–22.
Escudero, Paola, Rachel Hayes-Harb & Holger Mitterer. 2008. Novel second-language words
and asymmetric lexical access. Journal of Phonetics 36(2). 345–360.
Escudero, Paola, Ellen Simon & Karen Mulak. 2014. Learning words in a new language:
Orthography doesn’t always help. Bilingualism: Language and Cognition 17(2). 384–395.
Escudero, Paola & Karen Wanrooij. 2010. The effect of L1 orthography on non-native vowel
perception. Language and Speech 53(3), 343–365.
Frost, Ram & Johannes C. Ziegler. 2007. Speech and spelling interaction: The interdependence
of visual and auditory word recognition. In M. Gareth Gaskell (ed.), The Oxford Handbook
of Psycholinguistics, 107–118. Oxford: Oxford University Press.
Hayes-Harb, Rachel, Kelsey Brown and Bruce L. Smith. 2018. Orthographic input and the acquisition
of German final devoicing by native speakers of English. Language and Speech 61(4). 547–564.
Hayes-Harb, Rachel, Janet Nicol & Jason Barker. 2010. Learning the phonological forms of new
words: effects of orthographic and auditory input. Language and Speech 53(3). 367–381.
Horst, Jessica S. & Michael C. Hout. 2015. The Novel Object and Unusual Name (NOUN)
Database: A collection of novel images for use in experimental research. Behavior
Research Methods 48(4). 1393–1409.
Jackson, Susan & Walcir Cardoso. 2017. The acquisition of English /h/ by Francophones: Input
frequency and perceptual salience in a corpus study. In Jaime Demperio, Suzanne
Springer, & Beau Zuercher (eds.), Proceedings of the Meeting on English Language
Teaching. Québec: Université du Québec à Montréal Press.
Janda, Richard D. & Julie Auger. 1992. Quantitative evidence, qualitative hypercorrection,
sociolinguistic variables – And French speakers’ ‘eadhaches with English h/Ø. Language
& Communication 12(3–4). 195–236.
John, Paul. 2006. Variable h-epenthesis in the interlanguage of Francophone ESL learners.
Montreal, Canada: Concordia University MA thesis.
LaCharité, Darlene & Philippe Prévost. 1999. Le rôle de la langue maternelle et de
l’enseignement dans l’acquisition des segments de l’anglais langue seconde par des
apprenants francophones. Langues et linguistique 25. 81–109.
Mah, Jennifer. 2011. Segmental representations in interlanguage grammars: the case of
francophones and English /h/. Montreal, Canada: McGill University dissertation.
Marjou, Xavier. 2019. OTEANN: Estimating the Transparency of Orthographies with an Artificial
Neural Network. Retrieved from https://arxiv.org/abs/1912.13321v3
Mielke, Jeff. 2008. Interplay between perceptual salience and contrast: /h/ perceptibility in
Turkish, Arabic, English, and French. In Peter Avery, Elan Dresher & Keren Rice (eds.), Contrast
in Phonology: Theory, Perception, Acquisition, 173–192. Berlin, New York: Mouton de Gruyter.
O’Brien, Mary. 2021. Ease and Difficulty in L2 Pronunciation Teaching: A Mini-Review. Frontiers
in Communication 6. https://doi.org/10.3389/fcomm.2020.626985
Peperkamp, Sharon, Inga Vendelin & Emmanuel Dupoux. 2010. Perception of predictable
stress: A cross-linguistic investigation. Journal of Phonetics 38(3). 422–430.
Rafat, Yasaman. 2016. Orthography-induced transfer in the production of English-speaking
learners of Spanish. The Language Learning Journal 44(2). 197–213.
Saletta, Meredith, Lisa Goffman & Tiffany P. Hogan. 2016. Orthography and modality influence
speech production in adults and children. Journal of Speech, Language, and Hearing
Research 59(6). 1421–1435.
Shatzman, Keren B. & James M. McQueen. 2006. Segment duration as a cue to word
boundaries in spoken-word recognition. Perception & Psychophysics 68(1). 1–16.
Shea, Christine. 2017. L1 English/L2 Spanish: Orthography–phonology activation without
contrasts. Second Language Research 33(2). 207–232.
Showalter, Catherine E. and Rachel Hayes-Harb. 2015. Native English speakers learning
Arabic: The influence of novel orthographic information on second language phonological
acquisition. Applied Psycholinguistics 36(1). 23–42.
Statistics Canada. 2017. Focus on Geography Series, 2016 Census. Statistics Canada
Catalogue no. 98-404-X2016001. Ottawa, Ontario. Retrieved May 7th from Statistics
Canada: https://www12.statcan.gc.ca/census-recensement/2016/as-sa/fogs-spg/Facts-
cma-eng.cfm?LANG=Eng&GK=CMA&GC=485&TOPIC=5
Trofimovich, Pavel & Paul John. 2011. When ‘three’ equals ‘tree’: Examining the nature of
phonological entries in L2 lexicons of Quebec speakers of English. In Pavel Trofimovich &
Kim McDonough (eds.), Applying priming methods to L2 learning, teaching and research:
Insights from psycholinguistics, 105–129. Amsterdam: John Benjamins.
Walker, Douglas C. 2001. French Sound Structure (Vol. 1). Calgary: University of Calgary Press.
Weber, Andrea & Anne Cutler. 2004. Lexical competition in non-native spoken-word
recognition. Journal of Memory and Language 50(1). 1–25.
Appendix: Participant characteristics

and assigned groups
Learning Condition Participant Gender Age Level of English % of daily use
Consistent Spelling P F  intermediate 

P F  elementary –
P F  upper intermediate 
P F  elementary >
P M  intermediate –
Auditory P F  intermediate 
P F  beginner >
P M  upper intermediate 
P F  intermediate –
P F  elementary >
P F  low intermediate –
Inconsistent spelling P F  upper intermediate –

P M  upper intermediate –
P F  upper intermediate –
P F  upper intermediate 
P F  upper intermediate >
Improving fossilized English pronunciation
by simultaneously viewing a video footage
of oneself on an ICT self-learning system
Abstract: Teaching pronunciation by using the names of the letters of the alpha-
bet can contribute to accurate pronunciation, as half of all English phonemes are
included when the letters of the alphabet are pronounced (e.g., /b/+/iː/ for B).
However, when pronouncing the names of the letters, Japanese learners tend to
replace some English sounds with similar Japanese ones, and this can lead to
fossilization of incorrect pronunciation. This paper thus examined whether an In-
formation and Communication Technology (ICT) self-learning system is effective
in improving the fossilized sounds found in learners’ pronunciation of the names
of the alphabet letters in English. This system offers learners an opportunity to
view real-time videos of themselves. The approach was found to improve fossil-
ized English pronunciation, especially with consonants.
Keywords: English pronunciation, ICT, fossilized language, self-video, Japanese

learners
1 Introduction
The importance of learning English skills has been a focus of education courses
around the world due to the globalization of economies. In order to communicate
with other people in English, there are many skills to be mastered: English gram-
mar, vocabulary, and syntax, which together constitute the basic knowledge of
English itself, but there is also socio-cultural understanding, listening, and
speaking, as well as non-verbal communication skills such as facial and manual
gestures (Acton 1984; Smotrova 2017). Unquestionably, pronunciation plays the
most crucial role in oral interaction, and pronunciation errors may lead to severe
breakdowns in communication; therefore, the teaching and learning of correct
Acknowledgments: This study was supported by a Grant-in-Aid for Scientific Research promoted
by JSPS (the Japan Society for the Promotion of Science; Grant No. 17K02951, 18K00787). VER-
SON2 and Nissho Co. helped with the development of the ICT materials.
Yuri Nishio, Meijo University

Akiyo Joto, Prefectural University of Hiroshima
https://doi.org/10.1515/9783110736120-010
250 Yuri Nishio, Akiyo Joto
pronunciation are key issues to be discussed in second language acquisition (Jar-

osz 2019).
In Japan, the sounds have not been taught systematically in English courses
in primary and secondary education. Additionally, English phonetics is not a re-
quired subject for university students taking a teacher-training course. This
means that teachers in Japan are often less than confident about teaching En-
glish (Joto, Miyake, and Nishio 2017). This results in insufficient training in En-
glish pronunciation for students.
Regarding English pronunciation, segments are likely to impact intelligibil-
ity at the lexical level in that mispronunciation may lead listeners to fail to de-
code the intended words (Levis 2018: 37). With regard to Japanese phonemes,
there are only five vowels /a, i, u, e, o/, and there are 24 consonants (e.g., /k, s, t, n/)
(Kokusaikoryukikin 1989). Even though voiceless plosives are phonemically
transcribed in the same way in both English and Japanese, they are pronounced
differently phonetically: English voiceless stops are aspirated at the onset of
stressed syllables, while their Japanese counterparts are not. For the pronuncia-
tion of the Japanese /i/, the front of the tongue is a little lower than for the En-
glish /i/, which is tensed with the mouth pulling strongly sideways. For such
reasons, Japanese learners of English need to have a basic knowledge of the
articulatory differences between the two languages.
Couper (2006) found that explicit teaching on pronunciation brought sig-
nificant improvement. Learners need to be made aware of their pronunciation,
and incorrect pronunciations should be corrected immediately; if not, their pro-
nunciation will not change even when they have been learning English for a
long time. In the end, incorrect sounds become fossilized (Gass and Selinker
1992; Selinker 1972).
We must also consider how students learn pronunciation amidst Japan’s
scanty English input. Japan is defined as being in the “Expanding Circle” (Kachru
1985), where English is taught as a foreign language and people do not need to
use English daily. Additionally, Japanese people’s contact with native English
speakers is extremely limited. Under such circumstances, it is suggested that in-
formation communication technology (ICT) training can be effective because it is
conducive to independent study (Pennington and Roger-Revell 2019). Several
variations on ICT training materials have been developed, such as videos that
present visual articulation aids (Lambacher 2010) and native speakers’ pronunci-
ation of sounds (Hazan et al. 2006).
For our study, we developed ICT materials that included real-time self-videos
that allowed the learners to compare their pronunciation with a video of a native
speaker pronouncing the target sounds. In previous developments (Hazan et al.
2006), learners could watch a video of a native speaker pronouncing sounds, but
Improving fossilized English pronunciation by simultaneously 251
they did not have the opportunity to compare their own pronunciation simulta-
neously unless they used a mirror to watch their mouth moving. We, therefore,
developed our ICT training with a self-video, and in our study, we examined how
the ICT training with a self-video could improve the learners’ pronunciation in
comparison with the ICT training without a self-video. The ICT material to be
learned should involve familiar lexical items, which would be retained more eas-
ily by the learners (Carley and Mees 2020). Building on this idea, we chose the
names of the letters in the English alphabet because it is introduced to English
beginners at quite an early stage, so they should know how to pronounce the
names of the letters of the alphabet. If they are unable to pronounce some of the
names of the letters of the alphabet, these will be considered as having become
fossilized. In addition, half of all English phonemes are included when the letters
of the alphabet are pronounced (e.g., /b/+/iː/ for B). It is assumed that if /bi:/ for
B is pronounced correctly, the word including /bi/ sounds like beach /biːʧ/ could
be pronounced correctly. Furthermore, none of the previous studies on English
phonetics have dealt with the sounds in the names of the letters of the alphabet.
Therefore, we investigated whether the ICT materials we developed using the al-
phabet could be effective in helping Japanese university students to improve
their English pronunciation of consonants and vowels. Our goal is to demon-
strate how these ICT materials can help both teachers and learners improve their
English and help them with their pronunciation.
Our research questions are as follows:
1. Can ICT self-training help participants improve their pronunciation of the
names of the letters of the alphabet?
2. Is ICT training with a self-video more beneficial to participants than ICT
training without a self-video?
3. What do participants think about the ICT materials provided?
2 Learning pronunciation
2.1 English pronunciation in Japanese education
Japanese education systems have changed drastically due to both historical and
economic reasons. Sasaki (2008) describes a 150-year history of school-based En-
glish education and assessment in Japan, going back to around 1860. Her study
shows how, prior to 1970, learning English was regarded as a unilateral means of
importing foreign culture and knowledge. However, from 1970 to 1990, English
education was influenced by rapid globalization, Japan’s economic growth, and
the internationalization of English itself. Since globalization introduced the ne-

cessity for business leaders to have a good command of English, strong demands
from the business world prompted rapid changes in English education. The New
Courses of Study implemented by the Ministry of Education, Sports, Culture and
Technologies (Joto, Miyake, and Nishio 2017; MEXT 2017b) became obligatory,
and from 2020 English activities were introduced in the third or fourth grade of
elementary school (9 and 10-year-olds), and English was taught as a required
subject to fifth and sixth graders. The English education now delivered focuses
on fostering increased communicative competence in English.
Several surveys have been conducted to shed light on the realities of English
education in Japan. According to the MEXT (2015), 67% of elementary school
teachers who participated in a MEXT survey experience challenges and anxiety
relating to teaching English pronunciation. Most incumbent elementary school
teachers teach all subjects, namely, Japanese, math, science, social studies, and
English, although English was not included as a subject when they did their
teaching education courses. Therefore, 93% of their responses reported that they
tried not to teach English pronunciation in their classrooms and relied instead on
Assistant English Teachers for their English pronunciation models. Joto, Miyake,
and Nishio (2017) found a negative correlation between anxiety around English
pronunciation and teachers’ experience and training that involved knowledge of
English phonetics. In their study, the teachers had not received any systematic
training in English pronunciation from the Municipal Board of Education or dur-
ing their nationwide service training.
Furthermore, the core curriculum of the teacher training course was changed
in 2017 (MEXT 2017a), whereafter the pedagogical importance of English pronun-
ciation was stressed. Although English phonetics is not a required subject as
such, knowledge of English pronunciation is now taught as one of the items in
the English Studies. However, in Japan’s education system, the teachers still
have insufficient knowledge about pronunciation, and this can be attributed to
the lack of opportunity to learn pronunciation from their teachers.
2.2 Difficulties experienced by Japanese learners

of English pronunciation
2.2.1 Different consonants and vowels
For Japanese learners of English, one of the reasons for the difficulties they experi-
ence with English pronunciation is that the English phonemes are very different
from those of Japanese. Lado (1957) developed the Contrastive Analysis Hypothesis
(CAH) theory to explain this, which suggests that by comparing a first language
(L1) with an L2, it is possible to predict which pronunciation features will be either
the easiest or the most difficult for the learner to master. Flege’s (1995) Speech
Learning Model (SLM) predicts that if an L2 learner perceives an L2 speech sound
to be similar to a known L1 speech sound, the two sounds will be combined and
assimilated. In contrast, if the L2 sound is perceived as new, then a new category
will be established with properties that may eventually match the properties of the
true L2 sound. Another model, the Perceptual Assimilation Model (PAM) (Best
1995; Best and Tyler 2007), explains that the discrimination of a non-native con-
trast is perceived as assimilated sounds if the phonological equivalent to a native
contrast is perceived.
There have been several studies on how the Japanese perceive vowels that
are similar in English and Japanese. Shimizu (2016) describes the acoustic and
phonetic characteristics of Japanese (L1) and English (L2) vowels produced by
Japanese ESL learners and compares them with those of 11 native English
speakers. He focuses on the first (Fl) and the second (F2) formants of vowels in
both the L1 and the L2 of Japanese ESL learners. The Japanese learners tended
to use their own vowel regions in the vocal tract to produce American English
(AE) vowels, which are similar to Ll sounds. Simizu concludes that they seem
to support the PAM (Best 1995; Best and Tyler 2007) in the way they acquire
their L2 vowels.
Oh et al. (2011) investigated the effect of age of acquisition on first and second
language vowel production by Native Japanese (NJ) adults and children as well as
by age-matched Native English (NE) adults and children. After living in the USA
for one year, the NJ children had more accurate production for English “new” vow-
els, /ɪ/, /ε/, /ɑ/, /ʌ/, and /ʊ/ in a native-like manner, but the NJ adults did not
reach an accurate production.
Lambacher et al. (2005) examined whether a six-week identification training
would be effective in improving the identification and production of the Ameri-
can English (AE) mid and low vowels /æ/, /ɑ/, /ʌ/, /ɔ/, /ɝ/ by native Japanese.
The identification performance of the participants improved after identification
training with feedback, and the training also had a positive effect on their pro-
duction of the targeted AE vowels.
From these studies, as Oh et al. (2011) mentioned, it was evident that native
Japanese children acquired native-like vowels, but native Japanese adults did not
reach the native levels, although six weeks of identification training could have a
positive effect on their production (Lambacher et al. 2005). However, Japanese uni-
versity students in Japan used the same L1 vowel tract regions to produce American
vowels (Shimizu 2016), so we can assume that English vowels are more challenging
to acquire because some of the phonemes are quite similar to the Japanese ones,
especially for Japanese adults living in Japan.
English consonants are also different from the Japanese ones. Riney and An-
derson-Hsieh (1993) mentioned that standard Tokyo Japanese includes the conso-
nants /p, t, k, b, d, g, ts, s, z, m, n, ɾ, h, j/, whereas American English had the
following consonants: /p, b, t, d, k, g, f, v, θ, ð, s, z, ʃ, ʒ, ʧ, ʤ, m, n, ŋ, l, r, j, w, ʍ, h/.
Comparing the two inventories, /f/, /v/, /θ/, /ð/, /ʃ/, /ʒ/, /ʧ/, /ʤ/, and /ʍ/ did
not exist among the Japanese consonants.
Regarding the perception of the English consonants, Yamada and Adachi
(1998) studied comprehensive data inquiring about which English phonemes
were difficult to identify. The participants listened to the words, which con-
sisted of the target consonant and vowel /iː/, and distinguished the correct
word. As the following results show, generally, less than 50% of the sounds
were correctly distinguished: /z/ showed an accuracy rate of 52% [misidentified
as /ð/ (23%), and as /ʤ/ (18%)]; /f/ presented an accuracy rate of 37% [misi-
dentified as /ð/ (26%), and as /s/ (20%)]; /θ/ was correct in 37% of the cases
[misidentified as /s/ (30%), and as /ʃ/ (17%)]; /ð/ had an accuracy of 34% [mis-
identified as /z/ (28%), /ʤ/ (12%), and /v/ (11%)]; /v/ was identified correctly
in 29% of the times [misidentified as /ð/ (25%), /z/ (13%), and /b/ (10%)]. The
results of the perception task revealed that no equivalent consonants existed in
Japanese, which made them difficult to distinguish.
Regarding pronunciation, Yamada and Adachi (1999) explained which En-
glish consonants were mispronounced and substituted by Japanese phonemes,
for example, /s/ was substituted by /ʃ/; /f/ by /ɸ/; and /r/and /l/ by /ɾ/. Joto
(2020) found that Japanese learners mispronounced the English fricative /s/
and /ʃ/ as the Japanese fricative /ɕ/. Joto (2009) also investigated how native En-
glish speakers judged the English consonants pronounced by Japanese university
students based on their intelligibility rates. Those getting lower intelligibility
scores than the average (of 2.47, where 3 is the full mark) were /ʤ/ (major) 2.15;
/w/ (wet) 2.01; /ð/ (then) 1.92; /θ/ (thick); /w/ (womb) 1.78; /z/ (zee) 1.73; /j/ yeast
1.55; /w/ (wood) 1.57. The English phonemes which have a similar counterpart
in Japanese, namely /j/ and /w/, were particularly problematic; however,
even when the phonemes in Japanese did not have similar counterparts, the
English phonemes tended to be substituted by the Japanese ones.
Vance (1987) explained the articulatory differences between Japanese and
English, which include: (a) lip rounding, which is weaker in Japanese than in
English; (b) jaw position, which is more open in English than in Japanese; and
(c) a “tongue blade articulator” in Japanese versus a “tongue tip articulator” in
English.
The results from several studies show that for Japanese learners of English,
not only sounds that are similar to English, but also new sounds that do not
exist in the L2 system, can be considered to be problematic.
2.2.2 Fossilized pronunciation
The pronunciation errors made by Japanese learners are considered as a by-product

of the process of interlanguage or as fossilization. Major (1987) explained how the
developmental stages of interlanguage related to phonology: the interference of the
first language decreased in the course of learners’ L2 learning process, but the de-
velopmental process fluctuated, which resulted in their pronunciation changing at
different points of their interlanguage. However, once a stable stage had been
reached if there were still L1 phonological features that remained in the L2 phono-
logical sounds, these sounds were considered to be fossilized. Fossilization was a
term coined by Selinker (1972), who described it as the cessation of development in
a language system or subsystem, which affected most second language (L2) learn-
ers and users, particularly in the phonological, grammatical, and lexical areas of a
language (Han and Odlin 2005). Fossilization is generally explained as a clear inter-
ference by the L1. In principle, it should be logical to expect all errors to be transi-
tory. A lapse of five years was necessary before an element was considered to be
definitively fossilized (Gass and Selinker 1992).
In the studies mentioned previously, regarding vowels (e.g., Shimizu 2016)
and consonants (e.g., Yamada and Adachi 1999), some English phonemes are
more likely to appear as errors when pronounced by Japanese learners, and
these phonemes are considered to be fossilized sounds. We are interested in
showing which English phonemes are generally regarded as fossilized sounds
and intend to investigate whether, with proper training, there is an opportunity
to improve the pronunciation of these sounds.
2.3 Alphabet learning
What materials should be used for training in pronunciation? Familiar and com-
mon ways consist of having learners listen to individual phonemes in words or
minimal pairs showing the contrasts (Carley and Mees 2020). In our study, we
used the names of the letters in the English alphabet itself as the target for the
pronunciation training, so that A was learned as the diphthong /eɪ/, B as a
consonant+vowel /biː/, etc. The English alphabet is introduced during the early
stages of learning, and there are several studies showing that alphabet knowl-
edge of letterforms, e.g. the corresponding sound of the letter A in ‘apple’ is /æ/,
is essential for reading, spelling acquisition, and comprehension of L1 children
(Piasta and Wagner 2010). The teaching of a letter with its corresponding sounds
is called phonics, which is helpful in learning to read (Ehri 2013, 2020).
In Japan, the English alphabet is introduced in the first textbook for third-
year pupils in elementary schools. Teaching pronunciation using the alphabet
can contribute to accurate pronunciation because 24 phonemes, about half of
the total English phonemes, appear when the letters of the alphabet are pro-
nounced (e.g., /eɪ/ for A, /b/+/iː/ for B). There are eight vowels that appear in
the alphabet:/ɛ/ in F, S, X; /ʌ/ in W; /iː/ in B, C, D, G, P, T, V, Z; /uː/ in Q, U and
W; /eɪ/ in A, H, J, K; /aɪ/ in I, Y; /oʊ/ in O; /ɑɚ/ in R. This also applies to the
consonants in English: /b/ in B; /s/ in C, S, X; /f/ in F; /ʤː/ in G, J; /ʧ/ in H; /k/
in K, Q, X; /l/ in L; /m/ in M; /n/ in N; /p/ in P; /j/ in Q, U; /t/ in T; /w/ in Y;
and /z/ in Z. However, Japanese learners tend to replace some of the English
sounds with similar Japanese ones (e.g., Z [zi:]→[ʥi:], A [eɪ]→[e]+[i]).
Additionally, Japanese loan words are used for the letters of the alphabet as
follows: /eː/ as A; /biː/ as B; /ɕiː/ as C; /diː/ as D; /iː/ as E; / eɸ/ as F; /dʑi/ as G;
/eiʧ/ as H; /ai/ as I; /ʥeː/ as J; /keː/ as K; /eɾu/ as L; /emu/ as M; /enu/ as N; /oː/ as
O; /piː/ as P; /kjɯ:/ as Q; /a:ɾu/ as R; /esu/ as S; /tiː/ as T; /jɯː/ as U; /bɯi/ as V,
/dabuɾjɯː/ as W; /ekkɯsɯ/ as X; /wai/ as Y; /dzetto/ as Z. If the Japanese loan
word influences the pronunciation of the English alphabet, Japanese learners
of English will pronounce A as /eː/ instead of /eɪ/. If some of the names of the
letters of the alphabet are pronounced like the Japanese sounds, these sounds
can be considered fossilized because the Japanese learners learned the alpha-
bet a long time before.
2.4 Factors influencing L2 pronunciation
In terms of the factors that influenced L2 pronunciation in adults, Purcell and

Suter (1980) identified four significant predictors of accented speech: first lan-
guage, aptitude for oral mimicry, length of time in the L2 environment, and
strength of concern for pronunciation accuracy. Thompson (1991), in a study of
Russian speakers of English, found that the factors that best predicted native
speakers’ perception of accent were the age at arrival, gender, self-ratings of oral
mimicry skills, and overall proficiency in the L2. Flege (1995) found that, in addi-
tion to age at arrival, the factors of length of residence, the speaker’s gender, and
relative use of the first and second languages all affected the degree of perceived
accent. The above studies on pronunciation were all set in environments involving
long periods of residence, such as immigration or families moving to their parents’

jobs. Nation and Newton (2009) mentioned, from the pedagogical perspective,
that five factors have been shown as major effects on the learning of another
sound system: the age of the learner, the learner’s first language, the learner’s cur-
rent stage of proficiency, the experience and attitudes of the learner, and the con-
ditions for teaching and learning.
Couper (2006) studied the impact of explicit teaching on pronunciation im-
provement. In the same fashion, Zhang and Yuan (2020) reported that the positive
effects of explicit pronunciation instruction did affect accuracy in pronunciation.
From the previous studies, we can conclude that the factors that would af-
fect Japanese learners of English as a foreign language would be age, aptitude,
oral mimicry, the strength of concern for pronunciation accuracy, and explicit
pronunciation instruction. Therefore, the participants in our study should be
carefully chosen to have similar English educational backgrounds as much as
possible.
2.5 Pronunciation training
2.5.1 Traditional pronunciation training
Traditionally, there are two major approaches to teaching pronunciation: the in-
tuitive-imitative approach and the analytic-linguistic approach (Celce-Murcia
2001). The intuitive-imitative approach is based on the learner’s ability to imitate
sounds and speech. As one of the factors influencing pronunciation is oral-
mimicry (Purcell and Suter 1980; Thompson 1991), learners with a good ear for
mimicry can acquire the L2 sounds well. Teachers tend to show how to produce
particular segments and suprasegmentals without any explicit instruction and
have students listen to the sounds and repeat them in a traditional teaching way.
Stevick (1978) mentioned that learners were able to copy new sound forms
easily, but three things could cause difficulties for learners in doing so. First,
the learners might overlook some features. In this case, the teacher helped
them by providing a suitable model that was appropriate to their level. Second,
the learners might sound bad to themselves although they were copying well.
Students were very sensitive about their pronunciation when demonstrating
foreign sounds, either in the classroom or in public, so they should be helped
to develop a more positive attitude. Third, learners could become anxious
about making the sounds. In this case, the teacher should not point out the
learners’ errors but should find ways to reduce their anxiety.
The analytic-linguistic approach provides knowledge in the field of phonet-

ics, referring students to phonetic charts and articulatory features of the sounds,
which can make the process of acquiring the pronunciation of a foreign language
more conscious.
Either the intuitive-imitative approach or the analytic-linguistic approach
is presumed that teachers give lectures for pronunciation to students in class-
rooms, though, in decades, new technologies have been developing for teach-
ing pronunciation outside classrooms.
2.5.2 ICT pronunciation training
Several ICT training software applications are available on the Internet, which
has been developed based on second-language speech processing research re-
sults. These programs involve auditory or visual input (pictures of a speaker or
video clips in which target sounds or words are pronounced), which can help
learners improve their L2 pronunciation and speech perception (Hardison 2010).
Auditory-visual integration was considered crucial and was explained by the re-
sults of the McGurk Effect (McGurk and MacDonald 1976), which recognizes that
visual mouth information and the sound together affected the decoding process.
The various types of electronic visual display, such as for viewing amplitude and
pitch and for viewing and measuring the duration and frequency range, were
helpful in improving learners’ pronunciation (Lambacher 2010).
One of the studies relating to Japanese learners was that of Hazan et al.
(2006), which investigated the sensitivity of second language learners to the
phonetic information contained in visual cues when identifying a non-native
phonemic contrast. Spanish and Japanese learners of English were tested on
their perception of /b/-/p/ in three conditions: audio (A), visual (V), and audio-
visual (AV) modalities. The A condition involved listening to the target sounds,
the V condition involved watching video clips of a native speaker’s face, and
the AV condition had them combined. Although the Spanish students showed
better performance overall, both learner groups achieved higher scores in the
AV condition. The same experiment was conducted for /l/-/r/ by Korean and
Japanese learners. Overall, these results show the impact of the learner’s lan-
guage background, although correlations between scores for the auditory and
visual conditions suggest that increasing auditory proficiency in identifying a
non-native contrast is linked with increased proficiency in using visual cues to
the contrast.
Lambacher (2010) reported the use of a CALL tool that utilizes acoustic data
in real-time to help Japanese L2 learners improve their perception and production
of English consonants. This involved a speech-learning software running on a net-

worked workstation. The software provided an acoustic analysis of their recorded
utterances. This system allowed learners to realize the difficulties in their English
consonants by visualizing their own pronunciation.
Pennington and Roger-Revell (2019) reviewed the most recently available
technologies for teaching and learning pronunciation that could be applied to
pronunciation pedagogy. These included visuals, acoustic feedback using pitch
contours or wave patterns, and artificial intelligence (AI) with robot-assisted
language learning. They also provided references to usages on the Internet. The
effectiveness of these new technologies has yet to be investigated.
In our current study, we developed two types of ICT materials: one involving
video clips of native speakers explaining how to pronounce words, and the other
using the same video clips followed by real-time videos that enable the learners
to view their own articulation. The native video clips provide visuals for the artic-
ulation together with corresponding model sounds, which has been proven effec-
tive (Hazan et al. 2006). In addition, explanations of the pronunciation provide
explicit instruction (Couper 2006; Zhang and Yuan 2020) and foster awareness of
the accurate pronunciation (Purcell and Suter 1980). In the second ICT system,
the real-time videos of the learner’s own pronunciation enable the learners to rec-
ognize and monitor whether their oral mimicry is correct or not (Purcell and
Suter 1980; Thompson 1991). This enables learners to develop their mimicry and
ultimately master the correct L2 sounds.
3 Method
3.1 Participants
Twenty intermediate-level Japanese private university students who were major-

ing in English participated in our study. They were asked to complete personal
background questionnaires (see Table 1). They were all third- and fourth-year stu-
dents. Some students started to learn English when they were less than six years
old, but most started when they were between 7 and 12 years of age. All of them
had studied abroad in America, Canada, or Australia for three to five months dur-
ing their second year. The experience of studying abroad helped the students to
improve their English proficiency, and they were now expected to have a good
command of English. In addition, they attended regular English classes and were
in an environment that provided authentic input from native speakers.
Table 1: Participants’ backgrounds.
Questions EX Group CO Group
Grade rd year, th year of university rd year, th year of university
Age  (.)  (.)
Age of studying Less than  years old: , Less than  years old: , – years
English – years old: , – years old: , – years old: , – years
old: , – years old:  old: 
Place of studying Cram school or English school: Cram school or English school:
English , Elementary school:  , Elementary school: 
Teacher of English Japanese: , Foreigner: Japanese: , Foreigner: , Japanese

, Japanese and foreigner:  and foreigner: 
Experience of Australia, America, Hong Kong Guam, Thailand, America, Korea,

traveling to foreign for a week Malesia for a week
countries
Experiences of Study in Canada, America, Study in Canada, America, Australia

studying in foreign Australia for  months to  for  months to  months: 
countries months: 
Study time for  min. (.)  min. (.)

listening a week
Study time for  min. (.)  min. (.)

reading a week
Study time for  min. (.)  min. (.)

writing a week
Study time for min. (.)  min. (.)

speaking a week
Knowledge of English Yes: , No:  Yes: , No: 

pronunciation
Learn how to Yes: , No:  Yes: , No: 

pronounce the
English alphabet
Corrected by Yes: , No:  Yes: , No: 

someone
English qualification TOEIC: . (.) TOEIC:  (.)
Note: Numerals describe the number of participants. The number in parentheses is SD

(standard deviation).
Their TOEIC listening and reading scores ranged from 555 to 915, indicat-
ing the Common European Framework of Reference (CEFR) level of B1 to B2
(TOEIC Official HP), so their English levels were intermediate or close to ad-
vanced. They were divided into two groups based on their TOEIC scores: an ex-
perimental group (EX) and a control group (CO). The division into two groups
was based on their demographic variables and the results of their TOEIC scores:
the EX group: n = 10 (females = 9, male = 1), average age = 21, and average
TOECI score = 731; the CO group: n = 10 (females = 7, males = 3), average age = 21,
and average TOECI score 772. The Kruskal-Wallis test was conducted to ensure
that both groups were at the same level of English proficiency (p > .427), and it
confirmed that the groups were at equivalent levels. Most participants had traveled
to several countries, Hong Kong or Thailand, etc., for a short time, from three days
to one week. They were exposed to English on a daily basis because they were tak-
ing several English classes every day, such as English Communication, Reading,
Writing, or Discussion courses. Outside the curriculum, the university provided a
facility called the Global Plaza to encourage students to communicate with foreign
teachers freely. They were asked about their total time of contact with English per
week, including listening, reading, writing, and speaking. The Kruskal-Wallis Test
showed that differences in the duration of their English study were not significant
(listening: p = .967 >.05; reading: p = .539 >.05; writing: p = .902 > .05; speaking:
p = .427 > .05). Both groups were thus considered to have similar backgrounds for
English proficiency and experience.
Seven and eight students in the EX and the CO groups, respectively, took
an English phonetic course, knew how to pronounce the names of the letters of
the alphabet, and also received feedback regarding the pronunciation of the
names of the letters of the alphabet by their English teachers. Therefore, all par-
ticipants’ proficiency levels, English experiences and attitudes, and conditions
for teaching and learning would be the same for the factors that Nation and
Newton (2009) suggested would influence pronunciation. This research was ap-
proved by the ethical board of the university where the first author works, and
all participants consented to participating in the experiment.
3.2 Materials
3.2.1 Pre-and post-test
For the pre-and post-tests, both groups were asked to record videos of themselves
pronouncing the names of the letters of the alphabet from A to Z by using their
cellphones.
3.2.2 Training sessions
The EX group and the CO group had two different platforms (see Appendix A).
Both platforms had the native speaker’s video clip seen from three directions
(the front, the side, and a focus on the mouth from the front) giving an explana-
tion of how to pronounce the sound (e.g., “B, B, try to pronounce B by breath-
ing out air on your hand”). The alphabet letter and the corresponding phonetic
symbol were displayed (e.g., B-b /bi:/). For the EX platform, a self-learning
video was shown next to the native speaker’s video. The learner could see his
or her face pronouncing the English simultaneously while watching the native
speaker’s pronunciation on the video and listening to his pronunciation. The
learner then tried to mimic the native speaker’s pronunciation.
The learners pressed each alphabet letter from A to Z once and then pressed
Review to review the material from A to Z again, without stopping.
3.2.3 Questionnaires
After both groups had completed their ICT training, the participants were asked
to fill in a paper-and-pencil type of questionnaire. The questions were as follows:
Q1: Was this PC program useful? Q2: Were the native speaker’s videos helpful?
Q3: Were the explanations of the pronunciation by the native speaker helpful?
Q4: Was your self-video helpful? Q5: Was your self-voice recording helpful? Q6:
Was the IPA (International Phonetic alphabet) useful? Q7: Was this PC program
easy to use? Q8: Which of the contents were most useful? Choose the three items
that were the most helpful in improving your pronunciation and rank them (both
groups had the following options: ‘Native speaker’s video’, ‘Explanations on pro-
nunciation’, ‘Self voice pronouncing’ and ‘the IPA.’ The EX group only had ‘Your
self-video’ as an additional option).
3.3 Procedure
All participants answered the paper-based questionnaires about their personal

backgrounds, their English learning experiences, the time when they started to
study English, their experience studying abroad, their regular studying hours,
and their TOEIC scores. These were used to divide the participants equally be-
tween the CO and the EX groups.
As a pre-test, both CO and EX participants used their own cell phones to
record their pronunciation of the Alphabet from A to Z at a natural speed and
pausing for one second between the alphabet letters. Each cellphone had a
high-tech camera and a high-quality sound recording system, so the partici-
pants recorded their voices pronouncing the alphabet using the movie app in
their own camera and then sent the video clip to the author via e-mail or the
LINE app. After that, while sitting at a PC, the members of the two groups stud-
ied their specific materials on the ICT site individually for about 30 minutes.
After the ICT training, the CO and EX participants all recorded their pronuncia-
tions of the alphabet as a post-test. Finally, they answered the questionnaire
about the usefulness of and their satisfaction with the ICT material.
3.4 Analysis and assessment of pronunciation
In order to allow the analyses of the sounds of the names of the letters of the
alphabet, two male native speakers of American English were asked to pro-
nounce the alphabet so that their productions could be compared with the Jap-
anese speakers’ pronunciation. The productions of the Japanese and American
speakers were digitally recorded and saved in a wave file format on a computer.
These speech materials were listened by the two authors, who were trained to
transcribe the sounds in the IPA, and the two authors’ inter-rater reliability was
shown to be high by Cronbach’s coefficient alpha, which was .865. Addition-
ally, we examined the sounds using Praat.
4 Results
4.1 The effectiveness of the ICT training
The results of the pre-and post-tests for both the EX Group and the CO Group were
described using the IPA (see Appendix B). The total number of samples was 1040
(26 letters of the alphabet for 20 participants for both pre-and post-tests). To examine
the differences between the pre-and post-tests of the EX and CO groups, a two-way
repeated-measures ANOVA was conducted (see Table 2) using SPSS 25. Statistically
significant differences between the pre-and post-tests were found following the ICT
training for both conditions: the EX group (with self-videos) and the CO group (with-
out self-videos) [F(1, 18) =14.96, p <.001]. The effect size was 0.454, which means it
was very large. In terms of the differences between the EX group and the CO group,
the results showed no significant differences [F(1, 18) = .316, p = .581 > .05], and the
effect size was 0.017, indicating it was small. The interaction between the groups
and the tests was not statistically significant [F(1, 18) = 2.10, p = .105 > .05], but the
effect size was medium (0.164). These results show that the ICT training was effec-
tive in improving the participants’ pronunciation of the alphabet, as based on the
improvements between the pre-and post-tests. However, the results of the EX group
with its self-video training and the CO group without self-video training were not
seen as statistically different.
Considering the results shown in Appendix B, which shows the participants’
productions, there were differences in the difficulty experienced by the EX and CO
groups. Based on the percentages obtained in the pre-and post-tests for both
groups, the alphabet sounds fell into four categories of correctness: 100%–80%
(B, D, E, I, M, Q, S, U), 80%–50% (A, C, F, K, N, O, T), 50%–30% (L, P, X), and
30%–0% (R, V, W, Y, Z). In quantitative research, the mean scores are examined
to compare one condition with another. As Table 2 shows, the difference between
pre-and post-tests was statistically significant for both groups, though the differen-
ces between ICT with and without a self-video were not significant. That means
the self-learning system, regardless of whether it includes a self-video or not,
could prove to be helpful in improving English alphabet pronunciation. However,
some of the names of the letter of the alphabet were quite well pronounced even
before the ICT training, such as B, D, E, I, M, Q, S, and U. The pronunciations of
the names of other letters were found more difficult, and there were both similari-
ties and difficulties in the improvement of the EX and CO groups (see Appendix B).
Regarding improvement in the individual sounds, we investigated whether
a learner would improve more on a particular alphabet letter when using the
ICT material with a self-video, as the EX group did, or using the ICT material
without a self-video, as the CO group did. In the following section, we will ex-
amine which letters of the alphabet improved most in each of the groups.
Table 2: Means, Standard Deviation, and Two-Way ANOVA Statistics for Pre- and Post-tests
between EX and CO Groups.
Variable EX Group CO Group ANOVA
M SD M SD Effect F ratio df η
Pre-test . . . . G . . .
Post-test . . . . Pre&Post .✶✶✶ . .
G×Pre&Post . . .
Note: N = 10. ANOVA = analysis of variance, G = group, Pre&Post = pre- and post-tests
✶✶✶
p < .001
4.2 Improvements in the EX and CO groups
Appendix B shows three categories of improvement. The sounds are repre-

sented in a narrow or broad transcription. The majority of the sounds are tran-
scribed in a broad transcription, known as broad transcription and marked by
“/ /” (e.g., /k/), whereas narrow transcription needed to illustrate phonetic de-
tails – such as aspiration – and are marked by “[ ]” (e.g., [kʰ]). Regardless of
which ICT training the learners undertook, some alphabet letters achieved high
accuracy percentages for both the EX and the CO in both the pre- and post-
tests, with 90% or 100% (D, E, I, M, Q, U). In the second category (which com-
prises A, B, C, F, G, H, K, L, N, O, P, R, S, T, V, X, Z), some errors are shown in
the pre-test, and after learning the alphabet with the ICT training, with or with-
out the self-video, the error rates decreased. For the third category, some of the
alphabet letters, such as J, W, and Y, were problematic, and there was little im-
provement in either the EX or CO groups.
Tables 3 and 4 show how the learners improved their pronunciation for
each condition of the groups. The number of improvements indicates how
many learners improved in each alphabet letter. For example, two students
did not pronounce A accurately in the pre-test, but then, in the post-test, the
two learners pronounced it correctly. The number ‘2’ thus indicates personal
development, whether through the EX or CO treatment. The other criterion,
called the improvement rate, is used to show the percentage of learners who
improved from the pre- to the post-tests, which is calculated as follows: if four
learners cannot pronounce a letter correctly, and after the training, two learn-
ers pronounce the name of a letter correctly, the improvement rate will be
50% (2/4✶100 = 50). This makes it easy to compare the improvement of the EX
and CO groups.
Table 3: Number of Improvements in the EX Group.
Alphabet Errors Improvements Number Number of Improvement

of errors improvements rates (%)
A /eɪ/ eː eɪ   
B /biː/ ― ― ― ― ―
C /siː/ ɕiː siː   
F/ɛf/ eɸ ɛf   
G /ʤiː/ ʥiː ʤiː   
H /eɪʧ/ eiʨ eɪʧ   
J /ʤeɪ/ ʥei ―   
Table 3 (continued)

K [kʰeɪ] keː [kʰeɪ]   

L /ɛl/ eɯ ɛl   
N /ɛn/ enɯ ɛn   
O /oʊ/ oː oʊ   
P [pʰiː] [piː] [pʰiː]   
R /ɑːr/ aːɾ ɑːr   
S /ɛs/ eɵ ɛs   
T [tʰiː] tiː [tʰiː]   
V /viː/ vɯi viː   
W /dʌbḷjuː/ dabɯɾjɯ ―   
X /ɛks/ ekks ―   
Y /waɪ/ ɰaɪ ―   
Z /ziː/ dziː ziː   
Mean . . .

SD . . .
Note: Phonemic transcription is described using “/ /”; for phonetic transcription,

“[ ]” is used. ‘―’ indicates no items and ‘―’ in the column for improvement rates
indicates that the number of errors has increased. Mean indicates the average
score divided by the 19 letters of the alphabet, and SD indicates the standard
deviation of the Mean.
Table 4: Number of Improvements in the CO Group.

A /eɪ/ eː eɪ   
B /biː/ vː bː   
C /siː/ ɕiː ―   
F/ɛf/ ɛf ɛf   
G /ʤiː/ ʥiː ʤiː   
H /eɪʧ/ eiʨ eɪʧ   
J /ʤeɪ/ ʥei ʤeɪ   
K [kʰeɪ] keː [kʰeɪ]   
L /ɛl/ eɯ el   
N /ɛn/ ― ― ― ― ―
O /oʊ/ oː oʊ   
P [pʰiː] [piː] [pʰiː]   
R /ɑɚ/ aːɯ ɑɚ   
S /ɛs/ eɵ ɛs   
Table 4 (continued)

T [tʰiː] tiː tʰiː   

V /viː/ bi viː   
W /dʌbḷjuː/ dabɯɾjɯ ―   
X /ɛks/ ekks ɛks   
Y /waɪ/ ɰaɪ waɪ   
Z /ziː/ dziː ziː   
Mean . . .

SD . . .
Note: Phonemic transcription is described using “/ /”; for phonetic transcription,

“[ ]” is used. ‘―’ indicates no items and ‘―’ in the column for improvement rates
indicates that the number of errors has increased. Mean indicates the average
score divided by the 19 letters of the alphabet, and SD indicates the standard
deviation of the Mean.
4.3 Vowel improvements in the EX and CO groups
Generally, some alphabet letters had a high score, which meant a high level of cor-
rect pronunciation before the training with the ICT. These included B, D, E, I, M,
Q, S, and U, which contained the following vowels: /iː/ in B, D, E; /aɪ/ in I; /ɛ/
in M and S; and /uː/ in Q and U. These vowels show similar sounds between the
Japanese sounds for the names of the alphabet letters and the English sounds. For
example, the Japanese pronounce B /biː/, D /diː/, E /iː/, I /ai/.
Regarding the other alphabet letters containing /iː/, as in C, G, P, T, V, and Z,
they were not correctly pronounced. The tense vowel /iː/ tends to be substituted by
the Japanese /iː/ in pronunciation, and although the sounds are quite similar, the
pronunciation problems of those alphabet letters were not due to the vowel but to
the consonants they contained. Furthermore, the English diphthong /eɪ/ is pro-
duced differently in Japanese learners’ pronunciation. The Japanese A, J, and K are
pronounced /eː/, /ʥeː/, /keː/, respectively. In the pre-test, two learners in both the
EX and CO groups did not pronounce this diphthong correctly, but in the post-test,
they pronounced /eɪ/ correctly. In contrast, J was not pronounced appropriately:
the vowel /eɪ/ was pronounced accurately, but the consonant was not correctly
produced. This problem will be discussed in the next section.
The diphthong /aɪ/ was less difficult to pronounce. The letter I was pro-
nounced perfectly in both the pre-and post-tests by both the EX and the CO
groups. Although the /aɪ/ in Y was pronounced correctly, the consonant /w/
was not pronounced accurately. The consonants /ʤ/, /w/, and /k/ will be dis-
cussed in the following section.
The diphthong /oʊ/ showed slightly different improvement rates: the im-
provement rate in the EX group was 100% (5 out of 5), but the CO showed an
improvement rate of 50% (2 out of 4). The participants in the EX group may have
learned its production after having seen their self-video pronouncing the /oʊ/,
which is more rounded than the Japanese /oː/.
The short vowel /ɛ/ is found in the letters S and X. The former was pro-
nounced correctly in the pre-and post-test by both the EX and the CO groups.
The /ɛ/ in X was pronounced correctly, but the consonant /k/ was influenced
by the Japanese geminate /kk/, as seen in some productions of [ekks].
The most difficult letter was R /ɑɚ/, which showed a lot of variation in the
sounds used, such as [aːɾ], [aːɾɯ], [a˞ː], and [aː]. The Japanese sound inventory
does not have either /r/ or /l/, so Japanese learners tend to assimilate and sub-
stitute the Japanese sound /ɾ/ for both /r/ and /l/. Additionally, the English [ɑ]
is an open back unrounded vowel, but the Japanese counterpart is /a/, which is
an open front unrounded vowel. The English sound /r/ is produced by the
tongue-tip curling slightly upward toward the rear part of the alveolar ridge
(Carley 2020; Carley and Mees 2020). The Japanese participants tried to make
the /r/ sound by curling up the tip of the tongue, but their sounds resulted
in /aːɹ/ with the Japanese /aː/ followed by the consonant /ɹ/ instead of the rhotic
vowel sound /ɚ/, which was different from the English diphthong /ɑɚ/. Generally
speaking, the vowel pronunciation of the names of the letters, except for /ɑɚ/, was
fairly well understood for each letter. Some errors were reduced following the ICT
training for both the EX and CO groups.
4.4 Consonants and the EX and CO groups
A variety of consonant sounds can be found in the names of the letters of the
alphabet: B /b/, C /s/, D /d/, F /f/, G and J /ʤ/, H /ʧ/, K /k/, L /l/, M /m/, N /n/,
P/p/, Q and U /j/, S /s/, T/t/, V /v/, Y /w/, and Z /z/. We will focus on the prob-
lematic consonants: the stops, the fricatives, the affricates, and the approxi-
mates, followed by the three-syllable alphabet letter W.
4.4.1 Stops
There are several contrasts between voiceless stops and voiced stops, such as
P/p/ and B/b/, and T/t/ and D/d/. Additionally, a voiceless stop should be aspi-
rated at the beginning of the word, and whether the stop consonants are aspi-
rated or not is a crucial aspect in the acquisition of L2 English. Therefore, the
VOT (voice onset time) was measured in Praat in order to check whether the aspi-
ration in voiceless stops would be long enough. According to Lisker and Abram-
son (1964), as defined by their VOT values, voiceless unaspirated stops present
VOT values from 0 to 25 ms, while VOT values in voiceless aspirated stops range
from 60 to 100 ms. The pronunciation of the stop consonants /p/, /t/, /k/ tends
to be difficult for Japanese L2 learners because /p/, /t/, /k/ in the Japanese inven-
tories are not aspirated in the assigned beginning of a Japanese word. Therefore,
the participants were assumed to pronounce /p/, /t/, /k/ in English without any
aspiration, even at the beginning of the word. Table 5 shows that the VOT is
shorter in the Japanese counterparts of the English consonants. From these con-
sonants pronounced by the Japanese, /k/ showed a relatively good performance,
whereas /p/ and /t/ were especially difficult. As for /p/, seven participants in the
EX group and four in the CO group did not pronounce these correctly in the pre-
test (see Tables 3 and 4), but after the training, in the post-test, 50% (2 of 4) in
the CO group and 71% (5 of 7) in the EX group improved. Regarding VOT, /p/ in
the EX was longer as we compared the pre-test with the post-test, from 24 ms to
64 ms (40 ms longer), and /p/ in the CO also improved, from 46 ms to 78 ms (32
ms longer). The alveolar /t/ was the most similar consonant in the EX and CO
groups. For this consonant, 50% (3 of 6) in the EX improved, and 50% (1 of 2) in
the CO group also showed improvement. The VOTs also became longer (49
ms→74 ms in the EX group and 58→84 ms in the CO group). From this perspec-
tive, the training, both with a self-video and without a self-video, can contribute
to improvements in the aspiration of the stop consonants based on the VOTs, but
only /p/ for the improvement rates of the EX group was found that the ICT with a
self-video had slightly advantages.
The voiced stop consonants /b/, /d/ were pronounced relatively well. How-
ever, one instance of /b/ was pronounced with the upper teeth touching the
lower lip but with no friction: the place of articulation was the same as /v/, but
the manner was the same as that of a stop sound. Additionally, in the produc-
tion of /d/, the blade of the tongue reached a wide area of the alveolar ridge,
which is a similar pronunciation to that in Japanese. Further analyses are still
needed to clarify this.
Table 5: VOT (Voice Onset Time) in EX group, CO group, and English native speakers.
Ex Co Native
Pre Post Pre Post
M SD M SD M SD M SD M SD
K /keɪ/ . . . . . . . . . .
P /piː/ . . . . . . . . . .
T /tiː/ . . . . . . . . . .
4.4.2 Fricatives
Fricatives are difficult to pronounce correctly: /s/ in C improved to 75% (3 out of 4)

in the EX, but no student improved in the CO. The participants used the Japanese
/ɕ/ instead of /s/ for C, but after the training, the EX group used the /s/ correctly.
The participants found /z/in the letter Z to be one of the most challenging conso-
nants. The Japanese do not distinguish between /z/, /dz/, and /dʑ/. The partici-
pants pronounced the name of the letter Z with the Japanese /dz/, /ʥ/ in both pre-
and post-tests. The improvement rates in both groups are 22%: 2 out of 7 in the EX,
and 40%: 2 out of 5 in the CO.The /f/ in F was highly improved, from producing
the English /f/ with the Japanese /ɸ /in the pre-test to /f/ in the post-test, with the
EX having 75% of the improvement rate (3 out of 4), and the CO showing a rate of
improvement of 100% (3 of 3). The /v/ in V was difficult to improve, with the EX
showing 37.5% of the improvement rate (3 out of 8) and the CO indicating an in-
crease of 25% (2 out of 8).
For the fricatives mentioned above, the articulation in English is very differ-
ent from that in Japanese, so we assume that the participants in the EX watched
the native speaker’s mouth moving and tried to mimic the model sounds. How-
ever, the pre- and post-tests revealed that especially /z/ and /v/ were still pro-
nounced incorrectly despite the training. However, the information from the
self-videos and the real-time sounds of oneself must have been effective to an
extent for the EX group, because its average improvement rate for the four con-
sonants was 58%, whereas, it was 41% in the CO group.
4.4.3 Affricates
The affricates /ʧ/, in H, and /ʤ/, in G and J, were also problematic. The partic-
ipants in both groups pronounced the Japanese sounds [ʨ] for /ʧ/ and [ʥ], [ʣ]
for /ʤ/. Surprisingly, no one in either group pronounced them correctly in the
pre-test. After the training, the improvement rate for /ʧ/ in H was 30% in the EX
group and 10% in the CO group. Likewise, for /ʤ/ in G and J, the improvement
rate was 30% in the EX group and 10% in the CO group. Therefore, the EX
group was able to use the articulation information more and mimic the sounds
better than the CO group.
4.4.4 Approximants
Approximants /l/ in L and /w/ in Y were also very challenging phonemes. L should
be pronounced as [ɛɬ], but only a few participants, namely three in the CO group,
could pronounce the dark /l/, while the other participants could not pronounce
the dark /l/ at all. However, the participants were regarded as having proper pro-
nunciation if they pronounced /l/, and not the Japanese /ɾ/.
The /w/ in Y was also very difficult. The Japanese do not have a /w/ sound
and they usually substitute it with the Japanese /ɰ/. The participants in the EX
group tried to make their lips rounded, but it was not enough. The sound did
not improve at all.
4.4.5 Three-syllable word W
W [dʌbɫjuː] is the only word with three syllables. The Japanese W is pronounced
as [dabɯɾjɯ], so this pronunciation appeared three times in the pre-test in both
the EX and CO groups. The post-test for the EX group was even worse, as it was
pronounced incorrectly seven times. No progress was shown for either group. The
learners might not know that W is a three-syllable word; besides, the dark /l/ in
W [dʌbɫju] and the consonant clusters [bɫ] were difficult for the Japanese learners.
4.5 Satisfaction with and usefulness of ICT materials
Table 6 shows the results of the questionnaire asking what the participants in
both groups thought about the ICT training program. Questions 1 through 8
were the same for both groups, with the exception of Q4, because the ICT train-
ing program for the CO group did not include a self-video. Regarding Q8, the
participants identified which content they thought was useful, choosing three
of the alternatives, but for the CO group, the item ‘your self-video’ was not in-
cluded. Although one participant in the EX group gave a slightly lower score
(the average score was 2.57 out of 5 for eight items) than the others, the other
participants from both groups gave positive answers for all the items. In gen-
eral, they found the ICT program useful, and the native speaker’s videos and
explanations were considered to be very helpful. The EX group also found the
self-video useful. The self-pronouncing voice was more beneficial for the CO
group than for the EX group. The IPA was less helpful than the other items. The
most useful items of the content were the explanations on how to pronounce,
the native speaker’s video, and, for the EX group, the self-video. The CO group
listed the native speaker’s videos, the explanation on how to pronounce, and
the self-pronouncing voice as most helpful.
Table 7 shows all suggestions made by the participants about improving
the ICT materials. Several comments suggested that the video clips of the alpha-
bet should be a little long and that if the video clips could appear automati-
cally, without clicking, the time could become shorter than the original ones.
One participant would prefer to listen to a female voice.
Table 8 shows all other comments from the participants. For both groups,
some participants were surprised that they did not know how to pronounce the
alphabet, and they found the W to be especially difficult. The explanation of
the pronunciation by the native speaker was regarded as very easy to under-
stand and useful. For the EX group, the self-video was very useful for compari-
son with the native video clips, while the CO group realized how important it
was to watch the native speaker’s mouth carefully and mimic the sounds. Their
comments indicated that their awareness of pronunciation would increase fol-
lowing the ICT training.
Table 6: Satisfaction and usefulness of ICT materials in both the EX and the CO groups.
M SD M SD
Q Was this PC Program useful? . . . .
Q Were the native speaker’s videos helpful? . . . .
Q Was the explanation of the ways of pronunciations by the native . . . .
speaker helpful?
Q Was your self-video helpful? . . ― ―
Q Was your self-pronouncing voice helpful? . . . .
Q Was the IPA useful? . . . .

Table 6 (continued)
M SD M SD
Q Was this PC Program easy to use? . . . .
Q Which contents were useful? Choose three and rank them.
Native speaker’s video No No
The explanation of the ways of pronunciation No No
Your self-video No ―
Your self-pronouncing voice No
The IPA
Table 7: Suggestions to improve the ICT program (Q9).
EX Group CO Group
– I want to listen to a female’s voice as – Every time I have to click to watch a video
well as a male’s voice. clip, it might be better if a video clip
– It would be better if my pronunciation could be played back.
could be judged with a score. – The training was a bit too long.
– It would be better if learners could turn – One set of the alphabet was a bit too
the microphone on and off more easily. long.
– Pauses between video clips would allow – It might be more effective if the
for more practice. explanation were shorter.
– If I could hear my own voice clearly and – It might be useful to learn not only the
naturally, it would be better. alphabet but also some vocabulary.
Table 8: Other comments (Q 10).
EX Group CO Group
– I was shocked by how unaware I was – I felt that making sounds that do not
regarding the openness of the mouth and exist in Japanese was challenging.
the placement of the tongue. – I was able to learn how to pronounce
– I was able to understand how to English sounds, and this would be
pronounce English sounds accurately helpful in conversations in English.
when I watched the movement of the – I have not practiced the pronunciation of
mouth and the tongue of the native the alphabet since I was an elementary
speaker from the front and the side student. At that time, I watched the
angles. native English teacher’s mouth carefully
– I did not understand all of the phonetic and tried to mimic his articulation. This
symbols, so it would be difficult to base helped me improve my pronunciation.
my pronunciation exclusively on the – I found that the English sounds could be
reading of phonetic symbols. improved if I became conscious of my
– The explanations of the native speaker pronunciation.
were easy to understand, such as “the – I think watching the native speaker’s
movement of your mouth when you eat a mouth very carefully is the key to
very sour pickled plum [umeboshi].” improving pronunciation. Watching and
– This ICT program was very mimicking – more than listening to the
straightforward and useful. I could watch native speaker’s sounds – encourage me
videos and listen to the sounds. I want to to practice pronunciation.
continue using it several times a day. – The native speaker’s explanation was
– “W” was more difficult than I expected. straightforward This ICT program is very
Hearing my own voice and comparing it effective for practicing pronunciation. I
with the native speaker’s voice was very wish I could have learned with this
useful. program when I was in elementary
– I was able to compare my video with the school.
native speaker’s video simultaneously, – I was surprised I had not watched the IPA
and I tried very carefully to mimic the presented on the screen. I also found out
native speaker’s articulation. Then I how often my pronunciation was
found my pronunciation was getting inaccurate. I gained a better
better. understanding by watching the native
speaker’s video, and I was able to
improve my pronunciation greatly by
opening my mouth wider.
– It was very easy to understand how to
pronounce English alphabet because I
was able to watch how to pronounce it
from different angles.
5 Discussion
This paper aimed to investigate whether an ICT training system could help
Japanese learners of English improve their pronunciation of the names of the
letters of the English alphabet. The names of the letters of the alphabet con-
tain about half of all English phonemes, and they are introduced at the begin-
ning of the learning process, so they have been known for a long time. If some
of the phonemes still cannot be pronounced correctly, they are regarded as
having fossilized.
Based on the results from the pre-and post-tests, both the EX group, who
learned from the ICT and with a self-video, and the CO group, who learned
without the self-video, improved their pronunciation. Those results prove that
the answer to Research Question 1 is positive, and the participants felt the ICT
was useful. Research Question 2, regarding whether the ICT training with a self-
video or without a self-video would be more beneficial to participants, did not
show a statistically significant difference. However, there were some differen-
ces between the two groups. From the vowels, /oʊ/, as in O, could not be pro-
nounced correctly by half of the participants in both the EX and CO groups in
the pre-test. After the training, 100% of the learners in the EX group (5 of 5)
pronounced it correctly, and 50% of the students in the CO group (2 of 4) im-
proved their pronunciation. The self-video helped the EX group to understand
the roundness of the lips better than the CO group. On the other hand, the pro-
duction of R /ɑɚ/, which was challenging, improved to 50% in both the EX and
CO groups. Lambacher et al. (2005) also investigated improvements in the iden-
tification and production of /ɑ/ after a six-week identification training period,
though, according to Oh et al. (2011), native Japanese adults during a one-year
stay in America showed lower accuracy rates in the pronunciation of /ɑ/. In our
study, the unrounded open back vowel /ɑ/ was replaced by the Japanese un-
rounded open vowel /a/, as in Shimizu’s (2016) study, as Japanese students
tend to substitute Japanese sounds for American English sounds, and this
seems to confirm the PAM (Best 1995; Best and Tyler 2007).
Regarding /iː/ and /ɪ/, these two vowels are also very challenging for Japanese
native speakers because the Japanese contrast the long and short lengths of /iː/
and /i/, and not the vowel quality (Heidlmayr, Ferragne, and Isel 2021). Heidl-
mayr, Ferragne, and Isel investigated Japanese adults’ hearing abilities, testing
them two times (one week after starting living in Canada and one year later), but
their pronunciation did not become similar to that of native English speakers. Shi-
mizu (2016) mentions that the F1 and F2 formants of English, /iː, ɪ, ɛ, æ/, as pro-
nounced by five male university students, were similar to those of a native
male English speaker, but the formants of six female university students were
still different from those of a native female English speaker. We could conclude
that the /iː/ in the productions of B, C, D, E, etc. by Japanese speakers was similar
to that of English speakers, based on the perceptions of the authors and the anal-
ysis of the F1 and F2 formants. However, we need further analyses to compare
the Japanese versions with their English counterparts. For now, we may conclude
that the EX intervention was able to help participants improve /oʊ/ in O, and
that both ICT training could help to improve /ɑɚ/ to some extent.
Regarding the consonants, the voiceless stops /p, t, k/ were improved by both
types of ICT training, though the production of /t/ in the EX group with a self-video
seems scarcely advantageous for highlighting aspiration. The fricatives /s, z, f, v/
were also difficult phonemes because these sounds are similar to the Japanese
phonemes /ɕ, ʣ, ʥ, ɸ, b/, respectively, so they were easily substituted. This can
be explained by the SLM (Flege 1995). After the training, /s/ and /f/ improved,
but /z/ and /v/ did not. The EX group had more benefits in the production of the
four fricatives than the CO group. The most difficult were the affricates /ʧ/ and /ʤ/,
whose articulation should be done with rounded lips and intense friction. For
this information about articulation, the ICT with a self-video can be extremely
useful for learners.
The approximants /l/ and /w/ were very challenging phonemes, especially
the production of the dark /l/, because the learners do not tend to learn the dis-
tinction between the two kinds of /l/, the light /l/ and the dark /l/ at school, so
they do not know where the tongue should be positioned. /w/ in Y was substi-
tuted by the Japanese /ɰ/, which is not rounded. The three-syllable word W was
incredibly difficult, and even after the ICT training, no progress was shown.
Regarding all the above-mentioned consonants, Yamada and Adachi, in a
study on perception (1998) and another study (1999) on production, reported
that these consonants were problematic. This was also shown by Joto (2009),
who indicated that these phonemes had low intelligibility ratings. Even the chal-
lenging phonemes /ʧ/ and /ʤ/ showed progress when using the ICT with a self-
video, so we can suggest that for the sounds whose mouth movements are more
clearly visualized, the self-videos are the most powerful aid for learners. First,
they watch the native speaker’s video and understand how to pronounce the
sounds, and then they can compare their own production with the native articu-
lation by using the self-videos in real-time. They can also gain awareness of pro-
nunciation by paying careful attention to the articulation. With regard to the
participants’ ad-hoc questionnaires, learners mentioned that they were satisfied
with both ICT training programs, and they commented that both the native vid-
eos and the explanations were very useful. Regarding the ICT with self-videos,
the students recognized that the self-videos were beneficial for improving their
pronunciation. As Purcell and Suter (1980) insisted, concern for accuracy is one
of the important predictors of improving pronunciation.
The ICT training for both groups improved their pronunciation according to
the pre- and post-tests. Although the results for the EX and CO groups did not
differ in terms of improvement rates or number of improvements, we can con-
clude that the EX group benefitted more than the CO group because participants
in the EX group could see their self-videos and check their mouth movements,
simultaneously comparing them with the native speaker’s.
Regarding pedagogical implications, Nation and Newton (2009) suggested
that teachers could understand the influence of the L1 by becoming familiar with
the sound system of the learners’ first language and thus gaining ideas for creat-
ing the effort and attention needed to bring about the desired changes. As Couper
(2006) mentioned, appropriately focused instruction could lead to changes in
learners’ phonological interlanguage even when this might appear to have be-
come fossilized. We strongly suggest that teachers give instruction regarding pro-
nunciation systematically and regularly and hold that most fossilized phonemes
could be changed.
An additional way to use this ICT system is to apply bottom-up and top-
down training, as the system now provides evaluations on vowels, consonants,
syllables, rhythms, intonations, and individual sounds. Alphabet training in
this research is regarded as a top-down type of training. Learners can practice
at the Alphabet site and learn how to pronounce the alphabet, where each let-
ter contains a consonant plus a vowel (B /bi:/) or just a vowel (A/eɪ/). If they
find that some phonemes are difficult to pronounce or if they do not recognize
how to pronounce them, they can access the Vowels and Consonants sites to
make sure they can make these sounds correctly. Learners can access each site,
moving back and forth, and then increase their practice on their own. In con-
trast, for the bottom-up training, the learners access the Vowels and Conso-
nants sites first to learn the segments and then access the other site to learn to
put into practice a variety of words, phrases, and sentences, such as the alpha-
bet, rhythms, intonations, and sound changes. These features of ICT training
can improve the learners’ self-learning autonomy.
Several developments are needed in this ICT self-learning system to improve
the vowels because the explanations for the vowels are less than those available
for the consonants. Furthermore, the native speaker’s mouth movements cannot
show the inside of the mouth and how the tongue is moving up and down or
forward and back. As one participant commented, “It would be better if I could
see the inside of the native speaker’s mouth.” Pennington and Roger-Revell
(2019) reviewed the technologies currently available for teaching pronunciation,
focusing on feedback about their usefulness and limitations. Based on these
results, we acknowledge that the ICT alphabet training site does not provide real-
time feedback, so we have been developing a feedback system in which the utter-
ances in the intonation or rhythm training sites within this self-learning training
system are immediately transcribed into text. If the effectiveness of the real-
feedback system is proven, we will also put the feedback system on the alphabet
training site.
6 Conclusions
Although the alphabet is introduced in the early stages of learning, students
seldom really learn the specific sounds of the names of the letters. In our study,
after practicing for short periods, such as 30 minutes, the participants showed
an improvement in their pronunciation. This training provides insights in help-
ing learners to recognize how to articulate sounds. Noticing and recognizing
how to articulate is essential, as is emphasized by Purcell and Suter (1980). The
participants’ pronunciation of the consonants improved through the ICT self-
learning system, using a self-video.
We also suggest this system can be used for teachers as well as learners. In
particular, elementary school teachers who teach English pronunciation to
young learners will be able to prevent them from developing fossilized sounds
that are affected by their native Japanese.
Note:
URL for the ICT system:
https://npl-mock.glexa.net/intonation
Appendix A
The ICT platform for the EX group
The ICT Platform for the CO group

Appendix B
280
Results of Pre- and Post-Pronunciation Training for the EX and CO Groups.
EX-Pre EX-Post CO-Pre CO-Post
Alphabet Native Pronunciation Numbers (%) Pronunciation Numbers (%) Pronunciation Numbers (%) Pronunciation Numbers (%)
A eɪ eɪ   eɪ   eɪ   eɪ  
eː   eː   eː   eː  
B biː biː   biː   biː   biː  

viː  
C siː   siː   siː   siː  

ɕiː   ɕiː   ɕiː   ɕiː  
θiː   θiː  
D diː diː   diː   diː   diː  
E iː iː   iː   iː   iː  
F ɛf ɛf   ɛf   ɛf   ɛf  
ɛf✶   ɛf✶   ɛf✶   ɛf✶  
eɸ   eɸ   eɸ   eɸ  
G ʤiː ʤiː   ʤiː   ʤiː   ʤiː  

dʑiː   dʑiː   dʑiː   dʑiː  
dziː   dziː   dziː   ʑiː  
ʑiː   ʒiː   diː   dziː  
ðiː  
zi:  
H eɪʧ eɪʧ   eɪʧ   eɪʧ   eɪʧ  
eɪʨ   eɪʨ   eɪʨ   eɪʨ  
I aɪ aɪ   aɪ   aɪ   aɪ  
J ʤeɪ ʤeɪ   ʤeɪ   ʤeɪ   ʤeɪ  

ʥeɪ   ʥeɪ   ʥeɪ   ʥeɪ  
ʣeɪ   ʒeɪ   ʑeɪ   ʒeɪ  
ʒeɪ   ʣeɪ   ʣeɪ   ʣeɪ  
ʥeː   ʥeː   ðeɪ  
ʣeː  
K keɪ keɪ   keɪ   keɪ   keɪ  

keː   k✶eɪ   keː   keː  
k✶eː   keː  
L ɛɫ/ɛl ɛɫ/ɛl   ɛɫ/ɛl   ɛɫ/ɛl   ɛɫ/ɛl  

erɯ   elɯ   ɛɯ   eo  
eɯ   eo  
elɯ  
M ɛm ɛm   ɛm   ɛm   ɛm  
N ɛn ɛn   ɛn   ɛn   ɛn  

enɯ  
eɴ  
el  
Note: ef✶ indicates /f/ with no friction, k✶eː, k✶eɪ indicates the VOT of /k/ shows less than 40 ms.
Improving fossilized English pronunciation by simultaneously
281
EX—Pre EX-Post CO-Pre CO-Post
282
Alphabet Native Pronunciation Numbers (%) Pronunciation Numbers (%) Pronunciation Numbers (%) Pronunciation Numbers (%)
O oʊ oʊ   oʊ   oʊ   oʊ  
oː   oː   o:  
P pː piː   piː   piː   piː  

p✶i:   p✶i:   p✶i:   p✶i:  
Q kjuː kjuː   kjuː   kjuː   kjuː  
R ɑɚ ɑɚ   ɑɚ   ɑɚ   ɑɚ  
aːɾ   aːɹɯ   aːɯ   aːɯ  

aːɾɯ   a˞ː✶   aːɹɯ   aːl  
a˞ː✶   aːl   aːɾ   aːɹɯ  
aː   aː   a˞ ː✶  
aʊ  
S ɛs ɛs   ɛs   ɛs   ɛs  
eθ   eθ  
T tiː tiː   tiː   tiː   tiː  

t✶iː   t✶iː   t✶iː   t✶iː  
diː   diː  
U juː juː   juː   juː   juː  
V viː viː   viː   viː   viː  

v✶iː   v✶iː   v✶iː   v✶iː  
vɯi   vɯi   vɯi   vɯi  
bɯi   bɯi  
biː  
W dʌbḷjuː dʌbḷjuː   dʌbḷjuː   dʌbḷjuː   dʌbḷjuː  
dabɯɾjɯ   dabɯɾjɯ   dabɯɾjɯ   dabɯɾjɯ  
dabɯɾɯ   dabjɯ   dabjɯ   dabjɯ  
dabɾjɯː   davɾjɯː   ðavɾjɯː   dabɾjɯ  
davɯjɯː   dablɯː   davɯjɯ   dabɯjɯː  
davjɯː   dabɯju   ðavjɯː  
davɯɾjɯ   dabɯɾɯ  
davɯɾjɯ  
X ɛks ɛks   ɛks   ɛks   ɛks  

ekks   ekks   ekks   ekks  
ekθ   ekθ  
jeks  
Y waɪ waɪ   waɪ   waɪ   waɪ  

ɰai   ɰai   ɰai   ɰai  
Z ziː ziː   ziː   ziː   ziː  

ʣiː   ʣiː   ʣiː   ʣiː  
ʥiː   ʥiː   ʥiː   ð iː  
dzed   zet  
dzet  
Note: piː indicates the VOT of /p/ shows less than 25 ms. a˞ː✶ indicates rhotic /a/. t✶iː indicates the VOT of /t/ shows less than 35ms. v✶iː indicates
✶
/v/ with no friction, which sounds like /b/. In W, there is a variety of wrong sounds, especially/ð/ and /v/ with no friction.
Improving fossilized English pronunciation by simultaneously
283
References
Acton, William. 1984. Changing fossilized pronunciation. TESOL Quarterly 18(1). 71–85.
Best, Catherine T. 1995. A direct realist view of cross-language speech perception. In Winifred
Strange (ed.), Speech Perception and Linguistic Experience: Issues in Cross-language
Best, Catherine T. & Michael D. Tyler. 2007. Non-native and second-language speech
perception: Commonalities and complementarities. In Ocke-Schwen Bohn & Murray
J. Munro (eds.), Language Experience in Second Language Speech Learning: In honor of
James Emil Flege, 13–34. Amsterdam: John Benjamins.
Carley, Paul & Inger M. Mees. 2020. American English Phonetics and Pronunciation Practice.
New York: Routledge.
Celce-Murcia, Marianne. 2001. Teaching English as a Second or Foreign Language, 3rd ed.
Boston: Heinle & Heinle Publisher.
Couper, Graeme. 2006. The short and long-term effects of pronunciation instruction. Prospect
21(1). 46–66.
Ehri, Linnea C. 2013. Orthographic mapping in the acquisition of sight word reading, spelling
memory, and vocabulary learning. Scientific Studies of Reading 18(1). 5–21. https://doi.
org/10.1080/10888438.2013.819356
Ehri, Linnea C. 2020. The science of learning to read words: A case for systematic phonics
instruction. Reading Research Quarterly 55(S1). S45–S60.
Flege, James E. 1995. Second language speech learning theory, findings, and problems. In
Language Research, 233–277. Timonium: York Press.
Gass, Susan M. & Larry Selinker (eds). 1992. Language Transfer in Language Learning:
Revised edition. Amsterdam: John Benjamins.
Han, ZhaoHong & Terence Odlin (eds). 2005. Studies of Fossilization in Second Language
Acquisition. Clevedon: Multilingual Matters. https://doi.org/https://doi.org/10.21832/
9781853598371
Hardison, Debra M. 2010. Visual and auditory input in second-language speech processing.
Language Teaching, 43(1). 84–95. https://doi.org/DOI:10.1017/S0261444809990176
Hazan, Valerie, Anke Sennema, Andrew Faulkner, Marta Ortega-Llebaria, Midori Iba, &
Hyunsong Chung. 2006. The use of visual cues in the perception of non-native consonant
contrasts. The Journal of the Acoustical Society of America, 119(3). 1740–1751.
https://doi.org/doi:10.1121/1.2166611
Heidlmayr, Karin, Emmanuel Ferragne & Frederic Isel. 2021. Neuroplasticity in the
phonological system: The PMN and the N400 as markers for the perception of non-native
phonemic contrasts by late second language learners. Neuropsychologia 156. 107831.
https://doi.org/10.1016/j.neuropsychologia.2021.107831
Jarosz, Anna. 2019. English Pronunciation in L2 Instruction: The case of Secondary School
Learners. Cham: Springer.
Joto, Akiyo. 2009. komyunikeshon noryoku wo koryo shita nihongobogowasha no eigoonsei ni
kansuru hokatsutekikenkyu [A comprehensive study of English sounds produced by
native speakers of Japanese from the perspective of communicative ability]
Kagakukenkyuhi Hojokin Kenkyuseika Hokokusho [Kaken Research Report].
Joto, Akiyo. 2020. Intelligibility and acoustic features of the English fricatives /s/ and /ʃ/
produced by native speakers of Japanese. Nihon Gengo Onsei Gakkai [JALS Japan
Association of Language and Speech] 2. 39–54.
Joto, Akiyo, Misuzu Miyake & Yuri Nishio. 2017. Shogakko eigokatsudo ni shisuru
hatsuonshido manyuaru no sakuseini mukete: Eigohatsuon shido no jittaichosa to
kyokashobunseki wo motoni [Toward the development of a teacher’s manual for teaching
English pronunciation in elementary school English activities: based on a questionnaire
survey of English sound education and an analysis of English textbooks]. JACET Chugoku-
Shikoku Chapter Research Bulletin 14. 143–160.
Kachru, Braj B. 1985. Standards, codification and sociolinguistic realism: The English
language in the Outer Circle. In Randolph Quirk & Henry George Widdowson (eds.),
English in the World: Teaching and Learning the Language and Literature, 11–30.
Kokusaikoryukikin. 1989. Kyoshiyo nihongohandobukku hatsuon kaiteiban [Japanese
handbook for teachers, pronunciaiton, revised]. Tokyo: Bonjinsha.
Lado, Robert. 1957. Linguistics Across Cultures. Ann Arbor: University of Michigan Press.
Lambacher, Stephen G. 2010. A CALL tool for improving second language acquisition of
English consonants by Japanese learners. Computer Assisted Language Learning 12(2).
137–156. https://doi.org/10.1076/call.12.2.137.5722
Lambacher, Stephen G., William L. Martens, Kazuhiko Kakehi, Chandrajith A. Marasinghe &
Garry Molholt. 2005. The effects of identification training on the identification and
production of American English vowels by native speakers of Japanese. Applied
Psycholinguistics 26(2). 227–247.
Lisker, Leigh & Arthur S. Abramson. 1964. A cross-language study of voicing in initial stops:
Acoustic measurements. Word 20(3). 384–422.
McGurk, Harry & John MacDonald. 1976. Hearing lips and seeing voices. Nature 264. 746–748.
Major, Roy C. 1987. Foreign accent: recent research and theory. International Review of
Applied Linguistics in Language Teaching 25(3). 185–202.
MEXT. 2015. Heisei 26 nendo, shoggako gaikokugokatsudo jisshijokyochosa no kekka no gaiyo
[The survey of elementary English education, 2014]. https://www.mext.go.jp/
a_menu/kokusai/gaikokugo/1362148.htm
MEXT. 2017a. Shin kyoshoku katei koa karikyuramu [New teacher training course core
curriculum]. https://www.mext.go.jp/component/b_menu/shingi/toushin/__icsFiles/
afieldfile/2017/11/27/1398442_1_3.pdf
MEXT. 2017b. Shogakko shin gakushushidoyoryo kaisetsu gaikokugo katsudo gaikokugo hen
[Course of Study guides, Foreign language activities]. https://www.mext.go.jp/content/
20201029-mxt_kyoiku01-100002607_11.pdf
Nation, I. S. P. & Jonathan Newton. 2009. Teaching ESL/EFL Listening and Speaking. New York:
Routledge.
Oh, Grace E., Susan Guion-Anderson, Katsura Aoyama, James E. Flege, Reiko Akahane-Yamada
& Tsuneo Yamada. 2011. A one-year longitudinal study of English and Japanese vowel
production by Japanese adults and children in an English speaking setting. Journal of
Phonetics 39(2). 1–25. https://doi.org/doi:10.1016/j.wocn.2011.01.002.
Pennington, Martha C. & Pamela Rogerson-Revell. 2019. English Pronunciation Teaching and
Research. London: Palgrave Macmillan.
Piasta, Shayne B. & Richard K. Wagner. 2010. Learning letter names and sounds: Effects of
instruction, letter type, and phonological processing skill. Journal of Experimental Child
Psychology 105(4). 324–344. https://doi.org/10.1016/j.jecp.2009.12.008.
Purcell, Edward D. & Richard W. Suter. 1980. Predictors of pronunciation accuracy: a
reexamination. Language Learning 30(2). 271–287.
Riney, Tim & Janet Anderson-Hsieh. 1993. Japanese pronunciation of English. JALT Journal 15(1).
21–36.
Sasaki, Miyuki. 2008. The 150-year history of English language assessment in Japanese
education. Language Testing 25(1). 63–83. https://doi.org/10.1177/0265532207083745.
Selinker, Larry. 1972. Interlanguage. International Review of Applied Linguistics 10. 209–231.
https://doi.org/https://doi.org/10.1515/iral.1972.10.1-4.209.
Shimizu, Katsumasa. 2016. Nihonjin gakushusha ni yoru eigoboin no shutoku ni tsuiteno
kosatsu [A study on the acquisition of English vowels by Japanese ESL Learners]. JACET
Chubu Journal 14. 51–62.
Smotrova, Tetyana. 2017. Making pronunciation visible: Gesture in teaching pronunciation.
Stevick, Earl W. 1978. Toward a practical philosophy of pronunciation: Another view. TESOL
Quarterly 12(2). 145–150.
Thompson, Irene. 1991. Foreign accents revisited: The English pronunciation of Russian
immigrants. Language Learning 41(2). 177–204.
Vance, Timothy J. 1987. An Introduction to Japanese Phonology. New York: State University of
New York Press.
Yamada, Tsuneo & Takahiro Adachi. 1998. Eigo risuningu kagakuteki jotatsuho [Saientific
ways to improve your skill of English]. Tokyo: Kodansha.
Yamada, Tsuneo & Takahiro Adachi. 1999. Eigo supikingu kagakutekijotatsuho [Scientific ways
to improve your speaking skill of English]. Tokyo: Kodansha.
Zhang, Runhan & Zhou-min Yuan. 2020. Examining the effects of explicit pronunciation
instruction on the development of L2 pronunciation. Studies in Second Language
Acquisition 42(4). 905–918. https://doi.org/10.1017/s0272263120000121
Natallia Liakina, Denis Liakin
Speech technologies and pronunciation
training: What is the potential for efficient
corrective feedback?
Abstract: In this paper, we will first examine different types of implicit and ex-
plicit corrective feedback (CF) that automatic speech recognition (ASR)-based
applications can provide and discuss their impact on the acquisition of L2 pro-
nunciation in light of SLA findings. Second, we will report the results of our
action research on the use of three different ASR-based tools in two university-
level French pronunciation courses, with specific reference to learners’ percep-
tions of the utility of different types of automatic corrective feedback provided
by these tools. To conclude, we will offer avenues of discussion and practical
suggestions for the effective and sensible integration of ASR-based applications
in the teaching and learning of L2 pronunciation, in and beyond the classroom.
Keywords: speech recognition, pronunciation, second language acquisition,

learner autonomy
1 Introduction
Intelligible speech is integral to L2 acquisition and use, and is essential for ef-
fective communication (Arteaga 2000; Levis and McCrocklin 2018; Morin 2007;
Thomson and Derwing 2014). Although students frequently express the need or
desire to improve their pronunciation, teachers often neglect it in favor of the
development of other skills in the traditional language classroom (Isaacs 2009;
Lang et al. 2012; Lebel 2011; Saito 2012) and, at the same time, it is rare for stu-
dents to receive sufficient instruction and feedback on pronunciation from their
teacher due to the lack of time and/or appropriate resources and training (Col-
lins and Muñoz 2016; Cucchiarini and Strik 2013; Morin 2007; Neri, Cucchiarini,
and Strik 2002). Input alone (exposure inside and outside the classroom) is in-
sufficient for pronunciation advancement (Elliott 1995; Flege 1981; Fortune and
Tedick 2015; Han and Odlin 2006; Kennedy 2011; Solon 2016), so learners need
to have extensive opportunities for output during classroom interactions or
Natallia Liakina, McGill University

Denis Liakin, Concordia University
https://doi.org/10.1515/9783110736120-011
288 Natallia Liakina, Denis Liakin
individual practice mediated by technology. In addition to that, students are

not able to monitor their own oral production when practicing autonomously
or during learning activities without receiving any corrective feedback (CF);
they need knowledge, strategies, and resources to develop their pronunciation
(McCrocklin 2014).
This paper reflects on the potential of speech technologies for efficient CF
in a university-level French course, with attention on the ways technology pro-
vides students with instantaneous targeted feedback to have novel opportuni-
ties to improve L2 pronunciation in personalized and effective ways (Bajorek
2017; Blake 2013; Liakin, Cardoso, and Liakina 2015).
2 Corrective feedback and L2 pronunciation

training
According to numerous SLA studies conducted in traditional classroom and
laboratory settings, pronunciation-based CF is essential for second language
speaking development (Derwing and Rossiter 2003; Kartushina et al. 2016;
Thomson 2011) and can lead to improvement in terms of perception (Lee and
Lyster 2016, 2017) as well as accuracy, comprehensibility, and intelligibility
(Baker and Burri 2016; Derwing, Munro, and Wiebe 1998; Hinks 2003; Lord
2005; Saito and Lyster 2012b).
Corrective feedback can be provided in a variety of formats and can be ex-
plicit or implicit, in addition to being audio- or video-based. It can be immediate
when the CF is part of a communicative activity, or it can be delayed, for exam-
ple, when the learners receive their grade, assessment comments, and correc-
tions some time after submitting an oral assignment. In terms of the information
provided to the learners, the researchers distinguish binary and targeted feed-
back. Binary feedback is a comprehensive assessment informing learners in a
general way if their pronunciation was correct (Chapelle 2001), including a score
or grade without specifying what went wrong. Targeted feedback, conversely,
draws attention to specific errors, allows the learners to understand the reasons
that motivated the grade or assessment comment, and suggests ways to address
them (Crompton and Rodrigues 2001; Gass, Behney, and Plonsky 2013; Wiggins
2012).
As for the ways the feedback is formulated, two main types of oral CF pro-
vided by instructors to correct pronunciation are reformulations of the errone-
ous statement and prompts (Lyster and Ranta 1997; Ranta and Lyster 2007).
A reformulation is when learners are provided with the correct form through
Speech technologies and pronunciation training 289
explicit corrections or pronunciation-based recasts that enable learners to vali-

date their knowledge. As for the prompts, they are visual or auditory cues given
to the learners to self-repair without providing the correct form, such as clarifi-
cation requests, repetition of the error, metalinguistic clues, and elicitation.
They can come from an instructor or peers (Kennedy, Blanchet, and Trofimo-
vich 2014) or software (Hincks 2003).
According to the results of the recent meta-analyses of an important num-
ber of studies on pronunciation-based CF, oral CF is significantly more efficient
than no CF and its effect is more beneficial when it is integrated into explicit
phonetic instruction and pronunciation-focused activities (see Lee, Jang, and
Plonsky 2015; Saito and Plonsky 2019). Also, immediate feedback seems to be
more beneficial than delayed feedback (Li, Zhu, and Ellis 2016).
In terms of efficiency of different types of feedback, Gooch et al. (2016)
found in their experimental study targeting the production of /ɹ/ by Korean
learners of ESL that while both prompts and recasts seem to be equally effective
in improving controlled production of the targeted sound, prompts contribute
more to improvement in spontaneous production in communicative contexts
(Gooch et al. 2016). These findings suggest that prompts and recasts should be
used as a combination of CF techniques. First, prompts will make learners draw
on their previous knowledge acquired through explicit instruction, and second,
the use of recasts will help them to notice the negative evidence directed at the
intelligibility of the output, which will allow learners to practice the correct
form in response to the model pronunciation (positive evidence) (Ellis and
Sheen 2006; Loewen and Philp 2006; Lyster, Saito, and Sato 2013; Nicholas,
Lightbown, and Spada 2001; Saito and Lyster 2012a).
Similar results were found by Lee and Lyster (2017) in their study compar-
ing effects of different types of CF types on speech perception and production.
In their study, conducted with 100 Korean learners of English, they investigated
the efficiency of different CF conditions: no CF, a prompt (What did you say?),
visual CF (Wrong!), and three auditory CF types such as target condition provid-
ing the correct pronunciation model (No, she said ship), nontarget condition
(No, not sheep) and combination condition (no, she said ship, not sheep). Ac-
cording to the results, the combination type was the most beneficial type of CF
for perception development, which can be explained by the fact that the learn-
ers were able to compare the correct model and their erroneous pronunciation.
As for the production, the target condition was more effective than all other
types.
To conclude, the research suggests that it is important to orchestrate a
range of feedback techniques and make choices based on linguistic target, con-
text, proficiency level, etc. (Ellis 2012; Lyster, Saito, and Sato 2013; Ranta and
Lyster 2018). A key point is that CF cannot be useful if learners have not en-
gaged in any initial explicit learning opportunities, or if they are not provided
with clear information on erroneous utterances beyond information about cor-
rectness (Hattie and Timperley 2007).
While researchers, teachers and learners consider CF as a crucial compo-
nent of L2 pronunciation teaching and learning, the frequency of CF episodes
targeting pronunciation in L2 classrooms is very low and represents only 22.4%
of teacher-learner interactions (Brown 2016). Since opportunities for pronuncia-
tion training and immediate personalized corrective feedback are limited in the
traditional classroom setting, can the use of new speech technologies be a via-
ble solution to provide the learners with effective pronunciation practice with
meaningful feedback?
3 New speech technologies, new opportunities

for pronunciation learning?
Computer-Assisted Pronunciation Training (CAPT) has suggested the effective-
ness of two tools for learning outcomes and users’ learning experience: auto-
matic speech recognition (ASR) and text-to-speech synthesis (TTS) (Bajorek
2017; Blake 2013; Cucchiarini and Strik 2013; Golonka et al. 2014; Liakin, Car-
doso, and Liakina 2015, 2017b; Morton and Jack 2010; Wang and Young 2015).
Automatic Speech Recognition (ASR) is a computing process that instantly
transcribes spoken language into text. In the context of pronunciation instruc-
tion, researchers propose using ASR to teach the pronunciation of a foreign lan-
guage and to assess students’ oral production (Liakin, Cardoso, and Liakina
2015; McCrocklin 2014).
Text-to-speech (TTS) is a natural-language modeling process that changes
units of text into speech for audio presentation. Text-to-speech programs usu-
ally feature different speed levels for their voices (speech output), including
both female and male speakers with different pitches (low and high), different
accents, and a highlight function that displays the words, sentences, and para-
graphs being read by the program.
These technologies can be used to encourage practice and repetition (Chap-
elle and Jamieson 2008; Garcia, Nickolai, and Jones 2020; McCrocklin 2016),
which personalizes learning (Derwing 2010; Tsutsui 2004), provides immediate
visual feedback on pronunciation (Mroz 2018; Neri, Cucchiarini, and Strick
2002; Neri et al 2008; Wang and Young 2015), and encourages learner auto-
nomy (Chapelle and Jamieson 2008; McCrocklin 2016).
A CAPT system may include either audio or visual feedback, which pin-
points pronunciation errors while students are making unlimited trials practic-
ing a target language in the absence of teacher involvement. In a CAPT system,
ASR technology automatically transcribes students’ voice recordings into writ-
ten words. The speech visualization technology integrated into the system visu-
ally shows the deviation of students’ pronunciation from that of native model
speakers, and this feedback can provide learners with particular metacognitive
strategies to facilitate mastery of the target language sound patterns (Tsai
2019).
Speech recognition software varies greatly in terms of its validity, reliability,
and the quality of the feedback provided to the users; therefore, further research
is necessary to determine if speech recognition software actively supports L2 pro-
nunciation development (Bajorek 2017; Liakin, Cardoso, and Liakina 2017b; Mroz
2018).
3.1 TTS
There exists very little research on the effects of the use of TTS as an L2 peda-
gogical tool. Liakin, Cardoso, and Liakina (2017a) investigated the acquisition
in production of French liaison, i.e., the pronunciation of a latent word-final
consonant in a mobile TTS-based learning environment. Using a pre/post/
delayed-posttest design with two experimental groups (TTS-group and French
instructor supervised group) and a control group, the results indicated that the
two groups that received instruction, namely the TTS and teacher-led groups,
outperformed the control group in liaison production. This study confirmed
TTS’ ability to aid in pronunciation learning.
The results obtained by Bione and Cardoso (2020) suggest that synthetic
voices have the potential to deliver intelligible and comprehensible input, simi-
lar to human speech. Their study evaluated a modern English TTS system in an
EFL context in Brazil in terms of its speech quality, ability to be understood by
L2 users, and potential for focus on specific language forms in comparison with
a native English speaker. The results of the study indicate that the performance
of both the TTS and human voices were perceived similarly in terms of compre-
hensibility, while ratings for naturalness were unfavorable for the synthesized
voice. For text comprehension, dictation, and aural identification tasks, partici-
pants performed relatively similarly in response to both voices.
3.2 ASR
The majority of studies that have investigated the effects of ASR on the acquisi-
tion of L2 pronunciation have shown that, despite many limitations, this technol-
ogy has the potential to be effective. In the context of pronunciation teaching,
researchers suggest two possible applications for ASR: (1) to teach the pronuncia-
tion of a foreign language; and (2) to assess students’ oral production. A series of
studies show that computer-assisted pronunciation instruction using ASR can be
effective in the acquisition of L2 phonological features (Bodnar et al. 2016; Cuc-
chiarini, Neri, and Strik 2009; Garcia, Nickolai, and Jones 2020; Liakin, Cardoso,
and Liakina 2015, 2017b; McCrocklin 2016; Mroz 2018, 2020; Mushangwe 2015;
Neri et al. 2008; Penning de Vries et al. 2014; Seferoglu 2005; Strik et al. 2009,
2012, among others).
Liakin, Cardoso, and Liakina (2015) investigated the effects of mobile ASR-
based learning on the acquisition of the problematic French vowel /y/ in pro-
duction and perception. The study consisted of three groups of learners: one
received instruction via ASR, the other via a French instructor, and the third
acted as the control group. Their findings indicated that the group that received
ASR-based instruction improved significantly in /y/ production from pretest to
posttest, in comparison with the two other groups.
An experimental study by Mroz (2020) aimed to determine the impact of
mobile-based ASR in Gmail on the intelligibility and proficiency of Intermediate
learners of French as a foreign language, and whether any individual factors
influenced learning outcomes. The results of this study showed that ASR users
significantly outperformed non-ASR users on intelligibility, particularly when
exposed to instruction on spelling-to-sound patterns, and demonstrated the
most significant growth in proficiency.
In Garcia, Nickolai, and Jones (2020), the authors presented a 15-week
classroom study measuring the student outcomes of instructor-led pronuncia-
tion lessons versus entirely ASR-based pronunciation training in lower-level
Spanish courses. The study found that both instructor-led and ASR-based in-
struction techniques yielded statistically significant gains in pronunciation rat-
ings. ASR seems to outperform traditional instruction when targeting specific
phonemes, especially in the short-term, while the instructor-led group which
received explicit instruction on pronunciation saw longer-term gains regarding
comprehensibility. The data suggest that ASR-based instruction shows promise
to improve certain aspects of L2 pronunciation.
3.3 Efficiency of ASR-based corrective feedback
While more and more studies investigate the usefulness of speech technologies
to develop different L2 skills, a limited number of researchers investigated the
efficiency of different types of automated, immediate CF that can be provided
to learners using ASR and TTS-based tools for L2 pronunciation practice. The
following studies guided us in designing our action research, which will be pre-
sented in the following section.
Cucchiarini, Neri, and Strik (2009) conducted a research experiment with a
group of 30 adult immigrants who were divided into three groups who used:
(1) an ASR-based Computer Assisted Pronunciation Training (CAPT) system devel-
oped specifically for Dutch L2 learners and providing CF on a limited number of
problematic Dutch sounds; (2) a CAPT system with no CF; or (3) a regular
teacher-front classroom instruction with no CAPT system. The ASR-based feed-
back consisted of the transcription of the utterance produced by the learners
with mispronounced phonemes identified in red, a smiley, and a comment indi-
cating that there was an error. The system also allowed the learners to listen
and to compare their pronunciation with the model. In order to regulate the
level of anxiety, only three errors maximum were signaled for each recording.
According to the results, the group with ASR-based CF outperformed two other
groups on the production of the targeted sounds; however, the difference in im-
provement for three groups was not statistically significant for the phonemes
not targeted by the automatic feedback. While this study demonstrated the pos-
itive impact of the ASR-based CF on the production of the sounds targeted by
the training, the researchers concluded that limiting CF to a restrained number
of problematic sounds is not an effective strategy to obtain significant overall
learning effects and pronunciation quality.
To better understand CF in an ASR system, Wang and Young (2014) re-
searched the effects of two different types of immediate automated CF provided
through a pedagogical ASR-based intelligent computer-assisted speaking learn-
ing (iCASL) system for autonomous practice of English pronunciation. The partic-
ipants – 38 adult ESL learners from Taiwan – were divided into an experimental
and a control group and had to complete weekly reading activities independently
during an eight-week period. While the control group received only implicit CF
that consisted of a speaking score and a waveform diagram, the experimental
group benefited from additional explicit targeted CF, including a corrective com-
ment, a list of words pronounced correctly and with errors, and recasts of the
learners’ utterances. Finally, the learners had access to audio recordings with
full sentences and single-word forms that could be played at a natural and slow
pace. According to the results, 94% of control group participants reported being
confused and not being able to interpret the overall assessment scores and the
waveforms. Therefore, it is not surprising that only the experimental group’s par-
ticipants, exposed to both implicit and explicit multi-modal targeted CF, attained
significant improvement rates in pronunciation.
Bajorek (2017) considers, among other fundamental points, the importance
of L2 pronunciation and how targeted feedback of spoken production can sup-
port language learners. Her findings indicate that the softwares reviewed (Ro-
setta Stone, Duolingo, Babbel, and Mango Languages) provide insufficient
feedback to learners about their speech and, thus, have unrealized potential.
The author recommends that learners be provided, wherever possible, with tar-
geted feedback so that they can act on this information and improve their
speech via explicit instruction. Accordingly, ASR can be helpful for learners in
providing immediate targeted feedback, but this capability must be explained
through explicit instructions rather than being used as an unexplained assess-
ment tool.
To conclude, ASR is a very promising technology that should allow stu-
dents to get immediate feedback on their pronunciation, thus making them
more independent in learning this aspect. However, as presented in Liakin, Car-
doso, and Liakina (2017b), many participants of their two studies experienced a
great deal of frustration when they were unable to understand why the applica-
tions could not understand them and how they could correct themselves:
“Sometimes I didn’t know what to change, so I just said the same thing over and over.”
“[. . .] I didn’t even know what I was doing wrong.”
“[. . .] when I was getting to the thirteenth, the fourteenth [try] and I’m just like ‘I don’t
know how you want me to say it!’”
There is also little research on learners’ perceptions of the use of ASR and TTS
for French as a second/foreign language pronunciation learning in general
and, more specifically, on the immediate automated feedback they receive.
4 Action research study: Integration of three

automatic speech recognition apps into
a corrective pronunciation course
As illustrated at the beginning of this chapter, previous research mainly fo-
cused on only one specific ASR-based tool used to support one type of task,
mainly reading tasks. No studies were conducted in a French as a Second Lan-
guage (FSL) intact classroom with elementary-level learners.
This study adopts an action research approach to examine the student per-
ceptions of speech technology as a pronunciation-learning tool and the imme-
diate feedback it provides. It aims to explore the use of three different types of
ASR and TTS supported tools that allow learners to practice pronunciation and
receive instantaneous, automatic feedback, not only on structured read and re-
peat tasks but also on a broader range of communicative tasks to transfer the
new skills acquired in a controlled environment into more spontaneous oral
communication contexts. Our goal was to explore the potential of different ASR
and TTS applications and the types of automated corrective feedback they offer
at different stages of the pronunciation learning process, as suggested by SLA
and CALL research and second language pedagogy.
4.1 Research questions
The following research questions guided our investigation:

1. What are students’ perceptions of the immediate automated corrective feed-
back provided by speech technology?
2. How do learners perceive the use of speech technology as a learning tool
for pronunciation training?
4.2 Method
4.2.1 Participants
Fifty-seven young adult FSL learners of French participated in this study

(45 female, 12 male). All participants were recruited from three intact L2
French classrooms taught by teacher-researchers at two large North American
universities. They were either native English speakers or had native-like profi-
ciency in the language. All participants had a beginner-level proficiency in
French (A1 level, according to the Common European Framework of Reference
for Languages; Council of Europe, 2001) and were enrolled in elementary-
level specialized corrective pronunciation classes.
4.2.2 Pedagogical design and tasks
The pedagogical design of the study was based on the framework for effective
pronunciation teaching in communicative contexts (Celce-Murcia et al. 2010),
which includes the following steps:
– listening in the form of the perceptual teaching aimed at developing phono-
logical awareness and proposing perception and discrimination activities;
– repetition/imitation in the form of guided practice;
– communication in the form of the reuse of spontaneous production in vari-
ous contexts such as speech acts, presentations, etc. (Celce-Murcia et al.
2010: 45).
The development of pedagogical tasks was also guided by SLA research find-
ings that suggest the efficiency of explicit learning (Derwing and Munro 2015;
Ellis 1994), of focus-on-form (Long 2000) integrated into a thematic framework
and based on the communicative and task-based methods (Elliott 1997; Gatbon-
ton and Segalowitz 2005; Trofimovich and Gatbonton 2006; Yule, Powers, and
Macdonald 1992) and of noticing hypothesis (Schmidt 1994, 1995).
Four pedagogical sequences were developed by the teachers-researchers to
allow the students to practice the phonemes targeted by the course curriculum
outside of the classroom. The tasks were grouped in four 1.5-hour assignments
that students needed to complete outside of regular contact hours as homework.
Each assignment included the following activities in context, each focusing
on specific segmental and suprasegmental elements and with a focus on vocab-
ulary and formulaic expression of a theme (e.g. silent and pronounced final
consonants, rounded vowels /oe/-/ø/, qualitative adjectives and description of
a person, nasal vowels, enumeration intonation and food):
– a review of the articulation, the grapheme-phoneme correspondences and
the pronunciation rules;
– auditory discrimination and grapheme-phoneme correspondence autocor-
rected exercises;
– reading tasks with a focus on targeted phonemes;
– communicative tasks.
In order to achieve these pedagogical goals, three different ASR and TTS-based
tools were integrated: iSpraak, a teaching tool designed for pronunciation train-
ing in a great variety of languages; Pronunciator, a multi-language learning
platform; and Speech to Text Translator TTS app for mobile devices, a free dicta-
tion tool.
4.2.3 ASR- and TTS tools used in this study
iSpraak (https://www.ispraak.com) is an online activity generator that automates

speech evaluation for second language learners. Instructors can create activities
by providing the platform with a short text in the L2 and an optional MP3 file to
serve as the audio model. The application can also generate text-to-speech (TTS)
audio files if a model recording does not yet exist, and they can be played with a
feminine or masculine synthetic voice at a regular or slow pace. iSpraak is a ped-
agogical tool that was created specifically for pronunciation practice in a wide
variety of languages, which enables learners to perform simple reading tasks. In
terms of immediate corrective feedback, iSpraak was the only tool that we found
that provided users with targeted explicit corrective feedback; i.e., a transcription
of the utterance with mispronounced words and/or phonemes highlighted in
blue, a score, the possibility to listen to the model, and exposure to different pro-
nunciations of problematic words via the website forvo.com. In our study, this
tool was used for listening and reading tasks with a focus on targeted phonemes
providing both natural and synthetic voice models.
Speech to Text Translator TTS is a free speech recognition, text to speech
and an instant live translator application not designed for pedagogical pur-
poses. However, users can still do simple and complex pedagogical tasks like
reading, looking for the spelling of a new word, recording a vocal message (e.g.
leaving a grocery list to a roommate, making a reservation, placing an order,
inviting to a party, etc.), or practicing a presentation (e.g. presenting some’s
eating habits). As for the CF, it is implicit since the only way to validate or verify
the pronunciation is through an analysis of the speech synthesis of the spoken
message, the transcription, and the translation of the spoken message. In this
study, the participants used the app for reading tasks.
Pronunciator (https://www.pronunciator.com), a comprehensive multi-
language learning tool, is a Web or mobile platform for iOS and Android.
The learners can do pronunciation and vocabulary drills and interactive tasks
like pronunciation in context or speech functions. As for CF, it is a combination
of binary and implicit feedback types: the users receive a score and a general
audio message for the overall assessment associated with the percentage of the
score (e.g.: Try again!). While the system doesn’t allow the users to see the tran-
scription of their utterances, they can listen to their recording and compare it
with the model. By using this platform, the participants were engaged in listen-
ing, reading and communicative tasks.
4.2.4 Instruments
At the end of the first week (first use) and at the end of the fourth week (last
use), the participants were invited to respond to a survey questionnaire involv-
ing a set of six statements regarding their perceptions of the use of each speech
technology as a learning tool for pronunciation training and of the immediate
automated corrective feedback provided by each of three tools (using a five-
point Likert scale in order to measure the degree to which students disagreed or
agreed with each statement: (1) strongly disagree, (2) disagree, (3) neutral, (4)
agree, and (5) strongly agree).
In order to better understand the quantitative results, participants were in-
vited to express their opinion on each statement of the questionnaire at the end
of the experiment. For each tool used in the study, the statements asked partici-
pants if: (a) the tool increased their motivation to learn about French pronuncia-
tion; (b) the tool allowed them to become aware of some of their pronunciation
problems; (c) the tool allowed them to evaluate their own pronunciation (to de-
cide whether their pronunciation was correct or incorrect); (d) the tool is user-
friendly; (e) the immediate feedback was helpful; and, finally, (f) they thought
this is a great tool to learn and practice pronunciation.
To guarantee confidentiality and to avoid factors that could affect data col-
lection or interpretation of the statements, the survey was administered at
home, without the presence of the teacher, and using English, the language of
instruction at the university where the study took place.
The data from the survey questionnaire were analyzed using descriptive
statistics, in which we established the mean values and associated standard de-
viations for each item under consideration.
4.3 Results
The data compiled by means of the survey questionnaire were analyzed via a
simple mean calculation with associated standard deviation (descriptive statis-
tics). Means were used to measure the students’ ratings of the statements
adopted in the study.
4.3.1 Research question 1
Quantitative results
For our first research question, What are students’ perceptions of the immediate
automated corrective feedback provided by speech technology?, we analysed sur-
vey statements 1–3: (1) “the immediate feedback was helpful”; (2) “the tool al-
lowed me to evaluate my own pronunciation”; (3) “the tool allowed me to
become aware of some of my pronunciation problems.” Table 1 illustrates the
results for each of these three items collected after the first and the last use of
the tools.
Table 1: Survey results (Research question 1).
Statements iSpraak Speech to Text TTS Pronunciator
Week  Week  Week  Week  Week  Week 
Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD
. The immediate . . . . . . . . . . . .
feedback was
helpful.
. The tool allowed . . . . . . . . . . . .
me to evaluate my
own pronunciation.
. The tool allowed . . . . . . . . . . . .
me to become aware
of some of my
pronunciation
problems.
Although the students’ perceptions were positive for all three statements, the re-
sults of the responses to the questions regarding the usefulness of corrective
feedback clearly indicate that the iSpraak application, which offered a combina-
tion of implicit and explicit corrective cues, comments, and targeted feedback,
was the most highly rated. The Pronunciator platform received the lowest scores:
the speech recognition was not always reliable because of some technical issues
with ASR, and the scores and general appreciation messages that students re-
ceived were inconsistent and difficult to interpret.
What we found very interesting was the change in perception of the Speech
to Text Translator TTS app between the first and the last use of the tool, particu-
larly the appreciation of the implicit feedback that the app offered in the form
of a transcription of the utterances. This assessment was much more favorable

at the end of the intervention. Our interpretation of such change in learners’
perception of this tool toward the end of the study is that the learners took the
necessary time to learn the application and were better equipped to decode the
implicit feedback and to correct themselves, thanks to the explicit teaching in
their corrective phonetics course.
Qualitative results
All the participants were invited to express their opinion1 on each statement
from the questionnaire at the end of the experiment.
As for the help of immediate feedback (Statement 1), on the positive side,
all learners (n=57) appreciated the variety and complementarity of CF types
that can be exemplified with the following excerpts: “It is always nice to have
immediate feedback to be able to correct you mistakes.” – “All feedbacks are
helpful.” – “I can find quickly my pronunciation errors from these tools.” These
responses are likely due to the fact that immediate feedback allowed them to
see their progress, in addition to receiving messages of encouragement in Pro-
nunciator, as illustrated by one of the participants’ comments: “My favorite tool
for feedback is Pronunciator because it encourages me to keep working on pro-
nunciation.” At the same time, the participants appreciated seeing their mis-
takes immediately in Speech to text Translator TTS and iSpraak: “It is very
useful to know exactly what it is that you are mispronouncing.”, allowing them to
“correct mistakes before [they] remember things wrong.”
The less positive comments on the helpfulness of the immediate feedback
concerned the perceived inaccuracy of the tool and the inability to autocorrect
after completing the tasks. In terms of the assessment of their pronunciation,
several participants stated that sometimes iSpraak was a source of slight disap-
pointment since they often obtained an almost perfect score even if they mis-
pronounced several words. As for the helpfulness of the binary feedback
provided by Pronunciator, it was reported that “ . . . it could have been better if
it provided you with more specific feedback (like iSpraak), like telling you ex-
actly which words were mispronounced.”
As for Statement 2 (The tool allowed me to evaluate my own pronunciation),
the learners appreciated all the tools: “I found all of the tools helpful in showing
me what the standard pronunciation should be, and in catching my mistakes.”
They additionally liked the possibility to see their pronunciation score in Pro-
nunciator and iSpraak: “It gives me scores that I can evaluate myself.”
 We have chosen to keep the participants’ comments in their original and unmodified form.
They also appreciated the fact that immediate feedback allows them to vi-
sualize what has been said, to listen to, and compare their pronunciation
against a model in Pronunciator, thus contributing to their practice and learn-
ing experience, as highlighted by a participant: “It is a helpful tool to aware of
my pronunciation problems. It let me compare between my pronunciation and
right pronunciation.”
However, as can be observed in Table 1, they had some difficulties with
Speech to Text Translator TTS at the beginning, since there were ads in the app.
At the same time, some students expressed a lack of confidence and mentioned
the fact that not all tools provided explicit feedback on the errors: “I wish it told
me what was wrong with my pronunciation but I realize that is hard for an appli-
cation to do.”
Finally, the learners appreciated the fact that immediate feedback allowed
them to become aware of their mistakes and the quality of their pronunciation
(Statement 3), thus helping them to identify the elements pronounced incorrectly
and to know exactly what to correct: “I do really recommend the apps iSpraak
and Pronunciator. For the first one, you can know what kind of pronunciation you
should stress on and for the second one, you could imitate what you have heard.” –
“iSpraak and Speech-to-text TranslatorTTS made me realize most what I was doing
wrong.” – “ . . . very useful to know what I said wrong.”
Students particularly liked iSpraak. This can be exemplified with the fol-
lowing excerpt: “I like the system of this tool. First, I listen to the sound clip to be
familiar the right pronunciation and rhythm of the sentence. And then, I record
my pronunciation. Also, they give me quick feedbacks that I can aware where I
need to fix.”
Despite these findings, we need to mention that some participants had a
feeling of frustration, especially those who had pronunciation problems and
would prefer to have the chance to listen to their own recordings.
These results suggest that the type of corrective feedback impacts one’s
learning experience and that tools that provide targeted explicit and implicit
feedback are perceived as more useful. It could also be concluded that learners
need more coaching in spelling decoding and correction strategies if the correc-
tive feedback they receive is implicit. Scores with or without a general comment
are less useful and less appreciated.
4.3.2 Research question 2
Quantitative results
For our second research question, How do learners perceive the use of speech
technology as a learning tool for pronunciation training?, we analysed the survey
statements 4–6 from Table 2: (4) “I think this is a great tool to learn and practice
pronunciation”; (5) “the tool increased my motivation to learn about French pro-
nunciation”; and (6) “the tool is user-friendly.”
Table 2: Survey results (Research question 2).
Statements iSpraak Speech to Text TTS Pronunciator
Week  Week  Week  Week  Week  Week 
Mean SD Mean SD Mean SD Mean SD Mean SD Mean SD
. I think this is a . . . . . . . . . . . .
great tool to learn
and practice
pronunciation.
. The tool . . . . . . . . . . . .
increased my
motivation to learn
about French
pronunciation.
. The tool is user- . . . . . . . . . . . .
friendly.
Similar to the case of the three previous statements, all statements for the second
research question indicated that the students’ perceptions were positive; the only
exception was for Speech to Text Translator TTS after the first week. Here, again,
the assessment was much more favorable at the end of the intervention since the
students were able to learn how to use the tool.
Qualitative results
For Statement 4 (I think this is a great tool to learn and practice pronunciation),
the learners appreciated all three tools and how they complemented each
other: “I think they’re all great tools for different reasons, and I enjoyed the vari-
ety.” – “Each tool offered something different in terms of pronunciation help [. . .]” –
“I would use three all again [. . .] for pronunciation.” Students also appreciated
the possibility of unlimited practice.
Even though these tools might be a source of frustration for some students,
especially for learners with many pronunciation problems, they seemed to ap-
preciate that learning can be done autonomously, not in front of the group, mo-
tivating the practice: “It is certainly very useful for practicing. It can get a bit
frustrating at times, but it also encourages to keep doing it to get better. It’s like a
very personal challenge.”
As for the motivation statement (The tool increased my motivation to learn
about French pronunciation), students were unanimous and positive: “The styles
of the tools are various, but all tools inspire my motivation to learn about French
pronunciation equally.” – “It made me realize many errors in my knowledge of pro-
nunciation, and it made want to fix them.” – “Most of these tools actually motivate
to do better in French language, because when I am corrected, it shows me what I
need to work on.”
A small number of students mentioned a few drawbacks such as the limited
number of activities and the time to be accustomed to the tools: “I feel that
using the tools helped me to want to learn more, but they took a while to get ac-
customed to.”
Finally, for the statement The tool is user-friendly, although students found
them easy to use (“They are easy to use. Interface is simple and to the point.”),
we can observe that students had some difficulties with Speech to Text Transla-
tor TTS at the beginning (“ . . . has weird layout. Too many ads.”), but at the
end they changed their perception significantly. We want to stress that it is cru-
cial to train learners well, technically and pedagogically, to take the time to
master the tool, and to consider the potential negative influence of pop-ups
and advertisements in apps as it was in the case of Speech to Text Translator
TTS. That is often the case when it comes to applications that offer free access
and are not pedagogically conceived.
In sum, the qualitative analysis allowed us to identify the perceived bene-
fits and drawbacks of speech technologies as pronunciation practice tools and
their potential for effective corrective feedback illustrated in Table 3. These fac-
tors can either enhance or, on the contrary, have a negative effect on a learning
experience and should be considered when a pedagogical decision is made
about the integration of ASR and TTS-based tools into the curriculum.
5 Discussion and pedagogical implications

Built on the findings in three interconnected fields of the second language stud-
ies, SLA, CALL and L2 pedagogy, our action research responds to “the strong
Table 3: Perceived benefits and downsides of speech technologies.
Benefits Drawbacks
visualization of what has been said/spoken absence of explicit feedback on the

errors
possibility of listening to and comparing their difficulty with results interpretation

pronunciation to a model
identification of the elements pronounced pop-ups and advertisements in the app

incorrectly
awareness of their mistakes and of the quality of unusual layout of the tool
their pronunciation
targeted corrective feedback limited number of activities
monitoring of the progress impossibility to listen to own recording
encouragement to improve the pronunciation source of frustration for learners with

(source of motivation) and willingness to continue many pronunciation problems
in order to improve results
unlimited practice lack of confidence in the software
intuitive and easy-to-use tools feeling of frustration
call for studies that explore how pronunciation-focused CF can be implemented

in the most efficient and effective manner” (Saito 2021: 422).
The positive results obtained in this exploratory action research study shed
light on the interactions between the learners and different types of automated
corrective feedback provided by speech technology on their pronunciation and
suggest ways of effective integration of ASR and TTS-based applications into the
L2 curriculum to enhance the learning experience and make it more meaningful.
In terms of the usefulness of different types of corrective feedback, the re-
sults are in line with previous findings in the fields of SLA and CALL. The com-
bination of explicit targeted feedback and automated audio recasts provided by
iSpraak was perceived as the most helpful CF technique. From the very begin-
ning of the experiment, such CF allowed the participants to validate their
knowledge (a score), to identify the errors (problematic words and phonemes
highlighted), to listen to a model (audio recasts), to be able to correct it immedi-
ately and to realize if there was progress. These findings go along with the re-
searchers who claim that the pronunciation-based CF is maximized when the
learners know exactly what they need to correct (Bajorek 2017; Cucchiarini and
Strik 2013, Wang and Young 2014) and when they are provided with an audio
model (or a recast) rather than a prompt (Saito 2021). It is important to note that
the results of our study suggest that targeted feedback provided in form of en-
hanced visual prompts and followed by audio recasts is useful, especially during
guided practice activities when the learners are still acquiring knowledge about
new pronunciation features. This is important at this stage because learners need
a solid scaffolding to support their learning and to prepare them to engage in
more complex oral tasks.
As for the prompts that were given implicitly via written transcription of the
utterances in iSpraak and Speech to Text Translator TTS, it was interesting to ob-
serve how the learners’ perceptions changed over time. As noted by many re-
searchers, CF is beneficial only when it is built on the previous knowledge and if
it can be understood by the learners based on what they already know (see
Ammar and Spada 2006; DeKeyser 2007; Lyster, Saito, and Sato 2013). Also, CF is
more efficient when the learners “have enough phonetic knowledge, conversa-
tional experience, and perceptual awareness of target sounds” (Saito 2021: 422).
At the beginning of the experiment, our participants, all near-beginners, had a
very limited knowledge and experience, therefore they weren’t ready and well
equipped to be able to decode the implicit feedback, to engage in reflection of
their successes and to identify the gaps in their pronunciation that need to be
addressed and they didn’t perceive the ASR-based dictation app as a useful tool
for learning pronunciation. After four weeks of intensive training, including ex-
plicit instruction and extensive practice of grapheme-phoneme correspondence,
which is very challenging for learners of French, the appreciation of the implicit
CF in forms of prompts was as positive as the one of the targeted explicit
feedback with audio recasts. These findings suggest that the integration of
ASR-based dictation tools should be carefully prepared and an appropriate
scaffolding in form of explicit teaching and training on how to fix the pro-
nunciation errors based on the transcription of the utterance provided by the
app in order to avoid learners’ frustration and loss of motivation (Liakin, Car-
doso, and Liakina 2017b). Finally, in terms of the place of pronunciation ac-
tivities supported by ASR and TTS dictation tools, it appears that they are
suitable for semi-guided activities and more communicative tasks that are
part of spontaneous practice.
As for binary feedback, consisting of an overall assessment (score and gen-
eral appreciation message) introduced during the final phase of the sequences
for communicative practice in context, the appreciation of CF remained neutral
during the whole length of the experiment. Such CF was considered a signal
that something went wrong and a prompt to try again, but not as a sufficient
teaching technique to support learning and correct pronunciation. However, it
is important to draw attention to the implicit combination of auditory feedback
available in Pronunciator when the learners, on top of the binary feedback,
listened to their own recorded utterances and compared them to the model re-
cordings. Many participants found it extremely helpful and expressed the wish
to have the same feature available for all the tools and pronunciation activities,
which could be one of the criteria for teachers who choose ASR-TTS tools for
their students or learning app developers working in the field of CAPT.
In sum, most of the participants found all the apps and the combination of
different types of CF very useful and complementary, which supports the recom-
mendations of many researchers to orchestrate different CF techniques (Ammar
and Spada 2006; Hattie and Temperley 2007; Lyster and Ranta 1997).
In terms of pedagogy, the design of the technology-mediated tasks was in-
spired by the framework for effective pronunciation teaching, which allowed
progressive learning with multiple outcomes, from explicit teaching of pronun-
ciation rules to the practice of grapheme-phoneme correspondence; perception
to production; and from controlled to spontaneous processing and output
(Celce-Murcia et al. 2010). The explicit instruction and oral feedback provided
during class time enabled the learners to identify and understand errors, and
then correct themselves based on different types of automated ASR-based feed-
back. This supports the claims made by many researchers that explicit phonetic
knowledge is necessary for pronunciation-focused CF to be effective, so it could
lead to significant learning gains in pronunciation.
In sum, the results of this study suggest that ASR-based automated immedi-
ate feedback, provided in a variety of formats, has a positive impact on pronunci-
ation learning when it is combined with training on learning strategies, explicit
instruction on articulation techniques, and spelling-to-sound patterns integrated
into level-appropriate realistic tasks. Altogether, this encourages meaningful pro-
nunciation practice in context and empowers learners to become more motivated
and more autonomous in their pronunciation practice outside of the classroom
(Dikerson 2015; Liakin, Cardoso, and Liakina 2017b; McCrocklin 2016).
6 Concluding remarks
The first goal of this study was to review the different types of speech technolo-
gies to better understand how they could be used for pronunciation instruction,
with specific attention on how ASR- and TTS-based applications may offer an
array of opportunities for immediate automated corrective feedback. The second
goal of this study was to better understand learners’ perceptions of the utility of
the different types of corrective feedback provided by the abovementioned appli-
cations (i.e., iSpraak, Speech to Text Translator TTS, Pronunciator). With regard to
the first objective, as based on research about the role of CF in L2 pronunciation,

it was determined that ASR- and TTS-based applications can provide corrective
feedback that can be more explicit or implicit, focusing on more binary or tar-
geted feedback via audio or visual channels. With regard to the second objective,
which was answered via an action research project in two university-level French
pronunciation courses, participants indicated that the pedagogical use of the
three abovementioned applications were useful, indicating in survey and short-
answer written responses that they perceived the various types of corrective feed-
back to be useful. A key point to consider, however, is that the three applications
were only viewed as complementary tools after they received training about L2
pronunciation and how to use the tools for independent practice, indicating that
teachers still play a critical role. Given the findings in this paper, there is poten-
tial for efficient corrective feedback in CAPT systems that utilize ASR and TTS,
though future research is needed to better understand the role of training in ef-
fectively using these resources.
References
Ammar, Ahlem & Nina Spada. 2006. One size fits all? Recasts, prompts and L2 learning.
Studies in Second Language Acquisition 28(4). 543–574.
Arteaga, Deborah L. 2000. Articulatory phonetics in the first-year Spanish classroom. The
Modern Language Journal 84(3). 339–354.
Bajorek, Joan. 2017. L2 Pronunciation in CALL: The unrealized potential of Rosetta Stone, Duolingo,
Babbel, and Mango Languages. Issues and Trends in Educational Technology 5(2). 24–51.
Baker, Amanda & Michael Burri. 2016. Feedback on second language pronunciation: A case
study of EAP teacher’s beliefs and practices. Australian Journal of Teacher Education 41(6).
1–19. doi:10.14221/ajte.2016v41n6.1 (accessed 16 June 2021).
Bione, Tiago & Walcir Cardoso. 2020. Synthetic voices in the foreign language context.
Language Learning & Technology 24(1). 169–186.
Blake, Robert J. 2013. Brave New Digital Classroom: Technology and Foreign Language
Learning. Washington: Georgetown University Press.
Bodnar, Stephen, Catia Cucchiarini, Helmer Strik & Roeland van Hout. 2016. Evaluating the
motivational impact of CALL systems: Current practices and future directions. Computer
Assisted Language Learning 29(1). 186–212.
Brown, Dan. 2016. The type and linguistic foci of oral corrective feedback in the L2 classroom:
A meta-analysis. Language Teaching Research 20(4). 436–458.
Pronunciation: A Course Book and Reference Guide, 2nd edn. Cambridge: Cambridge
University Press.
Chapelle, Carol. 2001. Computer Applications in Second Language Acquisition: Foundations
for Teaching, Testing, and Research. Cambridge: Cambridge University Press.
Chapelle, Carol & Joan Jamieson. 2008. Tips for Teachers: Computer-assisted Language
Learning. New York: Pearson Longman.
Collins, Laura & Carmen Muñoz. 2016. The foreign language classroom: Current perspectives
and future considerations. The Modern Language Journal 100 (S1).133–147. https://doi.
org/10.1111/modl.12305 (accessed 16 June 2021).
Council of Europe. 2001. Common European Framework of Reference for Languages: Learning,
Teaching, Assessment. Cambridge, U.K.: Press Syndicate of the University of Cambridge.
Crompton, Peter & Sherwin Rodrigues. 2001. The role and nature of feedback on students
learning grammar: A small scale study on the use of feedback in call in language
learning. Proceedings of the Workshop on Computer Assisted Language Learning,
Artificial Intelligence in Education Conference, 70–82.
Cucchiarini, Catia, Ambra Neri & Helmer Strik. 2009. Oral proficiency training in Dutch L2: The
contribution of ASR-based corrective feedback. Speech Communication 51(10). 853–863.
Cucchiarini, Catia & Helmer Strik. 2013. Second language learners’ spoken discourse: Practice
and corrective feedback through Automatic Speech Recognition. In Hwee Ling Lim & Fay
Sudweeks (eds.), Innovative Methods and Technologies for Electronic Discourse Analysis,
169–189. Hershey: Information Science Reference.
DeKeyser, Robert. 2007. Skill Acquisition Theory. In Bill VanPatten & Jessica Williams (eds.),
Theories in Second Language Acquisition: An Introduction, 97–113. Mahwah, NJ:
Lawrence Erlbaum Associates Publishers.
Derwing, Tracey M. 2010. Utopian goals for pronunciation teaching. In John Levis & Kimberly
LeVelle (eds.), Proceedings of the 1st Pronunciation in Second Language Learning and
Teaching Conference, Ames, USA, 2009, 17–19. Ames, IA: Iowa State University.
Derwing, Tracey M., Murray J. Munro & Grace Wiebe. 1998. Evidence in Favor of a Broad
Framework for Pronunciation Instruction. Language Learning 48(3). 393–410.
doi:10.1111/0023-8333.00047 (accessed 16 June 2021).
Derwing, Tracey M. & Marian J. Rossiter. 2003. The Effects of Pronunciation Instruction on the
Accuracy, Fluency, and Complexity of L2 Accented Speech. Applied Language Learning 13(1).
1–17.
Dickerson, Wayne. 2015. Using orthography to teach pronunciation. In Marnie Reed & John
Levis (eds.), The Handbook of English Pronunciation, 488–503. Chichester: Wiley
Blackwell.
Elliott, A. Raymond. 1995. Foreign language phonology: field independence, attitude, and the
success of formal instruction in Spanish pronunciation. The Modern Language Journal 79(4).
530–542.
Elliott, A. Raymond. 1997. On the teaching and acquisition of pronunciation within a
communicative approach. Hispania 80(1). 95–108.
Ellis, Rod. 1994. The Study of Second Language Acquisition, 2nd edn. Oxford: Oxford University
Press.
Ellis, Rod. 2012. Language Teaching Research and Language Pedagogy. Oxford:
Wiley–Blackwell.
Ellis, Rod & Young Sheen. 2006. Reexamining the role of recasts in second language
acquisition. Studies in Second Language Acquisition 28(4). 575–600. doi:10.1017/
S027226310606027X (accessed 16 June 2021).
Flege, James E. 1981. The phonological basis of foreign accent: A hypothesis. TESOL Quarterly
75(4). 443–455.
Fortune, Tara W. & Diane J. Tedick. 2015. Oral proficiency assessment of English-proficient K-8
Spanish immersion students. Modern Language Journal 99(4). 637–655.
Garcia, Christina, Dan Nickolai & Lillian Jones. 2020. Traditional versus ASR-based
pronunciation instruction: An empirical study. Calico Journal 37(3). 213–232.
Gass, Susan M., Jennifer Behney & Luke Plonsky. 2013. Second Language Acquisition: An
Introductory Course, 4th edn. New York: Routledge.
Gatbonton, Elizabeth & Norman Segalowitz. 2005. Rethinking communicative language
teaching: A focus on access to fluency. The Canadian Modern Language Review 61(3).
325–353. http://dx.doi.org/10.3138/cmlr.61.3.325 (accessed 16 June 2021).
Golonka, Ewa M., Anita R. Bowles, Victor M. Frank, Dorna L. Richardson & Suzanne Freynik.
2014. Technologies for foreign language learning: A review of technology types and their
effectiveness. Computer Assisted Language Learning 27(1). 70–105. doi:10.1080/
09588221.2012.700315 (accessed 16 June 2021).
Gooch, Debbie, Paul Thompson, Hannah M. Nash, Margaret J. Snowling & Charles Hulme.
2016. The development of executive function and language skills in the early school
years. Journal of Child Psychology and Psychiatry 57(2). 180–187.
Han, ZhaoHong & Terence Odlin. 2006. Studies of Fossilization in Second Language
Acquisition. Bristol: Multilingual Matters.
Hattie, John & Helen Timperley. 2007. The power of feedback. Review of Educational Research
77(1). 81–112.
Hincks, Rebecca. 2003. Speech Technologies for Pronunciation Feedback and Evaluation.
ReCALL: The Journal of EUROCALL 15(1). 3–20. doi:10.1017/S0958344003000211
(accessed 16 June 2021).
Isaacs, Talia. 2009. Integrating form and meaning in L2 pronunciation instruction. TESL
Canada Journal 27(1). 1–12.
Kartushina, Natalia, Alexis Hervais-Adelman, Ulrich Hans Frauenfelder & Narly Golestani.
2016. Mutual influences between native and non-native vowels in production: Evidence
from short-term visual articulatory feedback training. Journal of Phonetics 57. 21–39.
Kennedy, Sara. 2011. Le développement de la parole L2 d’étudiants universitaires non-natifs.
Paper presented at the Journée d’étude sur la phonétique des langues secondes,
Université du Québec à Montréal, 1 April.
Kennedy, Sara, Josée Blanchet & Pavel Trofimovich. 2014. Learner pronunciation, awareness,
and instruction in French as a second language. Foreign Language Annals 47(1). 76–96.
Lang, Yong, Lin Wang, Lianxia Shen & Yinying Wang. 2012. An integrated approach to the
teaching and learning of zh. Electronic Journal of Foreign Language Teaching 9(2).
215–232.
Lebel, Jean-Guy. 2011. Nécessité de la correction phonétique en FLE. Paper presented at the
Journée d’étude sur la phonétique des langues secondes, Université du Québec à
Montréal, 1 April.
Lee, Andrew H. & Roy Lyster. 2016. Effects of different types of corrective feedback on
receptive skills in a second language: A speech perception training study. Language
Learning 66(4). 809–833.
Lee, Andrew H. & Roy Lyster. 2017. Can corrective feedback on second language speech
perception errors affect production accuracy? Applied Psycholinguistics 38(2). 371–393.
Lee, Junkyu, Juhyun Jang & Luke Plonsky. 2015. The Effectiveness of Second Language
Pronunciation Instruction: A Meta-Analysis. Applied Linguistics 36(3). 345–366.
Levis, John & Shannon McCrocklin. 2018. Reflective and effective teaching of pronunciation. In
Akram Faravani, Mitra Zeraatpishe, Hamid Reza Kargozari & Maryam Azarnoosh (eds.),
Issues in Syllabus Design, 77–89. Rotterdam: Sense Publishers.
Li, Shaofeng, Yan Zhu & Rod Ellis. 2016. The effects of the timing of corrective feedback on the
acquisition of a new linguistic structure. Modern Language Journal 100(1). 276–295.
Liakin, Denis, Walcir Cardoso & Natallia Liakina. 2015. Learning L2 pronunciation with a
mobile speech recognizer: French /y/. CALICO Journal 32(1). 1–25.
Liakin, Denis, Walcir Cardoso & Natallia Liakina. 2017a. The pedagogical use of mobile speech
synthesis (TTS): Focus on French liaison. Computer Assisted Language Learning 30(3).
348–365.
Liakin, Denis, Walcir Cardoso & Natallia Liakina. 2017b. Mobilizing instruction in a second-
language context: Learners’ perceptions of two speech technologies. Languages 2(3).
1–21.
Loewen, Shawn & Jenefer Philip. 2006. Recasts in the adult L2 classroom: Characteristics,
explicitness and effectiveness. Modern Language Journal 90(4). 536–556.
Long, Michael H. 2000. Focus on form in task-based language teaching. In Richard D. Lambert
& Elana Shohamy (eds.), Language Policy and Pedagogy: Essays in Honor of A. Ronald
Walton, 179–192. Amsterdam: John Benjamins Publishing Company.
Lord, Gillian. 2005. (How) can we teach foreign language pronunciation? On the effects of a
Spanish phonetics course. Hispania 88 (3).557. doi:10.2307/20063159 (accessed June 16
2021).
Lyster, Roy & Leila Ranta. 1997. Corrective feedback and learner uptake: Negotiation of form in
communicative classrooms. Studies in Second Language Acquisition 19(1). 37–66.
Lyster, Roy, Kazuya Saito & Masatoshi Sato. 2013. Oral corrective feedback in second
language classrooms. Language Teaching 46(1). 1–40.
McCrocklin, Shannon. 2014. The potential of Automatic Speech Recognition for fostering
pronunciation learners’ autonomy. Ames: Iowa State University dissertation.
McCrocklin, Shannon. 2016. Pronunciation learner autonomy: The potential of Automatic
Speech Recognition. System 57. 25–42.
Morin, Regina. 2007. A neglected aspect of the standards: Preparing foreign language
Spanish teachers to teach pronunciation. Foreign Language Annals 40(2). 342–360.
Morton, Hazel & Mervyn Jack. 2010. Speech interactive computer-assisted language learning:
a cross-cultural evaluation. Computer Assisted Language Learning 23(4). 295–319.
Mroz, Aurore. 2018. Seeing how people hear you: French learners experiencing intelligibility
through automatic speech recognition. Foreign Language Annals 51(3). 617–637.
Mroz, Aurore. 2020. Aiming for advanced intelligibility and proficiency using mobile ASR.
Journal of Second Language Pronunciation 6(1). 12–38.
Mushangwe, Herbert. 2015. Using voice recognition software in learning of Chinese as a
foreign language pronunciation. The Journal of Language Teaching and Learning 5(1).
52–67.
Neri, Ambra, Catia Cucchiarini & Helmer Strik. 2002. Feedback in Computer Assisted
Pronunciation Training: when technology meets pedagogy. Proceedings of the 10th
International CALL Conference, 179–188. Antwerp: University of Antwerp.
Neri, Ambra, Ornella Mich, Matteo Gerosa & Diego Giuliani. 2008. The effectiveness of
computer assisted pronunciation training for foreign language learning by children.
Computer Assisted Language Learning 21(5). 393–408.
Nicholas, Howard, Patsy M. Lightbown & Nina Spada. 2001. Recasts as feedback to language
learners. Language Learning 51(4). 719–758.
Penning de Vries, Bart, Catia Cucchiarini, Stephen Bodnar, Helmer Strik & Roeland van Hout.
2014. Spoken grammar practice and feedback in an ASR-based CALL system. Computer
Assisted Language Learning 28(6). 550–576.
Ranta, Leila & Roy Lyster. 2007. A cognitive approach to improving immersion students’ oral
language abilities: The awareness-practice-feedback sequence. In Robert DeKeyser (ed.),
Practice in a Second Language: Perspectives from Applied Linguistics and Cognitive
Psychology, 141–160. New York: Cambridge University Press.
Ranta, Leila & Roy Lyster. 2018. Form-focused instruction. In Peter Garrett & Josep M. Vots
(eds.), The Routledge Handbook of Language Awareness, 40–56. New York: Routledge.
Saito, Kazuya. 2012. Effects of instruction on L2 pronunciation development: A synthesis of 15
Saito, Kazuya. 2021. Effects of corrective feedback on second language pronunciation
development. In H. Nassaji & E. Kartchava (eds.), The Cambridge Handbook of Corrective
Feedback in Second Language Learning and Teaching, 407–428. Cambridge: Cambridge
University Press.
Saito, Kazuya & Roy Lyster. 2012a. Effects of form–focused instruction and corrective
feedback on L2 pronunciation development of /r/ by Japanese learners of English.
Saito, Kazuya & Roy Lyster. 2012b. Investigating the pedagogical potential of recasts for L2
vowel acquisition. TESOL Quarterly 46(2). 387–398.
Saito, Kazuya & Luke Plonsky. 2019. Effects of second language pronunciation teaching revisited:
A proposed measurement framework and meta-analysis. Language Learning 69(2).
652–708.
Schmidt, Richard. 1994. Deconstructing consciousness in search of useful definitions for
Applied Linguistics. Consciousness in second language learning 11. 237–326.
Schmidt, Richard. 1995. Attention and Awareness in Foreign Language Learning. Honolulu:
University of Hawaii at Manoa.
Seferoglu, Gölge. 2005. Improving students’ pronunciation through accent reduction software.
British Journal of Educational Technology 36(2). 303–316.
Solon, Megan. 2016. Do Learners Lighten Up? Phonetic and Allophonic Acquisition of
Spanish /l/ by English-Speaking Learners. Studies in Second Language Acquisition 39(4).
1–32.
Strik, Helmer, Jozef Colpaert, Joost van Doremalen & Catia Cucchiarini. 2012. The DISCO ASR-
based CALL system: Practicing L2 oral skills and beyond. Proceedings of the Conference
on International Language Resources and Evaluation (LREC 2012). 2702–2707.
Strik, Helmer, Khiet Phuong Truong, Febe de Wet & Catia Cucchiarini. 2009. Comparing
different approaches for automatic pronunciation error detection. Speech Communication
51(10). 845–852.
Thomson, Ron I. 2011. Computer Assisted Pronunciation Training: Targeting second language
vowel perception improves pronunciation. CALICO Journal 28(3). 744–765.
Trofimovitch, Pavel & Elizabeth Gatbonton. 2006. Repetition and Focus on Form in processing
L2 Spanish words: Implications for pronunciation instruction. The Modern Language
Journal 90(4). 519–535.
Tsai, Pi-hua. 2019. Beyond self-directed computer-assisted pronunciation learning: A
qualitative investigation of a collaborative approach. Computer Assisted Language
Learning 32(7). 713–744.
Tsutsui, Michio. 2004. Multimedia as a means to enhance feedback. Computer Assisted
Language Learning 17 (3–4).377–402. doi:10.1080/0958822042000319638 (accessed
16 June 2021).
Wang, Yi Hsuan & Shelley C. Young. 2014. A study of the design and implementation of the
ASR-based iCASL System with corrective feedback to facilitate English learning.
Educational Technology & Society 17(2). 219–233.
Wang, Yi Hsuan & Shelley C. Young. 2015. Effectiveness of feedback for enhancing English
pronunciation in an ASR‐based CALL system. Journal of Computer Assisted Learning 31(6).
493–504.
Wiggins, Grant. 2012. Seven keys for effective feedback. Feedback for Leaning 70(1). 10–16.
Yule, George, Maggie Powers & Doris Macdonald. 1992. The variable effects of some task-
based learning procedures on L2 communicative effectiveness. Language Learning 42(2).
249–277.
Part IV: Pronunciation in the laboratory: High
variability phonetic training
Ellen Simon, Bastien De Clercq, Pauline Degrave,
Quentin Decourcelle
On the robustness of high variability
phonetic training effects: A study on the
perception of non-native Dutch contrasts
by French-speaking learners
Abstract: There is growing evidence in the literature for the positive effect of
high variability phonetic training (HVPT) on the perception of non-native con-
trasts. In the present study, we aim to examine the robustness of perceptual
training effects. We define robustness along three dimensions: (1) the generaliz-
ability of the training to novel tokens and talkers, (2) the long-term retention
effects of the training, and (3) the effect of training in non-optimal listening
conditions, i.e., with noise added to the signal.
The participants are 48 adult L1 French learners of Dutch in Belgium, 27 of
whom are enrolled in secondary education, while the others are university stu-
dents (N=21). Participants are assigned to an experimental (N=27) or a control
group (N=21). Both groups take a pre-test, post-test and delayed post-test, which
consists of a lexical identification task with and without noise. The experimental
group is trained on five Dutch sound contrasts in five multimodal HVPT ses-
sions, consisting of perceptual identification tasks with feedback and metalin-
guistic information.
The results show a nuanced picture: overall, where training effects are
found, learners are able to generalize these to novel tokens and talkers, thus
confirming the effectiveness of HVPT for pronunciation training. However, the
results also reveal considerable variability in the effectiveness of HVPT along
most robustness variables, which can to a large extent be attributed to the mod-
erating variables we examined, being the type of learners (secondary education
vs. university) and the type of sound contrast.
Keywords: phonetic training, L2 perception, HVPT, Dutch, French
Acknowledgements: We wish to thank Hubert Naets from the Research Centre CENTAL (UCLou-
vain) for his invaluable help with the development of the online environment for the experiment.
Ellen Simon, Quentin Decourcelle, Ghent University

Pauline Degrave, Ghent University; UCLouvain
Bastien De Clercq, Vrije Universiteit Brussel
https://doi.org/10.1515/9783110736120-012
316 Ellen Simon et al.
1 Introduction
It is generally acknowledged that listening to a non-native language is difficult,
especially with respect to the perception of non-native contrasts which do not
occur in the native language (see for instance Williams and Escudero 2014 for an
overview). Driven by the large number of language learners who report having
difficulty with non-native speech perception, a productive research line has
emerged which addresses the impact of phonetic training on L2 perception.
These studies generally report positive effects of phonetic training, showing that
phonetic training can help improve learners’ non-native perception (see Lee,
Jang, and Plonsky 2015 and Sakai and Moorman 2018, for overviews of experi-
mental studies on this topic). A well-known training approach that has been re-
ported to be effective is called High Variability Phonetic Training (HVPT, see
Thomson 2018 for a detailed overview of training studies within this framework).
In this approach, first developed by Logan, Lively, and Pisoni (1991), learners are
exposed to auditory stimuli from multiple speakers and in multiple contexts
(e.g., target vowels flanked by different consonants). The idea is that the variabil-
ity in the realization of a particular phoneme, due to differences in, for instance,
vocal tract size, dialect and speaking rates will help the learner to build a more
robust phonological category. This will in turn enhance perception and word rec-
ognition across different contexts (Logan, Lively, and Pisoni 1991: 4–5).
HVPT was originally purely auditory-based, exposing learners to a large num-
ber of stimuli containing the target L2 sounds. However, it can be combined with
providing learners with explicit metalinguistic information (e.g., a comparison
with the native language) as well as with articulatory-based instruction. Articula-
tory-based instruction focuses on the position of the articulators during the pro-
duction of L2 vowels and consonants and compares it to the articulatory setting
during corresponding L1 sounds (Saito and Plonsky 2019). As Saito and Plonsky
(2019: 662) note, this approach makes use of visual materials, such as diagrams
and animations. The idea behind it is that learners base their sound representa-
tions on the articulatory gestures made for the production of the speech sounds.
In Best’s Perceptual Assimilation Model (PAM), listeners are hypothesized to “ex-
tract invariants about articulatory gestures from the speech signal, rather than
forming categories from acoustic-phonetic cues” (Best and Tyler 2007: 24). If
learners thus receive information on the position of tongue, lips and velum, this
will help them to build representations for L2 sounds. Hazan et al. (2005) report
on an HVPT study, in which they compared the effectiveness of auditory training
with that of audio-visual perceptual training on perception and production. L1
Japanese speakers were trained on the L2 English contrasts /v/-/b/-/p/ and /l/-/r/.
The results revealed that learners’ perception improved in both conditions, but
On the robustness of high variability phonetic training effects 317
more so in the audiovisual condition, in which participants were presented with a

natural face in addition to the auditory stimuli.
While previous studies using HVPT have generally found a positive effect
on L2 perception, in the present study, we aim to gain insight into how robust
these effects are. We identify three dimensions of robustness: (1) the generaliz-
ability of training effects to novel tokens and novel talkers, (2) the duration of
the impact of training, and (3) the effect of training on listening in non-optimal
conditions, i.e., with background noise. In what follows, we briefly discuss pre-
vious studies addressing these dimensions. In addition, we discuss the poten-
tial role of moderating factors on the effectiveness of the training.
1.1 Generalizability and long-term benefits of training
Most studies using HVPT include novel tokens and tokens produced by novel
talkers in the posttest or set up a separate posttest called the ‘generalization test’.
An example of a study that successfully applied HVPT and reported generaliza-
tion to novel tokens and talkers is the study by Bradlow et al. (1997) on a percep-
tual training programme for L1 Japanese learners of L2 English. The training
focused only on the contrast between /r/ and /l/ and consisted of 45 sessions
over a period of 3–4 weeks. The results showed substantial gains in identifica-
tion, from 65% in the pretest to 81% in the posttest and a similarly high percent-
age in the generalization tests with novel words and a novel speaker.
A subset of training studies using HVPT also look at the long-term effects of
the training by including a delayed posttest or retention test several weeks or
months after the end of the training. An early study by Pisoni, Lively, and
Logan (1994) tested the long-term effects of training Japanese listeners on the
identification of English /r/ and /l/ using Logan, Lively and Pisoni’s (1991)
HVPT. They found that accuracy decreased only by 2% from the posttest at the
end of the training to a posttest three months later and no significant decrease
in accuracy was observed for the tests of generalization. After six months of
training, participants still obtained higher scores than at pretest level. Simi-
larly, Wang and Munro (2004) trained native speakers of Mandarin and Canton-
ese on three English vowel contrasts in a programme consisting of 2–3 training
sessions of 50–60 minutes per week over a period of two months. The pro-
gramme had a positive effect on learners’ performance on an identification task
in a posttest as well as in a retention test three months after the programme
had ended. Nishi and Kewley-Port (2007) also found a long-term retention effect
of perceptual training in native Japanese listeners trained on American English
vowels. Both listeners trained on nine vowels and listeners trained on a subset
of three difficult vowels showed improved perception to novel tokens and novel
talkers in a generalization task and in a delayed posttest after three months. In
a study by Rato (2014), native speakers of European Portuguese were similarly
trained on six difficult English vowels in three HVPT training sessions. Partici-
pants who were trained on the vowel contrasts performed significantly better in
a posttest, including a generalization test with novel tokens and talkers (for
two of the three contrasts), and in a delayed posttest after two months. It
should be noted that not all studies report large gains. Aliaga-García and Mora
(2009), for instance, found only a small effect on L2 perception of a HVPT pro-
gramme targeting two consonant and two vowel contrasts in English which are
known to be problematic for native speakers of Catalan.
In sum, the results suggest that the HVPT paradigm can lead to changes in
listeners’ perception which are long-lasting (or at least still observable after six
months, as in Pisoni, Lively, and Logan’s 1994 study). A possible explanation
for the positive effects of training may be that the training triggers listeners to
subconsciously pay attention to the relevant acoustic cues which were under-
used before training took place, and that this newly acquired sensitivity to rele-
vant cues may be permanent.
1.2 Perception in non-optimal conditions
It is well known that non-native perception is seriously challenged when the lis-
tening conditions are not optimal, as in the case of background noise or distor-
tions of the signal through, for instance, a bad telephone connection. Indeed, as
Cutler et al. (2004: 3668) point out: “As non-native listeners, we are all too famil-
iar with the phenomenon that listening to non-native language seems dispropor-
tionately difficult under disadvantageous listening conditions, such as against a
noisy background.” Research indeed confirms that non-native listeners have dif-
ficulty with speech recognition when noise has been added to the signal (see the
special issue edited by Garcia-Lecumberri, Cook, and Cutler (2010) on this topic).
Mattys et al. (2012) make a distinction between speech degradation with and
without energetic masking: the former occurs when there is physical overlap be-
tween the target signal and a nontarget signal, such as background noise. The
target signal itself, however, is intact. The latter occurs when speech is filtered,
for instance in the case of telephone transmission, when the lower frequencies
are not transmitted. A number of studies report on perception experiments which
have tried to mimic a context of listening in a noisy environment by adding noise
to the stimuli, thus creating speech degradation with energetic masking. An ex-
ample is the study by Lengeris and Nicolaidis (2015): they set up a programme to
train native speakers of Greek on the perception of seven English consonants

which are known to be difficult for native speakers of Greek. The pre- and post-
test as well as the training sessions consisted of closed-set identification tasks,
with feedback provided only in the training sessions. In the pre- and posttest,
but not in the training, stimuli were presented either in quiet or with a multi-
talker babble played simultaneously at a signal-to-noise ratio of −2dB. The results
showed a modest yet significant improvement in consonant identification for par-
ticipants who had received the training compared to a control group. As ex-
pected, performance was lower in noise than in quiet, but training had a positive
effect in both conditions. Other studies have examined the effect of training lis-
teners on perception in adverse conditions, operationalized by adding multibab-
ble noise to the stimuli, on speech perception in quiet and in noise, and have
generally found positive effects (e.g., Leong et al. 2018).
1.3 Language learning difficulty: Moderating variables

in instruction
The notion of second language difficulty further accounts for the many ways in
which HVPT can affect the three core dimensions of the language acquisition
process, i.e. its route, rate and final level of attainment (Ellis 2015). Language
learning difficulty itself has been conceptualized as a multifaceted notion, as a
series of moderating variables influencing the effectiveness of any instructional
treatment. In their taxonomy of L2 difficulty, Housen and Simoens (2016) distin-
guish between learner-related, feature-related, and context-related difficulty.
Learner-related difficulty is described as the “encounter of language features
with the language learner’s individual capacities and abilities” (Housen and
Simoens 2016: 167). From the perspective of phonological acquisition, such diffi-
culty may for example arise as a function of a learner’s phonological awareness
(Anthony and Francis 2005), but also as a function of individual differences,
such as age of learning, motivation and learning styles or strategies (Archibald
2021; Dörnyei 2009; Moyer 1999). These factors are well known in the second lan-
guage literature, but have more rarely been examined in relation to the impact of
HVPT (see also Thomson 2018).
Feature-related difficulty, then, can refer to the inherent cognitive require-
ments posed by a language feature independent of the above-mentioned learner-
related features (Housen and Simoens 2016) or, in phonological terms, can also
be understood to refer to markedness or frequency in the input (Archibald 2021),
or to phonological features which pose inherent articulatory difficulties and may
for example arise late in L1 acquisition. Finally, context-related difficulty stems
from learning conditions (e.g., instructed vs. naturalistic) (Housen and Simoens
2016) or the nature of phonological instruction (e.g., implicit vs. explicit, see Pel-
tekov 2020).
Crucially, language learning difficulty can also arise at the interface of
these three sources of difficulty. Such is the case when HVPT manipulates
input frequency through repeated exposure in an instructional setting. From a
contrastive perspective, the interface between language and learner-related fac-
tors can also be understood in terms of the relation between the learner’s L1
and L2 phonology, as formalized by Flege’s Speech Learning Model (Flege 1995;
Flege and Bohn 2021) and Best’s Perceptual Assimilation Model (Best 1994; Best
and Tyler 2007). These models start from the idea that listeners interpret non-
native sounds in terms of the phonetic categories of their native language. The
English contrast between /i/ and /ɪ/, for instance, is hard to perceive by native
speakers of (Brazilian and European) Portuguese, as they categorize both in
terms of their native category /i/, and word recognition of words differing in
these sounds, such as ‘sit’ versus ‘seat’, is often problematic (Lima Jr. this vol-
ume; Rato 2014). Similarly, native speakers of Dutch tend to have difficulty per-
ceiving the contrast between English /ɛ/ and /æ/, since Dutch has only one
vowel in that area of the acoustic vowel space, which is transcribed as /ɛ/, but
has various phonetic realizations depending on the regional accent (Escudero,
Simon, and Mitterer 2008). Given the different correspondences between native
and target language phonemes and the range of (spectral and durational)
acoustic cues which may signal contrasts, it is not unlikely that HVPT may be
more effective for some target features than for others.
1.4 The present study
On the basis of the literature reviewed above, we can conclude that, overall, pre-
vious studies examining the effect of high variability phonetic training on the
perception of difficult L2 contrasts suggest that training leads to gains in the per-
ception of target contrasts and that the learning that has taken place can be gen-
eralized to novel contexts and novel talkers. In addition, there is evidence that
these gains may last until well after the end of the training session. There is also
some evidence that phonetic training in optimal listening conditions may en-
hance the perception of L2 speech with background noise, though more research
on this context is needed. However, previous training studies have relied on con-
siderably different methodologies, using treatments of different (or unreported)
length or frequency and differing in the type of instruction and feedback that is
provided. Studies also often focus on one particular issue, such as the type of
training, the long-term retention of the training effect or the addition of noise to
the stimuli. Moreover, a large number of studies include training on one L2 con-
trast only (see e.g., the studies on English /r/-/l/ discussed above).
In the present study, we aim to examine the robustness of HVPT by includ-
ing a number of variables that may affect the effects of training in one and the
same study design. Specifically, we examine (1) the generalizability of the per-
ceptual training to novel tokens and novel talkers, (2) the long-term retention
effects of the training, and (3) the effect of training in quiet on perception in a
noisy environment. All of these factors can provide us with information on the
robustness of the training effects: specifically, training effects are more robust
if they can be generalized to novel tokens and novel talkers; they are more ro-
bust if they have long-term effects; and finally, they are more robust if they ex-
tend to the perception of L2 sounds in adverse listening conditions, such as a
noisy background. By examining the effect of novel tokens and novel talkers,
the effect of leaving time between the training and a perception test and the
effect of the training on stimuli in quiet and with noise added to the signal, we
can contribute to the existing body of research on phonetic training by showing
the robustness of HVPT effects in the context of our study.
In addition, we explore the moderating role of learner- and target-related fac-
tors on the robustness of the training effects. The effect of learner profile is ex-
plored by comparing training effects in two groups of French-speaking learners
of Dutch. In contrast to the bulk of training studies which focus on L2 English
(Sakai and Moorman 2018), the current study is set in Belgium and focuses on L2
Dutch vowel and consonant contrasts which do not occur in the learners’ native
language (see Section 3.2 on the Method). In Belgium, French is the only official
language in the French-speaking part of Belgium, the Walloon region, whereas
in Flanders the official language is Dutch (Hamers and Blanc 2000). Most Wal-
loons do not hear or speak Dutch on a daily basis and the same holds true for the
Flemish, who generally do not use French in everyday life. The exception is Brus-
sels, which is officially bilingual and where there is individual bilingualism in
part of the population. As such, Dutch is taught as a foreign language in second-
ary and tertiary education in Wallonia. All schools in Wallonia have to offer at
least one foreign language from 5th grade onwards (pupils aged 10–11). They
have the choice between English, Dutch and German, and can also offer Spanish.
Most pupils in the first year of secondary school take English as their foreign lan-
guage, followed by Dutch and German (Mettewie 2021). In tertiary education,
Dutch language and literature programmes are offered to students majoring in
linguistics, applied linguistics and literature, as well as, in some universities, to
students majoring in other programmes (e.g. law or economics) but with a minor
in Dutch. In the current study, we explore the potential difference in training
effects between two groups of Dutch language learners with different profiles: a
group of secondary school pupils for whom Dutch is a compulsory subject and a
group of university students enrolled in a Dutch language programme.
The effect of target feature is examined by including five contrasts in the
training. We have selected five Dutch contrasts which have been reported to be
difficult for native speakers of French (see Section 3.2 for details). As a result,
the analysis will allow us to compare the robustness of the training across tar-
get features.
In the next section (Section 2), we formulate the research questions and hy-
potheses, followed by information on the methodology (Section 3). The results
are presented and discussed in Sections 4 and 5, respectively.
2 Research questions and hypotheses

The main research question we address in this study is how robust high vari-
ability phonetic training effects are for the perception of L2 Dutch contrasts by
French-speaking learners of Dutch in Belgium. We identify three dimensions of
robustness, namely (1) the generalizability of the perceptual training to novel
tokens and novel talkers, (2) the long-term retention effects of the training, and
(3) the effect of training in quiet on perception in a noisy environment, opera-
tionalized by adding noise to the stimuli. In addition, we explore the moderat-
ing role of learner-related factors and language feature on the training impact.
We can hence formulate a main research question (RQ1), which can be broken
down in three subquestions (RQ1a-c) and two more explorative research ques-
tions (RQ2 and RQ3):
RQ1: How robust are HVPT effects on the perception of Dutch contrasts for
French-speaking learners of Dutch in Belgium?
RQ1a: Do HVPT training effects extend to novel tokens and novel talkers?
RQ1b: Is there long-term retention of HVPT training on perceptual identifi-
cation, i.e., are benefits observable in a delayed posttest?
RQ1c: Do training effects of HVPT in quiet extend to phoneme identification
in adverse listening conditions?
RQ2: To what extent is the robustness of HVPT moderated by learner-related

factors (learner profile)?
RQ3: To what extent is the robustness of HVPT moderated by the type of lan-
guage feature?
On the basis of the literature reviewed in Section 1, we hypothesize that the re-
sults will lead to positive responses to RQ1a and RQ1b. In general, previous re-
search examining the effects of HVPT reports that learners benefit from the
training and that they can generalize the acquired knowledge of or sensitivity
to the use of relevant cues to novel tokens and novel talkers (e.g., Bradlow
et al. 1997; Nishi and Kewley-Port 2007; Pisoni, Lively, and Logan 1994). Since
the HVPT framework explicitly uses multiple talkers and multiple contexts in
order for learners to develop stable phonetic categories for L2 sounds, the gen-
eralizability of the training to novel contexts and talkers is expected within the
framework. Long-term retention of training was observed in earlier studies (a.o.
Nishi and Kewley-Port 2007; Wang and Munro 2004) and we predict to find it in
the current study as well, though we also note that long-term effects may de-
pend on the length and duration of the training sessions.
We also hypothesize a positive response to RQ1c, although research on the
effects of training in quiet on perception in noise is limited (but see Lengeris
and Nicolaidis 2015, discussed in Section 1). We predict that performance on
perceptual identification of stimuli in noise will be lower than of stimuli in
quiet, but that training will have an effect on both groups of stimuli.
The extent to which we can provide positive responses to RQ1a-c will pro-
vide us with insight into the overall robustness of the training, thereby answer-
ing the main research question (RQ1).
With respect to RQ2, we explore the question which type of learner benefits
most from phonetic training using two educational profiles: younger secondary
school pupils with lower intramural exposure to Dutch and an older group of stu-
dents enrolled in a Dutch programme at university. As factors such as proficiency,
age and language exposure have been underresearched as independent variables
in HVPT studies (see Thomson 2018), these two profiles represent an explorative
dimension of the study. A study by Alshangiti and Evans (2014) compared the ef-
fect of HVPT in Arabic learners of English with a higher or lower proficiency level
in English and found mixed results. They observed that high proficiency learners
benefited more from training than low proficiency learners on the perception of
speech in noise (measured through a verbal repetition task of stimuli presented in
noise), but that low proficiency learners showed greater improvement in vowel
identification (measured through a closed-set identification task). The authors hy-
pothesize that the greater improvement in vowel identification in the low profi-
ciency learners may be because these learners had more room to improve than the
high proficiency learners. On the basis of these findings, it is difficult to formulate
hypotheses about the effects of proficiency level. Unlike Alshangiti and Evans’
(2014) study, our study does not include training in noise (only pre- and post-test
items were presented in noise, see 3.2.3) and we may hence hypothesize that the
lower proficiency learners in our study would benefit more from training than the
high proficiency learners. Alternatively, the fact that the university students have
chosen Dutch as one of their (minor or major) subjects may reflect a more positive
attitude towards Dutch compared to the secondary school pupils for whom Dutch
is a compulsory subject. As a result, we may also expect larger gains in L2 Dutch
perception for the university students compared to the secondary school pupils.
Since motivation or attitude was not measured and controlled for in our study,
this prediction will necessarily remain speculative.
Finally, in response to RQ3, we predict that the perceptual training may be
beneficial for the perceptual identification of all five contrasts. However, we
also predict that we will observe some differences in the level of difficulty of
the five contrasts in the pretest and that these differences may affect the magni-
tude of the training effects.
3 Methodology
3.1 Participants
The study was conducted in Wallonia, i.e., the French-speaking part of Bel-
gium, with a sample of 48 participants: 27 were pupils in a secondary school
and 21 were students at a university.
The secondary school pupils attended general education when being tested.
They were enrolled in the fourth, fifth or sixth year of secondary education. They
typically have four hours of Dutch classes a week and are able to understand
Dutch at an A2 level in the Common European Framework of Reference for Lan-
guages (CEFR). The university students were recruited in a class of Dutch lan-
guage and grammar given at the UCLouvain, a French-speaking university in the
French-speaking part of Belgium. They were 1st or 2nd year students enrolled in
an introductory Dutch proficiency course with a B1 entry requirement, either as
part of a programme of Linguistics and Literature, or as part of an optional Dutch
module in a Law major. They had a minimum of nine hours of Dutch-spoken
classes per week, including courses on Dutch language proficiency, grammar
and literature.
To select the participants, an initial sample of secondary pupils (N=31) and
university students (N=33) completed a background questionnaire enquiring into
personal information (age, gender, study orientation, place of birth and of resi-
dence, nationality, hearing problems) and their language background (mother
tongue(s), exposure to Dutch, knowledge of other languages). They also com-
pleted the listening comprehension component of Dialang (Lancaster University
n.d.). From this sample, we selected only French-speaking participants who did
not have Dutch as one of their mother tongues and who reported no hearing
problems. The selected participants were divided into two groups of similar size,
i.e., a training group or a control group, each with the same proportion of univer-
sity and secondary school participants. Since some students dropped out of uni-
versity, changed schools or study programmes or did not take part in some parts
of the experiment due to internet connectivity issues, the final sample consists of
27 respondents in the training group (15 secondary pupils and 12 university stu-
dents) and 21 participants in the control group (12 secondary pupils and 9 univer-
sity students). The respondents were not paid for their participation. Table 1
gives an overview of some general characteristics of the different groups.
Table 1: Final selection of participants.
Secondary school University
Control Training Control Training
N    
Age – – – –

(M=., (M= ., (M=., (M=.,
s.d. = .) s.d. = .) s.d. = . s.d. = .)
Gender  women  women  women  women

 men  men  men  men
Study orientation / /  Linguistics &  Linguistics &

Literature Literature
 Law  Law
Mother tongue(s)  French  French French  French

 French-Lingala  French-Chinese  French-Turkish
The Dutch learning experience of the samples of secondary pupils and of uni-
versity students can be contrasted at several levels.
First, the secondary pupils show a lower mean Dutch level than the univer-
sity students. The results obtained for the listening comprehension part of the
Dialang test (Lancaster University n.d.) situate the secondary pupils between A1
and B1 (C1 for one pupil) and the university students between B2 and C2. Mean
self-reported proficiency levels are also lower for the secondary pupils (control =
2.25, s.d. = 0.83; training = 2.06, s.d. = 1.05) than for the university students (con-
trol = 2.62, s.d. = 0.92; training = 3.08, s.d. = 0.99). The self-reported proficiency
was measured by means of a five-point Likert-scale, one standing for very low and
five very high.
Secondly, their exposure to Dutch is different: the university students take
a weekly minimum of nine hours of Dutch classes with a B1 entry requirement,
whereas the secondary pupils are taught four hours of Dutch per week with an
A2 entry requirement. Concerning their contact with Dutch outside school or
university, the selected pupils did not frequently engage in extracurricular L2
Dutch activities, such as watching Dutch-spoken television or media. For in-
stance, only one pupil watched Dutch-spoken television at least once a week. A
vast majority of the pupils reported to engage in these activities less than once
a month. In contrast, the majority of the university respondents watch Dutch
films or television more often (at least once a week).
Thirdly, even though data on the participants’ motivation and attitudes
were not gathered, the university students, who have chosen to study Dutch as
a major or minor at university, are more likely to have a higher motivation and a
more positive attitude towards learning Dutch than secondary pupils whose mo-
tivation is often low in French-speaking schools in Wallonia (Mettewie 2015).
3.2 Materials
3.2.1 Target contrasts and stimuli
Five target contrasts were selected, including five vowels and three consonants:
(1) /i/ vs /ɪ/, (2) /ɑ/ vs /aː/, (3) /ə/ vs ø, (4) /x/ vs /k/, (5) /h/ vs ø. The contrasts
were selected on the basis of a handbook for French learners of Dutch which
discusses the most problematic Dutch sounds for native speakers of French
(Hiligsmann and Rasier 2007).
The contrasts /i/-/ɪ/, /ɑ/-/aː/ and /x/-/k/ are difficult to perceive for native
speakers of French, as French has only one of the two members of the contrast,
respectively /i/, /a/ and /k/. Typically, French learners of Dutch produce both
members of the Dutch contrast as the closest French sound, thereby failing to
distinguish between minimal pairs such as zit /ɪ/ -ziet /i/ (‘sits’-‘sees’), man /ɑ/
-maan /aː/ (‘man’-‘moon’) and lag /x/ -lak /k/ (‘laugh’-‘varnish’). Dutch /h/
does not have a direct counterpart in French and, in contrast to the previously
mentioned contrasts, is often not realised by French learners of Dutch (or real-
ised as a glottal stop). These learners would then fail to make a contrast be-
tween the members of minimal pairs such as hals /h/ -als ø (‘neck’-‘if’). Finally,
while the central vowel /ə/ exists in French, it is generally silent in word-final
position (with the exception of monosyllabic function words such as je, le or
ne). In Dutch, by contrast, the presence of the vowel plays an important gram-
matical role in the formation of the past tense (hij werkt-hij werkte; ‘he works’,
‘he worked’) and the declension of adjectives (een oud huis-het oude huis; ‘an
old house’-‘the old house’).
For each contrast, eight monosyllabic Dutch minimal pairs were selected
for the pretest and training.1 Half of this list was also used for the posttests,
alongside four additionally recorded monosyllabic minimal pairs for each con-
trast (see below).
Stimuli were recorded by six native speakers of Dutch (three female and
three male), two of which were used for the posttests only (see below). All were
working in the Dutch section of an Applied Linguistics department at Ghent
University, a Dutch-speaking university in Flanders. They all used Dutch on a
daily basis in a professional setting, including in teaching. They were raised
monolingually in childhood, though later in life they had all learnt additional
languages, including English, French and German. One speaker used both
Dutch and French at home at the time of the recording. They grew up in East-
Flanders (N=3), West-Flanders (N=2) or Flemish Brabant (N=1), but all spoke
Standard Dutch without a detectable regional accent. Their ages ranged from
30 to 62 (M=41). The speakers were instructed to read the list of stimuli in the
carrier phrase Ik heb X gezegd: ‘I have said X’. They were asked to read at a
comfortable pace using a normal, falling intonation pattern, to repeat the sen-
tence whenever they hesitated and to take as many breaks as they needed. The
recordings took about 30 minutes per person. They were made in a sound-
attenuated booth with the audio software Reaper, using a Renkforce CU-4
microphone (4 speakers) or in a quiet room with a Marantz solid state re-
corder PMD620 and a Sony ECM-MS907 microphone (2 speakers).
 Due to the nature of the contrast, the /ə/ vs ø pairs always involved one monosyllabic stimu-
lus and one disyllabic stimulus (e.g. hij maakt /maːkt/- hij maakte /maːktə/; ‘he makes’-‘he
made’).
3.2.2 Training
Five training sessions were developed around the five target contrasts. The first
four training sessions were divided into two parts focusing on different contrasts
and with an optional break between the two parts. As shown in Table 2, sessions
3B, 4 and 5 include repetitions of previously presented contrasts, although with
different training materials (text, stimuli and visuals) to present the same infor-
mation. In the fifth and final session all contrasts were briefly repeated, so that
by the end of the training each contrast had been included in three training ses-
sions. Table 2 presents the target contrasts in each of the training sessions.
Table 2: Target contrasts per training session.
Session no. Target contrasts
 A. /ɪ/-/i/
B. /ɑ/-/aː/
 A. /x/-/k/
B. /h/- ø
 A. /ə/- ø
B. Repetition vowels: /ɪ/-/i/, /ɑ/-/aː/
 A. Repetition consonants: /x/-/k/, /h/- ø

B. Repetition: /ə/- ø
 A & B. Repetition of all five target contrasts
In each session, learners were trained on the contrast through a combination of

training methods. Although the number of screens devoted to each component
was not always exactly the same, the following components were found in all
sessions:2
1. Introduction: Illustration of the contrast by one minimal pair (auditory
input).
2. Auditory exposure: Auditory exposure to five minimal pairs by different
speakers accompanying written words on the screen.
3. Metalinguistic explanation: Explanation on phonological (meaning-
distinguishing) nature of the contrast in Dutch.
4. Contrastive information: Comparison with French and brief explanation of
the difficulty of the contrast.
 A detailed overview of each session can be obtained by contacting the authors.

5. Audiovisual input: Two videos showing the lower part of the face of a
speaker, illustrating mouth/lip movements during articulation.
6. Articulatory information: Information on the articulatory setting, explained
in words and accompanied by auditory example stimuli, waveforms (e.g., to
illustrate length differences) or cross-sections illustrating tongue position.
7. HVPT: Forced-choice identification task with feedback.
In addition, for two contrasts, /x/-/k/ and /h/- ø, information on spelling was
provided. This was done for /x/-/k/, because in Dutch both consonants can be
presented by multiple graphemes, namely <g> and <ch> for /x/ and <c> and <k>
for /k/. For /h/, it was mentioned that in Dutch, /h/ is always produced when it
is present in spelling (as <h>), which contrasts with silent <h> in French. For
the other contrasts, Dutch spelling transparently corresponds to each pho-
neme’s pronunciation in the target stimuli, so additional spelling information
was not deemed necessary.
Each session or session part ended with a forced-choice identification task
with feedback. Participants were told that they would hear different native
speakers pronounce Dutch words, corresponding to minimal pairs. After a
sound file was played, participants were instructed to click on the correspond-
ing word from a minimal pair, with a green check or a red cross appearing
after, respectively, a correct and an incorrect response. Upon selecting an incor-
rect response, the stimulus was played again and participants had to select the
correct answer in order to proceed to the following stimulus (Figure 1). Select-
ing the correct button at that point does not necessarily imply that participants
had perceived the contrast, but participants did receive visual feedback and ad-
ditional audio-exposure to the stimulus. The stimuli in the training sessions,
including the forced-choice identification tasks, were all produced by the same
four native speakers of Dutch.
Klik op het woord dat je hebt gehoord.
knap knaap
Figure 1: Illustration of forced-choice identification task (‘Click on the word that you heard’).
3.2.3 Pretest
At the time of the pretest, all participants completed an informed consent form
and an online background questionnaire, enquiring into their language learner
profile and their exposure to and knowledge of Dutch. As mentioned above, at
this stage participants also completed the listening comprehension component
of Dialang (Lancaster University n.d.).
Next, participants completed an online auditory forced-choice identification
task, which was designed to be similar to the HVPT component of the training.
As in the training, participants were instructed that they would hear different na-
tive speakers pronounce Dutch words, corresponding to minimal pairs. However,
during the pretest, participants did not receive feedback and immediately pro-
ceeded to the following stimulus upon selecting their answer.
The identification task consisted of two parts. In the first part, 160 stimuli
were presented in random order without noise, with a break after 80 stimuli. In
the second part, the same 160 stimuli were presented in random order with
noise, half with signal-to-noise ratio 0 (SNR 0) and half with SNR 8, with a
break after 80 stimuli. Noise was applied to the stimuli in Praat (Boersma and
Weenink 2019) using a script by McCloy (2013). In each part, each target feature
of the 5 target contrasts appeared 16 times and was produced by 4 native speak-
ers in a balanced design. Initial instructions were presented in French, but
switched to Dutch once the first part of the identification task started. The order
in which the minimal pairs were presented in the response categories was kept
constant. The stimuli used in the identification task were the same as those
used in the training, albeit with the addition of the two noise conditions.
Each part was preceded by a short training phase, designed to accustom
the participants to the testing format using a contrast (/y/ vs /u/) which did not
feature in the test itself and which is phonemic in both French and Dutch.
3.2.4 Posttest 1 and Posttest 2
The two posttests followed a similar procedure to the pretest identification

task, but introduced new stimuli and speakers in order to test the participants’
ability to generalise the training to new contexts and speakers. Both parts of
the test contained 40 familiar stimuli spoken by 4 familiar speakers, 40 new
stimuli spoken by 4 familiar speakers, 40 familiar stimuli spoken by 2 new
speakers, and 40 new stimuli spoken by 2 new speakers. The two posttests
were identical in all respects.
3.3 Procedure
To facilitate data collection during the COVID-19 pandemic, all parts of the
study were developed as an online website which participants could access
from home. Participants completed each part individually on a smartphone,
tablet or computer, using headphones. During their Dutch class, participants
completed the consent form and filled in the background questionnaire. One
week later, the participants were asked to complete the listening component of
Dialang. Participants were subsequently selected for further participation in the
study and assigned to the control or training groups.
Next, participants started the experimental phase, consisting of the pretest,
the five trainings, the posttest and the delayed posttest. The secondary pupils
completed the pretest and the first training during their Dutch class. The other
parts (training and posttests) were completed online at home, as the study coin-
cided with the COVID-19 pandemic. This also forced the second training to be
postponed by one week. At this point in time, secondary education had transi-
tioned to remote teaching, so that the subsequent training sessions and the post-
tests were completed at home. The university students, who attended online
classes from the beginning of the academic year, completed the entire experi-
mental phase at home, after their Dutch grammar class. Teaching staff and re-
searchers were available online in case of problems.
The pretest, trainings and first posttest were completed between October 2020
and December 2020. There was an interval of three to four days between each
training, as well as between training 5 and posttest 1. Posttest 2 was administered
one month after posttest 1. The pretest and posttests lasted about 15 minutes each,
and the training sessions each had a duration of about 20 minutes.
3.4 Statistical analyses
Data were only included for further statistical analysis if coming from partici-
pants who had taken part in all pre- and posttests and, for the training group,
had completed all 5 training sessions. Some participants experienced technical
difficulties which forced them to restart the pre- or posttests, or to stop a test
prematurely. In these cases, each participant’s responses were reviewed individ-
ually and complete datasets were prioritised. For example, if a participant re-
started a test after 20 trials and proceeded to fully complete the test afterwards,
the first attempt was discarded and the second was maintained. Considering the
high number of observations per participant, incomplete attempts were also in-
cluded as long as they included more than two thirds of the observations. In
total, there were an average of 317.71 out of 320 observations per participant, per
test (standard deviation = 10.95).
The data from the lexical identification task were analysed using a mixed-
effects logistic regression model under the generalized linear mixed models
framework, using the lme4 (Douglas et al. 2015) package in R (R Core Team
2020). The following explanatory variables were maintained for inclusion in the
statistical model:
– Participant ID (random factor)
– Timing: pretest, posttest 1, posttest 2
– Group: control, training
– Noise: NA, SNR 8, SNR 0
– New stimulus: no, yes
– New speaker: no, yes
– Profile: secondary school, university
– Target contrast: /ɑ/ vs /aː/, /x/ vs /k/, /h/ vs ø, /i/ vs /ɪ/, /ə/ vs ø
Independent variables were dummy coded, with the exception of the Target
contrast variable, which was effects coded.
The model presented in Section 4.2 was built on the basis of the theoretical
predictions of the study and includes Participant ID as a random factor, Timing,
Group, Participant profile, Target contrast and Noise as main effects and a set
number of two-, three- and four-way interactions. The variables New stimulus
and New speaker were only introduced in three-way interactions with Timing
and Group, since new stimuli or speakers were only introduced in the posttests
(see Table 3).
Table 3: Overview of coefficients in the mixed-effects logistic regression model.
Random factors Participant ID
Main effects Group, Timing, Noise, Target contrast, Profile
Two-way interactions Group:Timing
Three-way interactions Group:Timing:Noise, Group:Timing:New Speaker,

Group:Timing:New stimulus,
Group:Timing:Profile, Group:Timing:Target contrast
Four-way interactions Group:Timing:New stimulus:Target contrast,

Group:Timing:New speaker:Target contrast,
Group:Timing:Noise:Target contrast,
Group:Timing:New stimulus:Profile, Group:Timing:New
speaker:Profile, Group:Timing:Noise:Profile
Due to the high number of variables and interactions, our discussion will
focus on those coefficients that are statistically significant. Higher-order inter-
actions exploring the moderating effects of learner profile and target contrast
will be discussed first, whereas lower-order interactions will only be explored
where higher-order interactions are not statistically significant.
Post-hoc comparisons for fixed effects were carried out using the emmeans
package (Russell 2020), with pairwise or control vs. treatment comparisons for
significant main effects or interactions. Post-hoc tests were always carried out
for pretest vs. posttest 1 and pretest vs. posttest 2, with the exception of the var-
iables New stimulus and New speaker, for which no pretest data were available.
For these variables, new stimuli or speakers were always compared to familiar
stimuli or speakers, for posttests 1 and 2 separately. P-values were adjusted for
multiple comparisons using emmeans’s dunnettx method.
4 Results
4.1 Descriptive statistics
Table 4 summarizes the participants’ mean scores on the lexical identification

task for the pretest and the two posttests for the robustness variables (Timing,
Noise, New Speaker, New stimulus).3
A general comparison of the control and training groups’ scores reveals rel-
atively high average scores at all testing times, with means ranging between
.779 and .817. Gains for the training group are between .038 (posttest 1) and.029
(posttest 2), and are more limited for the control group, where participants’
scores increased between .013 (posttest 1) and .003 (posttest 2).
The results also reveal that the participants’ scores when evaluating new
speakers are on average higher in both the control and treatment group than
when evaluating familiar speakers, although this difference is never higher
than .014. New stimuli are judged with similar accuracy to familiar stimuli,
with differences never exceeding .005 at all testing times for both groups.
A closer look at the impact of noise on lexical identification accuracy re-
veals that across all testing times and in both groups, increased levels of noise
lead to a noticeable decrease of up to .095 in scores. At the same time, gains
between pre- and posttests can be observed for all noise conditions in the
 The different numbers of responses per category in Tables 4 and 5 are the result of the con-
nectivity issues described in Section 3.4.
training group, with the strongest gains observed for the high-noise condition
(posttest 1 = .057; posttest 2 = .047) and the lowest gains for the low-noise con-
dition (posttest 1 = .025; posttest 2 = 0.011). Differences between pre- and post-
tests never exceed .015 in the control group, with the only exception being an
increase in scores of .035 from pretest to posttest 1 in the high noise condition.
As a whole, these results suggest that training effects can be observed in both
posttests and can be extended to new stimuli and new speakers. The presence of
noise impacts participant performance in general, but the training group still
seems to benefit from instruction in all three noise conditions.
Table 4: Mean response accuracy for the main robustness variables.
Pretest Posttest  Posttest 
N Mean St.dev. N Mean St.dev. N Mean St.dev.
Timing only
Control  . .  . .  . .
Training  . .  . .  . .
New speaker
No Control  . .  . .  . .
Training  . .  . .  . .
Yes Control N/A N/A N/A  . .  . .
Training N/A N/A N/A  . .  . .
New stimulus
No Control  . .  . .  . .
Training  . .  . .  . .
Yes Control N/A N/A N/A  . .  . .
Training N/A N/A N/A  . .  . .
Noise
No noise Control  . .  . .  . .
Training  . .  . .  . .
SNR  Control  . .  . .  . .
Training  . .  . .  . .
SNR  Control  . .  . .  . .
Training  . .  . .  . .
A closer look at the moderating variables Profile and Target contrast in rela-
tion to Timing and Group (Table 5) reveals much more limited and less stable
gains in the training group for the secondary school pupils (posttest 1 = .022;
posttest 2 = .006) than for the university participants (posttest 1 = .056; posttest
2 = .055). The control group shows negligible increases for both learner profiles,
with the exception of posttest 1 for the university participants, where the in-
crease of .022 is similar to that of the secondary school training group.
Finally, a comparison of the various target contrasts reveals important dif-
ferences, with contrasts such as /i/ vs /ɪ/ (control = .623; training = .602) being
perceived much closer to chance level than contrasts such as /ə/ vs ø (control =
.855; training = .858) in the pretest. Gains between pre- and posttests are simi-
larly variable. The strongest gains in the training group are observed for /i/ vs /ɪ/
(posttest 1 = .081; posttest 2 = .060), for which the pretest scores were also the
lowest. At the same time, other contrasts for which the training group ob-
tained lower scores in the pretest did not see similar gains. For example, gains
for /ɑ/ vs /aː/, for which the second lowest scores were observed in the pretest,
were limited to .018 (posttest 1) and .009 (posttest 2). In contrast, some sounds
which caused few problems for the training group in the pretest still saw more
important gains. While participants in the training group identified the difference
between /x/ vs /k/ relatively well in the pretest, their performance still increased
comparatively substantially at posttest 1 (.041) and posttest 2 (.044).
Table 5: Mean response accuracy for the moderator variables.
N Mean St. N Mean St. N Mean St.

dev. dev. dev.
Profile
Secondary Control  . .  . .  . .
school Training  . .  . .  . .
University Control  . .  . .  . .
Training  . .  . .  . .
Target contrast
/ɑ/ vs /aː/ Control  . .  . .  . .
Training  . .  . .  . .
/x/ vs /k/ Control  . .  . .  . .
Training  . .  . .  . .
Table 5 (continued)
N Mean St. N Mean St. N Mean St.

dev. dev. dev.
/h/ vs ø Control  . .  . .  . .
Training  . .  . .  . .
/i/ vs /ɪ/ Control  . .  . .  . .
Training  . .  . .  . .
/ə/ vs ø Control  . .  . .  . .
Training  . .  . .  . .
4.2 Mixed effects logistic regression model
As mentioned above, our presentation of the mixed-effects logistic regression

model will first focus on higher-order interactions involving the two moderator
variables (Profile and Target Contrast), before turning to lower-order interac-
tions. Where specific post-hoc comparisons are not mentioned, they are not sta-
tistically significant. In addition, lower-order interactions are not reported if
they are also involved in statistically significant higher-order interactions.4 Re-
sults are grouped according to the robustness variables New stimulus, New
speaker and Noise. While we also consider the participants’ performance on the
delayed posttest to be an indicator of the robustness of training effects, the re-
sults for posttest 2 will be discussed alongside posttest 1 within the previously
mentioned sections.
4.2.1 New stimulus
No statistically significant effects were observed for the interaction between

Profile, Timing, Group and New stimulus (p > .05). A statistically significant in-
teraction was found for the interaction between Target contrast, Timing, Group
and New stimulus, with post-hoc tests indicating that the control group and
training group both performed worse on new stimuli for the contrast /h/ vs ø in
posttest 1 (Control: Est. = .812, z = 4.834, p < .0001; Training: Est. = .533, z = 3.510,
p = .011) and in posttest 2 (Control: Est. = .489, z = 3.171, p = .0351, Training: Est.
 The full output of the statistical analysis can be obtained by contacting the authors.
= .582, z = 4.097, p = .001). In addition, a statistically significant difference was

found in the training group for /ə/ vs ø, where participants perceived new stimuli
more accurately at posttest 1 (Est. = .533, z = −3.479, p = .0127). For all other con-
trasts and testing times, training effects were similar for new and familiar stimuli,
as no further statistically significant differences were found for interactions with
New Stimulus.
4.2.2 New speaker
Turning to the interactions involving New speaker, no statistically significant

effects were found for the interaction between Profile, Timing, Group and New
speaker (p > .05), nor for the interaction between Target contrast, Timing, Group
and New speaker (p > .05). The three-way interaction between Timing, Group
and New speaker was not statistically significant either (p > .05), indicating that
potential training effects were observed for familiar and new speakers alike, re-
gardless of target contrast or participant profile.
4.2.3 Noise
As for the training’s effects across different levels of noise, a statistically signifi-
cant effect was found for the interaction between Profile, Timing, Group and
Noise. More specifically, post-hoc tests revealed that, at university level only, the
training group’s scores increased in the no-noise and high-noise conditions from
pretest to posttest 1 (No noise: Est. = .722, z = 6.385, p < .001; SNR 0: Est. = . 456,
z = 3.685, p = .005) and from pretest to posttest 2 (No noise: Est. = .683, z = 6.096,
p < .001; SNR 0: Est. = .517, z= 4.125, p < .001), while the university students’ per-
formance was not statistically significant between pre- and posttests for the SNR
8 condition (p > .05). No statistically significant training effects were observed for
the control group or the secondary school participants in any of the noise condi-
tions (p > .05).
For our second moderator effect, Target contrast, the interaction between Tar-
get contrast, Timing, Group and Noise was revealed to be statistically significant.
Post-hoc tests will be discussed per target contrast. For the contrast /ɑ/ vs /aː/, no
statistically significant gains were observed from the pretest to the posttest in any
of the noise conditions for the control and training groups (p > .05). Training ef-
fects for /x/ vs /k/ were limited to the training group in the highest noise condition
only, both for pretest versus posttest 1 (Est. = .628, z = 3.884, p = .005) and for
pretest versus posttest 2 (Est. = .613, z = 3.774, p = .008).
As for the contrast /h/ vs ø, statistically significant differences were ob-

served for the training group between pretest and posttest 1 in the no-noise con-
dition only (Est. = .581, z = 3.663, p = .012). The training group also showed a
statistically significant increase in scores for the /i/ vs /ɪ/ contrast, albeit only
in the no-noise condition, for posttest 1 (Est. = .536, z = 4.980, p < .001) and
posttest 2 (Est. = .495, z = 4.622, p < 0.001).
Finally, for the contrast /ə/ vs ø, statistically significant training effects were ob-
served in the high noise condition only. Importantly, these effects were not only ob-
served for the training group, which scored higher in posttest 1 (Est. = .667, z = 3.397,
p = .031) and posttest 2 (Est. = .852, z = 4.168, p = .002), but also for the control
group in posttest 1 (Est. = .0361, z = 3.408, p = .030), albeit in a very limited way.
5 Discussion
In this study, we set out to investigate how robust HVPT effects are on the per-
ception of Dutch contrasts for French-speaking learners of Dutch in Belgium
and to what extent the robustness of HVPT is moderated by learner-related fac-
tors and by the type of language feature. Robustness was measured along three
dimensions: (1) the generalizability of training effects to novel tokens and novel
talkers, (2) the duration of the impact of training, and (3) the effect of training
on listening in non-optimal conditions, i.e., with background noise. The results
show a nuanced picture: they reveal considerable variability in the effective-
ness of HVPT along most robustness variables, which can to a large extent be
attributed to the moderating variables that were investigated. We briefly dis-
cuss the most important findings that lead to this conclusion.
First, contrary to our hypothesis, training effects were observed for some con-
trasts only. In other words, the target contrast itself was revealed to play an impor-
tant role in the effectiveness of HVPT, as gains were not observed for every single
contrast, nor did gains manifest themselves in the same way when improvements
were observed. As noted in the Introduction, the earliest studies on HVPT focused
on only one consonant contrast (e.g. Bradlow et al. 1997; Pisoni, Lively, and Logan
1994). Studies that did look into more contrasts also found different results for spe-
cific contrasts (e.g. Rato 2014, who found generalization effects for four out of six
vowels). Interestingly, in the present study, we observed variability both for con-
trasts on which participants scored relatively low in the pretest and those which
posed fewer problems in the pretest. For instance, the lowest pretest scores were
observed for /ɑ/ vs /aː/ and for /i/ vs /ɪ/, but participants’ performance only in-
creased in a statistically significant way for the /i/ vs /ɪ/ contrast. We do not readily
have an explanation for this. Further research into the exact perceptual mapping
of Dutch vowels onto French ones by L2 learners may be needed to help explain
this result. Participants were also often able to improve their performance on those
contrasts where pretest scores were relatively high (e.g. /x/ vs /k/ and /h/ vs /ø/),
albeit not for every noise condition.
Secondly, we only found training effects in the university group, but not in
the secondary school group. Since in most training studies the participants are
university students (e.g. Bradlow et al. 1997; Nishi and Kewley-Port 2007; Wang
and Munro 2004), we cannot readily compare this observation with most earlier
studies. One exception is Shinohara and Iverson’s (2021) study, which com-
pared perceptual training effects of English /l/-/r/ in Japanese adults, adoles-
cents and children. In contrast to the current study, their results revealed
higher gains from perceptual training in the adolescent group than in the adult
group. Crucially, their study included a wider variety of training tasks targeting
a single contrast only. In the current study, perhaps the lack of clear training
effects in the secondary school group may point at a lack of robustness of HVPT
across learner profiles, but it may also be the conditions of the training, which
was organized entirely online, that have impacted the participants’ perfor-
mance, especially for the secondary school pupils (see below). However, for
those contrasts where gains were observed, these generally applied to novel
and familiar stimuli (with the exception of two contrasts) and to novel and fa-
miliar speakers alike, as predicted in our hypotheses. This is in line with earlier
studies using the HVPT design, which generally report generalization across
stimuli and talkers. In general, the strongest training effects were observed for
the university group, across all three robustness indicators (generalizability to
novel stimuli and novel speakers, long-term effects and effects in quiet and
high-noise conditions), with the exception of the low-noise condition. This sug-
gests that, to the extent that HVPT effects can be observed, they appear to be
quite robust.
Thirdly, the generalizability of training effects to different noise conditions
was revealed to be one of the main sources of variability in the data, with training
effects for some target contrasts, such as /ə/ vs ø, only being observed in high-
noise conditions. This implies that, while the high pretest scores for this contrast
may not have characterized this as a high-priority target for pronunciation train-
ing, learners still experienced benefits in more adverse listening conditions.
In sum, the results suggest that considering training effectiveness from
multiple angles yields a nuanced picture of its robustness. However, the results
should also be interpreted with caution. One limitation of the study was that it
was carried out in a context where the control over experimental conditions
was limited. As participants took part in the study at home, it was impossible to
verify, for instance, the absence of background noise or whether they followed
the instructions closely (e.g. wore headphones). This may not be problematic,
as the results can be taken to be representative of a particular type of online,
individual training, but it is important to bear in mind when interpreting the
results. In addition, we realized that the materials used in the HVPT training
would have been more familiar to the university students than to the secondary
school pupils. From their linguistics courses, the university students would
have been familiar with explanations on vowel duration, waveforms or cross-
sections of the articulators. The reliance on metalinguistic knowledge typically
used in HVTP studies, which as noted tend to focus on the university popula-
tion, may thus help to explain the absence of gains in the secondary school
group.
As a final point in the discussion, leading to suggestions for future re-
search, we would like to bring up the issue of the ecological validity of HVPT
training studies, including the study we are reporting on in this chapter. Two
properties of the current study in particular raise doubts about the usefulness
of the training outside of scientific research.
First, training effects, even when statistically significant, were generally
limited, and a number of contrasts posed relatively few problems to the learn-
ers. The focus on the perception of isolated words may help to explain the high
scores for some contrasts and more natural contexts and production tasks may
paint a different picture. This relates to the more general question of how large
gains should be for a training to be useful in a real learning context. Conceiv-
ably, the answer to that question is not straightforward, but depends on the
pronunciation targets set by learners and teachers, as well as on the amount of
time available to the learners.
Secondly, the participants undoubtedly experienced weariness from extended
remote learning, which may have impacted their motivation when participating in
the study. For some participants, this may have been compounded by the repeti-
tiveness of parts of the training and tests (esp. the forced-choice tasks and the noise
conditions), but also by occasional technical difficulties, which were reported more
often in the secondary school group. While an exploration of the effects of partici-
pant motivation on HVPT effectiveness (esp. if pronunciation instruction is to be
integrated in regular foreign language classrooms) is undoubtedly an interesting av-
enue of research to pursue, it also means the results of the present study need to be
interpreted with necessary caution.
We therefore echo Wang and Munro (2004) when we point out that we do
not claim to have developed a pronunciation training programme that is suit-
able for Dutch as a Foreign Language teachers, even if the design of the train-
ing was directly inspired by existing pronunciation manuals for L2 Dutch. As
Wang and Munro (2004: 551) note, the development of such a software training
package would require collaboration with pedagogical specialists as well as
with technical experts on, for instance, user-friendly interfaces. Future projects
may well aim at such collaborations, given the limited materials that are now
available for Dutch pronunciation training for French learners.
6 Conclusion
On the whole, the current study makes a nuanced contribution to a growing
body of evidence revealing HVPT to be an effective paradigm for pronunciation
training. The focus on a variety of robustness indicators suggests that, where
training effects are found, learners are able to generalize these to new contexts
and new speakers.
Since one of the main factors found to affect the success of the training was
the participants’ profile, here operationalized mainly in terms of age and educa-
tional level, future research ought to consider how the design of HVPT can be
tailored to different learner profiles, including a focus on younger learners. In-
structed foreign language learning often starts in secondary schools, when
learners are aged between 12 and 18, or earlier. As such, it would be worthwhile
for future pronunciation training studies to shift the focus from university stu-
dents to younger children and adolescents.
Importantly, this study also set out to explore the effectiveness of pronuncia-
tion training in a hitherto underexplored setting, namely Dutch as a second lan-
guage in a French-speaking community. By focusing on a foreign-language-
learning context which is very widespread in French-speaking Belgium and to a
significant extent embedded in the secondary school curriculum, the study also
contributes empirical evidence which is hoped to support the development of
pedagogical materials in Dutch language education in French-speaking Belgium.
References
Aliaga-García, Cristina & Joan C. Mora. 2009. Assessing the effects of phonetic training on L2
sound perception and production. In Michael A. Watkins, Andreia S. Rauber & Barbara O.
Baptista (eds.), Recent Research in Second Language Phonetics/Phonology: Perception
and Production, 2–31. Newcastle upon Tyne, UK: Cambridge Scholars Publishing.
Alshangiti, Wafaa & Bronwen G. Evans. 2014. Investigating the domain-specificity of phonetic
training for second-language learning: Comparing the effects of production and
perception training on the acquisition of English vowels by Arabic learners of English.
In: Fuchs, Susanne, Martine Grice, Anne Hermes, Leonardo Lancia & Doris Mücke (eds.),
Proceedings of the 10th International Seminar on Speech Production Cologne, Germany,
5–8 May 2014. https://www.researchgate.net/publication/262635724 (accessed
09 June 2021).
Anthony, Jason L. & David J. Francis. 2005. Development of phonological awareness. Current
Directions in Psychological Science 14(5). 255–259. https://doi.org/10.1111/j.0963-
7214.2005.00376.x (accessed 14 June 2021).
Archibald, John. 2021. Ease and difficulty in L2 phonology: A mini-review. Frontiers in
Communication 6(18). 626529.
Best, Catherine. T. 1994. The emergence of native-language phonological influences in
infants: A perceptual assimilation model. In Judith C. Goodman & Howard C. Nusbaum
(eds.), The Development of Speech Perception: The Transition from Speech Sounds to
Spoken Words, 167–224. Cambridge, MA: The MIT Press.
Best, Catherine. T. & Michael D. Tyler. 2007. Nonnative and second-language speech
perception: Commonalities and complementarities. In Murray J. Munro & Ocke-Schwen
Bohn (eds.), Language Experience in Second Language Speech Learning: In Honor of
James Emil Flege, 13–34. Amsterdam: John Benjamins.
program]. Version 6.1, retrieved from http://www.praat.org/ (accessed 14 June 2021).
Bradlow, Ann R., David B. Pisoni, Reiko Akahane-Yamada & Yoh’ichi Tohkura. 1997. Training
Japanese listeners to identify English /r/ and /l/: IV. Some effects of perceptual learning
on speech production. Journal of the Acoustical Society of America 101(4). 2299–2310.
Cutler, Anne, Andrea Weber, Roel Smits & Nicole Cooper. 2004. Patterns of English phoneme
confusions by native and non-native listeners. Journal of the Acoustical Society of
America 116(6). 3668–3678.
Dörnyei, Zoltán. 2009. Individual differences: Interplay of learner characteristics and learning
environment. Language learning 59(s1). 230–248.
Douglas Bates, Martin Maechler, Ben Bolker & Steve Walker. 2015. Fitting linear mixed-effects
models using lme4. Journal of Statistical Software 67(1). 1–48. doi:10.18637/jss.v067.i01.
Ellis, Rod. 2015. Understanding Second Language Acquisition, 2nd edn. Oxford: Oxford
University Press.
Escudero, Paola, Ellen Simon & Holger Mitterer. 2008. The perception of English front vowels
by North Holland and Flemish listeners: Acoustic similarity predicts and explains cross-
linguistic and L2 perception, Journal of Phonetics 40(2). 280–288.
Flege, James. E. 1995. Second-language speech learning: Theory, findings, and problems. In
Winifred Strange (eds.), Speech Perception and Linguistic Experience: Issues in Cross-
language Research, 229–273. Timonium, MD: York Press.
Flege, James E. & Ocke-Schwen Bohn. 2021. The revised Speech Learning Model (SLM-r). In
Garcia-Lecumberri, Maria Luisia, Martin Cook & Anne Cutler. 2010. Non-native speech
perception in adverse conditions: a review. Speech Communication 52(11). 864–886.
Hamers, Josiane F. & Michel A. H. Blanc. 2000. Bilinguality and Bilingualism. Cambridge:
Cambridge University Press.
Hazan, Valerie, Anke Sennema, Midori Iba & Andrew Faulkner. 2005. Effect of audiovisual
perceptual training on the perception and production of consonants by Japanese learners
of English. Speech Communication 47(3). 360–378.
Hiligsmann, Philippe & Laurent Rasier. 2007. Uitspraakleer Nederlands voor Franstaligen
[Dutch pronunciation for French speakers]. Waterloo: Wolters Plantyn.
Housen, Alex & Hannelore Simoens. 2016. Introduction: Cognitive perspectives on difficulty
and complexity in L2 acquisition. Studies in Second Language Acquisition 38(2). 163–175.
Lancaster University. (n.d.) Dialang. https://dialangweb.lancaster.ac.uk (accessed
25 May 2021).
Lengeris, Angelos & Katerina Nicolaidis. 2015. Effect of phonetic training on the perception of
English consonants by Greek speakers in quiet and noise. Proceedings of Meetings on
Acoustics (POMA), 22. 060002.
Leong, Christine Xiang Ru, Jessica M. Price, Nicola J. Pitchford & Walter J. van Heuven. 2018.
High variability phonetic training in adaptive adverse conditions is rapid, effective, and
sustained. PloS one 13 (10).e0204888. https://doi.org/10.1371/journal.pone.0204888
Lima Jr., Ronaldo. 2019. A dynamic account of the development of English (L2) vowels by
Brazilian learners through communicative teaching and through explicit instruction, see
Chapter 6, this volume.
Logan, John S., Scott E. Lively & David B. Pisoni. 1991. Training Japanese listeners to identify
English /r/ and /l/: A first report. Journal of the Acoustical Society of America 89(2).
874–886.
Mattys, Sven, Matthew H. Davis, Ann R. Bradlow & Sophie K. Scott. 2012. Speech recognition
in adverse conditions: A review. Language and Cognitive Processes 27(7/8). 953–978.
McCloy, Daniel. 2013. Mix speech with noise [Praat script]. https://github.com/drammock
Mettewie, Laurence. 2015. Apprendre la langue de “l’Autre” en Belgique: la dimension
affective. Le Langage et l’Homme 50(2). 23–42.
Mettewie, Laurence. 2021. Wordt Nederlands een verplicht vak in Wallonië? [Will Dutch
become a compulsory subject in Wallonia?]. Neerlandia 124(1). 30–31. https://www.anv.
nl/tijdschrift/inhoudsopgaven/2020-1/wordt-nederlands-een-verplicht-vak-in-wallonie/
Moyer, Alene. 1999. Ultimate attainment in L2 phonology: The critical factors of age,
motivation, and instruction. Studies in Second Language Acquisition 21(1). 81–108.
Nishi, Kanae & Diane Kewley-Port. 2007. Training Japanese listeners to perceive American
English vowels: Influence of training sets. Journal of Speech, Language and Hearing
Research 50(6). 1496–1509.
Peltekov, Peter. 2020. The effectiveness of implicit and explicit instruction on German L2
learners’ pronunciation. Die Unterrichtspraxis/Teaching German 53(1). 1–22.
Pisoni, David B., Scott E. Lively & John S. Logan. 1994. Perceptual learning of nonnative
speech contrasts: Implications for theories of speech perception. In Judith C. Goodman &
Howard C. Nusbaum (eds.), The Development of Speech Perception: The Transition from
Speech Sounds to Spoken Words, 121–166. Cambridge: The MIT Press.
Rato, Anabela. 2014. Effects of perceptual training on the identification of English vowels by
native speakers of European Portuguese. Concordia Working Papers in Applied
Linguistics 5. 529–546.
R Core Team. 2020. R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. https://www.R-project.org/ (accessed
07 June 2021).
Russell, V. Lenth. 2020. emmeans: Estimated Marginal Means, aka Least-Squares Means. R
package version 1.5.3. https://CRAN.R-project.org/package=emmeans (accessed
07 June 2021).
Saito, Kazuya & Luke Plonsky. 2019. Effects of second language pronunciation teaching
revisited: A proposed measurement framework and meta-analysis. Language Learning
69(3). 652–708.
research. Applied Psycholinguistics 39(1). 187–224.
Shinohara, Yasuaki & Paul Iverson. 2021. The effect of age on English /r/-/l/ perceptual
training outcomes for Japanese speakers. Journal of Phonetics 89. 101108. https://doi.
org/10.1016/j.wocn.2021.101108 (accessed 02 May 2022).
Thomson, Ron I. 2018. High Variability [Pronunciation] Training (HVPT): A proven technique
about which every language teacher and learner ought to know. Journal of Second
Wang, Xinchun & Murray J. Munro. 2004. Computer-based training for learning English vowel
contrasts System 32(4). 539–552.
Williams, Daniel & Paola Escudero. 2014. Native and non-native speech perception. Acoustics
Australia 42(2). 79–83.
Pollianna Milan, Denise Cristina Kluge
Effects of perceptual training
in the perception and production of
heterotonics by Brazilian learners of Spanish
Abstract: In this longitudinal study, we investigated the effectiveness of percep-
tual training in the perception and production of heterotonics by Brazilian
learners of Spanish. Twenty-six participants were divided into four groups:
those with less academic exposure to Spanish, called the basic group, divided
between the ‘basic’ group with and without training; and those who had more
academic exposure to Spanish, called ‘intermediate’, divided between interme-
diate group with and without training. All participants took a pre-test, post-
test, generalization test and delayed post-test (between 42 and 58 days after the
training sessions), for both perception and production. Those who did not train
were compared to those who trained in order to find out if those who trained
had an improvement in their performance of the tests when compared to those
who did not take training. In all the perception tests, the participants had to
identify the stressed syllable of the heterotonic words and of the distractors.
The results showed positive effect of perceptual training on the perception and
production of heterotonics by the group that trained and had less academic ex-
perience. As this study follows the principles of Complex Systems, there was a
concern regarding the type of analysis for the results which should include not
only intergroups comparisons, but also individual comparison of the partici-
pants. In the individual analyzes, we concluded that the learners who benefited
most from training were those with lower academic experience and more diffi-
culties at the beginning of the study, i.e., in the pre-tests.
Keywords: perceptual training, heterotonics, Brazilian learners of Spanish,

complex systems, high variability perceptual training
Pollianna Milan, Federal University of Paraná

Denise Cristina Kluge, Federal University of Rio de Janeiro
https://doi.org/10.1515/9783110736120-013
346 Pollianna Milan, Denise Cristina Kluge
1 Introduction
The main objective of this longitudinal1 study is to investigate the effects of per-
ceptual training on the development of heterotonics2 of Spanish by Brazilian
speakers. In addition, we seek to analyze if such effects will occur both in per-
ception and in production and whether or not they will be long-lasting. To our
knowledge, there is no previous research on perceptual training of phonetic/
phonological features of Spanish with Brazilian learners, especially regarding
stress assignment, i.e., at the suprasegmental level. For this reason, this is an
original study intended to provide new insights into the field. We believe that a
language is developed according to the principles of Complex Systems; there-
fore, we propose a new approach to the analysis of the results, i.e., on a more
individual basis.
As perceptual training is based on learning a language through use and,
consequently, through repetition, we begin this chapter by pondering over a
quote by Morin (1990: 112): “Combine the cause and the effect, and the effect
will return to the cause, by retroaction, and the product will also be the pro-
ducer.” [“Juntai a causa e o efeito, e o efeito voltará sobre a causa, por retroa-
ção, o produto será também o produtor.”] This statement summarizes, in our
point of view, how a foreign language is used: when exposed to an unfamiliar
target language, learners acquire the status of producers of that language, and
are no longer mere observers (i.e., a product of it). In other words, they become
individuals who will use such language to communicate. The cause (learning a
language) and the effect (using the language learned) blend and complement
each other, because every time individuals use a language, they also develop it,
in a cyclical cause-effect and effect-cause relationship. Thus, depending on
how individuals use a language, categories are stored in their cognitive system;
this way, they can be reused whenever necessary (Bybee 2010). This also means
that the more this target language is used, the stronger this cause-and-effect
and effect-and-cause relationship becomes.
Another point to ponder is that a language is developed in particular con-
texts that cannot be overlooked. Consequently, there is a dynamic interaction
that involves adaptations (also at a personal level) in the teaching-learning pro-
cess, as postulated by Larsen-Freeman and Cameron (2008: 34): “Every change
 We consider this a longitudinal study due not only the training sessions, but all the tests
involved, which were administered over approximately four months.
 Heterotonics are words from two similar languages, with similar or identical spelling, but
stress on a different syllable.
Effects of perceptual training in the perception and production of heterotonics 347
in a system is influenced by context. Thus, when a person walks across a field,

every moment of walking involves adaptation by the body to the context, in the
form of the ground surface and all that is seen or noticed.” Along these lines,
we argue that an individualized analysis of training data cannot be neglected,
as context – individual context included – matters.
Therefore, our study is underpinned by the assumptions of Complex Sys-
tems, and also by Usage-based Phonology, to test our hypothesis that when Bra-
zilian learners of Spanish as a foreign language are frequently exposed to
heterotonic words, these learners will develop such words in their linguistic sys-
tem after hearing, perceiving and/or using them. Larsen-Freeman and Cameron
(2008) help to support our thesis by stating that when a new lexical item is used
by individuals, through interaction and adaptation, it establishes itself in the
language as something that will be active in their mind in the long run. Admit-
tedly, perceptual training is based on the repeated and frequent exposure to par-
ticular items – phonological ones, in the case of the present study, that is, items
from a more stable model of language. Still, this does not mean that the out-
comes of such language development will not be part of a Complex System, as
previously explained by Larsen-Freeman and Cameron (2008: 199): “(. . .) even
if a frozen or stabilized version of the language is used in a syllabus, grammar
book, and test, as soon as the language is ‘released’ into the classroom or into
the minds of learners it becomes dynamic” (emphasis added by the authors). In
addition, because heterotonics are words that are phonetically and semantically
similar, in a contrast between Spanish and Brazilian Portuguese, they will prob-
ably be developed according to what Bybee (2010) calls analogy and similarity,
i.e., synonyms or almost synonyms are initially attracted to the same category,
which in itself is a problem for learners, as it can generate noise in communica-
tion. Based on this assumption, exposure to correct input3 through perceptual
training can lead second language learners to create a specific category in their
internalized grammar for heterotonics, because training itself will provide the
correct input, which can make learners aware of the fact that these words are
stressed in a specific manner.
For perceptual training to be successful, and for input to be developed, re-
searchers use several tools in an attempt to make training more and more effec-
tive as it will be explained in the Methodology. However, as pointed out by Rast
 The term ‘input’ is used by linguists to designate learners’ exposure to the language that
they intend to develop. Research on foreign language development has been attempting to ex-
plain how learners process the input that they receive (Rast 2011). However, according to Ellis
(1985), it is not every input that is processed by speakers, either because they did not under-
stand part of it or because they did not pay attention to it.
(2011), input processing depends on factors that cannot always be controlled,

e.g., accidental or subliminal perception. According to Henshaw (2011), another
factor comes into play in perceptual training, namely the issue of attention.
Moreover, Goldstone and Byrge (2005) argue that perception can be learned;
however, there is a much stronger relationship between experience and percep-
tion: individuals with different experiences, contexts and/or training may have
strikingly different perceptions even if exposed to the same sensory input. For
the authors, “this raises important issues about the ontology of sensory experi-
ence, the relationship between cognition and perception, and the possibility of a
theory-neutral perceptual ground for science” (Goldstone and Byrge 2005: 01).
Despite having addressed different questions, for example, whether our percep-
tions depend on our beliefs, the authors argue that equivalent training sessions
can equalize perceptual differences. The reason lies in the fact that, although
there are marked differences in the perceptual processes of individuals, owing to
their particular experiences, the process by which perceptual systems change
with experience is widely shared among individuals. Thus, perceptual training
sessions can be positive, at least if learners are attentive to what they need to
develop. That is the reason why perceptual training experiments are still quite
diverse in the way the training itself is carried out: researchers have been trying
to find a model that can enhance learning capability. Therefore, we will thor-
oughly explain our methodology, which proposes an analysis of results while fol-
lowing the assumptions of language development according to the Theory of
Complex Systems.
2 Methodology
Our study used a corpus of 115 heterotonic words,4 i.e., those that, in a compari-
son between two languages (in this case, between Brazilian Portuguese5 and
 When preparing the list of heterotonic words arising from the stress contrast between Brazil-
ian Portuguese and Spanish, we found 155 examples; however, the corpus was left with 115
items because 40 of them had to be discarded. One of the reasons was the fact that some of the
words could be pronounced in more than one way, which means that one of these pronuncia-
tions was the same as in Portuguese – for example, the word penalty, whose stress in Spanish
can be assigned to either one of two syllables: pénalti or penalti (the first case of stress assign-
ment also occurs in Brazilian Portuguese). For the list of heterotonics that were discarded and
the respective reasons, see Milan (2019) (the stressed syllable of each word was underlined for
easier identification).
 In this chapter, all mentions of Portuguese refer exclusively to Brazilian Portuguese.
Spanish), differ in the position of the stressed syllable. In some cases, for exam-
ple, the word is paroxytonic in Portuguese (as in atmosfera6) but proparoxytonic
(as in atmósfera) in Spanish (atmosphere in English). There are, however, a large
number of words that fall under the following rule: both words are paroxytonic
and end in /ia/. The difference is that in Portuguese the vowel sequence /i-a/ is in
different syllables, forming a hiatus to which stress is assigned, whereas in Span-
ish, this vowel sequence is a diphthong, and stress falls on the previous syllable,
for example, in the Spanish words a-ne-mia [anaemia], bi-ga-mia [bigamy], fo-bia
[phobia] and or-to-pe-dia [orthopedic]. An exception to this group is the word po-lí-cia
[police], a paroxyton in Portuguese that ends in a diphthong, and po-li-cí-a, a parox-
yton in Spanish that ends in hiatus. The 115 heterotonics were distributed across the
tests (pre-test, post-test, delayed post-test and generalization) and the two training
sessions. The tests also had 30 distracting words, selected among those that are usu-
ally well-known by speakers of Spanish as a second language, since they are words
used in everyday life (also in Portuguese) and that frequently appear in textbooks,
such as cultura [culture] and salida [exit].
The corpus for the perception tests was recorded by eight speakers (a charac-
teristic of training methods with high variability) whose mother tongue was
Spanish: four of them were Mexican and their phrases were used in the pre-test,
post-test and delayed post-test, and in the two training sessions. Another four
speakers (two Hondurans and two Cubans) were recorded and their phrases were
used in the perception generalization test. The recordings with the speakers con-
sisted of reading aloud phrases that were displayed on a computer screen, e.g.,
“Yo dije atmósfera” [I said atmosphere]. To facilitate editing, we chose to insert
the words that we needed to create the perceptual tests in the carrier sentence
“Yo dije ______” [I said ______]. After the test items had been recorded by the
speakers, the perception tests were created in the TP software7 (Rauber et al.
2013) and validated by four speakers of Spanish as their mother tongue (other
than the speakers who had recorded the words) to find out if there was any mis-
take in the creation of the tests before they were administered.
All the learners who participated in this study spoke Brazilian Portuguese as
their mother tongue and studied Spanish as a foreign language. They were en-
rolled in an undergraduate degree in Spanish at the Federal University of Paraná,
and attended classes on a regular basis. The 26 participants attended two differ-
ent undergraduate courses: (i) 17 of them were attending the course ‘Spanish
 All stressed syllables of the example words were underlined for easier identification during
reading.
 The TP software, which was developed for the design and application of perceptual training
and testing, is available free of charge at <www.worken.com.br>.
Language 1ʹ; they had had 90-hour exposure to Spanish at the beginning of the
tests and had 180-hour exposure by the end of data collection. This group was
referred to as the basic group; (ii) nine of them were attending the course ‘Spanish
Language 3ʹ and had had 270-hour exposure to Spanish at the beginning of the tests
and had 360-hour exposure by the end of data collection. This second group was
referred to as the intermediate group. The 14 informants (10 from the basic group
and four from the intermediate group) who participated in the perceptual training
sessions were randomly selected. Therefore, there were four groups: a basic group
with training (10 participants); a basic group without training (seven participants);
an intermediate group with training (four participants); and an intermediate group
without training (five participants). All informants signed an informed consent form
in which they confirmed their acceptance to participate in the research. They were
aware of the fact that there would be no financial compensation8 and that they
would not be identified. This whole research is focused on Spanish word stress;
therefore, we checked whether the participants had had classes on Spanish word
stress placement before and/or during data collection. All learners had an expository
lesson and did exercises on Spanish word stress assignment, which means that they
were expected to know how to pronounce heterotonics. In addition, the participants
that had been having classes for a longer time, i.e., those from the intermediate
group, had been taught the stress of 58 heterotonics that were used in the tests. In
other words, this group was more familiar with such heterotonics.
2.1 Procedures
Our perceptual training study followed the standard of testing that is common to
this type of research. Before starting the training sessions, all informants took the
production pre-test and then the perception pre-test, which respectively assessed
the pronunciation and the perception of heterotonics. The pre-tests were used to
assess whether the participants knew, produced and perceived the heterotonics
adequately before our intervention with the study, and also to keep track, from
the beginning, of the heterotonics whose stressed syllable they failed to produce
and/or perceive according to expectations.
Next, some of the informants (14 of them) underwent the two perceptual train-
ing sessions (that will be explained in this section) for the purpose of comparison
to the other 12 informants who had not received perceptual training. The following
 In Brazil, researchers are not allowed to give any financial contribution to people who par-
ticipate in scientific/academic research.
step was to replicate the same tests (pre-tests) after training, which is the reason
why they are called post-tests (for both production and perception). Together with
the post-tests, the production and perception generalization tests were adminis-
tered to all participants (in which new heterotonics appeared, not yet seen in the
other tests and in the training sessions). Finally, between 42 and 58 days9 after the
last perceptual training session, the production and perception delayed post-tests
(identical to the pre-tests) were administered to find out if the informants had re-
tained, in the long term, what they may have learned in the training sessions. The
study was conducted10 between August 23 and November 24, 2017. In the analysis
of results, the data collected in each of the tests were compared.
For the production tests (the pre-test, the post-test and the delayed post-test
were the same), the informants were taken individually to a soundproof room for
recording, on scheduled days. They read carrier sentences inserted in Power
Point slides and displayed on a computer screen. These sentences contained het-
erotonics and distractors and had the same format as the phrases read by the
speakers of the perception tests (which were, then, edited because in the percep-
tion tests only the target words were used). In total, each participant read 40 het-
erotonics and 20 distractors inserted in carrier sentences in each test.
After each production test, the participants took11 the respective perception
test. On the day the perception tests were administered, the entire class was
taken to the computer lab for the test and each participant took the test on an
individual computer. The perception tests (the pre-test, post-test and delayed
post-test were identical) contained the same 40 heterotonics and the same 20
distractors spoken in the production tests. The difference was that instead of
pronouncing the sentences, the participants listened (on the TP software) to the
four Mexican speakers pronouncing the target words in isolation, and they
were expected to click on the stressed syllable in each word that they heard.12
 Data collection took place on different days because not all groups of students were able to
take the tests on the same weeks, as recess and exams had already been scheduled on the
university calendar.
 The research calendar can be seen in detail in Milan (2019).
 On average, the perception tests were carried out at 20 days after their respective produc-
tion tests.
 Before starting the perceptual tests, the participants were instructed to answer the ques-
tion “what is the number of the strong (stressed) syllable of the word that you heard?”. There
were four answer options, from button one to button four. After that, the participants were
given an example: if they heard the word ‘árboles’ [trees], they should mentally divide that
word into syllables, ár-bo-les, and then click on the button corresponding to the stressed sylla-
ble. They should bear in mind that, for this test, syllables had to be counted from front to
back, that is, the first syllable was ‘ár’, the second was ‘bo’ and the third was ‘les’.
The participants were told that they could hear each word 10 times before an-
swering by clicking on the ‘repeat’ button.
The two training sessions (administered after the pre-tests) took place on
different days and for only a part of the participants. The participants who did
not do the training sessions receive non-related input. The training sets were
composed of 56 heterotonics that had not been included in any of the produc-
tion and perception tests: 29 were used in the first training session and 27, in
the second. Precisely because the sets were created for training purposes, they
contained no distractors. The two sessions were also set up in the TP software
but, unlike the perceptual tests, the training sessions provided an answer (im-
mediate feedback) to each choice of stressed syllable made by the participants.
This means that the software pointed out whether the chosen answer was cor-
rect or incorrect. Whenever it was incorrect, the software clearly indicated the
mistake and immediately showed what the correct syllable was. The partici-
pants were supposed to hear the stimulus again and then click on the correct
answer, as pointed out by the software, so that they could move on to the next
stimulus. The generalization tests (which contained 19 new heterotonics and 10
new distractors) were administered together with the post-tests, i.e., the pro-
duction generalization test was randomly combined with the production post-
test. The same situation was administered to the perception generalization test
(in this case, with new speakers, two Hondurans and two Cubans).
2.2 Data analysis
Although there is a general consensus on the requirements of traditional research

in terms of choosing parametric or non-parametric statistics and other conven-
tions, the method of analysis of research data that adopt Complex Systems is not
yet defined. Therefore, in this section, we present a proposal for analysis following
what Lowie (2017) suggested in his article on methodological issues: the develop-
ment of a second language in a longitudinal study, like this one, is something that
cannot be considered identical for different individuals. Thus, the data collected
from the 26 participants distributed into four groups13 were evaluated in the
 The four groups are: those with less academic exposure to Spanish, called the ‘basic’
group, divided between the basic group with and without training; and those who had more
academic exposure to Spanish, called ‘intermediate’, divided between intermediate group
with and without training.
traditional perspective, in inter-group and intra-group statistical comparisons,14

and we will also discuss some individual issues within groups. According to Lima
Júnior (2016), a group may reveal trends of emergence of self-organization patterns
in the system, but we can only actually observe development when we look at the
individual level, because it is very idiosyncratic in itself. For the author, “different
learners cannot be treated as being equal just because they share the same charac-
teristic, be it the level of knowledge determined by a placement test, the number
of academic semesters they studied the L2 [the second language] for, or the age
group” [“não é possível tratar aprendizes diversos como sendo iguais só porque
compartilham uma mesma característica, seja ela o nível que um teste determi-
nou, o número de semestres que estudaram a L2 [segunda língua] ou a faixa etá-
ria”] (Lima Júnior 2016: 206).
To this end, we propose another way to observe the results, focusing at the
individual level. Initially, we will show the individual performance of the par-
ticipants in each group and we will highlight the learners who tended to per-
form above or below the average level of the group.15 After that, we will analyze
how the participants performed in the percentage variation of correct answers
between the pre-tests and the delayed post-test in both production and percep-
tion tests. Although research studies usually deal with percentage variation
only by subtracting the rate of correct answers from one test to the other, this
calculation seems inadequate. Amorin (2016), Boggiss et al. (2012), Matias
(2015), Iezzi, Hazzan, and Degenszajn (2004) explained that percentage varia-
tion should be calculated according to the formula: (x/y)-1*100. Where ‘x’ is the
initial value, in our case the pre-test, and ‘y’ is the final value, i.e., the delayed
post-test. In practice, if a learner correctly answers 10% of a test and then has
15% of correct answers of that same test in another assessment, it does not
mean that he or she increased the number of correct answers by 5% (only by
subtracting 15% from 10%); after all, 5% more for 10%, which is the first rate of
correct answers, is equal to 10.5%, rather than 15%. To measure how much this
learner increased his or her percentage of correct answers, using the formula
shown above, the value is 50%, because first the learner had 10% of correct
 The data were analyzed using non-parametric statistical tests, with significance (p≤ 0.05);
in addition, for each group, we showed the average percentage of correct answers and the
value of the standard deviation. For the Post Hoc Tests, we applied the Bonferroni correction of
p≤ 0.008. Importantly, we only reported the values of the tests that were statistically
significant.
 Every participant whose rate of correct answers was the same as the average rate of the
group, and also those whose rate was 10% above or below that value, were considered to be
within the average. The others were considered to be outliers.
answers and, in a second assessment, he or she answered more half of the

items correctly, that is, 50% more. For this reason, our assessment of individual
results has focused mainly on learners who had more difficulty (whose rate of
correct answers was always below the average rate of the group) and on learn-
ers who presented greater percentage variation from the pre-test to the delayed
post-test. We believe that, by calculating percentage variation, we can find out
if these learners with more difficulty were also the ones that presented the high-
est percentage variation, that is, if they were the ones that possibly benefited
the most from the training sessions.
In this study we determined the correct answers of the production tests by
hearing, since an acoustic analysis to define the stressed syllables produced by
the informants, e.g., fundamental frequency, duration and intensity clues, was
not productive. When we were in doubt as to which syllable had received
stress, we showed those words to two other raters.16 We could not recognize, by
hearing, stress placement in less than 1% of the total production. Thus, for the
production, the statistical analysis was based on the total of 3614 target words.
As for perception, the analysis was based on the total of 14456 responses: 160
responses for each of the 26 participants in the three tests (pre-test, post-test
and delayed post-test), and 76 responses for each of the 26 participants in the
generalization test. The rates of correct answers in the perception tests were
provided by the TP software itself, which, at the end of the tests, created a table
with the correct answers in each test and for each participant.
3 Results and discussion

First, we will report the results in the inter-group comparison, i.e., the percent-
age of correct answers among the groups and their possible significant differen-
ces, always addressing the correct answers of production first and then of
perception. After that, we will discuss the intra-group comparison, showing
how each group performed in each of the tests, including the generalization
tests. Finally, we will report the results of the individual analyses.
 The other two people who judged the words produced in the tests and that raised doubt
were two Hispanic speakers: a Madrid native who was also a linguist and a Guatemalan who
was a post-graduate student at the Federal University of Paraná.
3.1 Inter-group results
Figure 1 shows the data (in percentage of correct answers and the respective
standard deviation) resulting from the three production tests (pre-test, post-test
and delayed post-test) of the four groups of informants. In the pre-test, the two
intermediate groups performed better than the two basic groups, showing that
they had knowledge about the pronunciation of heterotonics before the training
sessions, as expected. This is because, in addition to being at a more advanced
level of academic exposure to Spanish, they had had classes on heterotonics, as
explained in the Methodology section. However, the difference in the rate of cor-
rect answers was only significant17 between the two groups that had had no
training: the rate of 30% of correct answers in the basic group without training
was significantly lower than the rate of 78% of correct answers in the intermedi-
ate group without training. In inferential terms, the other groups did not differ in
the percentage of correct answers, although we found that there was a difference
that needs to be considered throughout this research, since the basic group that
underwent training answered 47% of the items correctly in the pre-test while the
intermediate group that received training had a rate of 69% of correct answers.
Basic with training Basic without training Intermediate with training Intermediate without training
81% 90% 90% 89% 96% 94%

69% 78%
47% SD 15% 55% SD 12% SD 9% SD 10% 58% SD 7% SD 9%
30% SD 28% SD 16% SD 22% SD 25%
SD 27%
SD 11%
Pre-test Post-test Delayed post-test
Figure 1: Inter-group percentage of correct answers and standard deviation (SD) in the three
production tests. Source: The authors (2021).
In the post-test (central columns of Figure 1), the two intermediate groups
(even the one that did not participate in the training sessions) were equally cor-
rect in 90% of the productions, although there was greater variability in the
group that had undergone training (standard deviation of 12%) in comparison
to the group that had not had training (standard deviation of 9%). There were
significant differences in the correct answers of the four groups;18 however, the
Post Mann Whitney Hoc test did not show where they occurred. Based on the
 The value of the Kruskal Wallis test was χ2 = 10.63, p= 0.014. The value of the Mann Whit-
ney test was U= 0.00, p = 0.004 (with Bonferroni correction, the significance being considered
was p≤ 0.008).
 The value of the Kruskal Wallis test was χ2 = 10.85, p= 0.013.
analysis of the level of significance of the Post Hoc Test, we can affirm that
there was a tendency for the p-value to approach significance when the basic
group without training, which had 55% of correct answers in the post-test, was
compared to the other three groups. This result indicates that this group may
have had significantly fewer correct answers than the other three.
The percentage of correct answers of the groups in the delayed post-test (last
four columns of Figure 1) followed the trend of the post-test, with a small im-
provement in the percentage of correct answers for all groups. This result shows
that, if heterotonics were developed and/or improved during training, such
knowledge was retained in the long term.19 This time, the rate of correct answers
of the basic group with training (89%) was significantly20 higher than that of the
basic group without training (58%). The statistical results indicate a possible pos-
itive effect of perceptual training for the group of participants at the basic level.
Next, we will analyze the performance of the groups as regards perception. The
four groups fared better in perception than in production. In the perception pre-
test, the basic group without training had the lowest percentage of correct an-
swers (65%), as shown in the first four columns of Figure 2. Still, it was a high
percentage when compared to the correct answers in production, since this same
group had correctly answered less than half of the items in the production pre-
test (30%). When comparing correct answers between the four groups in the per-
ception pre-test, the Kruskal-Wallis test pointed out that there were no significant
differences. However, it should be noted that, in descriptive terms, there was a
15% difference in the rate of correct answers between the two basic groups.
88% 87% 85% 91% 93% 86% 95% 94%
80%
SD 20%
65% SD 15% SD 15% SD 14% 60% SD 13% SD 7% SD 14% 61% SD 8% SD 9%
SD 20% SD 25% SD 25%
Figure 2: Inter-group percentage of correct answers and standard deviation (SD) in the three
perception tests. Source: The authors (2021).
In the perception post-test, the rate of correct answers remained high and simi-
lar to that of the pre-test. However, the basic group without training decreased
 Between 42 and 58 days after the last training session.

 The value of the Kruskal Wallis test was χ2 = 12.79, p= 0.005. The value of the Mann-
Whitney test was U= 6.50, p = 0.003 between the basic group with and without training.
the percentage of correct answers from 65% in the pre-test to 60% in the post-
test. In the comparison of the four groups, in the post-test, there was also no
significance in the number of correct answers among the groups.
The correct answers in the perception delayed post-test followed the trend
of the other tests, i.e., it was easier for the learners to perceive the stressed syl-
lables of the heterotonics than to produce them properly. Although there are
significant differences21 in the percentage of correct answers among the four
groups in the delayed post-test, we could not find where this difference oc-
curred when using the Mann Whitney’s Post Hoc test. However, once again,
there was a tendency for the p-value to approach significance when the basic
group without training (61% of correct answers) was compared to the other
three groups whose rate of correct answers was above 80%.
In the comparisons among the four groups of this study, both in the produc-
tion and in the perception of heterotonics, the group that showed more difficulty
was the basic group without training. When comparing the basic group with
training to the two intermediate groups, the former always had a lower percent-
age of correct answers than the others, which shows that having more academic
exposure to Spanish, in addition to explicit exposure on heterotonics, were as-
pects that interfered in this study. To further investigate the topic, we looked at
the performance of each group in the tests and particularly in the generalization
test, in the intra-group analysis. In addition, after discussing the intra-group
analysis for both production and perception, we will focus on the individual
analysis, highlighting the participants who always scored above or below the
group average as well as the individual percentage variation among the tests.
3.2 Intra-group results in production
We regrouped the data to observe the performance of each group in each of the
three production tests (pre-test, post-test and delayed post-test); for this compar-
ison, we added the data from the production generalization test. Figure 3 shows
that the informants from the four groups, even those who did not participate in
the training sessions, improved from one test to the next, which may also sug-
gest that exposure to the tests alone has favored this outcome.
The basic group with training correctly produced 47% of the heterotonics
before the training sessions. After having been trained, they increased the rate
of correct answers to 81% in the post-test. Also, there was a lower rate of
 The value of the Kruskal Wallis test was χ2 = 9.32, p= 0.025.

Pre-test Post-test Delayed post-test Generalization test
89% 90% 96% 90% 94%

81% 78%
SD 10% 69% SD 12% SD 7% SD 9% SD 9%
SD 15% 55% 58% SD 16%
47% 30% SD 22% SD 25% SD 28%
SD 27% SD 11%
Figure 3: Intra-group percentage of correct answers and standard deviation (SD) in the four
production tests. Source: The authors (2021).
variability of responses of the group, because the standard deviation dropped

from 27% to 15%. This same group retained the knowledge that they had devel-
oped, as shown by the results of the delayed post-test, in which the rate of correct
answers was increased to 89%. The percentage of correct answers is significantly22
different when comparing the three tests for this group. It means there was a signif-
icant increase in the rate of correct answers in the pre-test, post-test and delayed
post-test (in this order); therefore, the positive effect of perceptual training has been
confirmed. In the generalization test, the percentage of correct answers was 67%.
This value was significantly23 lower than the rate of correct answers in the post-test
and delayed post-test. Thus, it can be inferred that the basic group with training
did not generalize what they had learned when their knowledge was tested to het-
erotonics not yet seen in the other tests. However, we cannot but mention that, in
descriptive statistics, the difference in the rate of correct answers from the pre-test
(47%) to the generalization test (67%) was 20%. In this type of analysis, this result
may indicate a possible generalization for heterotonics which were unknown.
The basic group without training had an improvement from the pre-test
(30%) for the post-test (55%) and also for the delayed post-test (58%), which
again indicates that the exposure to the tests alone (without the training ses-
sions) may also lead to an improvement in the production heterotonics. There
was a significant difference24 when comparing the rate of correct answers
among the three tests in this group, but Wilcoxon’s Post Hoc did not show
 The value of the Friedman test was χ2 = 19.15, p= 0.000. The value of Wilcoxon’s Post Hoc
test was Z= −2.80, p= 0.005 in the comparison between pre-test and post-test; Z= −2.53,
p= 0.011 in the comparison between the post-test and the delayed post-test, and Z= −2.80,
p= 0.005 in the comparison between the pre-test and the delayed post-test.
 The value of the Friedman test was χ2 = 27.00, p= 0.000. The value of Wilcoxon’s Post
Hoc test was Z= −2.80, p= 0.005 in the comparison between post-test and generalization test;
Z= −2.81, p= 0.005 in the comparison between the delayed post-test and the generalization test.
 The value of the Friedman test was χ2 = 8.07, p= 0.018.
where they occurred. However, when the rate of correct answers of the pre-test
was compared with that of the other two tests, the p-value tended to approach
significance, which may indicate that they actually answered fewer items cor-
rectly in the pre-test than in the others. Notably, this group scored less than the
average of the other three groups in this study: their rate in the delayed post-
test was 58%. In the generalization test, this group scored 36%, an outcome
that was more similar to that of the pre-test than of the post-test and the de-
layed post-test. However, the p-value was not significant and was similar in all
comparisons with the generalization test.
The intermediate group with training also showed an increase in the percent-
age of correct answers from the pre-test (69%) to the post-test (90%) and the
delayed post-test (96%). Although the value of correct answers among the three
tests is significantly25 different, Wilcoxon’s Post Hoc did not show where they oc-
curred. We found that the p-value approached significance more closely when the
pre-test was compared to the other two tests. Thus, descriptively, it can be stated
that this group had fewer correct answers in the pre-test and showed a small dif-
ference (only 6%) in performance from the post-test to the delayed post-test. We
will take the opportunity to report, as has been frequent in the observation of the
results, that a limitation of this research is the small number of participants (for
the intermediate group with training; in particular, there were only four). For this
reason, the statistical tests often did not point out where the difference was in the
percentage of correct answers. This also occurred in the generalization test, in
which the rate of 82% of correct answers is significantly26 different from the rates
of the other three tests; however, Wilcoxon’s Post Hoc did not show where these
differences were. Descriptively, we can affirm that the rate of correct answers in
the generalization test was more similar to that of the post-test and the delayed
post-test, which may indicate a positive generalization of the intermediate group
with training for the new heterotonics.
The intermediate group without training, represented in the last three col-
umns of Figure 3, also showed improvements from one test to the other. Al-
though the rate of correct answers was significantly27 different among the three
tests, Wilcoxon’s Post Hoc did not show where they were. This also happened
when the generalization test was compared to the other three tests. Descrip-
tively, the p-value tended to approach significance when the rate of correct an-
swers in the pre-test (78%) was compared to those of the post-test (90%), the
 The value of the Friedman test was χ2 = 7.60, p= 0.022.

 The value of the Friedman test was χ2 = 11.77, p= 0.008.
 The value of the Friedman test was χ2 = 14.02, p= 0.003.
delayed post-test (94%) and also the generalization (83%). This finding shows
that there has been an improvement in subsequent tests. However, this group,
as well as the intermediate one with training, participated in this study with
considerable previous knowledge of the production of heterotonics. Before we
present the intra-group results of perception, we will report how each individ-
ual fared in the production of heterotonics.
3.2.1 Individual production analysis
Figure 4 shows the percentage of correct answers for each of the 26 participants in
this study in the pre-test, post-test and delayed post-test. It should be noted that: (i)
numbers 1 to 10 represent the informants of the basic group with training; (ii) 11 to
17, the basic group without training; (iii) 18 to 21, the intermediate group with train-
ing; (iv) and 22 to 26, the participants in the intermediate group without training.

110%
Percentage of correct
90%
70%
50%
30%
10%
Participants 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Figure 4: Percentage of correct answers by each participant in the three production tests.
Source: The authors (2021).
What this figure visually shows is that there was, especially in the pre-test (ligh-
ter gray line), a high variability of responses from individuals who belonged to
the same group. This result confirms the hypothesis that although these inform-
ants had been placed in the same class of Spanish as a foreign language at uni-
versity, according to their level of knowledge, they performed differently. When
looking at the correct answers of participants 1 and 2, for example, we found that
the first one, in the pre-test, had a rate of 20% while the second had a rate of
68% in that same test. This discrepancy in the percentage of individual correct
answers in the same group was also found in the intermediate levels. In the pre-
test, informant 19 answered 95% of the items correctly, while informant 21, from
the same group, had a rate of 30%. Importantly, Figure 4 also shows the down-
ward curve of correct answers that is formed when we observe numbers 11 to 17,
which refers to the basic group without training. This group, as shown in the
results, has always had fewer correct answers than the other three groups. How-
ever, that does not mean that individual performance was always lower, but on
the contrary. Some informants in the basic group without training performed
more similarly to more experienced learners. For example, informant 17 had 90%
of correct answers, i.e., a score that approached the rate of correct answers of
both the basic group with training and the two intermediate groups. In summary,
what we found was that individuals 4, 9 and 17 always scored above the average
of their group and that individuals 5, 10, 16, 21 and 26 always scored below the
average of their respective groups in the production tests. Next, we will discuss
how the learners fared in the percentage variation in production.
For the sake of space, we will only report the results of the three informants
who most increased their rate of correct answers between the two tests in ques-
tion, namely learners 5 (467%), 1 (290%) and 10 (289%); and of the three learn-
ers who had the lowest rate of increase among the same tests: informants 23
(6%), 19 (5%) and 16 (−17%). This is why this proposal for analysis is particu-
larly interesting; participant 5, who was always below the average of the basic
group with training, in the individual analysis, was precisely the one who im-
proved the most by increasing the percentage of correct answers from the pre-
test to the delayed post-test (the number of correct answers increased by 467%,
i.e., the number of appropriate responses increased from 15% to 85%). This
same situation occurred with informants 10 and 21, who had, respectively,
289% and 183% of percentage variation between the same tests and who, in the
previous individual analysis, were participants who tended to lower the aver-
age of their group. This indicates that these learners were the ones that most
benefited from perceptual training, despite their difficulty in the tests that pre-
ceded the training itself.
On the other hand, the same cannot be said about informant 16, who be-
longs to the basic group without training. This participant always scored
right below the group average and was one of the participants with the low-
est rates of correct answers in all tests and in all groups. This may highlight
some factors; for example, the fact that this informant did not participate in
the training session (unlike individuals 5, 10 and 21); therefore, the produc-
tion tests may not have made sense to this person. Individuals 23 and 24, for
example, who did not participate in the training sessions, are also among
those who had less increase in the percentage of correct answers (6% and
8%, respectively). In their case, they had already respectively given 90% and
93% of adequate responses in the pre-test of production, i.e., they had nearly
answered the whole test correctly. Therefore, there was a small margin for
greater variation.
This type of comparison proved to be relevant in our study because it

showed that learners with more difficulty benefited more from perceptual
training, including those from the groups with more academic exposure. For
example, informant 21, from the intermediate group with training, had more
difficulty at the beginning of the tests, but was one of the participants that
benefited the most from perceptual training. Next, we will report the intra-
group analysis of the perception tests.
3.3 Intra-group results in perception
Unlike the results in production, all participants showed a good performance in

perception even before the training sessions. The basic group with training,
represented in the first three columns of Figure 5, scored 80% or more in the
three perception tests. There were significant differences28 among these rates of
correct answers, but the Wilcoxon’s Post Hoc test did not show where they hap-
pened.29 The p-value, however, was closer to significance when the pre-test
(80% of correct answers) was compared to the other two tests. This result may
indicate that the pre-test had fewer correct answers than the other two tests per-
formed after the training sessions. When the three tests (pre, post and delayed
post-test) were compared to the generalization test, there were significant dif-
ferences30 between them. In this group, the rate of correct answers was signifi-
cantly31 higher (81%) in the generalization test than in the pre-test (80%);
however, such rate was significantly32 lower in the generalization test than in
the post-test (85%) and in the delayed post-test (86%). This shows that the
basic group with training performed better in generalization than in the test
that preceded training; however, the learners in this group were not able to
generalize what they had learned in training, as they had done in the post-test
and delayed post-test.
 The value of the Friedman test was χ2 = 10.47, p= 0.005.

 As mentioned above, this is a limitation of our study because it has a small number of par-
ticipants, which weakened the statistical power of the Post Hoc tests.
 The value of the Friedman test was χ2 = 24.46, p= 0.000.
 The value of the Wilcoxon’s Post Hoc test was Z= −2.80, p= 0.005 in the comparison be-
tween the pre-test and the generalization test; Z= −2.81, p= 0.005.
 The value of Wilcoxon’s Post Hoc test was Z= −2.81, p= 0.005 in the comparison between
the post-test and the generalization test; and Z= −2.80, p= 0.005 in the comparison between
the delayed post-test test and the generalization test.
Pre-test Post-test Delayed post-test Generalization test

85% 86% 88% 91% 95% 87% 93% 94%
80% 61%
65% 60% SD 15% SD 13% SD 8% SD 15% SD 7% SD 9%
SD 20% SD 14% SD 14%
SD 20% SD 25% SD25%
Figure 5: Intra-group percentage of correct answers and standard deviation (SD) in the four
perception tests. Source: The authors (2021).
For the first time in this study, when comparing the tests, there was a re-
duction in the percentage of correct answers between the pre-test and the post-
test for one of the groups (the basic group without training): it dropped from
65% to 60%, and then, in the delayed post-test, it increased by only one per-
centage point, as shown in Figure 5. These correct answers were not signifi-
cantly different; however, there were differences33 when comparing the correct
answers of these three tests to the generalization test (63%). Nonetheless, the
Post Hoc test did not show where they occurred.
The intermediate group with training did not present significant differences
in correct answers in the three perception tests, but when adding the generaliza-
tion in the comparisons, there were significant differences34 in the rate of correct
answers in the four tests, although, again, we could not determine where they
occurred. Therefore, it is likely that they occurred precisely between the generali-
zation test (95%), the post-test (91%) and the pre-test (88%), since the delayed
post-test had an average rate of correct answers that was equal to that of the gen-
eralization test. If we analyze it in this way, we can affirm that the training ses-
sions were positive since this group had better performance in the delayed post-
test and in the generalization test than in the pre-test and the post-test.
The intermediate group without training had a similar outcome to that of
the intermediate group with training. No differences were found in correct an-
swers among the three main tests, but in comparison to the generalization test,
there were inferential differences,35 although the Post Hoc test did not show
where they occurred. The p-value tended to approach significance when the
generalization test was compared to the other three tests. Such result can indi-
cate, just by looking at the descriptive data, that the rate of correct answers for
 The value of the Friedman test is χ2 = 13.80, p= 0.003.

 The value of the Friedman test is χ2 = 9.07, p= 0.028.
 The value of the Friedman test is χ2 = 12.19, p= 0.007.
generalization (91%) was significantly higher than that of the pre-test, but
lower than that of the post-test (93%) and the delayed post-test (94%). Al-
though this group had not been trained, it had improved performance in the
perception tests, similarly to the group at the same level of academic exposure
that had been trained. Such outcome may suggest that only the exposure to the
tests (without the training sessions) has already helped them understand the
stressed syllables of Spanish in the heterotonics. Below are the perception re-
sults of each informant.
3.3.1 Individual analysis of perception
In the individual comparison, Figure 6 shows that the responses tended to be

more similar in the three tests (pre-test, post-test and delayed post-test) because
the curves almost overlap, unlike those of the production tests. Anyway, in the
perception tests, there were informants whose performance differed from that
of their peers from the same group.

Percentage of correct
110%
90%
answers
70%
50%
30%
10%
Participants 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Figure 6: Percentage of correct answers by each participant in the three perception tests.
Source: The authors (2021).
In the individual performance of perception, in the basic group with training, in-
formants 2, 6 and 9 always had a rate of correct answers above that of the group
average, but only the rate of informant 9 was above average both in production
and in perception. Informants 2 and 6, who did not score above average in pro-
duction, had an easier time in perception, which shows that those who perceive
heterotonics will not always produce them properly. The opposite also hap-
pened: informant 1, whose rate of correct answers was below the average in per-
ception, followed the average of the group in production, i.e., perceiving a
phonological aspect inappropriately does not always result in inappropriate pro-
duction. In summary, for perception, the informants who scored above the aver-
ages of their groups were 2, 6, 9, 11 and 17, while those who scored below the
average were 1, 16 and 21.
In the analysis of percentage variation, the three informants who most in-
creased the rate of correct answers between the pre-test and the delayed post-test
for perception were 1 (74%), 22 (45%) and 10 (34%). And those whose percentage
variation increased that least were 13 (−7%), 3 (−9%) and 15 (−56%). Notably, par-
ticipants 1, 16 and 21 were the ones whose rate of correct answers was always
below the average of their respective groups in perception, when we analyzed per-
centage variation. However, two of them were among those that most increased
the percentage of correct answers from the pre-test to the delayed post-test. Partic-
ipant 1, by the way, was the one who benefited the most in perception, in-
creasing the percentage of correct answers between the tests in question by
74%. And as this informant was part of the basic group with training, we can
see in the results for perception that this learner benefited from the training
sessions. The same happened to learner 21, who ranked fourth among those
who benefited the most, as the number of correct answers between pre-test
and retention increased by 26%.
Participant 16, on the other hand, who was from the basic group without
training and had an average rate of correct answers below that of the group,
was not among those who most increased the percentage of correct answers, as
there was an increase by only 3% from the pre-test to the delayed post-test.
That is, without training and only with exposure to tests, this learner barely
evolved from one test to the next. The participants who least increased the per-
centage of correct answers from the pre-test to delayed post-test fit into one of
three situations: (i) those who already had a high average of correct answers
and, therefore, could benefit from the study only to a small extent, which is the
case of participants 2, 6, 9, 11, 17, 20 and 24; (ii) the ones who trained (partici-
pant 3) but who did not benefit from the training session, because they had a
smaller number of correct answers at the end than at the beginning of the
study; (iii) and participants such as 13 and 15, whose answers were also less
accurate in the last test than in the first, and who were part of the basic group
without training, i.e., the fact that they had not been trained may have led to a
worse performance from one test to the other.
This type of analysis showed us that a more individualized approach allows
us to better understand the impact of perceptual training according to the pro-
file of each participant. This understanding is in line with the statement by
Larsen-Freeman (2018), which summarizes why it is so important to look at in-
dividual development, as students not only start from different points when
they first engage in a task, but they also make their own developmental path.
4 Conclusions
The inter-group analysis showed that there were positive effects of perceptual
training of heterotonics in Spanish as contrasted to Brazilian Portuguese for the
basic group, i.e., the one which had less academic exposure. The basic group
with training retained a significant average of correct answers in comparison to
the basic group without training. This was found, for example, in the comparison
of correct answers in the delayed production post-tests, in which the basic group
with training (average: 89%) had a significantly higher rate of correct answers
than the basic group without training (average: 58%). This inferential difference
was not found in the two groups with more academic exposure, because there
were no inferential differences at any point in this study, when comparing the
intermediate groups with and without training. In the two intermediate groups,
the effect of training was not positive because both groups were already familiar
with the heterotonics, as previously mentioned.
In the intra-group analysis, again the basic group with training had signifi-
cant differences in the rate of correct answers among the tests performed, both
in production and in perception. However, the members of the group were not
able to generalize, in inferential terms, what they had learned from being ex-
posed to new heterotonics in the generalization production test. In perception,
on the other hand, this same group was able to make this generalization for
heterotonics not seen in the other tests and also for new speakers (no longer
Mexicans, but Cubans and Hondurans). These results reinforced the positive ef-
fect of training for the basic group.
In the individual analysis, however, we found that having more or less aca-
demic exposure to Spanish was not necessarily a factor that determined the ef-
fect of training. This is because of participant 21, from the intermediate group
with training, who had a positive effect both in production and in perception.
From the production pre-test to the delayed post-test, this learner increased the
number of correct answers by 183%; in perception, this increase was 26% (the
fourth highest rate among the 26 participants in the perception test). This result
shows that when we first observed which participants tended to decrease the
group average and then we analyzed how these participants performed in the
percentage variation between tests, we realized that, in this type of comparison,
the individuals with more difficulty were exactly those who benefited the most
from training. This rule, in this study, was retained for learners with more diffi-
culty and who belonged not only to the basic group with training but also to
the intermediate groups with and without training. This means that even with
more academic exposure to Spanish, those who still had difficulty with the pro-
duction and the perception of heterotonics benefited from training, a fact that
had been masked in the group analysis. When there was lesser academic expe-
rience and absence of perceptual training, that is, learners from the basic group
who had difficulty and did not receive support to overcome it, the opposite oc-
curred: the informants who answered the tests more easily were the ones who
tended to increase the percentage of correct answers from one test to another.
According to Complex Systems Theory, this result shows that, basically, to
develop a language and/or learn something new, learners need to try different
ways to start and they ultimately feel satisfied with one form or another only
after enough interaction. Also, this attempt occurs in different ways for each
individual, even if that student belongs to the same group of learners and has
been exposed to the same number of hours of the language to be developed.
This reinforces that a learner’s performance cannot be generalized to the group,
nor can the group’s results be generalized to a particular individual (Lowie and
Vespoor 2015).
References
Amorin, Vitor. 2016. O Ensino de Matemática Financeira: Do Livro Didático ao Mundo Real
[Teaching financial mathematics: from the textbook to the real world]. Rio de Janeiro:
Sociedade Brasileira de Matemática.
Boggiss, George Joseph, Luiz Geraldo Mendonça, Luiz Alfredo Gaspar & Marcos Heringer.
2012. Matemática Financeira [Financial mathematics]. Rio de Janeiro: Editora FGV.
Bybee, Joan. 2010. Language, Usage and Cognition. Cambridge: Cambridge University Press.
Ellis, Rod. 1985. Understanding Second Language Acquisition. Oxford: Oxford University
Press.
Goldstone, Robert & Lisa Byrge. 2005. Perceptual learning. In Mohan Matthen (ed.), The
Oxford Handbook of Philosophy of Perception, 1–16. New York: Oxford University Press.
Henshaw, Florencia. 2011. Effects of feedback timing in SLA: A computer-assisted study on the
Spanish subjunctive. In Cristina Sans & Leow Ronald (eds.), Implicit and Explicit
Language Learning: Conditions, Processes, and Knowledge in SLA and Bilingualism,
85–99. Washington: Georgetown University Press.
Iezzi, Gelson, Samuel Hazzan & David Degenszajn. 2004. Fundamentos de Matemática
Elementar: Matemática Comercial, Financeira, Estatística [Fundamentals of elementary
mathematics: business, financial, and statistical mathematics]. São Paulo: Editora Atual.
Larsen-Freeman, Diane. 2018. Task repetition or task interation. In Martin Bygate (ed.),
Learning Language through Task Repetition, 311–330. Amsterdam: John Benjamins
Publishing Company.
Larsen-Freeman, Diane & Lynne Cameron. 2008. Complex Systems and Applied Linguistics.
Oxford: Oxford University Press.
Lima Júnior, Ronaldo Mangueira. 2016. A necessidade de dados individuais e longitudinais
para análise do desenvolvimento fonológico de L2 como sistema complexo [The need of
individual and longitudinal data to analyse L2 phonological development as a complex

system]. ReVEL 27(14). 203–225.
Lowie, Wander. 2017. Lost in state space? Methodological considerations in Complex Dynamic
Theory approaches to second language development research. In Lourdes Ortega &
ZhaoHong Han (eds.), Complexity Theory and Language Development. In celebration of
Diane Larsen-Freeman, 123–141. Amsterdam: John Benjamins Publishing Company.
Lowie, Wander & Marjolijn Vespoor. 2015. Variability and variation in second language
acquisition orders: A dynamic reevaluation. Language Learning 65(1). 63–88.
Matias, Rogério. 2015. Cálculo Financeiro: Teoria e Prática [Financial accounting: theory and
practice]. Lisboa: Escolar Editora.
Milan, Pollianna. 2019. Efeitos do treinamento perceptual na percepção e produção dos
heterotônicos por aprendizes brasileiros de espanhol [Effects of perceptual training in
the perception and production of heterotoninc by Brazilian leraners of Spanish]. Curitiba,
PR: Universidade Federal do Paraná dissertation.
Morin, Edgar. 1990. Introdução ao Pensamento Complexo [Introduction to complex thought].
Translated by Dulce Matos. Lisboa: Instituto Piaget. Epistemologia e Sociedade.
Rast, Rebekah. 2011. Input processing principles: A contribution from first-exposure Data. In
Cristina Sans & Ronald Leow (eds.), Implicit and Explicit Language Learning: Conditions,
Processes, and Knowledge in SLA and Bilingualism, 129–144. Washington: Georgetown
University Press.
Rauber, Andrea Schurt, Anabela Rato, Giane Rodrigues dos Santos, Denise Cristina Kluge &
Marcos Figueiredo. TP: Testes de Percepção e Treinamento Perceptual com Feedback
Imediato – Versão 3.1 [TP: perception tests and perceptual training with immediate
feedback – version 3.1]. 2013. http://www.worken.com.br/tp/tp_instala.html (accessed
5 May 5 2021).
Assessing the robustness of L2 perceptual
training: A closer look at generalization
and retention of learning
Abstract: It is widely acknowledged that second language (L2) speech acquisi-
tion is often challenging to adult learners. Certain non-native speech sounds
tend to be more difficult to perceive and to produce accurately than others, even
after years of experience with the L2. Adult L2 learners are therefore frequently
characterized as having not only foreign accent but also accented perception
(Strange 1995). Over the last four decades, numerous studies on L2 speech learn-
ing have applied training programs to improve the perception and production
abilities of L2 learners and thus reduce degree of accentedness with a focus on
nativeness or intelligibility (Sakai and Moorman 2018; Thomson and Derwing
2015). However, training studies with different language pairings have yielded
complex findings due to the interplay of subject, task, and stimulus variables
and assessment procedures (Bohn 2000; Thomson and Derwing 2015). To evalu-
ate the efficacy of a training study, Logan and Pruitt (1995) propose that both
generalization and retention of learning need to be examined. This paper pro-
vides a systematic review of 27 perceptual training studies conducted over the
last 40 years which include the testing of generalization and/or retention of L2
speech learning. It overviews the use of these measures and examines how effec-
tive perceptual training is in promoting robust L2 speech learning. The review
also discusses the benefits and challenges of using these learning robustness
evaluation methods. The limitations of the qualitative review are presented as
well as suggestions for future research.
Keywords: Perceptual training, generalization, retention, L2 speech learning
Acknowledgments: The authors wish to thank Owen Ward for assistance with gathering litera-
ture for this review.
Anabela Rato, University of Toronto

Diana Oliveira, University of Minho
https://doi.org/10.1515/9783110736120-014
370 Anabela Rato, Diana Oliveira
1 Introduction
Research on second language perceptual training has contributed to the under-
standing of three major processes involved in speech learning: perceptual plas-
ticity, modality transfer, and robustness of learning.
Over the last 40 years (i.e., since Pisoni et al. 1982; McClaskey, Pisoni, and
Carrell 1983; Strange and Dittmann 1984), phonetic training studies have shown
that speech perception remains malleable over the life span with perceptual reat-
tunement of already formed phonemic categories and the establishment of new
L2 categories possible in any age studied so far. Pertinently, Bohn (2018) ac-
counts for an age gap in cross-language research on perceptual plasticity which
does not include the testing of learning mechanisms and processes in older
adults (over the age of 40).
Notwithstanding the testing of perceptual learning in younger adults, the
findings of training studies provide evidence of the plasticity of the perceptual
system that makes L2 speech learning possible in adulthood, as assumed by
the two most widely cited theoretical models of non-native speech learning
(Flege’s Speech Learning Model (SLM) 1995, and the revised Flege and Bohn’s
Model (SLM-r) 2021; and Best’s Perceptual Assimilation Model (PAM) 1995, and
Best and Tyler’s PAM-L2 2007).
Phonetic training research has also contributed to the discussion on the
two speech modalities interaction by examining the relation between percep-
tion and production performance, viz. by assessing the transfer of effects of per-
ceptual training in production and vice-versa. Specifically, the findings of a
meta-analysis of 18 perceptual training studies that tested for effects in produc-
tion (Sakai and Moorman 2018) indicate that the two speech modalities are con-
nected. The findings showed that perceptual training leads to medium-sized
gains in perception and small improvements in production. The results of a cor-
relational analysis suggested a non-significant small to medium relationship
between perception and production gains. More recent studies have examined
the modality transfer effect of both perception and production training to deter-
mine which training type transfers most effectively to the other modality. Stud-
ies such as those by Aliaga-Garcia (2017), Herd, Jongman, and Sereno (2013),
and Sakai (2016) have shown that both perception and production training
gains may transfer to opposite modalities. Aliaga-Garcia (2017) and Herd, Jong-
man, and Sereno’s (2013) findings suggest that both training types transfer to
the other modality, but how well perception or production transfers depends
on the relationship between the sounds being trained. Sakai (2016), however,
found that perception-only training led to large gains in perception but to no
significant improvements in production and production-only training led to
Assessing the robustness of L2 perceptual training 371
variable results for production, and medium-sized improvements in perception.

Despite the conflicting results which might be explained by methodological dif-
ferences pertaining to length of training and type of training tasks, and by the
complex interaction of the two speech modalities, phonetic training studies
have shown evidence that training in one modality impacts both speech per-
ception and production.
Second language speech learning resulting from perceptual training programs
has also been reported to generalize to new conditions such as to untrained stim-
uli, talkers, and tasks, and to be maintained over time. Generalization and reten-
tion are thus two measures of robustness of learning that perceptual training
studies have tested over the last decades. The findings have allowed the identifica-
tion of some of the optimal training conditions which seem to lead to robust L2
speech learning, namely those that seem to reproduce the learning conditions that
learners would be exposed to in a natural L2 environment, whose paradigm is
known as high variability phonetic training. However, to the best of our knowl-
edge, a systematic review of the training studies that have tested for generalization
and/or retention of perceptual learning has not yet been conducted. Therefore, to
better understand the mechanisms that underlie successful L2 speech learning
and specifically to identify the training variables that lead to robustness of learn-
ing, we aim to take a closer look at the assessment of generalization and retention
in L2 perceptual training research.
1.1 Assessing the robustness of L2 speech learning

in cross-language perceptual training
In their review of methodological issues in perceptual training, Logan and
Pruitt (1995) argue that of the criteria from which different training procedures
can be assessed, generalization is a more important parameter than efficiency.
The rationale is that training paradigms will be of limited use if they do not pro-
mote generalization, whereas those that are inefficient but promote generaliza-
tion may still be considered successful (Logan and Pruitt 1995: 374). Efficiency
is defined as the improvement observed between pretest and posttest and gen-
eralization as the ability to transfer learning achieved during training to new
testing dimensions, operationalized in multiple ways, including changes in
stimuli, phonetic context, talker, and task. If generalization is observed in at
least one of the aforementioned dimensions, it can be argued that, to a certain
degree, robust L2 speech learning has occurred. Generalization tests are thus
procedures that address the ecological validity of training studies. However, in
a review of L2 pronunciation instruction research, Thomson and Derwing (2015)
note that findings reporting improvement are based on discrete pronunciation

features, which highlights the lack of evaluation procedures which examine
these features in more ecologically valid contexts, such as spontaneous speech.
Additionally, to fully evaluate the extent of the impact of training on ro-
bustness of learning, another criterion should be tested, namely the lasting ef-
fect of the training intervention. This can be assessed in delayed posttests by
examining the long-term modification of the learners’ speech performance, i.e.,
whether there was retention of learning. Retention of learning over time occurs
when the (phonological and phonetic) information is stored in long-term mem-
ory in such a way that it can be readily retrieved after the training is over. The
long-term effects of training are thus examined to further assess whether L2
phonetic categories have been readjusted or established in the learners’ mind.
However, Sakai and Moorman (2018) report that only less than half (7 out of 30)
of the perception training studies in their meta-analysis include long-term
posttests.
Generalization and retention are therefore two measures tested in training
studies to determine the success of a given intervention program.
In a perceptual training study, L2 speech learning is assessed with the ad-
ministration of the same test(s) before and after the intervention, following a
pretest-intervention-posttest experimental design. The evaluation of retention
of learning is done after the posttest in a later moment in time and can be
termed as a delayed posttest (See Fig. 1). Generalization tests can be included
in the posttest immediately after training is over and in the delayed posttest ei-
ther as a separate test or merged with the posttest(s).
Delayed
Pretest Training Posttest
posttest(s)
Figure 1: Perceptual training experimental design.
1.1.1 Generalization of learning
In phonetic training research, generalization of learning can be operationalized

and measured in different ways (Logan and Pruitt 1995). The learners’ ability to
transfer learning is assessed with tasks that include a new testing condition or
a combination of new conditions that were not included during training such
as transfer to (1) stimuli produced by new talkers, to (2) novel productions of
the same talker, or (3) to both untrained productions of new talkers, (4) to new
phonetic contexts (i.e., to items in which the contrasting target segments or

suprasegments are embedded in phonetic contexts not presented in training
such as adjacent sounds, and word or sentential contexts), (5) to new segments
or suprasegments (i.e. to stimuli containing novel phonetic categories that
share acoustic features with the training stimuli, e.g., a sound contrast at one
place of articulation generalized to the same contrast at a new place of articula-
tion); (6) to new type or quality of stimuli (synthetic to natural, and pseudo
word to real word and vice versa); and (7) to new tasks (Logan and Pruitt 1995;
Sakai and Moorman 2018).
There are two main forms of assessing generalization of learning conducted
either separately or in combination. The first is to assess transfer of learning by
comparing the learners’ performance between a pretest/posttest and a generali-
zation test comprising speech items that are truly novel (i.e., that have not been
tested nor trained before). This can be problematic since stimuli in the two tests
are not comparable. If tests are not matched on at least one feature, or more,
the comparison may be hindered. The second is assessing transfer of learning
with stimuli that were not included in the training, but had been presented be-
fore in the pretest. Although this procedure allows the comparison of percep-
tual performance in trained and untrained items, it raises the question of
whether stimuli presented for a second time can be termed novel (Logan and
Pruitt 1995: 371), which may impair the generalization assessment. In regard to
this, Sakai and Moorman (2018) recommend that training studies use both a pri-
mary posttest that does not change any feature from the training and secondary
tests that generalize features to make results more comparable across studies.
1.1.2 Retention of learning
Retention of learning is measured by comparing the learners’ performance in,

at least, two moments after training, but it can include more moments in time.
The aim is to assess whether the training program has lasting effects in the
learners’ L2 phonological system with the establishment of new perceptual cat-
egories for L2 phonemes or with the readjustment of existing categories to ac-
commodate new L2 sounds. The learners’ L2 perceptual abilities after training
are thus compared between an immediate posttest and a delayed posttest
though some studies also include a comparison between their baseline perfor-
mance (pretest) and the delayed posttest. The results can be interpreted as fol-
lows, provided that posttest perceptual performance was significantly more
accurate than the pretest results: 1) no significant difference between posttest
and delayed posttest indicates a positive retention of the learning attained
during training; 2) a significant positive difference (i.e., improvement) between

posttest and delayed posttest shows that learning still occurred after training
was over; 3) a significant negative difference (i.e., poorer performance) is indic-
ative of no long-term effect.
1.2 The present study
The present systematic review includes perceptual training studies that have
tested for robustness of speech learning by administering testing tasks to assess
generalization and/or retention. Despite the four decades of research on L2 speech
learning, the findings about the efficacy of training in promoting generalization of
learning and long-term modification of learners’ perceptual performance are
somewhat scattered. Therefore, this review aims to provide a succinct overview of
perceptual training studies to answer the following research questions:
1. How often are both measures of generalization and retention of learning
adopted in perceptual training studies of L2 speech?
2. How effective are L2 perceptual training studies in promoting robust speech
learning?
In sum, the goal is to assess the carryover and long-term effects of perceptual
training on L2 speech learning in a population of adult L2 learners, by address-
ing the PICO components (i.e., Population, Intervention, Comparison(s) and
Outcome) of a systematic review (Higgins et al. 2019).
In section 2, we describe the method, including the literature search and the
coding, and tabulate information from each study included in the review. Sec-
tion 3 comprises the descriptive results concerning participant demographics,
scope of training (i.e., target segmental or suprasegmental structures), perceptual
training features, assessment of generalization and testing of retention of learn-
ing. We then report trends in the data, but do not conduct any statistical meta-
analysis.
2 Method
2.1 Literature search
For the purpose of this review, we included experimental research that met the
following six eligibility criteria: studies that (1) were published between 1980
and 2020 in peer-reviewed journals; 2) were written in English; (3) implemented

a perceptual training experiment; (4) targeted adult participants; (5) tested pho-
nological features, segments or suprasegments in a second or foreign language
for the participants; and (6) included tasks that assessed generalization and/or
retention of speech learning.
We used a comprehensive keyword search of four relevant academic data-
bases in the fields of applied linguistics and education (ERIC, LLBA, MLA,
and PsycInfo) to search for studies published in peer-reviewed journals in the
time period ranging from January 1, 1980 to December 31, 2020. Other types of
unpublished research such as doctoral dissertations, master theses, and con-
ference proceedings were not included. Keywords used to search for studies
were a combination of terms from three conceptual groups (topic, interven-
tion and outcomes): (a) “second language” or “non native language” or “for-
eign language” or “third language”, and (b) “phonetic training” or “percept✶
training” or “variability training” or HVPT or “speech training” or “audi ✶
training” or “pronunciation training” or “computer based training” or “web
based training” or “segmental training” or “suprasegmental training” or “lab-
oratory training” or “phonological learning” or “speech learning” or “sound
learning” or “percept✶ learning”, and c) “generalization” or “transfer” or “un-
trained” or “nontrained” or “non-trained” or “new” or “novel” or “retention”
or “delayed” or “robust✶”.
In the literature search, 332 papers were retrieved. Exact duplicates (n=93)
were excluded at this point, as well as any other type of publications (n=26).
After reading the titles and abstracts, all irrelevant studies for the present re-
view were discarded (n=186), which left 27 journal publications. From these 27
papers, one publication could not be obtained.1 The remaining articles (n=26)
were read in full. One of these studies contained two independent experiments,
each of which fully meeting the eligibility criteria. As such, for quantitative
analysis purposes, this particular paper was taken to represent two different
studies as combined data analysis was not possible. Thus, data presented
below refers to a total of 27 peer-reviewed papers (Table 1; asterisked in the
references).
 Due to covid19 restrictions, the university’s library scan and deliver service was suspended
from March 2020 and the Morosan and Jamieson’s (1989) study, which is only available in
print, was not possible to obtain.
Table 1: Summary of studies assessing generalization and/or retention of learning, 1982–2020.
376
Study L L Sample N° Exp Target structures Generalization Retention

Size groups Test Test
Bradlow et al. () Japanese English   liquids yes yes
Bradlow et al. () Japanese English   liquids yes no
Burnham () English Arabic   fricatives yes no
Cebrian and Carlet () Catalan-Spanish English   vowels, stops, yes no

fricatives
Cheng et al. () Mandarin English   vowels yes no

Fouz-González and Mompean () Spanish English   vowels yes yes
Fuhrmeister and Myers () English Hindi   stops no yes
Fuhrmeister, Schlemmer and Myers English Hindi   stops yes yes

()
Godfroid, Lin and Ryu () English Mandarin   tone yes yes
Hardison () [Experiment ] Japanese American   liquids yes no

English
Hardison () [Experiment ] Korean American   liquids yes no

English
Huensch and Tremblay () Korean English   stops yes no
Iverson and Evans () Spanish, German English   vowels yes yes
Lee and Lyster () Korean English   vowels yes yes
Lively et al. () Japanese English   liquids yes yes
McCrocklin () Mandarin English   vowels no yes

and other languages
Motohashi-Saigo and Hardison English Japanese   stops and fricatives yes no

()
Nishi and Kewley-Port () Japanese English   vowels yes yes
Okuno and Hardison () English Japanese   vowels yes no
Pruitt, Jenkins and Strange () English, Japanese, Hindi   stops yes no
Hindi
Shport () English Japanese   stress (pitch) yes no
Strange and Dittmann () Japanese English   liquids yes no
Thomson () Mandarin English   vowels yes yes
Vlahou, Seitz, and Kopčo () English Hindi   stops yes no
Wang and Munro () Mandarin, Cantonese English   vowels yes yes
Wang () Hmong, Japanese, Mandarin   tone yes no

English
Wang et al. () English Mandarin   tone yes yes

Assessing the robustness of L2 perceptual training
377
2.2 Coding
The coding of the studies consisted of two phases. First, a preliminary set of vari-
ables were identified related to participants, target structure, perceptual training
and assessment of training, specifically generalization and retention of speech
learning. A coding scheme was thus developed and then piloted on a sample of
papers from the 27 studies. Both researchers coded all studies to ensure that rele-
vant information was not missed. The coding was discussed among the authors
of the study, unclear codes were revised and discrepancies were agreed upon.
The codes for each category of the coding scheme for this literature review syn-
thesis are presented in Table 2.
3 Results of the review

3.1 Participants
All studies focused on adult participants: in 26 papers, adults were explicitly

characterized as being over 18 years old and one study (Shport 2016) did not con-
tain any information on the age of the participants, though it could be inferred
that they were adult L2 learners by reference to their academic achievement.
Testing adult learners of a second language was defined as an inclusion criterion,
so one would not expect to find studies testing children or adolescents at this
stage. Still, we had hoped to obtain more detailed information on the age ranges
and means. Only one study provided both statistic measures, which makes it im-
possible to present a quantitative summary for this parameter. Nonetheless, by
examining the age range of the 15 studies which report this data, we can observe
that only one of these studies (Fuhrmeister and Myers 2020) recruited older adult
participants, over the age of 40 (range 18–59, mean=38). Regarding the linguistic
background of the participants, a third of the studies focused on native English
speakers learning an L2 and six other L1s could be found, with the following
count: English – 9 studies; Japanese – 6 studies; Mandarin and/or other Chinese
languages – 4 studies; Korean – 3 studies; three or more L1s – 3 studies; Span-
ish – 1 study; and Catalan-Spanish– 1 study. As for the participants’ L2, that is,
the target language (TL) in the 27 studies that constitute the corpus of this analy-
sis, in 16 experiments English was the target language, four studies focused on
Hindi as the L2 (e.g., Pruitt, Jenkins, and Strange 2006), Japanese was the lan-
guage being learnt by the participants in three studies (e.g., Motohashi-Saigo
and Hardison 2009), Mandarin or other Chinese language(s) was covered in
Table 2: Coding scheme.
Variables Codes
Participants
Age
First language
Target language
Learning context L context FL context
Target language naïve beginner intermediate advanced
proficiency
Study
Sample size
N° of groups
Type of target segment suprasegment feature
structure
Target structure stops fricatives liquids vowels stress tone syllable
N° of target structures
Training
Training paradigm HVPT LVPT Both
Tasks ID DISC Both
Number of sessions
Length of session
Length of training
N° of tokens
per session
(continued)
379
Table 2 (continued)
380
Variables Codes
Type of feedback trial-by-trial task-by-task session-by- no feedback

session
Instruction yes no
Training setting laboratory classroom home other
Stimuli
Quality natural synthesized both
Type real words pseudowords both
Presentation visual-only audio-only audiovisual audio and visual, audio,
audiovisual audiovisual
Retention
Control group yes no
Tasks ID DISC Both
Comparison pretest and delayed posttest and delayed both
posttest posttest
Retention of learning yes no
Ret in all conditions yes no
Ret of generalization yes no n/a
Time after posttest
Generalization
Control group yes no
Tasks ID DISC Both
N° of gen tests
Gen to new tasks yes no n/a
Gen to new stimuli yes no n/a
Gen to new talkers yes no n/a
Gen to new contexts yes no n/a
Gen to other yes no n/a
conditions
Gen in all conditions yes no n/a
381
three studies, (e.g., Wang 2013) and, finally, one paper investigated L2 Arabic
(Burnham 2013).
Regarding the learning context, ten studies (37%) focused on participants
who learnt the TL in a foreign language context (i.e., a classroom context in an
environment where the TL is not the societal language) (e.g., Okuno and Hardi-
son 2016; Wang et al. 1999), ten experiments recruited learners who acquired it
in a second language setting (i.e., a naturalistic target language environment)
(e.g. Nishi and Kewley-Port 2007; Wang and Munro 2004), one study included
participants who learned the TL in both contexts (Iverson and Evans 2009), and
six studies did not provide specific information about the language learning
environment.
Proficiency in the L2 ranged from initial to advanced levels and some stud-
ies focused on participants with little or no knowledge of the target language,
with the following distribution: naïve listeners – 6 studies; beginners – 5 stud-
ies; intermediate learners – 6 studies; advanced learners – 3 studies. Three pa-
pers did not mention their participants’ proficiency and four studies tested
learners with several proficiency levels (e.g., Lee and Lyster 2016; Okuno and
Hardison 2016). The approaches followed to assess proficiency are institutional,
in which grouping is based on the participants curricular or course levels (e.g.,
Burnham 2013; Cebrian and Carlet 2014; Fouz-González and Mompean, 2020),
or based on length of language experience in FL and/or L2 settings (e.g., Cheng
et al. 2019; Okuno and Hardison 2016). Some of them are also impressionistic,
in which grouping is based on self-assessment subjective descriptors (e.g., Lee
and Lyster 2016). Few studies used standardized language proficiency tests
(e.g., Huensch and Tremblay 2015; Iverson and Evans 2009).
Sample sizes varied greatly (mean=48, SD=58): most studies (44%) re-
cruited 20 to 40 participants (e.g., Shport 2016; Wang 2013); 37% of the experi-
ments had a sample size ranging between 40 and 303 L2 learners (e.g., Fouz-
Gonzaléz and Mompean 2020; Lee and Lyster 2016); and 19% tested samples
with less than 20 participants (e.g., Strange and Dittmann 1984; Wang et al.
1999). These descriptive statistics on sample size should take into consideration
the number of groups in the experimental design. For example, the study
which recruited over 300 participants had five experimental groups, with an av-
erage sample size of 51 participants per group, and one control group (n=50)
(Godfroid, Lin and Ryu 2017). On average, studies reported findings with 15 par-
ticipants per group (including the control group, when there was one), ranging
from 3.2 to 50.5 (median=12).
3.2 Target structure
The vast majority of the studies (82%) tested L2 segments whereas only 15% tar-
geted suprasegments, and 3% examined syllable structure (viz. codas, Huensch
and Tremblay 2015). Among the 22 papers that focused on L2 phonemic catego-
ries, 36% examined vowels (e.g., Iverson and Evans 2009; Wang and Munro
2004), 27% tested liquids (e.g., Lively et al. 1994; Bradlow et al. 1997), and 18%
were dedicated to stops (e.g., Pruitt et al. 2006; Vlahou, Seitz, and Kopčo 2019).
Fricatives were the target category in only one study (Burnham 2013) and two
experiments tested several phonemes (e.g. Cebrian and Carlet 2014). As for
suprasegments, 75% of the four publications analyzed tonal structures (e.g.,
Wang et al. 1999) and a single study focused on pitch-accent patterns (Shport
2016). Irrespective of target structure type, 48% of all papers tested one to two
segments/suprasegments (e.g., Hardison 2003; McCrocklin 2012), 26% of the ex-
periments trained their participants on three to four structures (e.g., Lee and
Lyster 2016; Wang 2013) and seven articles (26%) implemented training on five
or more target phonological units (e.g., Iverson and Evans 2009 targeted 14 En-
glish vowels; Nishi and Kewley-Port 2007 compared training with a large set of
9 vowels and a subset of 3 English vowels). Stimuli containing the structures of
interest were naturally produced in most cases (85% of all 27 studies) and were
embedded in real words in 73% of the 22 experiments that provided information
on the stimulus type.
3.3 Phonetic training
The majority of studies (70%) opted for the audio-only modality in their training
programs, with only 7% adopting the audiovisual format exclusively (e.g., Cheng
et al. 2019). The remaining experiments (22%) used both audio and audiovisual
stimuli presentation (e.g., Hardison 2003; Okuno and Hardison 2016). The pre-
dominant training paradigm (70%) was the high-variability phonetic training.
Among these 19 papers, two studies did not explicitly categorize their training
program, but they adopted variability of some sort (talkers’ voice and/or phonetic
context) and were, therefore, classified as HVPT experiments by the authors of
the present review (Okuno and Hardison 2016; Wang 2013). Identification tasks
were the preferred training task and were used in 85% of the studies. The remain-
ing four training experiments either used discrimination training tasks exclu-
sively (McCrocklin 2012; Strange and Dittmann 1984) or combined discrimination
and identification procedures (Cebrian and Carlet 2014; Fuhrmeister and Myers
2020). A similar trend was found for the test type before and after training: in
81% of the experiments, identification tasks were used to measure improvement

from training and 19% of the programs combined identification and discrimina-
tion tasks to assess the effect of training (e.g., Fuhrmeister and Myers 2020;
Strange and Dittmann 1984). No case was found of a study using one type of
training tasks and assessing immediate outcomes of the instructional program
using exclusively a task different from the one employed in training. The number
of training sessions varied greatly, ranging from a single session (e.g., Fuhrmeis-
ter, Schlemmer and Myers 2020) to 45 sessions (Bradlow et al. 1997, 1999) in the
26 papers which reported this info (mean=11; median=8). Six studies did not men-
tion the duration of each session and in the remaining 21 papers each session
lasted on average 41 minutes (range=13–90 min; median=40 min) in training pro-
grams with a mean total duration of approximately six hours (range=26 min–
1125 min; median = 320 min = approx. 5.3 hours). The time span ranged from
one day (2 out of the 25 papers that reported this metric; Fuhrmeister and Myers
2020; Fuhrmeister, Schlemmer and Myers 2020) to two months (1 study; Wang
and Munro 2004). Regarding the number of tokens per training session, consider-
ing the 25 papers that reported this information, trainees heard an average of 241
tokens in each session (range=20–700 tokens; median=180). Training sessions
took place mainly in a laboratory (85%) and 93 per cent of the experiments pro-
vided immediate trial-by-trial feedback. Fouz-González and Mompean (2020) also
reported providing cumulative (i.e., session-by-session) feedback. Seven training
programs included instructions, mostly in the form of brief phonetic descriptions
of the characteristics of the target segments (e.g. Pruitt, Jenkins, and Strange
2006; Cebrian and Carlet 2014) or suprasegments (Godfroid, Lin and Ryu 2017).
With respect to the number of experimental groups, the majority of the studies
tested two groups (1 group = 37%; 2 groups = 48%). Only four out of 27 experi-
ments (15%) did not have a control group (e.g., McCrocklin 2012) but all four had
at least two experimental groups.
3.4 Assessment of robustness of learning
From the pool of 27 studies, less than half included both measures of robust-
ness of learning (generalization and retention). Eleven studies included gener-
alization and retention tests after training (e.g. Iverson and Evans 2009; Nishi
and Kewley-Port 2007), 14 experiments included only the testing of generaliza-
tion of improvement achieved during training (e.g., Cebrian and Carlet 2014;
Wang 2013) and two assessed the long-term effects of training exclusively
(Fuhrmeister and Myers 2020; McCrocklin 2012).
3.4.1 Assessment of generalization
Twenty-five out of the 27 studies included in this analysis (93%) tested generaliza-
tion of the learning obtained via perceptual training to untrained conditions (stim-
uli, phonetic context, talker, task). As such, all percentages presented below,
referring to generalization measures, will consider n=25, unless otherwise stated.
Ninety-two percent of the experiments used identification tasks to assess transfer
of learning. Generalization of learning to new stimuli was tested by 84% of the 25
studies (e.g., Hardison 2003; Cebrian and Carlet 2014) and 92% investigated trans-
fer of improvement to the perception of untrained talkers (i.e., new voices) (e.g.,
Cheng at al. 2019; Okuno and Hardison 2016) and six out of 25 papers (24%) dealt
with generalization to new phonetic contexts (e.g., Thomson 2012; Wang 2013). All
studies found evidence of generalization of learning, but only 68% reported that
effect for all conditions tested. For example, whereas Godfroid, Lin and Ryu
(2017) reported transfer of perceptual learning to untrained tasks, stimuli and
talkers, Shport (2016) found evidence of generalization to new stimuli but not to
novel voices and Lee and Lyster (2016) observed the opposite trend, i.e., transfer
to novel talkers but not to untrained stimuli. Additionally, 18 out of the 21 experi-
ments (86%) that assessed generalization to untrained tokens reported transfer
in that condition (e.g., Bradlow et al. 1997, 1999). Approximately, the same per-
centage of studies (87%) reported evidence of transfer of learning to novel talkers
(e.g., Cebrian and Carlet 2014), out of the 23 papers that tested generalization in
this condition. Three studies in which the training program used tasks different
from the tests also reported carryover effects of training. For example, Strange
and Dittmann (1984) reported that improvement in AX discrimination tasks gen-
eralized to categorical perception identification tasks of the same synthetic stim-
uli. Five of the six studies that investigated transfer of perceptual learning to new
phonetic contexts observed generalization. One study (Thomson 2012) reported
mixed findings, with transfer of vowel perception to only one of the three new
contexts examined. Another experimental condition investigated by one of the
studies was speech perception in rooms with different acoustics (Vlahou, Seitz,
and Kopčo 2019), which reported that one of the experimental groups (trained in
multiple-room reverberant environments) generalized improvement to an un-
trained room.
3.4.2 Assessment of retention
Thirteen out of the 27 papers considered in this review (48%) tested retention of
learning. Thus, the information presented below, referring to retention measures,
will consider n=13. Seventy-seven per cent of these experiments used identifica-
tion tasks to assess performance some time after training was completed (e.g.,
Iverson and Evans 2009; Thomson 2012) and 85% tested retention using the
same type of task as in training. Thirty-one per cent of the experiments compared
performance in the delayed posttest with scores in the immediate posttest (e.g.,
Lee and Lyster 2016, McCrocklin 2012); 46% used pretest accuracy as the baseline
for comparison (e.g., Godfroid, Lin and Ryu 2017), two studies (15%) provided a
comparison between performance in the retention test and scores in both the pre-
test and the posttest (e.g., Iverson and Evans 2009), and one study did not in-
clude this information. The delayed posttest assessing retention of learning took
place between less than a day (e.g., Fuhrmeister, Schlemmer and Myers 2020)
and six months after training (Wang et al. 1999). Most studies (54%) tested re-
tention no longer than a month after the last training session (e.g., Godfroid,
Lin and Ryu 2017; Lee and Lyster 2016), in four experiments (31%) the delayed
posttest occurred three months after training was over (e.g. Nishi and Kewley-
Port 2007; Wang and Munro 2004), in one study four months afterwards
(Iverson and Evans 2009) and in another one six months after training completion
(Wang et al. 1999). Only two studies measured retention of learning in two subse-
quent times after the posttest. For example, Lively et al. (1994) tested retention 3
and 6 months after training was over. All 13 papers found evidence of retention of
learning. However, in only four experiments (31%) learning was retained in all
conditions tested (e.g., Fouz-González and Mompean, 2020). Eight of the 13
studies considered in this section provided information on the retention of
generalized learning and seven (87.5%) observed that generalization effects
were retained up to the moment of the delayed posttest (e.g., Bradlow et al.
1999; Lively et al. 1994).
4 Discussion
4.1 Participants
Although the body of research on perceptual training conducted over the last
decades has contributed to support the claim of life-long perceptual plasticity,
i.e., that L2 speech learning is possible in all ages, there is a gap in the learners’
age groups (Bohn 2018). Specifically, training studies have not investigated per-
ceptual learning in groups of mature adults. Due to the lack of standard data
reporting practices, it was not possible to calculate the average age of L2 partic-
ipants. However, a close examination of the measures provided (e.g., the age
range in 13 studies) allowed us to notice that only one study included L2 learn-
ers’ older than 40 years old. Future research should include wider age ranges
that include older learners to further test the claim of life-long perceptual learn-
ing mechanisms (Derwing et al. 2014; Bohn 2018).
Expectedly, English was the target language of most studies. This trend,
also observed by previous reviews (e.g., Lee, Jang and Plonsky 2015; Thomson
and Derwing 2015; Sakai and Moorman 2018), is explained not only by the sta-
tus of English as an international language but also by the selection of studies
chosen for analysis that were written in English. It also reinforces the need to
include languages other than English so that the study of other language-
pairings can further the examination of cross-linguistic influence and other lan-
guage-specific patterns in L2 speech learning. With the exception of one study
(McCrocklin 2012), training research controlled for the learners’ L1, which al-
lows the analysis of L1 and L2 interaction. Ten of the studies were conducted in
a foreign language context, in which learners are exposed to the target lan-
guage in a formal classroom setting, and another 10 in a second language envi-
ronment, in which naturalistic exposure to the target language may also occur
outside of the classroom. The different learning contexts may imply differences
in target language input (i.e., amount of exposure) and output (i.e. frequency of
use) which need to be accounted for, in particular during the training program,
including between the time elapsing from the immediate posttest and the de-
layed posttest(s). It is of particular relevance to interpret the findings of reten-
tion of improvement achieved during training and to understand if there was a
change in any of the external factors pertaining to language experience such as
amount of TL input and use, and context of learning or of affective variables
such as motivation. By providing a justification for not collecting data no longer
than six months after training was over, Pereira (2014: 186) explains the risk of
biased findings: “any testing 6 months later would carry the risk of confound-
ing the results if students had not continued having the same amount of input
because they had either dropped out or failed a module taught in English re-
sulting in having less English input for some months”.
Regarding language proficiency levels, the reviewed research included par-
ticipants with little or no knowledge of the TL to advanced learners and only
three studies did not report any proficiency indicator. However, as described,
training studies use a variety of measures that range from institutional to impres-
sionistic practices. This range of proficiency measures shows a general lack of
standardization in the reporting of participant language proficiency levels, as
previously noticed for L2 studies (Thomas 1994, 2006). The sample sizes ranged
from 8 to 303 participants, with most studies (44%) involving the participation of
20–40 learners. However, the average sample size of was 15 participants per
group (ranging from 3 to 51). The difficulty in recruiting and retaining partici-
pants in a longitudinal study which involves not only testing in different times
(two or more, if delayed posttest(s) are included) and training with several ses-
sions is acknowledged by researchers in the field of phonetic training studies.
Participant attrition, in particular, is often reported (e.g., Fouz-González and
Mompean 2020; Lively et al. 1994) and its resulting smaller sample size is fre-
quently recognized as a limitation of training studies. However, as also recom-
mended by authors of previous reviews (e.g., Sakai and Moorman 2018), an effort
must be made to increase sample sizes to conduct robust statistical analyses and
be able to generalize the findings. For example, to motivate participants to com-
plete all the phases of the training study, incremental compensation in the form
of participation fees or course credit could be provided.
4.2 Target structure
The underrepresentation of suprasegmentals in L2 speech learning research

(Thomson and Derwing 2015) is also observed in this dataset, with the vast ma-
jority of training studies focusing on segmental categories such as vowels, liquids
and stops. To contribute meaningfully to the understanding of L2 speech learn-
ing, the scope of training should include all aspects of speech, including its su-
prasegmental features.
Half of the studies dealt with one or two segments/suprasegments, while
a third included several (>5) phonemic structures. For example, Nishi and
Kewley-Port’s (2007) findings of greater improvement for the experimental
group trained with a fullset of vowels than for the subset group and the re-
ported retentation and generalization of learning for the fullset group seems
to suggest that a larger scope in perceptual training may lead to more robust
learning. However, more data is needed to draw such conclusions. Another
trend observed was the embedding of the target structures in naturally pro-
duced lexical items in most experiments which seems indicative of a concern
with the ecological validity of L2 speech training research.
4.3 Phonetic training
To reproduce the conditions that learners encounter naturalistically in a new

language environment, several training studies adopted a high variability train-
ing paradigm in which participants are exposed to a wide range of variability in
speech (talkers, stimuli, contexts). Most studies have opted for an audio-only
stimuli presentation, while fewer have included an audiovisual presentation

mode or a combination of both (e.g., Hardison 2003; Okuno and Hardison
2016). Given that speech perception involves the integration of both auditory
information and visual cues, the preferred selection of audio stimuli does not
emulate the multimodal nature of speech processing (Rosenblum 2005, 2008).
Based on the information provided, we observed that the duration of training is
not related to its scope, i.e., there is no positive association between number of
trained structures and length of training. For example, the training with the
longest duration (45 sessions of 20–30 minutes) (Bradlow et al. 1999) focused
on two target liquid sounds and one of the experiments with the shortest length
(8 sessions of 15–20 minutes) (Thomson 2012) included ten vowels. This is a
contradictory finding from Thomson and Derwing’s (2015:11) conclusion that
“the amount of pronunciation-specific input learners access is related to scope
of instruction”. As previously noted, some of the features of the reviewed train-
ing studies seem to suggest a preference to reproduce naturalistic language
learning environments; however, there are aspects that are more characteristic
of artificial settings which is the case observed in the vast majority of the ex-
periments which took place in the laboratory.
Though instructional variations have little effect on listeners’ performance
in identification and discrimination perception tasks (Beddor and Gottfried
1995), only a few studies provided brief articulatory descriptions of the target
segments (e.g., Cebrian and Carlet 2014; Fouz-González and Mompean 2020). It
is worthwhile noting that despite providing this information preceding the
training session, these studies isolated the speech modality, that is, no opportu-
nities were given in the perceptual training for production (i.e. repetition) of
the target sounds.
4.4 Assessment of robustness of learning
The testing of generalization is a valuable learning robustness measure that ad-

dresses the ecological validity of training studies. All studies found evidence of
generalization of learning and 17 of the 25 experiments reported a carryover effect
of training for all conditions tested. However, the extent of generalization was
not consistent across all training conditions and target L2 segments/supraseg-
ments. It may depend on the type of stimuli (e.g., Strange and Dittmann 1984
reported no generalization of learning with synthetic stimuli to real words), on
the type of phonetic context (e.g., Thomson reported transfer of vowel perception
to only one of the three new consonantal contexts tested), and on the target
phonological units (e.g., Cebrian and Carlet 2014 attested generalization of learn-
ing of the target stops and fricative consonants but not for the labiodental /v/).
Sakai and Moorman (2018) reported that only 7 out of 30 perception studies
in their meta-analysis included the testing of retention of L2 speech learning,
and thus we were expecting to find the same trend. However, though the re-
ported low number of training studies that include assessment of long-term ef-
fects was confirmed in this review, the proportion is higher with 13 of 27 studies
including the testing of retention of improvement. The assessment of generali-
zation of learning is more frequent, being reported in 25 studies. Two possible
reasons may explain the lower number of perceptual studies that do not in-
clude delayed posttest(s). On one hand, the challenge to retain participants
over an extended period of time that can range from one week to six months (or
longer) after training. For example, several studies include participants who
are undergraduate or graduate students (e.g., Burnham 2013; Motohashi-Saigo
and Hardison 2009) who may be no longer available after a certain period of
time, particularly if the study timeline does not coincide with the academic
yearly timetable. On the other hand, there is a methodological concern with the
control of the amount of TL input that participants are exposed to outside of the
experiment in the time gap between the posttest and the delayed posttest(s). The
thirteen perceptual training studies that included delayed posttests found evi-
dence of retention of learning. However, only four experiments reported positive
long-term effects in all conditions tested (e.g., Fouz-González and Mompean
2020). The other studies reported partial or mixed findings. For example, in Nishi
and Kewley-Port’s (2007) study, only the experimental group trained with the
fullset of target segments retained learning after three months in the generaliza-
tion to new voices and real words and no effect was observed in the subset group
of trainees. Generalization effects were retained up to the moment of the delayed
posttest in seven studies. Lively et al. (1994) findings’ show the same generaliza-
tion tendency (transfer of learning to new words produced by familiar talker to a
greater extent than generalization to new talker) three and six months after train-
ing was over. Further research that includes the assessment of generalization
in the delayed posttest(s) could provide meaningful information regarding L2
speech learning development.
Although the scope of this review was not the analysis of transfer of percep-
tual improvement to production (see Sakai and Moorman’s 2018 meta-analytic
review), we observed that less than a third (22%) of the 27 studies assessed the
relation between the two speech modalities, and only one study (Bradlow et al.
1999) included the three measures of robustness of learning.
To understand the three major processes involved in second language
speech learning – perceptual plasticity, modality transfer, and robustness of
learning – an ideal perceptual training study should include participants with a

range of ages that include older learners (>40 years old) to examine perceptual
malleability over the life span, the testing of carryover effects of improvement
in perception to production, the assessment of transfer of speech learning to
new experimental conditions, and the evaluation of whether the training inter-
vention produced long-term effects in the establishment of new phonemic seg-
ments and/or suprasegments in the learners’ L2 phonological system. In line
with other researchers (e.g., Thomson and Derwing 2015; Sakai and Moorman
2018), we also suggest that empirical research should include a large sample to
conduct statistical analyses and calculate effect sizes, adopt standard quantita-
tive data reporting procedures, involve the participation of a control group, and
provide enough information about the study to allow replication.
5 Conclusion
To examine the use of measures of robustness of L2 speech learning, 27 studies
were gathered for this literature review. Less than half of the studies (n=11) in-
cluded both generalization and retention testing. Fourteen experiments tested
for generalization of learning exclusively and two assessed retention of learn-
ing only. Transfer of learning to new experimental conditions is thus more fre-
quently tested than the long-term effects of training which highlights the need
to further investigate the effects of phonetic training programs, including the
delayed testing in more than one moment in time after training is over. This
would require, nonetheless, a thorough account of potential changes in the par-
ticipants’ learning experience and context in the time elapsing between posttest
and the delayed posttests.
The findings of the present narrative review show that all studies that
tested for carryover effects of training found evidence of generalization of im-
provement, and most of the experiments (17 out of 25) reported transfer of
learning for all conditions tested. The same trend was observed for the testing
of retention with all studies that tested for retention, reporting positive lasting
effects of perceptual training. However, less than half of the experiments (4 out
of 13) reported retention of improvement in all conditions.
In order to be able to conduct an exhaustive literature search and uphold
the quality and validity of perceptual training research, we decided to only re-
trieve peer-reviewed journal publications. However, this decision may have im-
pacted the results of our review, which seems to reflect a publication bias.
Thornton and Lee (2000) explain that this occurs when research which have
reported equivocal or negative findings are systematically excluded from publi-

cation, which are nonetheless important, leaving room for bias. In the context
of a meta-analysis or systematic review the results may indicate that they were
based on a selection of studies with positive (i.e., significant) findings, which
have a higher likelihood of being published and be identified through searches.
Future research should therefore include dissertations, theses and conference
proceedings that may provide more representative findings (including non-
significant results) on the assessment of robustness of learning in speech per-
ception training studies. Grey literature (e.g., unpublished reports) may also
provide data that is less biased; however, as Sakai and Moorman (2018) notice,
their electronic search may retrieve a less comprehensive dataset.
This narrative review provided a presentation and synthesis of the charac-
teristics and findings of perceptual training studies which met the pre-defined
eligibility criteria, specifically research which included the testing of generali-
zation and retention of learning. Hence, a summary of several information was
provided regarding the participants’ demographics and linguistic background,
the study’s sample size, and the training. Empirical evidence on the assessment
of generalization and the testing of long-term effects of training were also sum-
marized to identify trends in the data. However, this review did not calculate
the strength and consistency of such evidence of perceptual training impact on
the two aforementioned measures. We noted that some of the studies did not
provide standard measures (e.g., means and standard deviations for all the
measures of interest) that are necessary to conduct a statistical meta-analysis.
Considering the aforementioned limitations, the following step is, therefore, to
compare this review with a meta-analysis of (some) of the same studies and in-
clude dissertations, theses, and studies found in conference proceedings.
Funding
This work was supported by the Victoria College Research Award (Fall 2020),
University of Toronto.
References
Aliaga-Garcia, Cristina. 2017. The effect of auditory and articulatory phonetic training on the
perception and production of L2 vowels by Catalan-Spanish learners of English.
Barcelona: Universitat de Barcelona dissertation.
Beddor, Patrice & Terry Gottfried. 1995. Methodological issues in cross-language speech
perception research with adults. In Winifred Strange (ed.), Speech Perception and
Linguistic Experience: Issues in Cross-Language Research, 207–232. Timonium, MD: York
Press.
Best, Catherine. 1995. A Direct Realist view of cross-language speech perception. In Winifred
Strange (ed.), Speech Perception and Linguistic Experience: Issues in Cross-Language
Best, Catherine & Michael Tyler. 2007. Nonnative and second-language speech perception:
Commonalities and complementarities. In Ocke-Schwen Bohn and Murray Munro (eds.),
Language Experience in Second Language Speech Learning – In honor of James Emil
Flege, 13–34. Amsterdam: John Benjamins Publishing Company.
Bohn, Ocke-Schwen. 2000. Linguistic relativity in speech perception: An overview of the
influence of language experience on the perception of speech sounds from infancy to
adulthood. In Susanne Niemeier & René Dirven (eds.), Evidence for Linguistic Relativity,
1–28. Amsterdam: John Benjamins Publishing Company.
Bohn, Ocke-Schwen. 2018. Cross-language and second language speech perception. In
Eva M. Fernández & Helen Smith Cairns (eds.), The Handbook of Psycholinguistics,
213–239. New Jersey, USA: Wiley.
Bradlow, Ann R., Reiko Akahane-Yamada, David B. Pisoni & Yoh’ichi Tohkura. 1999. Training
Japanese listeners to identify English /r/and /l/: Long-term retention of learning in
perception and production. Perception and Psychophysics 61(5). 977–985. https://doi.
org/10.3758/BF03206911
Bradlow, Ann R., David B. Pisoni, Reiko Akahane-Yamada & Yoh’ichi Tohkura. 1997. Training
Japanese listeners to identify English /r/ and /l/: IV. Some effects of perceptual learning
on speech production. Journal of the Acoustical Society of America 101(4). 2299–2310.
Burnham, Kevin R. 2013. Phonetic Training in the Foreign Language Curriculum. Applied
Language Learning 23–24. 63–74.
Cebrian, Juli & Angélica Carlet. 2014. Second-language learners’ identification of target-
language phonemes: A short-term phonetic training study. Canadian Modern Language
Review 70(4). 474–499. https://doi.org/10.3138/cmlr.2318.
Cheng, Bing, Xiaojuan Zhang, Siying Fan & Yang Zhang. 2019. The role of temporal acoustic
exaggeration in High Variability Phonetic Training: A behavioral and ERP study. Frontiers
in Psychology 10. 1178. https://doi.org/10.3389/fpsyg.2019.01178.
Derwing, Tracey M., Murray J. Munro, Jennifer A. Foote, Erin Waugh & Jason Fleming. 2014.
training study. Language Learning 64(3). 526–548.
Flege, James. 1995. Second language speech learning: Theory, findings and problems. In
Flege, James & Ocke-Schwen Bohn. 2021. The Revised Speech Learning Model (SLM-r). In
Fouz-González, Jonás & Jose A Mompean. 2020. Exploring the potential of phonetic symbols
and keywords as labels for perceptual training. Studies in Second Language Acquisition
43(2). 1–32. https://doi.org/10.1017/S0272263120000455
Fuhrmeister, Pamela & Emily B. Myers. 2020. Desirable and undesirable difficulties:
Influences of variability, training schedule, and aptitude on nonnative phonetic learning.
Attention, Perception, and Psychophysics 82(4). 2049–2065. https://doi.org/10.3758/
s13414-019-01925-y
Fuhrmeister, Pamela, Brianna Schlemmer & Emily B. Myers. 2020. Adults show initial
advantages over children in learning difficult nonnative speech sounds. Journal of
Speech, Language, and Hearing Research 63(8). 2667–2679. https://doi.org/10.1044/
2020_JSLHR-19-00358
Godfroid, Aline, Chin-Hsi Lin & Catherine Ryu. 2017. Hearing and Seeing Tone Through Color:
An Efficacy Study of Web-Based, Multimodal Chinese Tone Perception Training. Language
Learning 67(4). 819–857. https://doi.org/10.1111/lang.12246
Hardison, Debra M. 2003. Acquisition of second-language speech: Effects of visual cues,
context, and talker variability. Applied Psycholinguistics 24(4). 495–522. https://doi.org/
10.1017/S0142716403000250
Herd, Wendy, Allard Jongman & Joan A. Sereno. 2013. Perceptual and production training of
intervocalic /d, ɾ, r/ in American English learners of Spanish. The Journal of the Acoustical
Society of America 133(6). 4247–4255. https://doi.org/10.1121/1.4802902
Higgins, Julian P. T., James Thomas, Jacqueline Chandler, Miranda Cumpston, Tianjing Li,
Matthew J. Page & Vivian A. Welch (eds.). 2019. Cochrane Handbook for Systematic
Reviews of Interventions, 2nd edn. Chichester (UK): John Wiley & Sons.
Huensch, Amanda & Annie Tremblay. 2015. Effects of perceptual phonetic training on the
perception and production of second language syllable structure. Journal of Phonetics 52.
105–120. https://doi.org/10.1016/j.wocn.2015.06.007
Iverson, Paul & Bronwen G. Evans. 2009. Learning English vowels with different first-language
vowel systems II: Auditory training for native Spanish and German speakers. The Journal
of the Acoustical Society of America 126(2). 866–877. https://doi.org/10.1121/1.3148196
Jamieson, Donald G. & David E. Morosan. 1989. Training new, nonnative speech contrasts: A
comparison of the prototype and perceptual fading techniques. Canadian Journal of
Psychology – Revue Canadienne de Psychologie 43(1). 88–96. https://doi.org/10.1037/
h0084209
Lee, Andrew H. & Roy Lyster. 2016. Effects of different types of corrective feedback on
receptive skills in a second language: A speech perception training study. Language
Learning 66(4). 809–833. https://doi.org/10.1111/lang.12167
Lively, Scott E., David B. Pisoni, Reiko Akahane-Yamada, Yoh’ichi Tohkura & Tsuneo Yamada.
1994. Training Japanese listeners to identify English /r/ and /l/. III. Long‐term retention
of new phonetic categories. The Journal of the Acoustical Society of America 96(4).
2076–2087. https://doi.org/10.1121/1.410149
Logan, John S. & John Pruitt. 1995. Methodological issues in training listeners to perceive non-
native phonemes. In Winifred Strange (ed.), Speech Perception and Linguistic Experience:
Issues in Cross-Language Research, 351–378. Timonium, MD: York Press.
McClaskey, Cynthia, David B. Pisoni, & Thomas Carrell. 1983. Transfer of training of a new
linguistic contrast in voicing. Perception and Psychophysics 34(4). 323–330.
McCrocklin, Shannpon. 2012. Effect of Audio vs. Video on Aural Discrimination of Vowels.
Teaching English as a Second or Foreign Language – The Electronic Journal for English as
a Second Language (TESL-EJL) 16(2). 1–16.
Motohashi-Saigo, Miki & Debra M. Hardison. 2009. Acquisition of L2 Japanese geminates:
Training with waveform displays. Language Learning & Technology 13(2). 29–47.
Nishi, Kanae & Diane Kewley-Port. 2007. Training Japanese listeners to perceive American
English vowels: Influence of training sets. Journal of Speech, Language, and Hearing
Research 50(6). 1496–1509. https://doi.org/10.1044/1092-4388(2007/103)
Okuno, Tomoko & Debra M. Hardison. 2016. Perception-production link in L2 Japanese vowel
duration: Training with technology. Language Learning & Technology 20(2). 61–80.
Pereira, Yasna I. 2014. Perception and production of English vowels by Chilean learners of
English: Effect of auditory and visual modalities on phonetic training. London: University
College London dissertation.
Pisoni, David B., Richard N. Aslin, Alan J. Percy & Beth L. Hennessy. 1982. Some effects of
laboratory training on identification and discrimination of voicing contrasts in stop
consonants. Journal of Experimental Psychology: Human Perception and Performance
8(2). 297–314. doi: 10.1037/0096-1523.8.2.297
Pruitt, John S., James J. Jenkins & Winifred Strange. 2006. Training the perception of Hindi
dental and retroflex stops by native speakers of American English and Japanese. The
Journal of the Acoustical Society of America 119(3). 1684–1696. https://doi.org/10.1121/
1.2161427
Rosenblum, Lawrence D. 2005. The primacy of multimodal speech perception. In David
B. Pisoni & Robert E. Remez (eds.), The Handbook of speech perception, 51–78. Malden,
MA: Blackwell.
Rosenblum, Lawrence D. 2008. Speech perception as a multimodal phenomenon. Current
Directions in Psychological Science 17(6). 405–409. doi: 10.1111/j.1467-8721.2008.00615.x
Sakai, Mari. 2016. (Dis)connecting perception and production: Training adult speakers of
Spanish on the English/i/-/ɪ/ distinction. Washington DC: Georgetown University
dissertation. https://repository.library.georgetown.edu/handle/10822/1042879
research. Applied Psycholinguistics 39(1). 187–224.
Shport, Irina A. 2016. Training English listeners to identify pitch-accent patterns in Tokyo
Japanese. Studies in Second Language Acquisition 38(4). 739–769. https://doi.org/
10.1017/S027226311500039X
Strange, Winifred. 1995. Cross-language studies of speech perception: A historical review. In
Strange, Winifred & Sibylla Dittmann. 1984. Effects of discrimination training on the
perception of /r-l/ by Japanese adults learning English. Perception and Psychophysics
36(2). 131–145. https://doi.org/10.3758/BF03202673
Thomas, Margaret. 1994. Assessment of L2 proficiency in second language acquisition

research. Language Learning 44(2). 307–336. doi: 10.1111/j.1467-1770.1994.tb01104.x.
Thomas, Margaret. 2006. Research synthesis and historiography: The case of assessment
of second language proficiency. In John M. Norris & Lourdes Ortega (eds.), Synthesizing
Research on Language Learning and Teaching, 3–50. Amsterdam: John Benjamins.
Thomson, Ron I. 2012. Improving L2 listeners’ perception of English vowels: A computer-
mediated approach. Language Learning 62(4). 1231–1258. https://doi.org/10.1111/
j.1467-9922.2012.00724.x
A narrative review. Applied Linguistics 36(3). 326–344. https://doi.org/10.1093/applin/
amu076
Thornton, Alison & Peter Lee. 2000. Publication bias in meta-analysis: Its causes and
consequences. Journal of Clinical Epidemiology 53(2). 207–216.
Vlahou, Eleni, Aaron R. Seitz &Norbert Kopčo. 2019. Nonnative implicit phonetic training in
multiple reverberant environments. Attention, Perception, and Psychophysics 81(4).
935–947. https://doi.org/10.3758/s13414-019-01680-0
Wang, Xinchun. 2013. Perception of Mandarin tones: The effect of L1 background and training.
The Modern Language Journal 97(1). 144–160. https://doi.org/10.1111/j.1540-
4781.2013.01386.x
Wang, Xinchun and Murray J. Munro. 2004. Computer-based training for learning English
vowel contrasts. System 32(4). 539–552. https://doi.org/10.1016/j.system.2004.09.011
Wang, Yue, Michelle M. Spence, Allard Jongman & Joan A. Sereno. 1999. Training American
listeners to perceive Mandarin tones. The Journal of the Acoustical Society of America 106
(6). 3649–3658. https://doi.org/10.1121/1.428217
Conclusion
Tracey M. Derwing
An overview of pronunciation teaching
and training
Abstract: This volume has highlighted many themes that have emerged in the
last two decades with regard to second language (L2) pronunciation research
and practice. Some of these themes are tied to new directions in applied lin-
guistics generally, for example, Complex Dynamic Systems theory (de Bot,
Lowie, and Verspoor 2007), while others are related to refined directions in
phonetics research (Flege and Bohn 2021), and still others are based on the bur-
geoning numbers of new studies in pronunciation pedagogy. What was once a
neglected area of second language acquisition is now brimming with studies on
all fronts.
Keywords: L2 pronunciation, pronunciation teaching, pronunciation research,

intelligibility, comprehensibility
1 The goal of L2 pronunciation instruction

What has also changed in our field, in addition to increased research, is a
changed attitude about what matters in pronunciation instruction. Most re-
searchers and practitioners agree that it is unrealistic and inappropriate to aim to
eliminate an L2 accent. Not only are we biologically programmed to hone in on
the phonemes in our first language (L1) at a very early age (Werker and Tees
2002) making it more difficult to recognize similar sounds in another language at
a later age, but there are also issues of identity involved. As long as individuals
can make themselves understood without the listener having to expend a great
deal of effort, it is perfectly natural and positive to speak with an accent (it
should be noted that listeners have a responsibility to engage and adapt to ac-
cents as well, as Derwing, Rossiter & Munro 2002 pointed out). In fact, all L1
speakers have a dialect accent that differs from other dialects of their mother
tongue, so it is not a big leap to make the analogy with L2 accents. What really
matters is whether people are successful communicators; are they intelligible
(does the listener understand the message intended by the speaker)? and are
they comprehensible (are they easy to understand)? If the answer to both these
Tracey M. Derwing, University of Alberta
https://doi.org/10.1515/9783110736120-015
400 Tracey M. Derwing
questions is yes, then unlike other aspects of language instruction, pronunciation

teaching is not necessary. While vocabulary, grammar and conventions of writing
may require considerable instruction in the L2 classroom, pronunciation is only
necessary where intelligibility and/or comprehensibility are compromised.
It is abundantly clear that even though there are some shared elements in
an L2 accent (otherwise we would not be able to identify a particular accent
when we hear it), there are also far more individual differences than previously
thought (Derwing and Munro 2015). Thus, even though in a given classroom
some students may not require any instruction at all, others may need consider-
able help from their teachers. Such variation among students requires more of
an instructor; not only should all students’ pronunciation be evaluated, but for
those who need assistance, individualized programming is necessary, requiring
significant expertise on the part of the teacher. However, many research studies
in the last ten years have indicated that a majority of language teachers feel ill-
prepared to incorporate pronunciation instruction into their classrooms (e.g.,
Foote, Holtby, and Derwing 2011; Huensch 2019). This is an ongoing challenge
in many countries (although the enthusiasm for pronunciation teaching in Ar-
gentina and Brazil is palpable!). Professional development opportunities for
language teachers in the area of pronunciation are lacking in many locales, al-
though there is no shortage of both quackery and well-intentioned but flawed
advice available on the internet (Derwing et al. 2014a). A novel approach to pro-
viding more teachers with assistance in developing pronunciation instruction
skills is detailed in Kochem et al. (this volume), but overall, it is the need to
match the upsurge in research with an uptick in teacher preparation and in-
struction that is most pressing.
There is no question that many students themselves are eager to gain assis-
tance with pronunciation; early research suggested that the context in which
they live often influences the extent to which they profess to want to sound like
a native speaker (Timmis 2002). As Subtirelu (2013) has pointed out, however,
what learners want is often a much more complex matter than was previously
thought. It is also the case that most people who learned a language after child-
hood are likely to have difficulty ever sounding like a native speaker (Flege,
Munro, and MacKay 1995); thus, setting nativeness as a goal is setting most
learners up for failure. That said, it is evident from many studies that pronunci-
ation instruction can help learners change their productions to become more
intelligible and more comprehensible with a relatively short intervention, even
if they have been immersed in an L2 environment for a very long time (Derwing
et al. 2014b; Thomson and Derwing 2015).
2 Theoretical and methodological concerns

2.1 The role of identity
Many learners of a second language, especially in immigrant contexts, wish

they could sound like a native speaker (Derwing 2003). Some believe that it
would change the way people treat them, and that they would feel less ‘oth-
ered’ and more a part of their local community. Marx (2002: 272) detailed her
efforts to achieve a native-like German accent when she was attending univer-
sity in Germany “to be judged as a competent member of the C2 [second cul-
ture]”. Marx went out of her way to establish a German identity, not only by
changing her accent and adopting dialectal forms from the region in which she
was living, but also by changing her appearance to conform with typical Ger-
man dress codes. When Piller (2002) interviewed 73 English-German couples,
she discovered that at least a third of the respondents indicated that they could
pass as a native speaker of their spouse’s L1 in short service encounters. All of
these individuals had massive exposure to their L2, and the languages involved
are fairly closely related. Piller points out that in each case, “personal motiva-
tion, choice and agency” (Piller 2002: 201) were key factors in being able to
pass as a native speaker, and that these L2 speakers would feel pride if they
were successful in being judged to be locals. However, Piller’s participants also
indicated that passing as a native speaker for more than an initial encounter
was not a goal, partly because eventually, as one of her interviewees stated,
“some reference to something every German knows will come up, and I won’t
understand, and they’ll think I am stupid” (Piller 2002: 195). Piller suggests
that highly proficient L2 speakers have to weigh the pros and cons of interac-
tions to ensure that their core identities are protected.
Other researchers have also confirmed that some L2 speakers can achieve
native-like pronunciation (Bongaerts et al. 1997) even if they learned their L2
after childhood, if they are highly motivated and able to obtain significant expo-
sure in an L2 that is closely related to the L1, but in a related study, Bongaerts,
Mennen, and van der Slik (2000) showed that language typology matters. If an
individual learns an L2 past childhood, regardless of degree of exposure, motiva-
tion or other factors, it will be almost impossible to sound like a native speaker
(e.g., a Vietnamese speaker who acquires a language such as English or Dutch).
Persons who live in contexts where their own L1 is the majority language
and the L2 is an addition to their linguistic repertoires are less likely to want to
sound like a native speaker. They are less likely to be judged in a negative way
for having accented speech, and in many cases, they are proud of their L1 and
want to maintain traces of it in their L2 productions. For instance, Gatbonton,
Trofimovich, and Magid (2005) examined Quebec learners of English responses

to their peers’ accents in English who ascribed the degree to which each speaker
was affiliated with their ethnic group. The authors determined that the raters as-
sociated degree of accent with ethnic affiliation.
A significant exception to the generally accepted premise that intelligibility
and comprehensibility are sufficient goals is in some Indigenouos settings in
which native-like targets are warranted. As Bird (2020) has shown in a language
revitalization project, it can be important to learners to produce speech that ap-
proximates native speaker accents as closely as possible, in part as a rejection
of the colonial language that is their mother tongue. In a fascinating study of a
Coast Salish /t’/ (an ejective /t/), learners produced a stronger ejective than na-
tive-speaking elders, as a way of ensuring a distancing from English /t/. Bird
reports that learners want to honour their elders by producing language as
close to native as possible, and are willing to try techniques that are far re-
moved from practice in most pronunciation classrooms, such as ultrasound, to
perfect their pronunciation. In the case of the /t’/, the fact that learners produce
it more strongly than their ancestors did is offset by the fact that it is less En-
glish-like than a weaker ejective (and, according to Flege’s Speech Learning
Model (1995), it is probably easier to perceive and produce because it is more
different from the L1 /t/ than a weak ejective).
In school settings where children and youth study foreign languages, iden-
tity can be threatened by sounding ‘too good.’ Kissau (2006: 416), in a study of
500 boys learning French in Canada, found that “male students were averse to
studying French for its ‘sissy’ associations”.
For teachers, it is important to understand their students’ own feelings re-
garding pronunciation and what their personal goals are. Identity is a tricky
issue, but if learners have difficulty making themselves understood, then their
core identity is definitely at risk. The ability to express oneself comfortably, with-
out being asked to repeat, is crucial to one’s self-confidence and willingness to
communicate. LeVelle and Levis (2014: 111) recommend that as a place to begin
both teachers and learners should realize that altering one’s pronunciation may
involve “interacting outside of the comfort zone”. Further, they suggest having
frank discussions about accent discrimination and what can realistically be ex-
pected from pronunciation instruction.
2.2 Setting targets
What should the focus of pronunciation instruction be? We have already estab-
lished that aside from Indigeneous contexts where target-like productions matter,
the goal should be intelligibility and/or comprehensibility, rather than nativelike-

ness. But what does that mean for a teacher? First, a teacher has to be able to con-
duct an initial assessment of each person’s pronunciation to determine where
their needs lie. Such an assessment does not have to be complex; a relatively short
recording (2–3 minutes) of spontaneously produced speech is usually enough to
gain a sense of where learners have difficulties. Asking learners to respond to
questions such as ‘What is your favourite meal?’ or ‘Tell me about your weekend’
will usually give an instructor enough information to analyze their speech. It may
be easier for teachers in the same program to take on the task of listening to the
recordings together. Derwing (in press) advises this cycle: listen “once for overall
comprehensibility, once for word stress, once for sentence stress, once for vowels,
and finally consonants.” Regardless of whether the class members come from mul-
tilingual backgrounds or all share the same L1, there will be differences in perfor-
mance. This assessment will allow for the teacher to pinpoint areas where each
learner has some struggles.
Once the initial assessment has been carried out, the teacher should look for
problems that are common to several members of the class. These are aspects
that can be covered in class. It is strongly recommended that teachers try to em-
ploy techniques that have been shown through research to be effective. Both the
instructor and the students can have more confidence in what they are doing if
they know that the approaches they take have been validated using rigorous re-
search methods. Unethical and completely inappropriate techniques are offered
both online and face-to-face by hucksters and well-intentioned but ill-informed
people (Derwing and Munro 2015). Sticking to techniques that have been for-
mally assessed by researchers is a better way to go. All of the suggestions below
have been tested by researchers and have been shown to have significant benefi-
cial effects on learners’ productions.
2.3 Research supported approaches to pronunciation

instruction
Some comprehensibility issues will be related to segmentals (vowels and conso-
nants). When deciding which segmentals to cover in class, functional load (FL)
should be taken into consideration as it has been shown that high FL errors
contribute more to a lack of comprehensibility than low FL errors (Kang and
Moran 2014; Munro and Derwing 2006). Some segmentals do much more ‘work’
in the language because they distinguish many minimal pairs. For instance, /p/
and /b/ in English have an extensive number of minimal pairs in word initial
position (‘pat’ vs. ‘bat’), as well as some in medial position (‘sopping’ vs.
‘sobbing’) and several in final position (‘rip’ vs. ‘rib’). These two consonants are
considered to be high FL, whereas consonants such as the interdentals, /θ/
and /ð/, are low FL even though when speakers substitute /s/ or /t/ for the for-
mer and /z/ or /d/ for the latter, the substitution is noticeable. However, in
most instances, a mispronunciation of either interdental has little or no conse-
quence for either intelligibility or comprehensibility. Once teachers have re-
viewed the high FL segmentals that learners seem to have difficulty with, they
should determine whether the problem is with perception, production, or both.
Administering a simple perception test will show the teacher whether the learn-
ers can discriminate between two segments. Some preliminary explanations
about segmental production can happen in class, but students can also be re-
ferred to technological aids to help them work on their perception at home.
Tools such as englishaccentcoach.com (Thomson 2022) can give learners an op-
portunity to easily focus on sounds that present them with difficulties. Research
has repeatedly shown, in more than 30 studies, that High Variability Phonetic
Training (HVPT) which is essentially what englishaccentcoach.com offers, has
a positive impact on perception, and in some cases, leads to improved produc-
tion as well (Thomson 2018).
It is often the case that shared problems with pronunciation are supraseg-
mental in nature, in which case the teacher can access resources to support the
learners in class. Numerous techniques can help with overall global improve-
ments, such as shadowing and mirroring. Meyers (nd) advocates inviting stu-
dents to choose a proficient and easy to understand speaker as a model and
have them break down what it is the model does with his/her voice, body move-
ments and gestures. She suggests asking students first to consider what the
speaker’s intended purpose is in a given communication and then further ana-
lyze the speech from there. After a period of 2–3 weeks students can make a
final video in which they mirror their model for their fellow students who can
provide feedback (for more information, see pronunciationforteachers.com
under the ‘Teaching’ tab). Another technique that focuses on suprasegmentals
is shadowing, which involves repeating a sample of speech at a very short
delay. This can also be done in class (often elements of sitcoms can be acted
out this way). Foote and McDonough (2017) have demonstrated, however, that
shadowing also lends itself to homework using recorded dialogues. In their
study, both comprehensibility and fluency were significantly enhanced as a re-
sult of shadowing.
A somewhat surprising technique, having students imitate in their L1 speak-
ers from the L2 they wish to learn, is effective in enhancing their pronunciation
of that L2 (Rojczyk 2015). Having students speak in their L1 takes away any pres-
sure to find suitable vocabulary or grammar, and allows them to focus on those
aspects of the L2 that they notice when a speaker of the L2 uses the students’ L1.
Rojczyk asked ten Polish students of English to imitate an English accent while
speaking in Polish. The researcher was particularly interested in the voice onset
times of stops, and determined that imitating an English accent resulted in signif-
icantly more English-like voice onset times. A similar study was conducted in
Barcelona by Everitt (2015). She compared three groups of Spanish/Catalan learn-
ers of English. One group received standard pronunciation instruction in English,
a second group spoke in their L1 with an English accent and a third group served
as a control. The researcher hypothesized that the second group would be able to
produce far more output than the other groups because there were no limitations
on their lexical and grammatical knowledge. A post test revealed that both
groups who received an intervention had superior perception and production in
English to the control group, but that the L1 imitation group performed better
than the group who had traditional instruction. There is a caveat with this tech-
nique in that some learners may refuse to imitate another accent. In some cul-
tures, imitation is viewed as disrespectful whereas in others it is interpreted as a
bit of fun.
Many researchers and practitioners consider fluency, or the flow of language
in the absence of disruptive pauses, repetitions and repairs, to be a component
of pronunciation. One global measure of fluency is speech rate (typically sylla-
bles per second). Munro and Derwing (1998: 165) conducted a study in which
they asked Mandarin speakers to read a passage at a normal, comfortable pace,
and then to read the same passage at a rate “half as fast as normal”. In fact, the
speakers actually slowed their speech to a rate that was approximately 75% of
normal, but it was clearly slower than the initial passages. These passages were
then randomized and played to listeners who rated them for comprehensibility.
The speakers were rated as slightly less comprehensible in the slowed condition.
In a second experiment, the authors took the same normally-produced passages
and both slowed them and sped them up by 10 percent using computer software
without interfering with pitch. Slowing the Mandarin speakers’ speech rate had a
negative effect on listeners’ perception of their speech. Many researchers have
investigated fluency since then, with the consensus that fluent speech is easier
for listeners to follow. Several recent studies have involved approaches to en-
hancing fluency that lend themselves well to both general second language and
pronunciation-specific classrooms.
Galante and Thomson (2017) introduced drama activities into a language
class, while a comparable control group undertook normal communicative
classroom presentations. Pre and post instruction rating tests determined that
the drama group made significant gains in fluency whereas the control group
did not change. Both groups showed improvement in comprehensibility, but

the drama group appeared to benefit more on this speech dimension as well.
Other researchers have enhanced fluency using games. Grimshaw and Car-
doso (2018) invented a game called Spaceteam ESL that requires language
learners to interact; furthermore, to compete successfully in the game, speakers
must communicate their message clearly. The researchers compared the perfor-
mance of players with those of a control group who performed non-gaming
tasks for the same period of time. The experimental group showed significant
gains in fluency ratings, outperforming the control group. In addition, they
were less anxious while communicating in their L2. The authors interviewed
the participants, and as one put it, “the game was more fun than English class,
we can learn when you play, so it’s fun” (Grimshaw and Cardoso 2018: 170).
The authors argue that the motivational aspect of the game is of real benefit, in
addition to the positive changes in the learners’ fluency.
Derwing, Waugh, and Munro (2021) posited that if learners’ productions meet
the expectations of listeners, they will be perceived as easier to understand. They
conducted a study in which an experimental group was taught basic pragmatics
(no overt pronunciation instruction was used in this course); the learners were
taught the typical formats of speech acts such as requests, compliments, polite re-
fusals and apologies, and they also focused on formulaic sequences. The learners
enacted scenarios with the researchers at the outset of the study, and again at the
end. Four of the scenarios (for each of the speakers) were later played to listeners,
who rated them for social appropriateness and comprehensibility. All of the sce-
narios showed significant improvement in social appropriateness, indicating that
the learners had acquired aspects of the pragmatic sequences involved, and three
of the four scenarios showed significant improvement in comprehensibility, sug-
gesting that meeting listeners’ expectations does indeed facilitate speakers’ com-
prehensibility. So much of language is predictable, that if an L2 learner follows
those predictable patterns, listeners will be better able to understand them. The
one scenario where comprehensibility did not improve was also the most challeng-
ing; the learners were required to remind their employer that they had been prom-
ised a raise after three months and it was now four months – was the raise still
forthcoming?
The pragmatics study cited above points to the intersection of pronunciation
and other aspects of language learning. As early as 1982, Varonis and Gass noted
the connection between pronunciation and grammar, such that the more ungram-
matical utterances were, the more accented listeners perceived them to be. In a
complementary study, Ruivivar and Collins (2019) found that more heavily ac-
cented the speaker is, the more ungrammatical they are perceived. These studies
point to the interplay among different aspects of language learning, but also to
the importance of including pronunciation while focusing on other aspects.
2.4 Feedback
Language students often profess to want more feedback than their teachers give
them, and some teachers are reluctant to provide negative feedback because
they worry that they will hurt their students’ feelings. It is true that if teachers
were to correct every aspect of a learner’s pronunciation that differs from a tar-
get, in some cases the amount of feedback would be overwhelming. This is
where the intelligibility/comprehensibility rubric comes in. Pronunciations that
do not interfere with understanding are not important and feedback is unneces-
sary. But features of speech that cause difficulty for listeners warrant explicit
feedback from the teacher. However, the teacher is not the only person in the
classroom who can provide useful feedback, and indeed, learners should be
helping each other with their productions. Martin and Sippel (2021) conducted
an innovative study in which four groups of first year learners of German partici-
pated (most of the learners were monolingual English speakers). One group was
a control; the other groups all received some pronunciation instruction, follow-
ing which one group experienced feedback from an instructor, another group
provided feedback to their classmates, and the last group received feedback from
their peers. The authors chose both segmental and suprasegmental targets as the
objects of instruction; the German phoneme /ts/ is often problematic because it
is written as [z], leading many learners to mispronounce it. Word stress in En-
glish-German cognates was the suprasegmental focus of instruction. A pretest
was administered in Week 1 of the study, and the peer feedback givers and re-
ceivers also had some explicit instruction on the nature of corrective feedback. In
Week 2 the learners in the experimental groups focused on the pronunciation of
the two targets, and in Weeks 3 and 4 they made recordings. Weeks 4 and 5 were
used for feedback on the recordings (the receivers of feedback were able to then
re-record) and a post-test was administered in Week 6. The pre and post tests
both consisted of individual words and sentences. Five native German speakers
rated the productions from all four groups for comprehensibility. All three inter-
vention groups improved significantly compared to the control group, but inter-
estingly, the group that outperformed the others was the group who provided
feedback to their peers, followed by the group who received feedback from a
teacher. The students who received feedback from their peers were in third place
but still well ahead of the control group. The authors point out that the provision
of feedback in classrooms does not have to be left solely to the instructor, and
that by being put in the position of having to provide feedback, students’ phono-
logical awareness regarding their own productions is raised.
3 Going forward
Clearly, the study of L2 pronunciation has come a long way in the last twenty
years, most especially in terms of empirical research. In previous eras, many
astute insights were made by expert practitioners, such as David Abercrombie
(1949), who maintained that most learners need only comfortable intelligibility
as their goal. These insights were all but lost, however, when new approaches
to L2 instruction became popular, such as the Communicative Language Teach-
ing. Had there been a solid body of research, rather than personal observations,
perhaps pronunciation would not have sunk into such a state of obscurity for
so long. We can hope that the current revival of interest in pronunciation is
maintained for years to come. There are a lot of questions to be addressed!
The radical expansion in the ownership and use of smartphones and other
technology suggests that far more attention could be paid to digital gaming and
other forms of pronunciation apps. A quick Google search for “apps for learning
English pronunciation” turned up an astonishing eighty million results. But how
many of these apps were developed with pronunciation experts at the helm?
Very few indeed. I have watched one researcher, Ron Thomson, take an idea
from beginning to (well, there is no end). His doctoral dissertation was an early
version of englishaccentcoach.com, but he required considerable funding and
technical assistance to expand that program into a platform that could be used
by learners from all over the world. Fortunately, he was able to secure funding
from two government departments, but the app now needs expensive updating
and the kind of technological expertise that a fulltime professor of applied lin-
guistics does not have. This is a priority for him, so he has obtained more fund-
ing, but what I conclude from seeing how many hours over the years go into a
project like this, is that we need far more collaboration with scholars from other
fields. Researchers in the Netherlands have developed an Automatic Speech Rec-
ognition (ASR) program designed to help L2 learners of Dutch, but unlike most
ASR programs, the developers took into account the most frequent and problem-
atic errors identified by teachers of Dutch (O’Brien et al. 2018). This program is
embedded in electronic language courseware available to Dutch learners. It is in-
credibly advanced compared to programs for English pronunciation but it was
developed by a team of linguists, applied linguists, engineers and computing sci-
entists. To make true progress, we need more collaboration across the board.
Another type of collaboration that currently happens somewhat erratically

is that among teachers, administrators and researchers. When each of these
groups share an understanding of what each other does and what constraints
are facing these groups, a synergy can emerge which allows for insights all
around, and better cooperation, which in the long run may benefit everyone,
especially language learners. Everyone involved in the provision of language
instruction lives a busy life, as do academics. But it is worth the time to build
trust across these areas. Academics who want to conduct research in a given
program should be asking themselves, what do I have to offer to the adminis-
trators and the teachers (and possibly the students, although often research
findings are more likely to benefit students who follow, but not the ones who
participate in the study itself). Is there anything I can do to make the whole
research process easier for those whose classes may be disrupted?
We are at the point now where research can offer more sophisticated re-
sults. For instance, we are beginning to see the inclusion of more delayed post
tests to determine whether a given intervention has had a lasting influence.
More delayed post tests would be most welcome. Another consideration, which
might be difficult to do, but which would help us better understand the relative
value of various activities is to conduct studies with the same students more
than once. Although in nascent stages, we are now seeing replications and ex-
tensions of earlier studies, which are very useful. Furthermore, more work
needs to be conducted with learners of languages other than English. Some re-
searchers have begun to investigate the learning of other languages, but more
studies are definitely needed.
Another area that deserves attention is how we can create more opportuni-
ties for real communication for students who are in the early stages of learning
a second language. Bueno Alastuey (2010) explored the value of what was es-
sentially conversations with several L1 speakers over the internet as opposed to
practice with a single speaker in the classroom. The students who conversed
with others spent more time on task (the author speculates that this might
partly have been to a lack of visual clues), enjoyed the tasks more, and im-
proved their pronunciation to a greater extent (as well as their general speaking
skills). Not only that, but they reported that their anxiety levels caused by
speaking in their L2 were reduced because their interactions were successful
and somewhat anonymous. Many L2 speakers in immigrant-receiving countries
complain that they have no one to interact with (Derwing, Munro, and Thom-
son 2008). For learners in other contexts the access to interlocutors can be even
more difficult. We as a community should be able to identify and implement
better opportunities.
For teachers of adult learners in particular, it is difficult to know what their

students will need in their own environment. We need more information about
language in the workplace and what would help people in their own lived con-
texts. Some preliminary studies exist (e.g., Dahm and Yates 2013, who explored
the needs of L2 doctors; Derwing et al. 2014b, who examined the needs of em-
ployees in a window factory) but we need far more research to inform instruc-
tors – and more instructors in the workplace itself. Yates (2022: 362) makes the
case that globalization and technology have made changes to workplace commu-
nication come at a rapid pace. She argues that researchers need to keep up, and
that “collaborative, integrated research-to-practice initiatives could include ex-
ploration of the content, design, and delivery of programs for both learners and
their co-workers. For researchers, this connection with practitioner-collaborators
can offer good insights into learner challenges, access to new research sites and
a sense of being able to make a real difference”.
Talking to students about their own perceptions is enlightening. We some-
times see them as ‘subjects’ in our experiments, but conversations with them
about their lived experiences can offer a jumping-off point for whole new direc-
tions in L2 pronunciation research (Derwing 2003). They are, after all, the whole
reason we are in our field.
I have no doubt that incremental studies will continue to contribute to our
knowledge, but it is always useful to step back and look at the bigger picture.
We have come a long way, but we still have a long way to go.
References
Abercrombie, David. 1949. Teaching pronunciation. ELT Journal 3(5). 113–122.
Bird, Sonya. 2020. Pronunciation among adult Indigenous language learners: The case of
SENĆOTEN /t’/. Journal of Second Language Pronunciation 6(2). 148–179.
Bongaerts, Theo, Susan Mennen & Frans van der Slik. 2000. Authenticity of pronunciation in
naturalistic second language acquisition: The case of very advanced late learners of
Dutch as a second language. Studia Linguistica 54(2). 298–308.
Bongaerts, Theo, Chantal van Summeren, Brigitte Planken & Erik Schills. 1997. Age and
ultimate attainment in the pronunciation of a foreign language. Studies in Second
Bueno Alastuey, Maria Camino. 2010. Synchronous voice computer-mediated communication:
Effects on pronunciation. Calico Journal 28(1). 1–20.
Dahm, Maria & Lynda Yates. 2013. English for the workplace: Doing patient-centred care in
medical communication. TESL Canada 30 [special issue 7]. 21–33.
De Bot, Kees, Wander Lowie & Marjolijn Verspoor. 2007. A dynamic systems theory approach
to second language acquisition. Bilingualism: Language and Cognition 10(1). 7–21.
Derwing, Tracey M. 2003. What do ESL students say about their accents? Canadian Modern
Derwing, Tracey M. (in press). Lessons learned from teaching teachers to teach pronunciation.
In Veronica Sardegna & Anna Jarosz (eds.), English pronunciation teaching: Theory,
practice and research findings. Bristol: Multilingual Matters.
Derwing, Tracey M., Helen Fraser, Okim Kang & Ronald I. Thomson. 2014a. L2 accent and
ethics: Issues that merit attention. In Ahmar Mahboob & Leslie Barratt (eds.), Englishes
in multilingual contexts, 63–80. Berlin: Springer.
Derwing, Tracey M., Murray J. Munro, Jennifer A. Foote, Erin Waugh & Jason Fleming. 2014b.
training study. Language Learning 64(3). 526–548.
Derwing, Tracey M., Murray J. Munro & Ronald I. Thomson. 2008. A longitudinal study of ESL
learners’ fluency and comprehensibility development. Applied Linguistics 29(3). 359–380.
Derwing, Tracey M., Marian J. Rossiter & Murray J. Munro. 2002. Teaching native speakers to
listen to foreign-accented speech. Journal of Multilingualism and Multicultural
Development, 23(4),245–259.
Derwing, Tracey M., Erin Waugh & Murray J. Munro. 2021. Pragmatically speaking: Preparing
adult ESL students for the workplace. Applied Pragmatics 3(2). 107–135.
Everitt, Charlotte. 2015. Accent imitation on the L1 as a task to improve L2 pronunciation.
Barcelona: Universitat de Barcelona thesis.
Flege, James E. 1995. Second language speech learning: Theory, findings and problems. In
language Research, 233–277. Timonium (Maryland): York Press.
Flege, James E. & Ocke-Schwen Bohn. 2021. The Revised Speech Learning Model (SLM-r). In
Flege, James E., Murray J. Munro & Ian R. A. MacKay. 1995. Factors affecting strength of
perceived foreign accent in a second language. Journal of the Acoustical Society of
America 97(5). 3125–3134.
Foote, Jennifer A., Amy Holtby & Tracey M. Derwing. 2011. Survey of pronunciation teaching in
adult ESL programs in Canada, 2010. TESL Canada Journal 29(1). 1–22.
Foote, Jennifer A. & Kim McDonough. 2017. Using shadowing with mobile technology to
improve L2 pronunciation. Journal of Second Language Pronunciation 3(1). 34–56.
Galante, Angelica & Ron I. Thomson. 2017. The effectiveness of drama as an instructional
approach for the development of second language oral fluency, comprehensibility, and
accentedness. TESOL Quarterly 51(1). 115–142.
Gatbonton, Elizabeth, Pavel Trofimovich & Michael Magid. 2005. Learners’ ethnic group
affiliation and L2 pronunciation accuracy: A sociolinguistic investigation. TESOL Quarterly
39(3). 489–511.
Grimshaw, Jennica & Walcir Cardoso. 2018. Activate space rats! Fluency development in a
mobile game-assisted environment. Language Learning & Technology 22(3). 159–175.
Huensch, Amanda. 2019. Pronunciation in foreign language classrooms: Instructors’ training,
classroom practices, and beliefs. Language Teaching Research 23(6). 745–764.
Kang, Okim & Meghan Moran. 2014. Functional loads of pronunciation features in nonnative
speakers’ oral assessment. TESOL Quarterly 48(1). 176–187.
Kissau, Scott. 2006. Gender differences in motivation to learn French. The Canadian Modern
LeVelle, Kimberly & John Levis. 2014. Understanding the impact of social factors on L2
pronunciation: Insights from learners. In John M. Levis & Alene Moyer (eds.), Social
Dynamics in Second Language Assessment, 97–118. Berlin: de Gruyter.
Martin, Ines A. & Lieselotte Sippel. 2021. Is giving better than receiving? The effects of peer
and teacher feedback on L2 pronunciation skills. Journal of Second Language
Pronunciation 7(1). 62–88.
Marx, Nicole. 2002. Never quite a ‘native speaker’: accent and identity in the L2 – and the L1.
Canadian Modern Language Review 59(2). 264–281.
Meyers, Colleen. nd. Mirroring. https://Pronunciationforteachers.com (accessed 12 June 2022).
Munro, Murray J. & Tracey M. Derwing. 1998. The effects of speech rate on the comprehensibility
of native and foreign accented speech. Language Learning 48(2). 159–182.
Munro, Murray J. & Tracey M. Derwing. 2006. The functional load principle in ESL
pronunciation instruction: An exploratory study. System 34(4). 520–531.
O-Brien, Mary G., Tracey M. Derwing, Catia Cucchiarini, Deborah M. Hardison, Hans Mixdorff,
Ronald I. Thomson, Helmut Strik, John M. Levis, Murray J. Munro, Jennifer A. Foote &
Greta M. Levis. 2018. Directions for the future of technology in pronunciation research
and teaching. Journal of Second Language Pronunciation 4(2). 182–206.
Piller, Ingrid. 2002. Passing for a native speaker: Identity and success in second language
learning. Journal of Sociolinguistics 6(2). 179–206.
Rojczyk, Arkadiusz. 2015. Using FL accent imitation in L1 in foreign-language speech research.
In Ewa Waniek-Klimczak & Miroslaw Pawlak (eds.), Teaching and Researching the
Pronunciation of English, 223–233. Cham, Switzerland: Springer.
Ruivivar, June & Laura Collins. Nonnative accent and the perceived grammaticality of spoken
grammar forms. Journal of Second Language Pronunciation 5(2). 269–293.
Subtirelu, Nicholas. 2013. What (do) learners want (?): A re-examination of the issue of learner
preferences regarding the use of ‘native’ speaker norms in English language teaching.
Language Awareness 22(3). 270–291.
Thomson, Ronald I. 2018. High Variability [Pronunciation] Training (HVPT): A proven technique
about which every language teacher and learner ought to know. Journal of Second
Thomson, Ronald I. 2022. English accent coach [online game]. Retrieved from www.englishac
centcoach.com (accessed 5 December 2021).
Thomson, Ronald I. & Tracey M. Derwing. 2015. The effectiveness of L2 pronunciation
instruction: A narrative review. Applied Linguistics 36(3). 326–344.
Timmis, Ivor. 2002. Native speaker norms and International English: A classroom view. ELT
Journal 56(3). 240–249.
Varonis, Evangeline & Susan Gass. 1982. The comprehensibility of nonnative speech. Studies
in Second Language Acquisition 4(2). 114–146.
Werker, Janet F. & Richard C. Tees. 2002. Cross-language speech perception: Evidence for
perceptual reorganization during the first year of life. Infant Behavior and Development
25(1). 121–133.
Yates, Lynda. 2022. Workplace communication. In Tracey M. Derwing, Murray J. Munro &
Ronald I. Thomson (eds.), The Routledge Handbook of Second Language Acquisition and
Speaking, 359–371. Abington, UK: Routledge.
Index
Accented speech 85, 87–103, 141, 209, Consonants 4–6, 13, 15–17, 28, 30–31, 35,
256, 401 41, 44–46, 48–59, 64–78, 90, 95, 140,
Accentedness 2, 4, 85–95, 99–103, 108, 369 171, 179–180, 209, 214, 218, 235,
Acoustic-orthography interface 41 249–256, 259, 267–270, 276–278, 316,
Argentinian speaker 4, 85–86, 89, 91, 319, 326, 328–329, 390, 403–404
93–95, 97, 99 Contextual factors 180, 200–203, 206,
Aspiration 19, 21, 265, 269, 276 208, 213
Assessment 4, 85, 90, 92, 95, 96, 99, 102, Corrective feedback 6, 218, 287–288,
103, 108, 204, 205, 211–214, 217–220, 290, 293, 295, 297–301, 303–304,
251, 279, 288, 294, 297, 300, 302, 305, 306–307, 407
353, 354, 369, 371, 373, 374, 378, 382,
384, 385, 389, 390, 392 Discrimination task 41, 50, 52–55, 58–59,
Automatic speech recognition (ASR) 6, 287, 62, 64–69, 72
290, 294, 408 Dutch. See also Protocol of Dutch as L2 7,
212, 214, 216–219, 220, 233, 293, 315,
Belgian speaker 4, 85–86, 89, 91, 93–95, 320–341, 401, 408
97, 315, 321–322, 324, 338, 341 Dutch Association 202
Body functions 200–203, 204–209, 212 Dynamic system 7–8, 14, 110, 147–150,
Brazilian learner 3–4, 7, 148, 150–151, 162, 161, 164
345–347 Dynamic System Theory 108–109, 210,
Brazilian listener 114–116, 137 220, 399
Brazilian Portuguese 13–14, 24–25, 36, 107,
109–110, 120, 141, 349–350 Effect of task 41–79
Brazilian speaker 320 English /h/ 229, 231, 233–235, 237, 239,
Brazilian teacher 5 241, 243–245, 247
English consonant 45, 65, 180, 254, 259,
Carryover effect 374, 385, 389, 391 269, 319
Case study 6, 13, 15, 17, 19, 21, 23, 25, 27, English learners 13, 113, 230, 233
29, 31, 33, 35 English vowel 4–5, 49, 65, 147–150, 153,
Chinese speaker 4, 85–86, 89, 91, 93–95, 162, 164, 253, 317–318, 383
97, 99 English-speaking consultation (ESC) 5, 168
Classroom-based study 88 Exemplar Model 4, 14, 17–18, 20–21,
Collaboration 211–212, 221, 341, 408–409 31–32
Common European Framework of Reference Explicit instruction 72, 147, 149, 162, 163,
(CEFR) 89, 261, 324 167–170, 173, 183, 191, 233, 257, 259,
Complex System 210, 345–348, 352, 367 289 292, 294, 305–306, 407
Comprehensibility. See also Perceived L2
comprehensibility 2–5, 34–35, 85–103, Fluency 35, 85–90, 95–103, 139, 175–176,
107–119, 128–141, 170–172, 205–206, 179, 189, 404–406
209, 218, 220, 288, 291–292, 399–407 Fossilization 8, 249, 255
Computer-assisted pronunciation training French learner 234, 315, 326–327, 341
(CAPT) 290 French pronunciation 287, 298,
Consensus building 174 302–303, 307
https://doi.org/10.1515/9783110736120-016
414 Index
Generalizability 7, 242, 315, 317, 321–323, Japanese phonemes 250, 254, 276
338–339 Japanese speaker 6, 85, 92, 95, 96, 98–100,
Generalization test 317–318, 345, 349, 103, 172, 251, 254–260, 263
350–352, 354, 357–359, 362–363,
371–373, 376 L2 acquisition. See also Second language
Graduate students 50, 147, 167–168, acquisition 1, 3, 5, 7–8, 44, 51, 71, 211,
170–173, 177, 189, 390 231, 287
Grapheme-to-phoneme correspondence L2 development. See also Second language
(GPC) 229, 231 development 3, 141, 149, 161, 164, 210
Greek 4, 41, 44–45, 50, 71, 319 L2 learning 1–2, 4, 18, 22, 43, 71, 150, 169,
197–199, 202–203, 207, 211, 255
Haitian speaker 4, 107, 109, 113, 115, 117–119 L2 phonology 17–18, 20, 31, 41, 68, 231, 320
Heterotonic words 7, 345–367 L2 pronunciation instruction 1, 2, 181, 211,
High variability 3, 7, 315–341, 345, 349, 360, 371, 399
371, 383, 388, 404 L2 pronunciation teaching 2–3, 5, 8, 35, 41,
49, 85, 87, 108, 109–110, 141, 290
ICF 5, 8, 198–221 L2 speech perception 2, 4, 41, 43, 48–49,
ICF model 5, 8, 197–221 51, 101, 103
Identification task. See also Phoneme identi- L2 speech perception and production 43
fication task 41, 50, 52–55, 58–59, 62, L2 teaching 3–7, 141
64, 66–70, 73, 75, 315, 317, 323, Language teacher training 167
329–330, 332–333 Learner autonomy 287, 290
Immediate feedback 289, 294–295, Learner profile 321, 322, 330, 333, 335,
298–301, 352 339, 341
Improvement rate 265–271, 277 Lexicogrammar 85–86, 95–96, 98, 101, 103,
Information and Communication Technology 175, 177, 179
(ICT) 6, 249–279 Linguistic factor 4, 44–45, 85–86, 88–90,
Information communication technology (ICT) 95, 100–103, 204
training 250–252, 258, 262–268, 272, Longitudinal study 4–5, 107–109, 112, 120,
275–277 123, 128, 132, 137, 140–141, 147, 149,
Intelligibility 2–5, 7, 34–36, 72, 85–87, 150–151, 164, 345–346, 352, 388
107–141, 167, 171, 179, 197, 199, Long-term effect 7, 317, 321, 323, 339, 372,
203–220, 250, 254, 276, 288, 289, 292, 374, 384, 390–392
369, 399–408
Intelligible. See also Intelligibility, Compre- Meta-analysis 2, 370, 372, 390, 392
hensibility, Accentedness 2, 5, 86–87,
110–111, 119, 140, 169, 170–172, 179, Native Language Magnet Model 1, 4, 41, 68
181, 182, 212–213, 217–218, 287, 291, Needs analysis 174–175, 177, 189, 191
399–400 New sound 1, 43, 69, 255
International graduate students 167, Non-native contrasts 41, 315–316
170–173
Oral communication 140, 167–175, 177, 179,
Japanese language 250, 251–253, 254, 181, 189–191, 295
267–274, 275, 278 Orthography 3, 13–18, 24, 28–30, 35–36, 41,
Japanese learner 249, 253, 275 229, 231–235, 241–245
Index 415
Participation 50, 172–173, 199–209, Retention 7, 315, 317, 320–323, 365, 369,
211–215, 325, 331, 388, 391 371–380, 384–387, 390–392
Pedagogical implications 4–5, 71, 85, 101, Revised Speech Learning Model 1, 4, 14, 20,
243, 277, 303 22, 43
Perceived L2 comprehensibility. See also Robustness 7, 17, 31, 315, 317, 319, 321–341
Comprehensibility 88
Perceptual Assimilation Model-L2 (PAM-L2) Second language acquisition. See also L2
1, 370 acquisition 250, 287, 307, 399
Perceptual learning 370–371, 385–387 Second language development. See also L2
Perceptual plasticity 370, 386, 390 development 1, 147
Perceptual training 7, 315–317, 321–339, Segment 19, 21, 71, 100, 152, 186, 229, 231,
345–367, 369–392 243
Phoneme identification task. See also Identi- Self-video 249, 251, 262, 264–265,
fication task 41, 50, 52–55, 58–59, 62, 268–269, 271–278
64–70 Sound system 1, 43, 48, 69, 70, 257, 277
Phonetic training 1, 3, 7, 315–341, 370–375, Spanish language 109, 232–233, 292, 321,
383, 388, 391, 404 349–350, 355, 357, 360, 364, 366
Phonetic variability 13 Spanish learners 7, 258, 345–366, 405
Phonological encoding 234 Spanish speakers 4, 85–86, 88–89, 91, 378
Plural formation 3, 13–14, 17, 19–36 Speech and Language Therapists 202
Poland speaker 4, 85–86, 89, 91, 93–95, Speech rate 85–86, 90, 95–101, 180, 189,
97, 99 206, 217, 405
Post-test 265, 291–292, 317–319, 327, Speech recognition 6, 287, 290–291, 294,
330–338, 345, 352, 357, 359, 363, 297, 299, 318, 408
371–374, 380, 386–388, 390–391
Pre-test 269, 292, 317, 324, 327, 330, Task effect See also Effect of Task 41, 49,
331–339, 356, 360–361, 371–373, 380, 64, 68–69
386, 407 Teacher-training 167, 216, 250, 252
Production training 1, 370 Technological Pedagogical Content
Pronunciation instruction 1–3, 5, 71, 150, Knowledge (TPACK) 5, 8, 167, 169,
161, 163–164, 167, 169–170, 178–181, 183–184, 186–187, 191
183, 191, 199, 209–211, 257, 290, 292, Text-to-speech (TTS) 290, 297
306, 340, 371, 399, 400, 402–403, Transfer 15–17, 28, 31, 68, 85, 176, 204,
405–407 207, 220, 229–230, 242, 295, 370–373,
Pronunciation research 3, 102, 209, 229, 375, 385, 389–391
399, 410
Pronunciation training 3, 6, 71, 169, 177, Word dictation task 41, 51–52
180, 183, 187, 197, 219, 255, 257, 258, Word frequency 41, 44–45, 49, 67–68
280, 287–307 Word learning 229, 231–234, 237–244
Protocol of Dutch as L2. See also Dutch and Word length 4, 41, 43–79
Dutch Association 212 Word-picture matching task 6, 229, 233

Ubiratã Kickhöfel Alves and Jeniffer Imaregna Alcantara de Albuquerque (Eds.) Second Language Pronunciation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ubiratã Kickhöfel Alves and Jeniffer Imaregna Alcantara de Albuquerque (Eds.) Second Language Pronunciation

Uploaded by

Copyright:

Available Formats

Ubiratã Kickhöfel Alves and

Jeniffer Imaregna Alcantara de Albuquerque (Eds.)

Different Approaches to Teaching and Training

Library of Congress Control Number: 2022943012

Bibliographic information published by the Deutsche Nationalbibliothek

© 2023 Walter de Gruyter GmbH, Berlin/Boston

Bastien De Clercq is a post-doctoral researcher and lecturer in the linguistics department of

Elena Cotos is an Associate Professor of TESL/Applied Linguistics in the English Department

conducts research, and trains graduate consultants on supporting multilingual students’

Jeniffer Imaregna Alcantara de Albuquerque is an Associate Professor at Universidade

Pollianna Milan is a professor at Federal University of Paraná (UFPR) at the Department of

Susan Jackson is a PhD candidate at Concordia University in Montreal, Canada. She is

Sviatlana Karpava (PhD) is a Lecturer in Applied Linguistics/TESOL at the Department of

Thaïs Cristófaro Silva is a Professor in Linguistics at the Postgraduate Program in Linguistics

Ubiratã Kickhöfel Alves is an Associate Professor at the Graduate Program in Linguistics at

Walcir Cardoso is a Professor of Applied Linguistics in the Department of Education at

Ubiratã Kickhöfel Alves, Jeniffer Imaregna Alcantara de Albuquerque

Part I: Pronunciation development and intelligibility:

Thaïs Cristófaro Silva, Wellington Mendes

Elena Kkese, Sviatlana Karpava

Pedro Luis Luchini, Cosme Daniel Paz, María Claudia Troglia

Jeniffer Imaregna Alcantara de Albuquerque, Ubiratã Kickhöfel Alves

Part II: L2 pronunciation teaching

Tim Kochem, Idée Edalatishams, Lily Compton, Elena Cotos

Ilvi Blessenaar, Lizet van Ewijk

Part III: L2 pronunciation training: Implications

Susan Jackson, Walcir Cardoso

Yuri Nishio, Akiyo Joto

Natallia Liakina, Denis Liakin

Part IV: Pronunciation in the laboratory: High variability

Ellen Simon, Bastien De Clercq, Pauline Degrave, Quentin Decourcelle

Pollianna Milan, Denise Cristina Kluge

Anabela Rato, Diana Oliveira

Ubiratã Kickhöfel Alves, Federal University of Rio Grande do Sul

Linguistics, such as Phonetics and Phonology, Psycholinguistics and Language

The present volume congregates these different approaches to L2 pronunciation

consonants (‘grapes’, ‘plates’). Their data show significantly higher rates of

providing empirical support to the claim that intelligibility and comprehensibil-

Derwing, Tracey M. & Murray J. Munro. 2015. Pronunciation Fundamentals: Evidence-based

Keywords: plural formation, orthography, phonology, phonetic variability

podcast [pɔ.dʒi.'kɛs.tʃi] (Nascimento 2016).1 The epenthetic vowel occurs more

the regressive assimilation rule, which applies to BP, would be transferred to L2

Table 1: [Cs] nouns in BP.

BP singular Transcription BP [Cs]-nouns Transcription Gloss

clube ['klu.bi] clubes ['klu.bs] ~ ['klu.bis] clubs

2 Modeling L2 phonological representations

occurs, as in It may!, whereas in intervocalic position, as in It is!, either an alveolar

Figure 1: Plural formation in English.

Figure 1 illustrates the network involved in plural formation in English

Figure 2: EMPL2 Plural formation in English.

Table 2: BP and L2 English target words.

<Ces> alpes artes cheques árabes baldes açougues

English Cs-nouns Cz-nouns

<Ces> grapes gates cakes tubes sides —

<Cs> cups cats parks jobs beds dogs, bags,

Table 3: Stimuli and expected answers in the picture-counting task.

Stimuli Expected answers

quatro crepes argentinos foram vistos

____ ____ argentinos foram vistos

uma banana foi vista

____ ____ foi vista

two maps are seen

____ ____ are seen

argentinos foram vistos

foi vista

are seen

is seen