Which Features of Accent Affect Understanding? Exploring The Intelligibility Threshold of Diverse Accent Varieties

Applied Linguistics 2018: 1–29
doi:10.1093/applin/amy053
Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Which Features of Accent affect
Understanding? Exploring the Intelligibility
Threshold of Diverse Accent Varieties
1,
OKIM KANG, 2RON I. THOMSON, and 1MEGHAN MORAN
1
Department of English, Northern Arizona University, and 2Applied Linguistics,
Brock University

E-mail: okim.kang@nau.edu
With the ascendency of English as a global lingua franca, a clearer understand-

ing of what constitutes intelligible speech is needed. However, research system-
atically investigating the threshold of intelligibility has been very limited. In this
article, we provide a brief summary of the literature as it pertains to intelligible
and comprehensible speech, and then report on an exploratory study seeking to
determine what specific features of accented speech make it difficult for global
listeners to process. Eighteen speakers representing six English varieties were
recruited to provide speech stimuli for two English listening tests. Sixty listeners
from the same six English varieties took part in the listening tests, and their
scores were then assessed against measurable segmental, prosodic, and fluency
features found in the speech samples. Results indicate that it is possible to iden-
tify particular features of English speech varieties that are most likely to lead to a
breakdown in communication, and that the number of such features present in
a particular speakers’ speech can predict intelligibility.
As English has become increasingly dominant as the primary language of

global communication, the number of users who speak those varieties histor-
ically considered to be standard (British, American, etc.) is now in the minority
(Li 2009). In fact, English is being appropriated in various ways by the com-
munities and nations who are increasingly using it. The World Englishes para-
digm,1 as discussed by Kachru (1992) and Nelson (2011), has been influential
in establishing that many widely established English varieties spoken around
the globe are legitimate means of communication. Furthermore, in some con-
texts, so-called non-standard varieties may actually be preferable to standard
varieties. Examples of this would include English varieties spoken in India and
Nigeria as official languages (or even first languages) of a subset of the popu-
lation, or English varieties used as a lingua franca across state boundaries, such
as when Koreans and Mexicans are engaged in commercial trade with each
other. Given the wide range of contexts in which English is spoken, one can
argue that imposing traditionally standard varieties (e.g. British and American
ß The Author(s) (2018). Published by Oxford University Press. All rights reserved.
For permissions, please email: journals.permissions@oup.com
2 EXPLORING THE INTELLIGIBILITY THRESHOLD

English) in English as an International Language (EIL) contexts is inappropri-
ate, since other varieties are already established as effective and more appro-
priate modes of communication (Hamp-Lyons and Davies 2008).
Given this reality, the notion that one particular variety should be the stand-
ard-bearer for English is increasingly unjustified. In response, some applied
linguists have begun promoting EIL as the goal to which most speakers
ought to aspire (Jenkins 2006). EIL is not so much a standard, as it is a nego-
tiation of form and meaning in interaction. An EIL approach recognizes that
the English language belongs to all who use it, rather than to speakers of
traditionally privileged varieties alone. In the realm of English pronunciation,
advocates of EIL emphasize mutual intelligibility over mastery of a particular
native accent (Yano 2001; Jenkins 2006). However, what constitutes mutual
intelligibility varies from context to context. In addition, even the features that
comprise the intelligibility of specific accented varieties (on the part of the
speaker alone) have not yet been fully established.
Despite a growing recognition that the pronunciation of many international
English varieties is intelligible to a wide range of listeners, many EIL speakers
continue to believe that they should aim for pronunciation associated with
historically prestigious varieties (e.g. the USA and British). In part, this belief
may be reinforced by the absence or near absence of so-called non-standard
varieties of English in teaching materials (e.g. coursebooks) as well as on tests
of English proficiency, which privilege British or American English varieties.
While an emphasis on these varieties may have had merit several generations
ago, the current reality of EIL, even within traditionally majority-English-
speaking countries, compromises the ecological validity of using only these
varieties in both pedagogy and assessment. Therefore, many scholars in
recent years have suggested the inclusion of accented varieties of English in
high-stakes English assessments (Taylor 2006; Abeywickrama 2013; Ockey and
French 2016; Ockey et al. 2016).
However, there are legitimate concerns with incorporating a variety of
English accents into the assessment of L2 listening. First, including accents
with which listeners are unfamiliar may actually be seen as lacking in ecolo-
gical validity, unless the test-taker aspires to understand such accents because
of some utilitarian purpose. Nelson (2011) argues that intelligibility of lan-
guage is contextually determined, and that there must be some communica-
tive usefulness associated with understanding a particular variety. Otherwise,
including a range of English varieties on a test of English proficiency may be
nothing more than an ideologically motivated enterprise. For example, a
Brazilian learner of English who wishes to remain in Brazil and is taking an
English exam only for employment purposes may have no need to understand
Japanese-accented English. However, tests such as the Test of English as a
Foreign Language (TOEFL) and International English Language Testing
System, which are marketed as international rather than local, could
strengthen their validity by incorporating different varieties of English into
their listening sections.
O. KANG, R. I. THOMSON, AND M. MORAN 3

Assuming that the use of English varieties does, in fact, fit with the test pur-
pose, a second concern is encountered: incorporating a variety of English accents
may disadvantage certain test-takers who have not had substantial experience
with those accents (Taylor and Geranpayeh 2011). Ockey and French (2016: 19)
go so far as to say that unless carefully validated, the use of a range of unfamiliar
accents for listening comprehension assessment is ‘professionally irresponsible’.
Even if items produced by particular accents are validated (i.e. found to be free
of bias), it is a challenge to determine which accents or English varieties should
be represented; incorporating every type of accent available in the target lan-
guage use domain is impossible from a practical standpoint (Elder and Harding
2008). Rather, the types of accents used should reflect what real-world listeners
might encounter in the context of EIL.
Further, even within each variety of English, individual speakers may be
more or less intelligible to EIL listeners. That is, we cannot assume that all
speakers of a particular English variety speak in a way that is sufficiently in-
telligible to all proficient listeners. Some listeners may process speech produced
by less intelligible speakers with less effort due to previous exposure to that
particular variety and its range of speakers (Gass and Varonis 1984). What is
needed are input texts that are equally intelligible to all listeners, regardless of
their previous exposure to particular varieties. Using mock TOEFL listening
comprehension tests as a test case, Ockey and French (2016) demonstrated
that when listening passages were spoken by some speakers with varied but
highly intelligible English accents (with a strength of accent below 2.0 on their
scale), their inclusion did not significantly affect comprehension scores.
However, for speakers with mild accents (i.e. 2.0–2.6 on the strength of
accent scale), listening comprehension scores were negatively affected.
It is important to keep in mind that Ockey and French’s (2016) study
focused on English varieties from English-majority countries, which can be
assumed to be more familiar to a greater number of listeners, and therefore
easier to understand. The process by which stimulus talkers from a wider range
of English varieties are selected for inclusion is far more complex. It requires an
empirically based approach to determining an intelligibility threshold for any
English variety used in a test of listening comprehension, whether those vari-
eties are familiar to listeners or not.
1. INTELLIGIBILITY, COMPREHENSIBILITY, AND

ACCENTEDNESS
Following Munro and Derwing (1995) and Derwing and Munro (1997), three
important constructs have come to dominate the pronunciation literature: in-
telligibility, comprehensibility, and accentedness. According to Derwing and
Munro, intelligibility refers to the extent to which the speaker’s intended ut-
terance is actually understood by a listener (often measured via transcription),
whereas comprehensibility pertains to the degree of difficulty the listener

experiences in attempting to understand an utterance. While Munro and
Derwing (1995) found the first two constructs to be quite highly intercorre-
lated, accentedness (meaning the extent to which an L2 learner’s speech is
perceived to differ from a particular standard) was found to be only moderately
correlated with comprehensibility and weakly correlated with intelligibility.
That is, a given speaker could have a reportedly strong foreign accent, yet
still be highly intelligible. Thus, the presence of a foreign accent does not ne-
cessarily imply reduced intelligibility (Harding 2011). A consequence of these
facts is that establishing what constitutes accented but highly intelligible
English speech can provide a means for incorporating a variety of English
accents into high-stakes tests of English listening proficiency.
Establishing intelligibility is a complex endeavor, since it is partially influ-
enced by factors beyond a given speaker’s control. Specifically, intelligibility
(and by extension comprehensibility) has an interactional dimension (Smith
and Nelson 1985), with listeners playing a role (Fayer and Krasinski 1987). For
example, intelligibility and comprehensibility can be affected by listeners’ at-
titudes, expectations, and stereotypes that they may associate with particular
accents (Rubin 1992; Kang and Rubin 2009; Lippi-Green 2012).
2. ESTABLISHING AN EMPIRICALLY MOTIVATED

THRESHOLD OF INTELLIGIBILITY
Previously, the threshold of intelligibility has been defined as the lowest re-
quirement for efficiently conveying a message from a native listener’s stand-
point (Gimson 1980). Extending this to the EIL context, the threshold of
intelligibility can be viewed as the point at which speech is considered just
good enough for successful communication between particular speakers and
their interlocutors. Unlike comprehensibility and accentedness, which are im-
pressionistic judgments on the part of listeners, intelligibility is the extent to
which a listener can correctly transcribe words that they hear. In this study, in
particular, intelligibility is operationalized and defined even more narrowly in
phonological and perceptual terms.
To date, definitions of what constitutes intelligibility are rather opaque.
What counts as successful communication depends on many factors, including
context, the interlocutors involved in the communication, prior experience of
listeners with particular accents, etc. Thus, what comprises a threshold of in-
telligibility must be operationalized in context specific ways. For example, a
high-stakes English test such as the TOEFL requires speakers who deliver the
listening section of the test to hold a degree of general intelligibility which is
sufficiently high to reduce any bias effects. The current study seeks to establish
a threshold of intelligibility across a variety of English accents in the context of
a TOEFL-type monologic listening comprehension test. Given the nascent
nature of this field of inquiry, our study is quite exploratory in nature. In it,
we specifically aim to identify which phonological features of six English

varieties collectively contribute to establishing a threshold of acceptable intel-
ligibility for high-stakes test-takers, regardless of what accent is used in such
tests.
3. FEATURES THAT AFFECT INTELLIGIBILITY AND LISTENING

COMPREHENSION
While what constitutes intelligibility remains under-defined, existing literature
provides a useful starting point for considering which features contribute to
speech that is just good enough. In the realm of segmentals (i.e. vowels and
consonants), Catford (1987) argues that all ‘phonemic contrasts carry a certain
functional load, based upon the number of pairs of words in the lexicon that
serves to keep [the contrast] distinct’ (p. 88). This research led to a list of all
English phonemes organized to reflect the frequency with which each phoneme
contrasts with other phonemes. This, in turn, allowed for each phoneme to be
ranked from high to low on a functional load continuum. Brown (1991) proposes
a more fine-tuned functional load hierarchy, which takes into account other
factors, such as whether two words that contrast by a single phoneme are from
the same part of speech and whether members of a contrast are similar sounding.
Munro and Derwing (2006) sought to empirically validate the importance of
functional load by examining the relative contributions of high versus low func-
tional load phonemes to the comprehensibility of foreign accented speech. They
found that speakers’ substitution
R of certain high functional
R load consonants with
contrasting phonemes (e.g. / /!/s/,/n/!/l/,/s/!/ /,/d/!/z/) caused listeners
to rate speakers’ comprehensibility much lower than in cases where substitutions
comprised low functional load consonants (e.g. /ô/and/d/opposition;//!/f/).
Likewise, Jenkins (2003) based her description of what constitutes a Lingua
Franca Core for EIL on empirical research, also having found that some phon-
emes are relatively less important to successful communication than are
others. For example, despite their relatively high frequency of occurrence,
when substituted with other contrasting phonemes, voiced and voiceless inter-
dental fricatives rarely cause confusion on the part of the listener. This is
consistent with Catford’s (1987) position that functional load is not defined
simply by frequency of occurrence for particular phonemes, but rather, the
frequency with which they contrast with other phonemes to form distinct
words (i.e. minimal pairs).
In addition to literature on the contribution of segmentals to listeners’ ability to
understand speech produced in accents other than their own, an increasing
number of studies have addressed the importance of non-native English speakers’
(NNES) prosody (suprasegmentals) in listeners’ judgments of comprehensibility
and oral proficiency (Kang et al. 2010). For example, at the level of phrases and
sentences, incorrect nuclear stress (i.e. emphasizing the ‘wrong’ words) can affect
listeners’ comprehension of content (Field 2005). Similarly, the use of stress to
emphasize every word, regardless of its function or semantic importance, causes

difficulty for listeners (Wennerstrom 2000; Kang 2010). Poor intonational struc-
ture (e.g. narrow pitch range in Kang 2010) and a disturbance in prosodic com-
position can also considerably affect native speakers listeners’ perceptions
(Pickering 2001). For example, the intonational characteristics of English pro-
duced by many East Asian speakers can cause US listeners to lose concentration
or to misunderstand the speaker’s intent (Kang 2012). In particular, how a
speaker applies rising, falling, or level pitch on the focused word of a tone unit
can affect both perceived information structure and social cues in L2 discourse.
Finally, numerous studies have investigated the relationship between fluency
and the comprehensibility of speech. Tavakoli and Skehan (2005) suggest that
fluency can be characterized along three separate dimensions (i) speed and dens-
ity per time unit (speaking rate), (ii) breakdown fluency (number and length of
pauses), and (iii) repair fluency. Furthermore, research indicates that speech rate
(Kormos and Denes 2004), breakdown fluency (Derwing et al. 2004; Kang et al.
2010), and repair fluency (Iwashita et al. 2008) are all associated with listeners’
comprehension of speech as well as with assessments of speakers’ oral profi-
ciency. Thomson (2015) reports a strong correlation between oral fluency ratings
and comprehensibility ratings, although he highlights the fact that in rare cases, a
speaker can be fluent, yet not very easy to understand.
While the literature described above reveals important variables that con-
tribute to the intelligibility of speech, we know little about how these features
interact with specific English accents and specific features of those accents.
4. RESEARCH QUESTIONS
The current study is guided by the following research questions:
1 To what extent do unique phonological features affect listeners’ compre-
hension scores and intelligibility scores?
2 How might these phonological features be used to establish a threshold of
intelligibility for a variety of English accents?
Phonological features were operationalized using both segmental and supra-
segmental measures. We aimed to identify a threshold of intelligibility for
TOEFL-type listening comprehension passages produced with varied accents
by examining which features of speech typically associated with intelligibility
contribute to the comprehension of test passages by typical TOEFL test-takers.
5. METHOD
5.1 Creation of test materials
5.1.1 Speakers
Eighteen speakers, three from each of six distinct groups, were recruited. Two
groups represent traditionally standard English accents, General American
(GA) and British Received Pronunciation (RP); two groups represent

traditionally non-standard English accents spoken in contexts where English is
an official language, but not the first language of the speakers, South African
Afrikaans (SA) and Indian Hindi (IN); and two groups represent English ac-
cents spoken by EIL speakers where English is not an official language, but
used for international communication, Mexican Spanish (SP) and Chinese
Mandarin (CH). One female and two male speakers per country were included.
The speakers’ L1s and geographical origins were controlled to ensure a meas-
ure of homogeneity of dialect.
An iterative pre-screening process was followed to arrive at our final set of test
speakers. First, we recorded potential speakers reading TOEFL listening compre-
hension test passages (but not the task to be used in the study). Then, following
Major et al.’s (2002) recommendations, potential speakers were informally as-
sessed by the research team to ensure that they (i) sounded conversational as a
lecturer, (ii) exhibited characteristics of pronunciation typical of that L1 speech
community, (iii) handled the terminology of TOEFL lecture scripts fluently so as
to appear to be experts in the field represented by TOEFL listening passages, and
(iv) had the mature voice quality and pitch of a professional or academic
speaker. Speakers from traditionally non-standard English varieties, who were
highly proficient in English, but who still retained phonological features that
were distinct from those found in historically ‘standard’ varieties (e.g. GA and
RP), were identified. The speakers who were ultimately selected included those
who were highly intelligible/comprehensible, yet still accented (Harding 2011),
as well as those with varying degrees of perceived comprehensibility. They were
determined by eight raters with graduate degrees in Applied Linguistics and
backgrounds in phonology and pronunciation. These raters provided a scalar
rating of the speakers’ sample lecture recordings, using a five-point scale with
‘easy to understand’ and ‘difficult to understand’ as end points. Using mean
rating scores, we selected three speakers from each traditionally non-standard
variety to represent low (4–5 out of 5), mid (3 out of 5), and high (1–2 out of 5)
degrees of comprehensibility, which, as noted earlier, is highly correlated with
measures of intelligibility.
5.1.2 Speaking tasks

Our final set of 18 speakers were asked to record themselves reading two text
tasks: (i) a randomly assigned listening stimulus passage (3–5 min) from the
TOEFL iBT listening texts of academic lectures and (ii) a series of 90 nonsense
sentences. The nonsense sentences were adopted from previous research on L1
intelligibility of individuals with hearing difficulties or disorders (Nye and
Gaitenby 1974; Picheny et al. 1985).
5.1.3 TOEFL listening passages

Twenty-three TOEFL passages were provided by the Educational Testing
Service. The passages corresponded to the 3–5 min monologic lecture portion
of the TOEFL. The subsequent questions are intended to ‘assess test takers’

ability to: understand main ideas or important details;. . .understand the organ-
ization of the information presented; understand relationships between the
ideas presented; and make inferences or connections among pieces of infor-
mation’ (TOEFL iBT Test Framework and Test Development 2010). Questions
are mostly multiple choice with one correct answer, although some questions
have more than one answer and therefore allow for partial credit. On the
survey, prior to each recording, there was a sentence of introduction such
as, ‘Listen to part of a talk in a United States history class’. Other lecture
topics included science, world history, astronomy, literature, language, and
the arts.
A screening and validation process was independently conducted by the
researchers to finalize a set of 18 of the 23 lecture scripts and related questions
that had similar ranges in item difficulty and discrimination. Forty-five partici-
pants (35 TOEFL preparatory students and 10 graduate students) took the 23
listening tests recorded by a GA-accented speaker to determine the difficulty of
the test items and the familiarity of the passages. Classical item analysis was
initially computed on participant responses to obtain index values of item
difficulty and discrimination. Many-Facet Rasch Measurement (MFRM) was
also performed for an additional validation process. The items matched by both
of these methods were finally selected for the study.
Based on the results of item difficulty analyses and passage familiarity scores,
we removed five passages from further use because of their largely deviated
passage difficulty levels. The items finally selected for the current study ranged
between 0.63 (the most difficult) and 0.89 (the easiest). Note that classical item
analysis statistics is sample-dependent; accordingly, the passages selected are
relative to the population of the current study. Infit values from MFRM ana-
lyses were also considered for this screening process. Using MFRM’s fit statis-
tics, we could eliminate a problematic or misfitting element. For practical
purposes, a reasonable range of Infit values between 0.5 and 1.5 has been
recommended (Lunz and Stahl 1990). Accordingly, the current study excluded
two passages with logit values of 1.05 and 0.36. That is, the former item
(1.05) was particularly easier than others, and the latter item (0.36) was
more difficult than others.
Another passage had to be removed because it had an item with a high Infit
value (1.96), but it was not one of the easiest ones. In general, the Infit has an
expected value of 1.0, with a standard error of 0. Researchers (Linacre and
Wright 2002) suggest that for mean squares to be useful for practical purposes,
a reasonable range of Infit values should be between 0.5 and 1.5. Values close
to or above 2.0 are considered distorting (Myford and Wolfe 2004). Then, we
deleted two more passages which contained the next most difficult item (logit
= 0.3) and the next easiest item (logit = 0.8), to retain the average difficulty
logits similar across the passages.
There were a few relatively difficult items, but not too different from the rest
of the items. Overall, most of them were clustered around the mean (0). As for
the test of unidimensionality, the minimum and maximum of the mean square

values (MNSQ) showed a little wider than the acceptable range (0.58–1.96)
suggested by McNamara (1996) (MNSQ: 0.75–1.3; ZSTD: 2 to +2). However,
item reliability was relatively high (0.88). Moreover, a brief vocabulary ana-
lysis for the entire 18 passages further added that all passages fell in the ranges
of 0.41–0.49 type token ratio, 75–81 per cent for most common 1,000 words,
and 3.7–4.4 per cent for the use of academic word list. Also, the selected pas-
sages had a familiarity score that ranged from 5.17 to 5.92 out of 7 (1 = very
familiar and 7 = not familiar at all). Passages and items were then evenly and
systematically distributed across speakers of the six L1s, according to topics,
item difficulty for each testlet, and other task features.
The final set of official TOEFL-type materials in the first task was controlled
for style (a professor’s monologic speech), length (passages of 500–800 words),
type and number of questions (6), and content (topics deemed appropriate for
university level students). All the listening passages assigned to each speaker
were similar in style but different in topic; that is, speakers recorded different
listening passages. To avoid a speaker’s speech rate being a confounding vari-
able to intelligibility (Derwing and Munro 2001), we also ensured that final
speech files fell within the range of 2.2–2.8 words (approximately 3.2–3.6
syllables) per second by having speakers re-record them if they did not hit
this target on the first pass (which most of them did). In Derwing and
Munro’s (2001) study, speech rate was found to be an integral aspect to intel-
ligibility; in the current study, we intentionally controlled for speech rate to
neutralize a ‘nuisance’ variable. While speech rate is difficult to control in
authentic conversational situations, it can easily be controlled for in a listening
passage lecture.
The final 18 speakers’ recordings were also evaluated by 48 novice listeners
who provided scalar judgments of both strength of accent and comprehensi-
bility. These raters included a mixture of undergraduate students (19), teachers
(8), and graduate students (21). For each speaker, listeners were asked to
complete two 7-point Likert scale items to reflect accent and comprehensibil-
ity: 1 = ‘no accent’/7 = ‘heavy accent’, and 1 = ‘easy to understand’/7 = ‘difficult
to understand’. Accent ratings for three speakers in each country were eval-
uated as follows: 1.21, 1.27, and 1.33 for American speakers; 2.33, 2.56, and
2.62 for British speakers; 3.44, 4.29, and 5.11 for Indian speakers; 3.68, 5.27,
and 5.53 for South African speakers; 3.60, 5.60, and 5.84 for Mexican speak-
ers; and 2.04, 4.13, and 5.67 for Chinese speakers.
Although the scales for the preliminary rating and the 48 novice listeners
were different, the comprehensibility ratings assigned by the novice raters
roughly matched those given by the expert raters. Comprehensibility ratings
showed a very similar pattern in that while GA and RP speakers did not dem-
onstrate variance in their comprehensibility, the speakers of other English
accents did. High-comprehensibility speakers ranged from 1 to 2, mid-compre-
hensibility speakers ranged from 3 to low-4, and low-comprehensibility speak-
ers ranged from mid-4 to 5. As a result, GA and RP speaker recordings were
treated as equivalent for the relative degree of comprehensibility, while

recordings from the speakers of the other English accents could be labeled low,
mid, and high in terms of their relative comprehensibility. The mean differ-
ences of both the comprehensibility and accentedness scores among these
three levels were statistically significant (p <. 05).
5.1.4 Nonsense sentences

A measure of intelligibility using nonsense sentences is one of the most
common methods used in the field of speech language pathology to investigate
the clarity of speech, but it has been underutilized in the field of applied lin-
guistics. It was one of five measurement techniques used as part of the larger
project from which this study stems (Kang et al. 2018a), which explored how
best to operationalize intelligibility. We selected this particular approach to
examine the intelligibility of different varieties of Englishes because it was
most significantly associated with listeners’ comprehension scores in that
larger study (Kang et al. 2018a).
Specific nonsense sentence items were chosen from the test banks used in
Nye and Gaitenby (1974) and Picheny et al. (1985). The sentences were se-
mantically meaningless while being syntactically normal, and they were com-
prised of high-frequency monosyllabic English words. Sentences taken from
Nye and Gaitenby (1974) all followed the pattern ‘The (adjective) (noun)
(verb, past tense) the (noun).’; those from Picheny et al. were slightly more
grammatically complex but had four distinct content words (e.g. The tall kiss
can draw with an oak.). Therefore, listeners were asked to fill out four missing
content words per sentence. The list included 72 sentences (4 sentences 18
speakers). A subset of four of each was chosen and randomized from each
speaker to be used in the evaluations. Accordingly, the possible scale of this
intelligibility measure for each speaker is from 0 to 16 (4 missing words 4
sentences).
The sentences were presented auditorily on the survey with four blanks
beneath each recording. Participants were asked to type in each of the four
words they heard in each blank, respectively. Manual coding allowed for in-
correct spellings to count as correct, as long as they were interpretable.
Likewise, homophones were accepted because of the lack of semantic context
of the sentences.
All recorded files were converted to a .wav file using the program Freemake
Audio Converter and screened for sound quality and volume. They were also
edited visually and aurally from a waveform display using PRAAT. Any vocal
dysfluencies, such as throat clearings or restarts, were deleted. The edited
sound files were then embedded into several surveys, after randomization,
using the assessment tool SurveyGizmo.

5.2 Listeners
Sixty listeners from the same six countries represented in our listening mater-
ials (the TOEFL monologic lectures) were recruited to take the TOEFL listening
and intelligibility tests we had created: 10 American (three males and seven
females), 10 British (four males and six females), 10 Indian (five males and five
females), 10 non-Anglophone South African (six males and four females), 10
Chinese (three males and seven females), and 10 Spanish (five males and five
females). Listeners comprised senior undergraduate university students or
early stage graduate students. All NNES listeners were highly proficient in
English; that is, they had received TOEFL iBT scores of 100 or higher. Their
ages ranged from 18 to 32.
All listeners were asked to complete a short demographic survey and a short
diagnostic test. The diagnostic test consisted of a one-passage listening test with
six questions derived from a currently available TOEFL iBT practice test and
produced by a standard GA speaker. Listeners who performed successfully on
the practice section with one or no mistakes/incorrect answers were invited to
complete the study.
5.3 Data collection procedures

The focus of this project involves one listening comprehension test and one
intelligibility measure: transcription of four missing content words in seman-
tically nonsensical sentences. Listeners were required to take the tests on two
different days: the listening comprehension test on Day 1 (18 passages in total
for approximately 2 h) and the nonsense sentence intelligibility test on Day 2
(30 min). They were instructed to complete each day’s test in one sitting. The
listeners were allowed to take notes for the comprehension test, but not for the
intelligibility measure.
Each listening session was highly controlled and supervised; listeners com-
pleted each test session in an approved computer lab, using headsets to ensure
sound fidelity. Five links were created that were identical except for the order
of listening passages and nonsense sentences. The links were distributed ran-
domly to the participants such that approximately 20 per cent of the listeners
received Link 1, another 20 per cent received Link 2, etc. This was to control
for order effect as well as the effect of listener fatigue. All listeners were com-
pensated for their participation.
5.4 Phonetic and phonological analysis

Recorded speech was subjected to phonetic and phonological analyses for seg-
mental, prosodic, and fluency features. The features of interest are known to
be highly correlated with native English speakers’ (NES) and NNESs’ commu-
nicative success (Anderson-Hsieh and Venkatagiri 1994; Pickering 2001;
Kormos and Denes 2004). Since pronunciation of other varieties of English
is not considered ‘errors’ in this study, we use the term ‘divergence’ to refer to

any difference relative to the American English standard typically associated
with the TOEFL. The GA English that was used as a basis for comparison was
the English referenced in Avery and Ehrlich (2008).
A trained phonetician identified all divergences in the TOEFL lecture record-
ings. Inter-coder reliability of .93 or above through intraclass correlation coef-
ficents was achieved between the trained phonetician and two other coders
who analyzed a subset of 10 per cent of the speech data. All pronunciation
categories examined are provided in Table 1. Although the final two features
(i.e. absence of postvocalic r and absence of flap) should not be regarded as
production ‘errors’, and are rather characteristic of certain varieties of English,
they sometimes serve to cause a breakdown of intelligibility and were thus
noted. If a word that contained a divergence from American English was re-
peated (with that same divergence), each occurrence was tallied separately.
We further analyzed certain segmental divergences (i.e. consonant and vowel
substitutions) in terms of the functional load principle, classifying them as high
functional load versus low functional load substitutions, following Catford
(1987) and Brown (1991). Rankings from 51 to 100 per cent were considered
as high functional load substitutions (e.g. ‘pit’ versus ‘bit’) and those from 0 to
50 per cent as low functional load substitutions (e.g. ‘they’ versus ‘dey’). All of
the segmental divergences were calculated as the total number of segmental
divergences divided by the total number of syllables articulated. This form of
assessment was completed for both the listening passages and the sentences.
Additional information regarding the phonetic and phonological analyses
can be found in Online Supplementary Materials and also in Kang et al.
(2018a).
5.5 Data analysis

During our initial analyses of the data (i.e. linear mixed-effects models,
LMEMs, and analyses of variance (ANOVAs)), we found that the American
and British listeners obtained significantly different TOEFL lecture comprehen-
sion scores than the other listeners (p < .000) (see Kang et al. (2018b) for more
details). Accordingly, the American and British groups were excluded from the
primary analysis of this study, as they do not represent target TOEFL test-
takers. The subsequent analyses were conducted using a listener group
comprising four countries, that is, South African, Indian, Spanish, and
Chinese listeners.
Several analyses were conducted to answer the first research question re-
garding the effect of the phonological features of speech on listeners’ compre-
hension scores. To facilitate interpretation of the relationship between each of
the phonological features and the listeners’ comprehension tests, we first cate-
gorized the pronunciation features into three groups: (i) segmentals, (ii) pros-
ody, and (iii) fluency. The segmental features include vowels and consonants,
while prosodic features involve stress, intonation, and rhythm features.

Table 1: Phonological and phonetic analyses of 18 speakers’ listening com-
prehension passages
Segmental features Prosodic features Fluency features
Consonant deletion Space Articulation rate

Vowel deletion Pace MLR
Syllable reduction Number of prominent Phonation time ratio
syllables
Consonant cluster Word stress divergence Number of silent pauses
simplification
High functional load Level tone choices Mean length of pauses
vowel divergence
Low functional load Falling tone choices Unexpected pause ratio
vowel divergence
High functional load Rising tone choices Expected pause ratio
consonant divergence
Low functional load Number of tone units
consonant divergence
Linking divergence Rhythm
Consonant insertion
Vowel insertion
Dark /l/
Absence of postvocalic /r/
Absence of flap
Fluency features refer to articulation rate, mean lenth of run, pauses, and
hesitation markers.
Initially, a principal component analysis (PCA) was conducted as part of the
dimension reduction process so that the number of normed phonological vari-
ables could be reduced for the final analysis as seen in Kang et al. (2018a).
Pearson correlation coefficients were computed to ensure that the final set of
selected phonological features were not highly correlated among each other.
Linear LMEMs were then performed to examine how the overall phonological
features could affect the listening test scores. LMEM was chosen as a primary
analysis due to its robust power and flexibility by including both random and
fixed effects (Faraway 2005). For this mixed model, we treated the reduced
phonological features as covariates and took speaker and listeners for random
effects. Listening comprehension scores and intelligibility scores served as de-
pendent variables. These variables were also checked for normal distribution.
The skewness and kurtosis test results ranged from 1.4 to 2.4 with SDs of
0.14–0.74. Given that acceptable values for psychometric purposes are be-
tween 2 and +2 (George and Mallery 2010), these variables were considered
to be normally distributed.

Our preliminary analyses (i.e. LMEM and ANOVAs) revealed that there was
no statistically significant difference in listeners’ comprehension scores among
the six GA and RP speakers and the four most highly intelligible speakers (one
speaker representing each of the South African, Indian, Spanish, and Chinese
accents) (p > .54), based on results of pairwise comparisons between the aver-
age listening comprehension scores for all of these groups. Accordingly, pro-
nunciation features of these 10 highly intelligible speakers were used to
describe a cutoff point of a speaker’s intelligibility threshold. Descriptive stat-
istics exhibited the tentative establishment of the threshold baseline.
6. RESULTS
First, we conducted PCA analyses to reduce the number of phonological vari-
ables. We provide a summary of the results below, but more detailed informa-
tion regarding this preliminary step of our analysis can be found in Kang et al.
(2018a).
The PCA was computed three times independently for each of the three
categories (segmental features, prosodic features, and fluency features) to
maintain the transparent nature of each category, which enhanced the inter-
pretation of composite variables. In the first computation, the principal com-
ponent consisted of the following six features: consonant deletions, syllable
reductions, consonant cluster divergence, high functional vowel substitutions,
low functional vowel substitutions, and high functional consonant substitu-
tions. All six features extracted had positive coefficients; accordingly, we com-
bined all six features and created one super-feature called ‘consonant and
vowel divergence’. Pace, word stress divergence, and falling tone choices
were discriminant from other variables displaying the same direction with
negative component loadings. On the other hand, number of tone units,
rising tone choices, and rhythm revealed the opposite characteristics with posi-
tive coefficients but strongly correlated with the Primary Component 1.
Accordingly, we labeled the first three features (pace, word stress divergence,
and falling tone choices) as ‘impeding prosodic markers’ and the other three
features (number of tone units, rising tone choices, and rhythm) as ‘enhancing
prosodic markers’.
Articulation rate, number of silent pauses, and unexpected pause ratio re-
vealed a positive relationship with the Primary Component 1, whereas mean
length of run (MLR) and expected pause ratio showed a negative association.
As a result, the first four features were grouped and labeled as ‘impeding flu-
ency markers’ as the increment of such variables hindered the listeners from
understanding the lectures. The other two features were composited together
and labeled as ‘enhancing fluency markers’ because the increase of these fea-
tures (i.e. MLR and expected pause ratio) would enhance the listeners’ listen-
ing comprehension.
In sum, for the LMEM analysis, the following five independent variables
were created as predictors for the response variables: (i) vowel and consonant

Table 2: Summary of phonological features for the five clustered variables
Variables Phonological features
Vowel and consonant consonant deletion, syllable reduction, consonant

divergence cluster divergence, high functional vowel substitu-
tions, low functional vowel substitutions, and high
functional consonant substitutions
Impeding prosody markers pace, word stress divergence, and falling tone
choices
Enhancing prosody markers number of tone units, rising tone choices, and
rhythm
Impeding fluency markers articulation rate, number of silent pauses, and un-
expected pause ratio
Enhancing fluency markers MLR and expected pause ratio
divergence, (ii) impeding prosody markers, (iii) enhancing prosody markers

(iv) impeding fluency markers, and (v) enhancing fluency markers. Table 2
provides a summary of these five categories with their phonological features,
respectively.
6.1 Effect of phonological features on listener comprehension

and intelligibility scores
The Pearson correlations of the five clustered variables confirmed that all vari-
ables were relatively independent of each other (.325 or lower). Consequently,
we computed the linear mixed-effects models using each listener and speaker
as random effects and all five phonological variables as covariates for both
listening comprehension and intelligibility as dependent variables. Table 3
below reports estimates of main effects for each of the phonological parameters
for each of the two models. The correlation estimates are the slope for the
effect of the outcome variables, and t-values are the estimates divided by the
standard errors. The results showed that vowel/consonant divergence, enhan-
cing prosody, and impeding fluency significantly impacted the listeners’ com-
prehension score (p < .05). Using Nakagawa and Schielzeth’s (2013) suggested
formula, conditional R2 = 0.31 was calculated with the fixed and random effect
variance included. That is, approximately 31 per cent of variance in listening
comprehension scores was explained by phonological variables selected in the
model when both random (listeners and speakers) and fixed factors were
considered.
Vowel and consonant divergence was a significant predictor of listening
comprehension with an effect size value of d = 0.15 which indicated a small
effect. It has a negative correlation estimate, which indicates an inverse rela-
tionship with the comprehension scores. As the listeners’ performance
increased, the speech samples had fewer vowel and consonant divergences

Table 3: Estimates of main effects for the five selected phonological features on
the listening comprehension test and intelligibility measure
Listening Intelligibility
comprehension test measure
Parameter df Corr. Est S.E t Sig. Corr. est. S.E t Sig.
Intercept 714 5.505 0.607 9.065 .000 11.073 1.438 7.698 .000
Vowel/consonant 714 10.2 1.630 6.255 .000 49.801 3.862 12.895 .000
divergence
Impeding prosody 714 0.218 0.007 1.643 .101 1.761 0.170 5.590 .000

Enhancing prosody 714 0.034 0.030 2.447 .015 0.002 0.071 0.084 .933
Impeding fluency 714 0.018 0.013 2.638 .008 0.028 0.315 1.701 .089
Enhancing fluency 714 0.0321 0.014 1.088 .277 0.444 0.033 6.214 .000
Notes. Corr. est. = correlation estimates; S.E. = standard error.

significance at p < .05.
Listening comprehension scale = 0–6; intelligibility scale = 0–16.
particularly regarding consonant cluster divergence, syllable or consonant de-

letions, or high functional vowel and consonant substitutions. Enhancing pros-
ody markers predicted the listening test score at a significant level with a small
effect size value of d = 0.03. These enhancing markers were directly propor-
tional to the comprehension test scores. The listeners’ performance improved
significantly when the number of tone units and rising tones increased in the
speech. Also, the speaker’s rhythmic ability as part of enhancing prosody was a
significant predictor of the listening test. That is, when the speaker made a
large contrast between stressed syllables and unstressed syllables, which is how
rhythm was measured in this study, the listeners were able to comprehend the
lecture and the test better than when the speaker did not attend to this con-
trast. In addition, impeding fluency markers significantly predicted the re-
sponse variable (d = 0.04), revealing a small effect size with a negative
relation with the listening test scores. In other words, the listeners’ perform-
ance decreased when listening to speech that was fast or contained many
pauses, particularly in unexpected pause locations.
Three out of five phonological features made significant contributions to
predicting the intelligibility scores. The vowel/consonant divergence variable
was the strongest variable with an effect size value of d = 0.21, followed by
enhancing fluency markers with an effect size of d = 0.16. The impeding
prosody markers also significantly predicted the measure of intelligibility
with an effect size of d = 0.09. The coefficient estimate was negative, indicating
an inversely proportional relation with intelligibility scores. The enhancing
prosody and impeding fluency markers did not exert any significant effect
on the measure of nonsense statements. The conditional R2 for this model
was 0.69 with both the fixed and random effect variance included. Note that

even though the selected phonological variables reported above showed stat-
istical significance (p < .01), their effect sizes were considered to be small (0.03–
0.21) and might not carry meaningful effects.
6.2 Threshold of intelligibility for high-stakes listening com-

prehension test
We then wanted to determine the point at which speakers were equally in-
telligible, despite differences in first language; in other words, we sought a
threshold of intelligibility. It is important to remember, however, that our
findings are mostly exploratory and descriptive in nature due to a small
sample of speakers (18).
A bar graph was created to examine the distribution of the 18 speakers’
intelligibility scores operationalized by the nonsense statement method (see
Figure 1). The speakers are displayed on the horizontal axis. The Y-axis rep-
resents the intelligibility scores ranging from 0 to 16.
Table 4 below also shows the descriptive statistics for the intelligibility scores
assigned for all 18 speakers. We used GA and RP Englishes as the basic com-
parison for the following two reasons. First, in the range of 0–16 (i.e. 4 missing
words per sentence 4 sentences per speaker = 16), the GA and RP speakers
scored 11.35 or higher (i.e. the mean of British Speaker #2 = 11.35) for their
intelligibility measures. This means that three out of four missing words should
be accurately transcribed in a decontextualized sentence, if a speaker is to be
considered highly intelligible. We can cautiously argue that the intelligibility
score can be approximately 11.35 or higher in the scale of 0–16 when intelli-
gibility is operationalized as a nonsense statement, if any speaker is considered
to be used for the listening test. This means that approximately 71 per cent
(11.35/16) of accuracy is required from this particular intelligibility measure.
The highly intelligible South African (SA), Indian (IN), Spanish (SP), and
Chinese (CH) English speakers securely fell in this range of intelligibility
scores: SA3 = 11.7, IN3 = 13.28, SP3 = 12.9, and CH3 = 12.01, respectively.
Interestingly, in this nonsense measure, no speaker received an average of a
full score of 16, presumably due to the complexity of the task. In other words,
it may have been too cognitively challenging for listeners to achieve perfect
scores, regardless of speakers’ intelligibility.
Table 5 provides descriptive statistics of listeners’ comprehension scores
among the six GA and RP speakers and the four most highly intelligible speak-
ers from each of the four countries mentioned above. No statistical difference
was found in pairwise comparisons of all of these groups for the average lis-
tening comprehension scores (i.e. F = 0.811, p = .542). Accordingly, these 10
speakers were used to explore the threshold of a speaker’s intelligibility in
the next section. Note that additional scatter plot distributions of all 18 speak-
ers confirmed that these speakers with high listening comprehension scores
(i.e. 5 or above out of 6) were also corresponding to those with high intelligi-
bility scores (i.e. approximately 12.5 or above out of 16).

16
14
12
10
8
6
Mean
4 Intelligibility
2
0
Indian 1
Indian 2
Indian 3
Chinese 1
Chinese 2
Chinese 3
South African 1
South African 2
South African 3
Spanish 1
Spanish 2
Spanish 3
American 1
American 2
American 3
Brish 1
Brish 2
Brish 3
Figure 1: Distribution of intelligibility scores of nonsense sentences for 18

speakers.
Note. For each traditionally non-standard variety, the first speaker in
each group has the lowest intelligibility, while the third speaker in each
group has the highest intelligibility. Intelligibility scale = 0–16.
Phonological characteristics of those six GA and RP speakers and four most

intelligible South African, Indian, Spanish, and Chinese speakers are described
in Table 6. The characteristics listed include prominence and tone selection, as
these have been found to have direct and consequential impacts on listener
perception and comprehension, as well as segmental features that deviated
from Standard American English yet did not significantly diminish intelligibil-
ity. The information below suggests that highly intelligible speakers whose
speech may manifest some divergence from GA English norms can be used
for recording lectures in assessment contexts. In other words, some (particular)
divergences from GA English do not decrease the general intelligbility of highly
intelligible speakers, indicating that these speakers’ renditions of listening
comprehension passages in high-stakes English tests, such as the TOEFL,
would not be a source of test bias. The phonlogical features in Table 6 can
be considered the features that are more likely to be associated with highly
intelligible speakers.
Based on the divergence frequency of vowels or consonants in each record-
ing of the entire passage among those highly intelligible speakers, we calcu-
lated the divergence rates. The phonological features above demonstrate that
divergence rates for vowels in content words for intelligible speakers are below
4.1 per cent (e.g. less than 5 divergences out of 120 content words) and
divergence rates for consonants in content words below 2.5 per cent (e.g.
less than 3 divergences out of 120 content words). Then, we could look at

Table 4: Descriptive statistics for the intelligibility scores of all 18 speakers
95 per cent confidence interval for mean
Mean SD Lower bound Upper bound Minimum Maximum
GA1 12.516 2.683 11.823 13.210 6.00 16.00

GA2 11.466 3.055 10.677 12.256 5.00 15.00
GA3 13.166 1.786 12.705 13.628 8.00 16.00
RP1 12.616 2.687 11.922 13.311 6.00 16.00
RP2 11.350 3.318 10.498 12.207 3.00 16.00
RP3 13.616 2.307 13.020 14.212 8.00 16.00
SA1 8.750 2.159 8.192 9.308 3.00 13.00
SA2 8.616 2.565 7.954 9.279 3.00 14.00
SA3 11.700 2.513 11.050 12.349 5.00 16.00
IN1 9.883 3.425 8.998 10.768 3.00 16.00
IN2 10.900 2.710 10.199 11.600 4.00 15.00
IN3 13.283 2.917 12.529 14.037 6.00 16.00
SP1 10.033 1.765 9.577 10.489 7.00 13.00
SP2 9.100 1.612 8.683 9.516 6.00 12.00
SP3 12.916 1.441 12.544 13.289 8.00 15.00
CH1 10.116 1.439 9.744 10.488 6.00 13.00
CH2 10.333 1.653 9.906 10.760 5.00 13.00
CH3 12.0167 1.72216 11.5718 12.4615 9.00 15.00
Table 5: Descriptive statistics for the listening comprehension scores of 10

highly intelligible speakers
95 per cent confidence interval for mean
Mean SD Lower bound Upper bound Minimum Maximum
GA 5.405 0.809 5.287 5.525 2.00 6.00

RP 5.372 0.845 5.253 5.491 2.00 6.00
SA 5.400 0.763 5.194 5.606 3.00 6.00
IN 5.283 0.975 5.077 5.489 2.00 6.00
SP 5.566 0.592 5.361 5.773 4.00 6.00
CH 5.366 0.780 5.161 5.573 3.00 6.00

Table 6: Descriptions of the phonological features of individual speakers rated
highly intelligible
Speaker Common phonological features
American 1 Pace (the average number of prominent syllables per run) = 2.72
Falling tone = 61 per cent; level tone = 0 per cent; rising tone =
39 per cent
Falling tone = 39 per cent; level tone =2 per cent; rising tone =
59 per cent
35 per cent
British 1 Pace (the average number of prominent syllables per run) = 2.53
49 per cent
/O/!/ow/ (1)
Absence of flap (various times)
Absence of postvocalic r (various times)
/A/!/ O/ (3)
/æ/!/A/ (various times)
Addition of /h/ (1)
Minor word stress (1)
44 per cent
Word stress (1)
/æ/!/A/ (various times)
28 per cent
/O/!/ow/ (1)
/ A/!/ O/ (1)
South African 3 Pace (the average number of prominent syllables per run) = 2.41
36 per cent
Word stress (2)
Trilled r (various times)

Speaker Common phonological features
Indian 3 Pace (the average number of prominent syllables per run) = 2.81
37 per cent
/s/!/z/ (1)
/w/!/v/ (1)
Word stress (1)
Absence of the flap (various times)
Spanish 3 Pace (the average number of prominent syllables per run) = 2.30
Falling ton e = 57 per cent; level tone = 3 per cent; rising tone =
40 per cent
/I/!/iy/ (1)
/z/!/s/ (2)
/ ô/!/d/ (1)
/ A/!/ow/ (1)
Chinese 3 Pace (the average number of prominent syllables per run) = 2.23
44 per cent
/I/ /! /iy/ (1)
/"/! /iy/ (1)
Note. The number or percentage after each feature signifies number of occurrences.
the maximum possible divergence cases for each phonological category. For
example, to set up a divergence rate for vowels, we examined vowel diver-
gence occurrences across all 10 speakers. A British Speaker #1 had five diver-
gences, a Mexican Speaker #3 two divergences, a Chinese Speaker #3 two
divergences, and so forth. Five divergences for vowels found from a British
speaker appeared to be the highest (most frequent) value found from all
speakers.
Speakers achieving divergence rates for vowels in content words below 4.1
per cent, divergence rates for consonants in content words below 2.5 per cent,
divergence rates for lexical stress below 1.6 per cent (i.e. no more than 2 stress
divergences out of 120 content words), and divergence rates for number of
prominent words in content words (pace = 2.72 per cent) can be considered
intelligible particularly in the assessment context of the current study. For
vowels and consonants, the divergences from high functional loads may not
exceed 1.6 per cent (i.e. 2 errors out of 120 words). For tone choices, falling
tones should be used no more than 63 per cent, but rising tones more than 37
per cent.

7. DISCUSSION
We examined the extent to which variability in three clusters of phonetic/
phonological features commonly associated with intelligible and comprehen-
sible speech might predict listeners’ comprehension scores on the TOEFL lis-
tening test and the intelligibility test scores. These clusters were defined as
segmental, prosodic, and fluency features. Among segmental features, diver-
gences in high functional load vowels, along with divergences in both high and
low functional load consonants, consonant deletion, syllable reduction, and
consonant cluster divergence were predictors of both listener comprehension
and intelligibility scores. The results about functional load not only support
previous research but also provide experimental evidence validating Catford
(1987) and Brown’s (1991) claims regarding functional load.
Beyond supporting the notion of functional load, our results also agree with
Bent, Bradlow, and Smith (2007), who found that many consonant and vowel
errors played a role in intelligibility. Probing individual consonants in more
detail, our results are also similar to Magen (1998), who found that some
consonant divergence matters to intelligibility—again, possibly related to func-
tional load. With respect to consonant divergence, it is also worth noting that
its impact on intelligibility differs depending on the consonant error’s position
in a word (Bent et al. 2007); word-initial consonant divergence is more detri-
mental to intelligibility than is consonant divergence in other positions. While
our data did not lend itself to a similar analysis, it is advisable to consider this
constraint when determining the severity of particular consonant divergence
in the development of test listening passages spoken by varied English accents.
We examined selected suprasegmental features associated with intonation,
stress, and rhythm. Of these, the appropriate use of rising tone, tone unit, and
rhythm clustered, as enhancing prosody markers were found to be strong
predictors of our participants’ listening comprehension scores. At the same
time, lexical stress and the inappropriate use of falling tone significantly pre-
dicted intelligibility scores. These findings are highly consistent with previous
studies. For example, Wennerstrom (2000) and Kang (2010) have convin-
cingly demonstrated that accurate lexical stress is particularly consequential
for intelligibility, relative to other suprasegmental error types. Furthermore,
divergence in lexical stress assignment has been shown to cause listeners dif-
ficulty in comprehension of oral texts (Hahn, 2004). With respect to the ap-
propriate use of rising tone, Kang (2012) argues that patterns associated with
particular English accents can cause listeners from other groups to have diffi-
culty paying attention, or even lead to listeners’ misunderstanding of speakers’
intent. More specific to language assessment, Kang (2010) found that when
there is a mismatch between listener expectations and speech rhythm, the
speaker’s comprehensibility declines, along with listener perceptions of the
speaker’s oral proficiency.
The temporal fluency measures we calculated from the listening samples
were also found to significantly predict listener comprehension and

intelligibility scores, but the strength of this relationship was mixed, and there-
fore may lack importance. Impeding fluency significantly predicted listener
comprehension, but did not predict intelligibility scores strongly. On the
other hand, enhancing fluency predicted intelligibility scores significantly,
but not listener comprehension. These mixed findings in the relationship be-
tween temporal measures and comprehension and intelligibility should not be
taken to imply that temporal aspects of speech are unimportant. Rather, the
nature of our stimuli was such that temporal features across speech samples
were homogenous, idealized, and controlled. All of our speakers who provided
recordings for the listening passages had been given the opportunity to re-
hearse the speech samples ahead of time and were asked to re-record them
in any cases where the speech rate did not fall within the normally expected
TOEFL range. This strict control might serve as a methodological limitation if
we were investigating intelligibility in a natural context, precluding a more
detailed analysis of the role of temporal measures in listening comprehension.
However, it is justifiable for our target context, in which these measures (i.e.
rehearsing, controlling for speech rate, etc.) are likely to be taken with pro-
spective lecture readers. The results confirm that when listening passages are
spoken with unfamiliar accents, as long as the normal speech rate expected for
the listening test is controlled, test developers can be confident that any loss of
comprehension for a particular speech sample is attributable to segmental and/
or suprasegmental divergence, and not temporal features.
Highly intelligible speakers discussed in Table 6 above rarely showed the
divergences of consonant deletion, syllable reduction, and consonant cluster
error. Then, their speech did include certain segmental divergences involving
high functional vowels, low functional vowels, and high functional conson-
ants. However, their occurrence of such divergence was largely sparse and
limited (i.e. once or twice of the occurrence through the entire speech).
Therefore, we can cautiously state that speakers who are recruited to produce
spoken materials for a high-stakes listening comprehension test should dem-
onstrate very few segmental divergences in content words over the entire
high-stakes listening passage. For example, TOEFL passages comprise approxi-
mately 120 content words. The highly intelligible speakers in our study who
we believe would be acceptable candidates for producing spoken test materials
produced two to five segmental divergences within high functional load
vowels and consonants. This leads to about 1.6 to 4 per cent of divergence
rates. We recommend discounting segmental divergence in function words,
which do not lead to a decrease in passage comprehensibility.
Ultimately, our attempt to identify which segmental, suprasegmental, and
temporal features are most deleterious to listener comprehension and intelligi-
bility scores was complicated by the large number of interrelations between
features typically associated with each category of divergence. It seems unlikely
that any particular divergence type impacts intelligibility and comprehension
scores in isolation, but rather, the nature of the divergence, combined with
where it occurs in relation to the proposition of an utterance, among other

listener variables, interact in determining the severity of particular divergence,
and the overall comprehensibility of a particular listening passage. Despite these
obvious complexities, by identifying those features that are most likely to impact
comprehension of listening passages spoken by a variety of English accents, our
results provide a reasoned basis for excluding particular speakers as candidates
from whom to obtain assessment-related speech samples due to low intelligibil-
ity. The current findings inform future research on phonological divergence of
highly intelligible speech and their relationship with listening performance.
Overall, test development can be informed by these results. Based on our
findings, we would recommend that if test developers are to include English
varieties beyond American and British, they should follow several guidelines.
First, in terms of phonological features in listening passages themselves, even
though generalizations should be made very carefully, highly intelligible
speakers (see Table 5) are characterized as rarely exhibiting consonant dele-
tion, inappropriate syllable reduction, and divergence in the pronunciation of
consonant clusters relative to American English norms. The characteristics of
these speakers can be analogous to those of speakers who showed very weak
strength of accent in Ockey and French (2016). Speakers who do exhibit these
segmental divergences should be avoided in English assessments.
Second, the speech of highly intelligible speakers can include certain segmen-
tal divergence involving high functional vowels, low functional vowels, and
high functional consonants. However, the occurrence of such divergence is
very limited (i.e. once or twice through the entire TOEFL-type listening text).
Therefore, speakers likewise should exhibit few divergences of this nature.
Last, we recommend treating segmental divergence in function words, re-
gardless of the segment’s functional load, as less egregious, since divergence in
function words is not expected to lead to a decrease in passage comprehensi-
bility. Therefore, when developing tests, a speaker should not be automatically
rejected if their speech contains some divergence. However, the contextualized
speech (i.e. the passage read by the target speaker) must be carefully analyzed
for the types and places of divergence. Test developers should also attempt to
avoid errors that occur in a piece of information that is critical for an item,
because even one error in an otherwise accurate reading could affect a lis-
tener’s answer choice if that error involves a crucial word or phrase. While this
may be more time-consuming and not always predictable with complete ac-
curacy, it is a necessary step to be able to increase ecological validity of large-
scale English tests while simultaneously avoiding bias based on speaker accent.
8. CONCLUSION
We believe that the most significant impact of the present study is that it
provides a rational, systematic approach to selecting speakers with a variety
of English accents for inclusion as model speakers in English listening tests. The
resulting approach to delineating a threshold of intelligibility, while prelimin-
ary, has the potential to take much of the guesswork out of selecting speakers

of what have been traditionally considered non-standard English accents to
provide speech samples that are most likely to meet test requirements. In this
study, despite speaking with noticeably non-standard accents, the four most
intelligible speakers with South African, Indian, Spanish, and Chinese accents
produced speech samples that did not negatively impact listening test scores.
Our findings can be further utilized by language teachers to enhance English
learners’ communicative success. Despite the limited speakers used in this
study, teachers can make informed decisions for their pronunciation instruc-
tion by identifying which features of speech are most likely to affect intellig-
ibility and comprehensibility (i.e. Table 5) for listeners, teachers, and learners
can be better able to prioritize where to focus their pronunciation instruction.
However, it must be noted that this was a script-reading task, which is inher-
ently different from spontaneous communication. To influence pronunciation
pedagogy, further research should be conducted on features of accented speech
that affect intelligibility and comprehensibility of conversational English.
Finally, we acknowledge several limitations to our study. The threshold of
intelligibility that we have tentatively attempted to establish in this study
should be interpreted with caution given the limited number of speakers, lis-
teners, and English varieties represented. Future research is needed to validate
the current findings with larger samples representing more varied populations.
It should be also noted that accent, investigated in the current study, is only
one dimension of difference across international varieties of English.
Vocabulary and grammar could impact test scores if they had not been con-
trolled. In addition, the current study was limited to listeners who were al-
ready highly proficient in English, and skilled at taking the high-stakes test in
particular. Additional research is called for to examine test-takers who repre-
sent a wider range of proficiency levels and testing experience.
NOTE
1 It is important to note that work within present discussion, we are only con-
the World Englishes paradigm includes cerned with differences in pronunci-
attention to variation in lexis, gram- ation, not with these other dimensions
mar, and pragmatics, in addition to pro- of difference.
nunciation. For the purposes of the
FUNDING
This research was funded by the Educational Testing Service (ETS) under a
Committee of Examiners and the TOEFL research grant. ETS does not discount
or endorse the methodology, results, implications, or opinions presented by
the researchers.
Conflict of interest statement. None declared.


REFERENCES
Abeywickrama, P. 2013. ‘Why not non-native Field, J. 2005. ‘Intelligibility and the listener:
varieties of English as listening comprehension The role of lexical stress,’ TESOL Quarterly 39:
test input?,’ RELC Journal 44: 59–74. 399–423.
Anderson-Hsieh, J. and H. Venkatagiri. 1994. Gass, S. M. and E. M. Varonis. 1984. ‘The
‘Syllable duration and pausing in the speech of effect of familiarity on the comprehensibility
Chinese ESL speakers,’ TESOL Quarterly 28: of nonnative speech,’ Language Learning 34:
808–14. 65–89.
Avery, P. and S. Ehrlich. 2008. Teaching Ameri- George, D. and M. Mallery. 2010. SPSS for
can English Pronunciation. Oxford University Windows Step by Step: A Simple Guide and
Press. Reference, 17.0 Update. Pearson.
Bent, T., A. R. Bradlow, and B. L. Smith. 2007. Gimson, A. C. 1980. An Introduction to the
‘Segmental errors in different word positions and Pronunciation of English, 3rd edn. Edward
their effects on intelligibility of non-native Arnold.
speech: All’s well that begins well’ in O. Hahn, L. D. 2004. ‘Primary stress and intelligi-
S. Bohn and M. J. Munro (eds): Second- bility: Research to motivate the teaching of
Language Speech Learning: The Role of Language suprasegmentals,’ TESOL Quarterly 38: 201–23.
Experience in Speech Perception and Production: A Hamp-Lyons, L. and A. Davies. 2008. ‘The
Festschrift in Honour of James E. Flege. John English of English tests: Bias revisited,’ World
Benjamins. Englishes 27: 27–39.
Brown, A. 1991. ‘Functional load and the teach- Harding, L. 2011. Accent and Listening Assessment:
ing of pronunciation’ in A. Brown (ed.): A Validation Study of the Use of Speakers with L2
Teaching English Pronunciation: A Book of Accents on an Academic English Listening Test.
Readings. Routledge. Peter Lang.
Catford, J. C. 1987. ‘Phonetics and the teaching Iwashita, N., A. Brown, T. McNamara, and
of pronunciation: A systemic description of S. O’Hagan. 2008. ‘Assessed levels of second
English phonology,’ in J. Morley (ed.): language speaking proficiency: How difficult?,’
Applied Linguistics 29: 24–49.
Current Perspectives on Pronunciation: Practices
Jenkins, J. 2003. World Englishes: A Reference
Anchored in Theory. TESOL.
Book for Students. Routledge.
Derwing, T. and M. J. Munro. 2001. ‘What
Jenkins, J. 2006. ‘The spread of EIL: A testing
speaking rates to non-native listeners prefer?,’
time for testers,’ English Language Teaching
Applied Linguistics 22: 324–37.
Journal 60: 42–50.
Derwing, T. M. and M. J. Munro. 1997.
Kachru, B. B. 1992. The Other Tongue: English
‘Accent, intelligibility, and comprehensibility:
across Cultures. University of Illinois Press.
Evidence from four L1s,’ Studies in Second
Kang, O. 2010. ‘Relative salience of supraseg-
Language Acquisition 19: 1–16.
mental features on judgments of L2 compre-
Derwing, T. M., M. J. Rossiter, M. J. Munro,
hensibility and accentedness,’ System 38:
and R. I. Thomson. 2004. ‘Second language
301–15.
fluency: Judgments on different tasks,’
Kang, O. 2012. ‘Impact of rater characteristics on
Language Learning 54: 655–79.
ratings of international teaching assistant’s oral
Elder, C. and L. Harding. 2008. ‘Language test-
performance,’ Language Assessment Quarterly 9:
ing and English as an international language:
1–21.
Constraints and contributions,’ Australian
Kang, O. and D. Rubin. 2009. ‘Reverse linguis-
Review of Applied Linguistics 31: 34.1–11.
tic stereotyping: Measuring the effect of lis-
Faraway, J. J. 2005. Linear Models in R (Texts in
tener expectations on speech evaluation,’
Statistical Science). Chapman and Hall/CRC.
Journal of Language and Social Psychology 28:
Fayer, J. M. and E. Krasinski. 1987.
441–56.
‘Native and nonnative judgments of intelligi- Kang, O., D. Rubin, and L. Pickering. 2010.
bility and irritation,’ Language Learning 37: ‘Suprasegmental measures of accentedness and
313–26. judgments of language learner proficiency in

oral English,’ The Modern Language Journal 94: Myford, C. M. and E. W. Wolfe. 2004.
554–66. ‘Detecting and measuring rater effects using
Kang, O., R. I. Thomson, and M. Moran. many-facet Rasch measurement: Part II,’
2018a. ‘Empirical approaches to measuring Journal of Applied Measurement 5: 189–227.
the intelligibility of different varieties of Nakagawa, S. and H. Schielzeth. 2013. ‘A gen-
English in predicting listener comprehension,’ eral and simple method for obtaining R2 from
Language Learning 68: 115–46. generalized linear mixed-effects models,’
Kang, O., R. Thomson, and M. Moran. 2018b. Methods in Ecology and Evolution 4: 133–42.
‘The effects of international accents and shared Nelson, C. L. 2011. Intelligibility in World
first language on listening comprehension tests Englishes: Theory and Application. Routledge.
TESOL Quarterly. doi: 10.1002/tesq.463. Nye, P. W. and J. H. Gaitenby. 1974. ‘The in-
Kormos, J. and M. Denes. 2004. ‘Exploring telligibility of synthetic, monosyllabic words in
measures and perceptions of fluency in the short, syntactically normal sentences,’ Haskins
speech of second language learners,’ System Laboratories Status Report on Speech Research, SR-
32: 145–64. 37/38, pp. 169–90.
Li, D. C. S. 2009. ‘Researching non-native speak- Ockey, G. J. and R. French. 2016. ‘From
ers’ views toward intelligibility and identity: one to multiple accents on a test of L2 listen-
Bridging the gap between moral high grounds ing comprehension,’ Applied Linguistics
and down-to-earth concerns,’ in F. Sharifian’s 37:693–715.
(ed.): English as an International Language: Ockey, G. J., S. Papageorgiou, and R. French.
Perspectives and Pedagogical Issues. Multilingual 2016. ‘Effects of strength of accent on an L2
Matters. interactive lecture listening comprehension
Linacre, J. M. and B. D. Wright. 2002. test,’ International Journal of Listening. 30: /
‘Construction of measures from many-facet 84–98.
data,’ Journal of Applied Measurement 3: 484– Picheny, M. A., N. I. Durlach, and L.
D. Braida. 1985. ‘Speaking clearly for the
509.
Lippi-Green, R. 2012. English with an Accent: hard of hearing: Intelligibility differences be-
tween clear and conversational speech,’
Language, Ideology and Discrimination in the
Journal of Speech and Hearing Research 28: 96–
United States. Routledge.
103.
Lunz, M. E. and J. A. Stahl. 1990. ‘Judge con-
Pickering, L. 2001. ‘The role of tone choice in
sistency and severity across grading periods,’
improving ITA communication in the class-
Evaluation and the Health Professions 13: 425–44.
room,’ TESOL Quarterly 35: 233–55.
Magen, H. 1998. ‘The perception of foreign-ac-
Rubin, D. L. 1992. ‘Nonlanguage factors affect-
cented speech,’ Journal of Phonetics 26: 381–
ing undergraduates’ judgments of non-native
400.
English-speaking teaching assistants,’ Research
Major, R. C., S. F. Fitzmaurice, F. Bunta, and
in Higher Education 33: 511–31.
C. Balasubramanian. 2002. ‘The effects of
Smith, L. and C. Nelson. 1985. ‘International
nonnative accents on listening comprehension:
intelligibility of English: Directions and re-
Implications for ESL assessment,’ TESOL
sources,’ World Englishes 4: 333–42.
Quarterly 36: 173–90.
Tavakoli, P. and P. Skehan. 2005. ‘Strategic
McNamara, T. 1996. Measuring Second Language
planning, task structure, and performance test-
Performance. Addison Wesley Longman.
ing,’ in R. Ellis (ed.): Planning and Task
Munro, M. and T. Derwing. 1995. ‘Processing
Performance in a Second Language. John
time, accent, and comprehensibility in the per-
Benjamins Publishing Company.
ception of native and foreign-accented speech,’
Taylor, L. 2006. ‘The changing landscape of
Language and Speech 38: 289–306.
English: Implications for language assessment,’
Munro, M. J. and T. M. Derwing. 2006. ‘The
English Language Teaching Journal 60: 51–60.
functional load principle in ESL pronunciation Taylor, L. and A. Geranpayeh. 2011.
instruction: An exploratory study,’ System 34: ‘Assessing listening for academic purposes:
520–31. Defining and operationalizing the test

construct,’ Journal of English for Academic www.ets.org/s/toefl/pdf/toefl_ibt_research_in
Purposes 10: 89–101. sight.pdf.
Thomson, R. I. 2015. ‘Fluency,’ in M. Reed and Wennerstrom, A. 2000. ‘The role of intonation in
J. Levis (eds): The Handbook of Pronunciation. second language fluency,’ in H. Riggenbach (ed.):
Wiley. Perspectives on Fluency. University of Michigan.
TOEFL iBT Test Framework and Test Yano, Y. 2001. ‘World Englishes in 2000 and
Development. 2010. ETS TOEFL. https:// beyond,’ World English. 20: 119–31.

NOTES ON CONTRIBUTORS
Okim Kang is an Associate Professor in the program of applied linguistics at Northern
Arizona University, Flagstaff, AZ, USA. Her research interests are speech production
and perception, L2 pronunciation and intelligibility, L2 oral assessment and testing,
automated scoring and speech recognition, world Englishes, and language attitude.
She has recently published two edited books: The Routledge Handbook of Contemporary
English Pronunciation and Assessment in Second Language Pronunciation. Address for corres-
pondence: Okim Kang, Department of English, Northern Arizona University, Liberal Arts
Building 18, Room 140, Flagstaff, AZ 86011-6032, USA. <okim.kang@nau.edu>.
Ron I. Thomson is a Professor of Applied Linguistics at Brock University. His research

focuses on the development of L2 pronunciation and oral fluency. He is also interested
in how computer-mediated instruction can facilitate easier and more rapid develop-
ment of L2 speech perception and production, and has developed www.englishaccent-
coach.com, a freely available High Variability Pronunciation Training (HVPT) app for L2
English learners.
Meghan Moran is an Instructor in the English Department at Northern Arizona

University, Flagstaff, AZ, USA. Her research interests include speech production and
perception, L2 pronunciation and intelligibility, language planning and policy, lan-
guage education policy, and linguistic discrimination. Meghan has recently co-authored
studies with Okim Kang and Ron I. Thomson on second language intelligibility and the
inclusion of accented varieties of English in high-stakes assessment, which can be found
in TESOL Quarterly and Language Learning.

Which Features of Accent Affect Understanding? Exploring The Intelligibility Threshold of Diverse Accent Varieties

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Which Features of Accent Affect Understanding? Exploring The Intelligibility Threshold of Diverse Accent Varieties

Uploaded by

Copyright:

Available Formats

Applied Linguistics 2018: 1–29

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

With the ascendency of English as a global lingua franca, a clearer understand-

As English has become increasingly dominant as the primary language of

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

1. INTELLIGIBILITY, COMPREHENSIBILITY, AND

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

2. ESTABLISHING AN EMPIRICALLY MOTIVATED

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

3. FEATURES THAT AFFECT INTELLIGIBILITY AND LISTENING

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

5.1.2 Speaking tasks

5.1.3 TOEFL listening passages

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

5.1.4 Nonsense sentences

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

5.3 Data collection procedures

5.4 Phonetic and phonological analysis

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

5.5 Data analysis

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Consonant deletion Space Articulation rate

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Vowel and consonant consonant deletion, syllable reduction, consonant

divergence, (ii) impeding prosody markers, (iii) enhancing prosody markers

6.1 Effect of phonological features on listener comprehension

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Parameter df Corr. Est S.E t Sig. Corr. est. S.E t Sig.

Notes. Corr. est. = correlation estimates; S.E. = standard error.

particularly regarding consonant cluster divergence, syllable or consonant de-

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

6.2 Threshold of intelligibility for high-stakes listening com-

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Figure 1: Distribution of intelligibility scores of nonsense sentences for 18

Phonological characteristics of those six GA and RP speakers and four most

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Mean SD Lower bound Upper bound Minimum Maximum

GA1 12.516 2.683 11.823 13.210 6.00 16.00

Table 5: Descriptive statistics for the listening comprehension scores of 10

Mean SD Lower bound Upper bound Minimum Maximum

GA 5.405 0.809 5.287 5.525 2.00 6.00

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Conflict of interest statement. None declared.

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Downloaded from https://academic.oup.com/applij/advance-article-abstract/doi/10.1093/applin/amy053/5259329 by Iowa State University user on 02 January 2019

Ron I. Thomson is a Professor of Applied Linguistics at Brock University. His research

Meghan Moran is an Instructor in the English Department at Northern Arizona

You might also like