You are on page 1of 89

To appear in: Wayland, R., ed., Second language speech learning.

Cambridge, UK: Cambridge University Press. ISBN 9781108840637

Chapter 1

The revised Speech Learning Model (SLM-r)


James E. Flege and Ocke-Schwen Bohn

Here we present the revised Speech learning model (SLM-r), an individual differences
model which aims to account for how phonetic systems reorganize over the life span
in response to the phonetic input received during naturalistic second language (L2)
learning. We first review research leading to the formulation of Speech Learning
Model, or SLM (Flege, 1995), before presenting a synthesis of that model and then its
revision. The SLM-r proposes that the mechanisms and processes needed for native
language (L1) acquisition remain accessible for use in L2 learning, without change or
exception, across the life span. By hypothesis, the formation or non-formation of new
phonetic categories for L2 sounds will depend on the precision of L1 categories at the
time L2 learning begins, the perceived phonetic dissimilarity of an L2 sound from the
closest L1 sound, and the quantity and quality of L2 input that has been received.
According to the SLM-r, the phonetic categories making up the L1 and L2 phonetic
subsystems interact with one another dynamically and are updated whenever the
statistical properties of the input distributions defining L1, L2, and composite L1-L2
categories (diaphones) change.

Like its predecessor, the revised Speech Learning Model (SLM-r) focuses on the learning of L2
vowels and consonants (or “sounds”, for short) across the life span. To define the context in which
the original SLM (Flege, 1995) developed, we begin by presenting some key studies carried out
before 1995. After summarizing the SLM with clarification of some key points, we present the
SLM-r.

The primary aim of the SLM-r differs from that of its predecessor, which was to “account for age-
related limits on the ability to produce L2 vowels and consonants in a native-like fashion” (Flege,
1995, p. 237). The SLM focused on differences between groups of individuals who began learning
The SLM revised, 2

an L2 before vs. after the close of a supposed critical period (CP) for speech learning (Lenneberg,
1976). Closure of the CP was regarded as an undesired consequence of normal neurocognitive
maturation that arose from diminished cerebral plasticity and a reduced ability to exploit L2 speech
input. The SLM-r offers an account for differences between “early” and “late” learners, but its
primary aim is to provide a better understanding of how the phonetic systems of individuals
reorganize over the life span in response to the phonetic input received during naturalistic L2
learning.

1. Work prior to 1995

A phonemic level of analysis dominated early L2 research. Bloomfield (1933, p. 79) posited that
because monolinguals learn to respond only to distinctive features, they can “ignore the rest of the
gross acoustic mass that reaches [their] ears”. Hockett (1958, p. 24) defined the phonological
system of a language as “not so much a set of sounds as ... a network of differences between
sounds”. Trubetskoy (1939) proposed that the phonological system of the native language (L1)
acts like a “sieve'' that passes only phonic information in the production of L2 words that is needed
to distinguish words found in the L1. This approach shifted attention away from the language-
specific phonetic details of the L1 to which children attune slowly during infancy and childhood
and it implied that such details might be inaccessible to individuals who learn the same language
as an L2. One dissenting voice was that of Brière (1966), who maintained that the relative ease or
difficulty of learning specific L2 sounds could only be predicted through “exhaustive” analyses of
phonetic details. (p. 795)

The aim of the Contrastive Analysis (CA) approach was to identify learning problems that would
need to be addressed through instruction in the foreign language classroom. Its general prediction
was that L2 phonemes that do not have a counterpart in the L1 would be difficult to learn whereas
those having an L1 counterpart would be relatively easy to learn. The CA approach assumed that
pronunciation errors observed in L2 speech were the result of faulty articulation (i.e., production),
not the results of incorrect targets resulting from faulty perception. Just as importantly, the CA
approach ignored the fact that the “same” sound found in two languages may differ greatly at the
phonetic level.

Another problem for the CA approach was that allophonic distributions of the “same” phonemes
found in two languages often differ cross-linguistically, making point-by-point comparisons of
The SLM revised, 3

phonemes difficult or meaningless (Kohler, 1981). The phonemes in a contrastive analysis were
defined primarily in terms of a static articulatory description of a single canonical variant. This
ignored the fact that an important part of L1 acquisition is the integration of conditioned variants
of a phoneme (Gupta & Dell, 1999; Song, Shattuck-Hufnagel, & Demuth, 2015). In addition, the
CA approach tacitly assumed that L2 learners make errors even after having received adequate
input and that knowing how the L1 is learned is irrelevant for an understanding of L2 speech
learning.

The one-time, one-size-fits-all CA approach soon fell from favor. As Lado had already noted in
1957, not all individual speakers of a single L1 make the same errors when speaking the same L2.
Flege and Port (1981) showed that the distinctive features needed to distinguish L1 phonemes
cannot be recombined freely to produce an L2 sound that is not present in the L1. Most importantly,
as noted in 1953 by Weinreich (pp. 83-110), the nature and extent of the mutual “phonic
interference” between the sounds in a bilingual’s two languages depends, in addition to
phonological differences, on factors such as language dominance, demography (e.g., ethnicity,
gender, age, etc.), years of L2 use, and the domains in which the L1 and L2 are used (see also
Grosjean, 1998).

In the 1970s research began examining purely phonetic aspects of L2 segmental production and
perception. Much of this early work focused on the voice onset time (VOT) dimension in the
production and perception of word-initial English stops by native Spanish speakers. For example,
Elman, Diehl, and Buchwald (1977) examined the identification of naturally produced consonant-
vowel syllables initiated by stops having VOT values in the “lead”, “short-lag”, or “long-lag”
ranges. Spanish and English monolinguals labeled stops having short-lag VOT as /p/ and /b/,
respectively. The Spanish-English bilinguals who participated were asked to label the same stimuli
in two “language sets” intended to induce a Spanish or English perceptual processing mode. The
effect of the language set manipulation was small for most participants, but five of the 31 bilingual
participants, referred to as “strong” bilinguals, were far more likely to identify short-lag stops as
/b/ in the English set than in the Spanish set. These five participants seem to have been Early
learners (R. Diehl, personal communication, June 3, 1990).

Many accepted the hypothesis by Lenneberg (1967) that a critical period (CP) exists for speech
learning closes at about the age of 13 years as the results of normal neurological maturation.
The SLM revised, 4

Lenneberg (1976, p. 176) also suggested that following the close of a CP, L2 learners cannot make
“automatic use” of L2 input from “mere exposure” to the input as children do when learning their
L1.

To evaluate the “automatic use” hypothesis, Flege and Hammond (1982) recruited native English
(NE) university students in Florida. All of them had been previously exposed to Spanish-accented
English and, in addition, were taking a Spanish class taught by a native Spanish speaker who spoke
English with a strong Spanish accent. The students were asked to read English sentences
containing two variable test words (The X is on the Y) with a feigned Spanish accent. The amount
of prior exposure to Spanish-accented English the students had received was estimated by counting
the number of expected “Spanish accent substitutions” (e.g., [vel] or [veɪl] for bail, [big] for big)
they produced in the English test words. VOT was measured in additional test words beginning in
/t/.

Members of both the Higher- and Lower-exposure groups shortened VOT in the direction of
Spanish, but only the Higher-exposure group produced significantly shorter VOT values than did
the members of a control group who read the sentences without special instruction. Importantly,
the students who produced significantly shortened VOT values when speaking English with a
feigned Spanish accent did not accomplish this by using a short-lag English /d/ to produce the /t/-
initial test words.

Flege and Hammond (1982) concluded that monolingual adults are able to access cross-language
phonetic differences through mere exposure to speech after the supposed closure of a CP for speech
learning. The results indicated that NE monolinguals with substantial exposure to Spanish-
accented English could not only detect phonetic differences between standard and Spanish-
accented English, they could also store that information in long-term memory and use it to guide
production (see also Reiterer, Hu, Sumathi, & Singh, 2013).

Williams (1977) found that the “phoneme boundary” between stops such as /b/ and /p/ occurred at
significantly longer VOT values for adult English than Spanish monolinguals. Flege and Eefting
(1986) reported that this also held true for monolingual children. They also reported that, within
languages, phoneme boundaries occurred at longer average VOT values for adults than for 8 to 9-
year-old children. Indeed, the phoneme boundaries of NE 17-year-olds occurred at significantly
shorter VOT values than those of NE adults, suggesting that attunement to L1 phonetic-level
The SLM revised, 5

details may continue into the late teenage years.

Not surprisingly, Flege and Eefting (1986) observed cross-language production differences that
mirrored the above-mentioned perception differences in phoneme boundaries. Both monolingual
NE adults and children produced /p t k/ with longer VOT values than age-matched native Spanish
(NS) monolinguals and, within both languages, adults produced longer VOT values than children
did.

Flege (1991) compared VOT in stops produced by groups of NS speakers differing in age of arrival
in the United States (means=2 vs. 20 years). These Early and Late learners also differed in percent
English use (means=82% vs. 66%). The Early learners produced English stops with native-like
VOT values, both individually and as a group. The average values obtained for Late learners, on
the other hand, were intermediate to the values observed for Spanish and English monolinguals.

The results of a speech imitation study (Flege & Eefting, 1988) suggested that NS Early learners
of English can form new long-lag phonetic categories for English /p t k/. This finding led Flege
(1991) to suggest that the accurate production of VOT in English stops by NS Early but not Late
learners arose from the inability by the Late learners to form new phonetic categories. Had this
been true it would have provided a solid empirical basis for CP proposed by Lenneberg (1967).
However, the interpretation suggested by Flege (1991) was problematic for two reasons. First, the
VOT values produced by individual NS Late learners ranged from Spanish-like to English-like. If
a CP exists, it should affect everyone in much the same way. Second, the results for the Late
learners may have reflected learning in progress rather than the performance that might have been
evident had they received as much L2 phonetic input as monolingual NE children need to achieve
an adult-like production of VOT.

For this chapter, we estimated years of full-time equivalent (FTE) English input that had been
received by the NS participants in Flege (1991). These values were calculated by multiplying years
of residence in the United States by proportion of English use (self-reported by each participant as
a percentage). The mean estimated FTE years of English input was much higher for the Early than
Late learners (means=17.2 vs. 9.2 FTE years). Thus, if category formation is a slow process
requiring input that gradually accumulates over many years of daily use, the Early-Late difference
observed by Flege (1991) might simply have been the result of input differences, not the loss of
capacity by the Late learners to form new phonetic categories.
The SLM revised, 6

FTE years of L2 input may be a somewhat better estimate of quantity of L2 input than LOR alone,
but it says nothing regarding the quality of L2 input. Early learners acculturate more rapidly
following immigration than Late learners do (Cheung, Chudek, & Heine, 2011; Jia & Aaronson,
2003). Acculturation involves the creation of social contact with native speakers of the target L2.
This means that the NS Late learners tested by Flege (1991) were likely to have been exposed
more often to Spanish-accented English than the Early learners were, and so they may have been
exposed to shorter VOT values in English words overall than were the Early learners and NE
monolinguals.

The effect of foreign-accented input was observed in research examining NS Early learners who
learned English in an environment where Spanish-accented English was the rule rather than the
exception. The mean VOT values obtained by Flege and Eefting (1987) for Early learners in Puerto
Rico were intermediate in value, in both production and perception, to values obtained for English
and Spanish monolinguals, and so were similar to the values obtained for NS Late learners of
English in Texas. The difference between the Early learners tested in Puerto Rico and Texas
suggested that the quality of L2 input may matter more than the age of first exposure to an L2.

Research in the period we are considering also showed that the magnitude of cross-language
phonetic differences matter. Flege (1987) examined the production of French vowels by NE
speakers who had lived in France for an average of 10 years. Unlike French, English has no /y/
and its /u/ differs acoustically from the /u/ of French. The three vowels of interest (French /y/ and
/u/, English /u/) differ primarily in F2 frequency, and NE speakers generally hear the French /y/ as
their English /u/ (Levy, 2009a). Flege (1987) hypothesized that NE speakers would be able to
produce the “new” French vowel, /y/, more accurately than the “similar” French /u/. In fact, the
difference between the NE speakers and French monolinguals in terms of the critical acoustic
phonetic dimension, F2 frequency, was non-significant for /y/ but not /u/.

Flege (1992) further evaluated the new-similar distinction by examining the production of English
vowels by native Dutch (ND) adults. The English vowel in hit (/ɪ/) was classified as “identical” to
a Dutch vowel based on previously published acoustic data and on reports that the auditory
differences between English /ɪ/ and the closest Dutch vowel are likely to go undetected by native
Dutch-speaking listeners. The English vowel in hat (/æ/) was classified as “new” because it
occupies a portion of vowel space not exploited by Dutch and because earlier research suggested
The SLM revised, 7

that /æ/ is learnable. The vowels in heat, hoot, hot and hut (/i/, /u/, /ɑ/, /ᴧ/) were each categorized
as “similar” to a Dutch vowel.

The results obtained by Flege (1992) for the “new” vowel in hat supported the view that ND Late
learners can form new phonetic categories for certain L2 vowels. However, the results obtained
for English vowels classified as “similar” to a Dutch vowel did not support the hypothesis that
native vs. non-native differences persist for L2 vowels that are similar but not identical to an L1
vowel. Two Dutch vowels classified as similar were produced quite well but two others were
produced poorly. Flege (1992, p. 162) concluded that no principled method existed for
distinguishing “new” from “similar” L2 sounds and so the trichotomy “new-similar-identical” was
not included in the SLM (Flege, 1995).

Flege, Munro, and Skelton (1992) evaluated the effect of L2 experience by recruiting two groups
each of native Mandarin (NM) and Spanish (NS) speakers. All had begun learning English as
adults, but the same-language groups differed according to LOR in the United States (Mandarin
means=0.9 vs. 5.5 years; Spanish means=0.4 vs. 9.0 years). The study focused on the production
of word-final English /t/ and /d/ because these stops are not found in the final position of Mandarin
and Spanish words. The authors hypothesized that the non-natives with a relatively long residence
in the United States would treat the word-final stops as “new” sounds and so produce them
accurately.

NE-speaking listeners were more successful overall in identifying the non-native speakers’
productions of /t/ than /d/ (means=82% vs. 65% correct). Acoustic analyses revealed that the NM
and NS speakers produced smaller acoustic phonetic differences between /t/ and /d/ (longer vowels
before /d/, higher F1 offset frequency for /d/, more closure voicing in /d/, longer closure for /t/)
than the NE speakers did. Stops produced by both “experienced” and “inexperienced” non-natives
were significantly less intelligible (means=68% for both groups) than stops produced by the NE
speakers. Within languages, the LOR-defined groups did not differ significantly. Of the 40 NM
and NS speakers tested, just six produced word-final stops that were as intelligible as the stops
produced by the NE speakers.

One possible explanation for the frequent errors in non-native speakers’ final stop productions that
were observed by Flege et al. (1992) is that adult learners of an L2 lack the capacity to learn new
forms of speech. Alternatively, the errors may have been the result of inadequate input.
The SLM revised, 8

Monolingual NE children need approximately five years of full-time English input in order to
produce /t/ and /d/ accurately in word-final position (e.g. Smith, 1979). The non-native speakers
designated as “experienced” had an average of just 4.2 FTE years of English input and were likely
to have often heard other non-natives produce the word-final English stops inaccurately.

The same two explanations might be applied to the findings of Flege and Davidian (1984) who
used a picture-naming task to elicit the production of /p t k/ and /b d ց/ in the final position of
English words. The participants were immigrants from China and Mexico (12 each) who had all
learned English as adults and had lived in Chicago for 4.2 years on average (range=0.2 to 7.5
years). Unlike members of the NE comparison group (n=12), these Late learners omitted
(means=2.3 vs. 3.4%), devoiced (means=29.5% vs. 43.0%) and spirantized (means=0.8 vs. 19.3%)
the word-final English stops. The differing frequency of error types observed for the two L1 groups
was readily understandable with reference to the inventory of word-final obstruents found in their
L1s, but overall, they produced only about half of the stops without error. All 24 participants were
enrolled in English as a Second Language classes at a local community college where they
certainly heard one another, and other immigrants outside the classroom, producing final English
stops with the same errors. At least some of them may have learned to accurately produce the
wrong phonetic “models”.

In summary, L2 speech research carried out prior to 1995 gradually began to focus on a phonetic
rather than a phonemic level of analysis. Language-specific phonetic differences between the L1
and L2 became the focus of speech production and perception research. The existing research made
clear that: (1) the L1 phonetic system “interferes” with L2 speech learning; (2) some L2 sounds
are learned better than others; (3) L2 sounds without an L1 counterpart might be learned more
effectively those than those without an L1 counterpart; and (4) the quantity and quality of L2 input
that L2 learners receive may exert an important influence on phonetic-level learning. It appeared
possible that Early learners generally produce and perceive L2 sounds more effectively than Late
learners do because they, but not Late learners, might be able to form new phonetic categories for
L2 sounds. This inference was at odds, however, with evidence that Late learners can gain access
to L1-L2 phonetic differences, store the detected differences in long-term memory, and then use
the stored perceptual representations to guide articulation.
The SLM revised, 9

2. The Speech Learning Model (SLM)

Flege (1995) observed that at a time when “children's sensorimotor abilities are generally
improving, they seem to lose their ability to learn the vowels and consonants of an L2” (1995, p.
234). We now know that earlier is generally better than later for L2 acquisition, but only in the
long run. Adults outperform children in the early stages of naturalistic L2 acquisition, but adult-
child differences tend to recede over time until Early learners outperform Late learners (e.g., Jia,
Strange, Wu, Collado, & Guan, 2006; Snow & Hoefnagel-Höhle, 1979 .

DeKeyser and Larson-Hall (2005) attributed the age-performance “cross-over” to age-related


cognitive changes. If applied to L2 speech learning, their hypothesis would mean that children
learn L1 speech implicitly through massive exposure to the sounds making up the L1 phonetic
inventory. Also by hypothesis, the efficacy of implicit learning mechanisms would be reduced
following the close of a critical period because L2 learners lose the ability to make “automatic”
use of input from “mere exposure” to the sounds making up the L2 phonetic inventory (Lenneberg,
1967, p. 176).

The ability to make effective use of ambient language phonetic input is the acknowledged
prerequisite for L1 speech acquisition (e.g., Kuhl, 2000). According to a “cognitive change”
hypothesis (DeKeyser & Larson-Hall, 2005), Late learners fare well in early stages of L2 learning
through use of explicit learning mechanisms, but such mechanisms are not well suited for the slow
process of attunement to the language-specific details defining L2 sounds and their differences
from L1 sounds. Early learners, on the other hand, learn L2 phonetic details well but slowly via
implicit learning mechanisms.

The SLM provided a way to understand the cross-over paradox without positing a loss of neural
plasticity or a change in the cognitive mechanisms needed for speech learning. As mentioned
earlier, research has shown (e.g., Flege & Hammond, 1982; see also Reiterer et al., 2013) that even
Late learners can gain access to the language-specific details defining L2 sounds. The SLM
proposed that L2 phonetic input is accessible and that L2 learners of all ages exploit the same
mechanisms and processes they used earlier for L1 speech learning, including the ability to create
new phonetic categories for certain L2 sounds based on the experienced distributions of tokens
defining those L2 sounds.

The SLM focused on the development of language-specific phonetic categories and the phonetic
The SLM revised, 10

realization rules used to implement those categories motorically. The model assumed a generic
three-level perception-production framework, illustrated in Figure 1, that envisages a flow of
information from a sensory-motor level to a phonetic category level to lexico-phonological
representations (see, e.g., Evans & Davis, 2015).

A pre-categorical, auditory level of processing is evident only in specific perceptual testing


conditions and is imperceptible to listeners (e.g., Werker & Logan, 1985), whereas the distinction
between the phonetic category and lexico-phonological levels is more readily evident. For
example, listeners can “hear” (i.e., perceive) a sound in the speech stream even when the sound
has been replaced by silence or noise, thereby removing any phonetic-level information (e.g.,
Samuel, 1981). Evidence that sounds are categorized at a phonetic level is provided by the fact
that monolingual listeners can recognize unfamiliar names heard for the first time.

Phonetic categories play two crucial roles. They define the articulatory goals used by language-
specific phonetic realization “rules” in producing speech (but see Best, 1995, for a different
perspective). More specifically, the realization rules “specify the amplitude and duration of
muscular contractions that position the speech articulators in space and time” (Flege, 1992, p. 165).
Second, phonetic categories are used to access segment-sized units of speech that, in turn, are used
to identify word candidates during lexical access.

Listeners are usually not consciously aware of phonetic categories as they process speech because
phonetic-level changes do not change meaning. However, language-specific phonetic categories
are sufficiently rich in detail that they permit the detection of a fluent speaker as non-native in as
little as 30 ms (Flege, 1998). Moreover, phonetic-level differences can be detected when listeners
focus attention on such differences (Best & Tyler, 2007; Pisoni, Aslin, Perey, & Hennessy, 1982).

2.1 Cross-language mapping

The SLM focused on sequential bilinguals who already possess a functioning phonetic system
when first exposed to an L2. For such individuals, L2 phonetic learning is influenced importantly
by the perceived relationships between the sounds making up the phonetic L2 and those present in
the L1 phonetic inventory. The SLM proposed that L1 and L2 sounds are perceptually linked to
one another through a cognitive process, “interlingual identification”, which operates
automatically and subconsciously. When first exposed to the L2, learners interpret the “full range”
(Flege, 1995, p. 241) of L2 sounds they encounter on the phonetic surface of the L2 as being
The SLM revised, 11

instances, some better than others, of existing L1 phonetic categories.

The SLM did not specify how much L2 input learners will need in order to establish stable patterns
of interlingual identification. We speculate that the amount of exposure needed to do so may
depend on the complexity of various L2 sounds, operationalized by the sounds’ frequency of
occurrence in the world’s languages and the time needed by monolingual children to learn them.

2.2 Position-sensitive allophones

According to the SLM, the mapping of L2 to L1 sounds occurs at the level of position-sensitive
allophones, not phonemes. This design feature was based on the observation (Kohler, 1981) that
allophonic distributions of phonemes vary across languages and, within a single language,
allophones may differ greatly in their articulatory and acoustic specification. Moreover, the
relative importance of multiple acoustic cues to the categorization of a sound may differ according
position (see, e.g., Dmitrieva, 2019, for English word-medial vs. word-final stops.

Research has shown that learning one position-sensitive allophone of an L2 phoneme does not
guarantee success in producing and perceiving other allophones of the same L2 phoneme (e.g.,
Mitterer, Reinisch, & McQueen, 2018; Mochizuki, 1981; Pisoni, Lively, & Logan, 1984; Rochet,
1995; Strange, 1992; Takagi, 1993). Iverson, Hazan and Bannister (2005) found that training NJ
speakers to identify /r/ and /l/ in initial position increased identification accuracy for liquids in that
position but not for medial liquids or liquids in initial clusters. To take another example, the
presence of voiced and voiceless consonants in word-initial position in the L1 does not permit
learners to produce the “same” consonants in the word-final position of L2 words if such sounds
do not also appear in the final position of L1 words (e.g., Flege, et al., 1992; Flege & Davidian,
1984, Flege & Wang, 1989).

2.3 Age of first exposure

The L1 phonetic system necessarily “interferes” with the learning of L2 sounds because sounds
encountered on the phonetic surface of an L2 “map onto”, that is, are perceptually linked to one or
more L1 sounds. The perceptual links created for various sound pairings may vary in rapidity and
consistency and may evolve as learners gain experience in the L2. Flege (1995, p. 263) suggested
that L2 learners may only “gradually discern” the existence of phonetic differences between an
L2 sound and the closest L1 sound and that when this happens “a phonetic category representation
The SLM revised, 12

may be established for the new L2 sound which is independent of representations established
previously for L1 sounds”.

By hypothesis, the likelihood of cross-language phonetic differences being discerned between


pairs of L1 and L2 sounds decreases as a function of the age of first exposure to the L2. This
change was attributed to increasing use of “higher order invariants” as age of first exposure to the
L2 increases. This was expected to make it increasingly difficult for L2 learners to “pick up”
detailed phonetic-level information regarding L2 speech sounds (Flege, 1995, p. 266). The SLM
age hypothesis was meant as an alternative to the Critical Period hypothesis, which claimed that
age-related effects in L2 speech learning arise from a reduction of neurocognitive plasticity.

2.4. L2 experience

The observation that L2 learners “gradually discern” L1-L2 phonetic differences implied changes
over time as a function of L2 experience. The term “experience” has been used in diverse ways.
In some early research, it was used to differentiate groups of individuals who had studied an L2 in
school from groups of participants who had not (e.g., Gottfried, 1984). A difference in schooling,
of course, might be expected to co-occur with differences in metalinguistic awareness (e.g., Levy
& Strange, 2008, p. 151). Flege (1995) used the term “experience” with reference to conversational
experience as was typical for L2 research at the time. More specifically for the SLM, the term
experience was meant to indicate the cumulative speech input learners have received while
communicating verbally in the L2, usually in face-to-face conversations.

Many researchers (including us) have used the variable “length of residence” (LOR) to index L2
experience because it can be readily obtained from a written questionnaire. Researchers have
reasonably supposed that, for example, a German who had lived in the United States for 10 years
would have heard and spoken English far more than a German had lived there for just 1 year. It
later became evident that LOR can provide a misleading index of quantity of phonetic input
because it specifies only an interval of time, not what occurred during that interval. Not all
immigrants begin using their L2 immediately upon arriving in the host country (e.g., Flege, Munro,
& MacKay, 1995a, Table I) or use their L2 on a regular basis even after years of residence there
(Moyer, 2009, p. 162).

The results of Flege and Liu (2001) suggested that LOR may provide a useful estimate of quantity
of L2 input only for immigrants who have both the opportunity and the need to use their L2
The SLM revised, 13

regularly. Self-estimated percentage use of an L2 usually increases as LOR increases but the
relationship between the two is non-linear. Important individual exceptions exist due to the
circumstances of everyday life, for example, a rapid increase in L2 use when an immigrant marries
a native speaker of the L2 or a decrease in L2 use following marriage to a fellow L1 speaker.

Even more importantly, the LOR variable provides no insight into the quality of L2 input. The
importance of quality of input on speech perception can be seen in research with monolingual
children. Many children who learn English as an L1 hear a single dialect of English. Such children
recognize words in their native language less efficiently when hearing an unfamiliar dialect of their
L1 or foreign-accented English (e.g., Bent, 2014; Buckler, Oczak-Arsic, Siddiqui, & Johnson,
2017).

Perceiving speech optimally requires adapting phonetic categories and their real time use in
recognizing and producing words to what has been heard and seen in the past, even the recent past.
It is unknown at present how much L2 input is needed to form phonetic categories in an L2 and
optimally adapt them to everyday use. This may depend, at least in part, on the uniformity of the
L2 speech input that is received.

At least some children learning an L1 are exposed to a single variety of their native language but
learners of an L2 rarely if ever get uniform input. It is usually impossible for L2 learners, at least
immigrants to a predominantly L2-speaking country, to avoid using their L2 in “mixed
conversations”. A mixed conversation in one in which L2 learners converse with one or
monolingual native speakers of the target L2 and at least one other non-native speaker. The L2
must be used by all participants due to the presence of the monolingual native speaker(s). In such
a context an L2 learner is likely to hear L2 spoken with a foreign accent, often their own kind of
foreign accent, by the other non-native speaker(s) present.

2.5 Categories, not contrasts

The SLM focused on individual sounds in the L1 and L2 phonetic subsystems of L2 learners rather
than on contrasts between pairs of sounds. The focus on individual, position-sensitive allophones
was based in part on the assumption that listeners match the properties of an incoming sound (e.g.,
the [θ] in think) to a representation stored in long-term memory because in real time speech
processing it takes too long to eliminate alternative candidates (e.g., “not [f]”, “not [s]”, “not [v]”,
etc.). The categorization of speech sounds is considered the basis of speech perception in
The SLM revised, 14

monolinguals (Holt & Lotto, 2010) and, in our view, this holds true for the perception of L2 sounds
given that (1) the L1 and L2 and sounds are perceptually linked via the mechanism of interlingual
identification.

Categorization and identification are not the same thing (Nosofsky, 1986; Smits, Sereno, &
Jongman, 2006). A stimulus sound is categorized by computing its relative distance along multiple
dimensions to multiple long-term memory representations, requiring generalizations across
discriminably distinct tokens within categories. The identification of a sound requires that a
decision be made regarding a sound’s unique identity and requires discrimination between
categories.

L2 and cross-language research has demonstrated the methodological importance of the distinction
between categorization and identification tasks. For example, Bohn and Flege (1993) assessed how
Spanish monolinguals perceived English stops in a two-alternative forced-choice identification
task. The stimuli were multiple natural tokens of Spanish stops (prevoiced /d/, short-lag lag /t/)
and English stops (short-lag /d/, long-lag /t/). The Spanish monolinguals consistently identified the
long-lag English /t/ tokens as “t” even though they surely did not have an English /t/ category.
Instead, they appear to have made use of an “X-not-X” strategy. Given the need to choose one of
two response alternatives, “d” or “t”, they selected the “t” response for long-lag stimuli because
these stimuli were clearly no instances of their Spanish /d/ category.

Many researchers have used N-alternative force-choice tests, attempting to offer every reasonably
possible response alternative. For example, MacKay, Meador and Flege (2010) examined the
perception of English consonants by native Italian speakers. In addition to a written label for the
target sounds (e.g., “s” for word-initial /s/ tokens) four other responses alternatives selected on
confusion matrices from earlier research was offered. While this reduces the problem, it does not
guarantee that the choices include what every individual listener perceives. Moreover, the
proliferation of response alternatives may be a source of confusion, especially for non-natives
whose spelling-to-sound correspondences are not native-like.

The methodology used to assess perception is crucial for the evaluation of how “native like” L2
learners are judged to be. Iverson and Evans (2007, p. 2852) noted, for example, that a two-
alternative forced-choice (2AFC) test can reflect perceptual sensitivity as much as categorization.
Díaz, Mitterer, Broersma, and Sebastián-Gallés (2012) evaluated L2 speech perception by
The SLM revised, 15

determining the percentage of L2 learners who obtained scores falling with the native-speaker
range. More of them met this criterion when L2 perception was tested using a categorization task
than in an identification and lexical decision task. For these authors, a categorization task provides
information regarding a “acoustic phonetic” level of analysis whereas the latter two tasks “involve
lexical processes” (p. 680).

The responses obtained in a 2AFC) test are often used to compute phoneme “boundaries”.
Escudero, Sisinni, and Grimaldi (2014, p. 1583) proposed that in order to perceive L2 vowels more
accurately, L2 learners may in certain instances need to “shift the boundary” between L1
categories. As we see it, a shift in boundary locations, if observed, is epiphenomenal, the result of
learning-induced changes in the phonetic categories themselves. Boundary locations may vary as
a function of experimental design. For example, Benders, Escudero, and Sjerps (2012) showed
that phonetic context and stimulus range effects were smaller when listeners were offered five
rather than just two response alternatives.

The perception of L2 sounds has often been evaluated by examining how accurately pairs of L2
sounds can be discriminated. Best and Tyler (2007) summarize research showing that
monolinguals tested in a laboratory setting usually discriminate two foreign sounds better if the
two sounds map onto distinct L1 sounds (a 2-to-2 mapping pattern) than if they map onto a single
L1 sound (a 2-to-1 mapping pattern). If the same monolinguals were later to learn the language
from which the foreign sounds examined in a laboratory experiment were drawn, discrimination
of the two foreign (now L2) sounds might be expected to improve over time. This could be
attributed to changes in cross-language mapping patterns (Best & Tyler, 2007; Tyler, 2019).
Within the SLM framework, changes in cross-language mapping are important because such
changes may lead to the formation or modification of phonetic categories. On this view, it is the
use of distinct phonetic categories that results in improved discrimination. For pairs of L2 sounds
that previously exhibited a 2-to-1 mapping pattern, this will require the formation of a new
phonetic category.

2.6 L1 phonetic development is slow

The SLM proposed that the mechanisms and processes used to establish the elements making up
the L1 phonetic system, including the ability to form phonetic categories, remain intact and
available for L2 learning.
The SLM revised, 16

Developmental research indicates that L1 phonetic categories develop slowly. Infants begin
attuning to the phonetic categories of what will become their L1 even before they have a lexicon,
perceptually grouping sets of acoustically similar sounds into “equivalence classes” (Kuhl, 1983).
The language-specific phonetic categories that guide production and perception evolve from
equivalence classes and continue to develop long after children have established a phonemic
inventory for their L1 in their first few years of life (see e.g., Hazan & Barrett, 2000; Lee,
Potamianos, & Narayanan, 1999).

Phonetic categories have traditionally been described as points in an n-dimensional perceptual


space. Children’s attunement to the phonetic categories of their L1 is based on long-term exposure
to statistically defined distributions of sound tokens to which they have been exposed. Each sound
token is processed as an instance of a category, leaving a trace in episodic memory (e.g., Hintzman,
1986).

The effect of L1 phonetic category development can be seen in research examining children’s
ability to categorize L1 sounds, which continues to improve at least until the age of 15 years (see,
e.g., Johnson, 2000; Markham & Hazan, 2004; Neuman & Hochberg, 1983). As monolingual
children are exposed to an ever wider range of distributions of variant realizations of an L1
category, that is, more broadly tuned distributions, they become better able to recognize words
spoken in an unfamiliar L1 dialect (Nathan, Wells, & Donlan, 1998; Bent, 2018) and to recognize
L1 words spoken with a foreign accent (Bent, 2014, Bent & Holt, 2018). The end point of L1
phonetic category development has not yet been established but it surely extends beyond the age
of 7 years (Bent, 2014; Newton & Ridgway, 2015). Indeed, there is evidence that the fine-tuning
for L1 categories extends over the entire life span (e.g., Harrington, Palethorpe, & Watson, 2000).

As children mature, they gradually produce L1 sounds with less variability and reduce the overlap
in their productions of adjacent L1 categories (Lee et al., 1999). The L1 phonetic categories that
monolingual children develop are multi-dimensional, cue-weighted representations of sound
classes residing in long-term memory. Each category is mediated by a narrow range of “best
exemplars” (or “prototype”) that specify the ideal weighting of a set of independent and
continuously varying properties (perceptual cues).

Prototypes define for listeners how realizations of a category ought to sound when produced by
themselves (self-hearing) and by others. They provide a reference point that listeners can use when
The SLM revised, 17

asked, in a laboratory experiment, to rate the members of an array of stimuli for category goodness
(e.g., Miller, 1994; Smits et al., 2006). The use of prototypes enable listeners to reliably chose the
“best exemplar” of a particular vowel category from an array of stimuli (e.g., Iverson & Evans,
2007, 2009; Johnson, Flemming, & Wright, 1993) and to detect “distortion” or “foreign accent”
in productions of a specific sound they have been asked to evaluate auditorily (Flege, 1992, p. 170
ff; Lengeris & Hazan, 2010).

Phonetic category prototypes also play a role in the categorization of speech sounds. Iverson and
Evans (2007) found, for example, that non-native speakers’ accuracy in categorizing naturally
produced English vowels varied as a function of how closely their perceptual prototypes for
English vowels resembled those of NE speakers.

2.7 L2 phonetic category formation

The SLM proposed that L2 learners of any age, like infants exposed to what will become their L1,
form auditory equivalence classes derived from statistical properties of the input distributions to
which they have been exposed while using the L2 (e.g., Anderson, Morgan, & White, 2003; Maye,
Werker & Gerken, 2002). In L1 monolinguals, the equivalence classes evolve into language-
specific phonetic categories without interference from another phonetic system. For individuals
learning an L2, on other hand, the formation and elaboration of new phonetic categories entails
disrupting a L2-to-L1 perceptual links as cross-language phonetic differences are discovered
(“discerned”).

The specification of how multiple cues are weighted for an L2 phonetic category is language-
specific and so it must be learned. For example, NE-speaking listeners use both spectral cues (i.e.,
formant frequencies) and duration to categorize English vowels (Flege, Bohn, & Yang, 1997).
However, the frequency (spectral) cues are more important for NE-speaking listener than the
temporal cues are because temporal cues are not reliably present, or are substantially reduced,
when English is spoken rapidly. In Swedish, on the other hand, duration is a more important cue
to the categorization of certain vowels than are spectral cues (McAllister, Flege, & Piske, 2003).

L2 category formation is understood less well than L1 category formation, but it seems reasonable
to think that L2 category formation takes at least as long as L1 category formation does. This is
because the distributions of sounds defining each L2 category are likely to be less uniform than
the distributions encountered by monolingual children. L2 learners, especially adults, are likely to
The SLM revised, 18

be exposed to diverse dialects of the target L2 as well as to multiple foreign-accented renditions


of the target L2 (Bohn & Bundgaard-Nielsen, 2009).

2.8 Factors determining L2 category formation

According to the SLM, L2 learners of all ages retain the capacity to form new phonetic categories
but will not do so for all L2 sounds differing auditorily from the closest L2 sound. By hypothesis,
a new phonetic category will be formed for an L2 sound when learners discover (discern) phonetic
differences between the L2 sound and the L1 sound(s) that is (are) closest in phonetic space to it.
The SLM proposed that discerning L1-L2 phonetic differences, and thus the likelihood of a new
category being formed for an L2 sound, depends on two factors. First, as the degree of perceived
cross-language phonetic dissimilarity between an L2 sound and the closest L1 sound(s) increases,
the easier it will be for L2 learners to discern cross-language phonetic differences. Second, the
older L2 learners are when they are first exposed to an L2, the less likely they will be to discern
cross-language phonetic differences.

No consensus existed in 1995 regarding the best way to quantify degree of perceived cross-
language dissimilarity, nor did an objective criterion exist for how great a perceived cross-
language phonetic difference must be in order to initiate the process of category formation. Valid
and reliable measures of cross-language dissimilarity are, of course, essential for testing SLM
predictions. We will return to this issue again later in the chapter.

2.9 Few if any perfect learners

According to the SLM, the categories that L2 learners form for certain L2 sounds will likely never
be identical to those of native speakers but this, in itself, does not demonstrate a loss or diminution
of the capacity for learning speech. New L2 phonetic categories are expected to differ from
monolinguals’ if L2 learners have received less phonetic input than monolingual children need to
reach adult-like levels of performance, or if the input distributions upon which L2 learners base
their L2 categories differ from the distributions to which monolingual native speakers have been
exposed. The latter is expected for virtually all L2 learners, especially those exposed to multiple
dialects of the L2 and to foreign-accented renditions of the L2 by other non-native speakers.

The SLM proposed that a new L2 phonetic category formed for an L2 sound might also differ from
the phonetic categories of monolingual native speakers if the relative importance of the multiple
The SLM revised, 19

features defining an L2 sound, as spoken by native speakers, differed from the relative importance
of the same features in corresponding L1 sounds, or if the L2 sound was, at least in part, defined
by some feature “not exploited in the L1” (Flege, 1995, pp. 241-243).

Finally, L2 learners might differ from monolingual native speakers because of interactions
between the L1 and L2 phonetic subsystems. The SLM proposed that such interactions occur
because L1 and L2 sounds exist in a common “phonological space”. In retrospect, we recognize
that use of this term was a misnomer and will instead use the term “common phonetic space”.

The categories making up a monolingual’s phonetic system tend to occupy positions in phonetic
space that augments correct categorization. Operation of this language universal increases inter-
category distances in phonetic space (see, e.g., Lindblom, 1990). According to the SLM (1995, p.
242), the elements making up the phonetic subsystems of bilinguals self-organize in the same way
as do the sounds of languages or dialects. As a result, a bilingual’s new L2 category might “deflect
away” from a category in the L1 phonetic subsystem to augment inter-category distances in the
common L1-L2 phonetic space of bilinguals.

2.10 L2 effects on L1 categories

L2 speech research prior to 1995 focused exclusively on the L2 but it is now clear that
understanding how L2 sounds are learned also requires an examination of how L1 sounds are
produced and perceived. One well known finding that compels this approach pertains to global
foreign accent. The strength of an L2-inspired foreign accent in the L1 varies inversely as a
function of an L1-inspired foreign accent in the L2 (Yeni-Komshian. Flege & Liu, 2000: see also
Flege, 2007).

The SLM proposed a mechanism that might account, at least in part, for L2-on-L1 effects. As
mentioned, learners do not form new phonetic categories for all L2 sounds. Some L2 sounds are
so similar to an L1 sound that an L1-for-L2 substitution would go unnoticed by monolingual
speakers of the target L2 (Flege, 1992).

What about L2 sounds for which a new category is not formed? The SLM proposed that
“composite” (compromise) L1-L2 categories may develop. The perceptual link between an L2
sound and the closest L1 sound remains intact, and a composite L1-L2 category develops that is
based on the combined distribution of sounds defining the L1 and L2 categories.
The SLM revised, 20

The SLM predicted that when a composite L1-L2 category develops, the L1 category may shift
in the direction of the L2 category (e.g., MacKay, Flege, Piske, & Schirru, 2001). The magnitude
of the shift, and whether it will be auditorily detectable by monolingual speakers of the L1,
depends on the nature of the combined distributions. Specifically, the magnitude of the shift in a
bilingual is expected to vary as a function of how much the L1 and L2 have been used
cumulatively by the bilingual over the course of his or her life, how much the L1 and L2 have
been used recently, and how dissimilar a bilingual perceives pairs of L1 and L2 sounds to be.

2.11 Perception before production

The fact that L2 learners often speak with a foreign accent (e.g., Flege, 1984) and produce errors
in specific vowels and consonants (e.g., Flege & Munro, 1994) gave rise to the widespread belief
that L2 production errors arise because of an age-related reduction in ability to learn new forms of
articulation. The SLM challenged this view, proposing that production errors often have a
perceptual basis. For the SLM, the accurate perception of an L2 sound is a necessary but not
sufficient condition for its accurate production.

The SLM proposed that perceptual phonetic categories formed for L2 sounds and the realization
rules used to motorically implement them will “align”, as in L1 acquisition. By hypothesis, the
production of an L2 sound will “eventually correspond” to the properties specified in its phonetic
category representation (Flege, 1995, p. 239). The SLM did not provide an estimate for how long
an alignment of the perceptual information and that encoded in motoric representations used to
realize (produce) a phonetic category will take.

3. The revised Speech Learning Model (SLM-r)

The SLM-r aims to account for how phonetic systems reorganize over the life span in response to
the phonetic input received during naturalistic L2 learning. Some aspects of the original SLM
(Flege, 1995) have been carried forward to the SLM-r without change but other aspects are new.
For example, the SLM-r continues to focus on sequential learning of an L2 following establishment
of an L1 phonetic system rather than on the simultaneous learning of two languages in infancy and
early childhood (see, e.g., Werker & Byers-Heinlein, 2008, for the latter), but the SLM “age
hypothesis” has been replaced by a new hypothesis which, if correct, may help explain age-related
The SLM revised, 21

effects on L2 speech learning. The SLM was radical in its simplicity and this is even more the case
for the SLM-r. If one needed a two-word summary of the SLM-r approach those two words would
be that there is “no change” in how an L1 and L2 is learned.

The core premises of the SLM-r are that: (1) the phonetic categories which are used in word
recognition and to define the targets of speech production are based on statistical input
distributions; (2) L2 learners of any age make use of the same mechanisms and processes to learn
L2 speech that children exploit when learning their L1; and (3) native vs. non-native differences
in L2 production and perception are ubiquitous not because humans lose the capacity to learn
speech at a certain stage of typical neuro-cognitive development but because applying the
mechanisms and processes that functioned “perfectly” in L1 acquisition to the sounds of an L2
does not yield the same results. A difference in L1 and L2 learning outcomes will necessarily arises
because:

a) L1 sounds initially “substitute” L2 sounds because the L2 are automatically linked to


sounds in the L1 phonetic inventory.
b) Pre-existing L1 phonetic categories interfere with, and sometimes block, the formation
new phonetic categories for L2 sounds.
c) The learning of L2 sounds is based on input that differs from the input that monolingual
native speakers of the target L2 receive when learning the same sounds.

The SLM-r shares the view of other theoretical models (e.g., Best & Tyler, 2007) that L2 speech
learning is profoundly shaped by perceptual biases induced by the L1 phonetic system. The SLM-
r has yet to be evaluated empirically. We think, however, that if furnished with adequate empirical
data, the SLM-r will be able to provide an account of how these biases change as a function of
exposure to L2 sounds.

We acknowledge, however, that not all perceptual biases which L2 learners bring to the task of L2
learning, and which change as an L2 is learned, can be attributed to the L1 or L2. An interesting
avenue for future research not developed in this chapter is the interaction of language-specific and
universal perceptual biases. To paraphrase Nam and Polka (2016), “the phonetic landscape … is
an uneven terrain”, with certain classes of sounds having a special status for all language learners
irrespective of previous language experience. For example, research inspired by the Natural
Referent Vowel framework of Polka and Bohn (2003; 2011) has shown that vowels which are
The SLM revised, 22

peripheral in the acoustic/articulatory vowel space have a special status, and there is some evidence
from recent L1 and L2 research that consonants with an alveolar place of articulation and stop
consonants in general have a special status for both L1 and L2 learners (Bohn, 2020).

3.1 Focus of the SLM-r

The focus of the SLM-r has changed in two important respects.

3.1.1. Early vs. Late learners

The SLM-r no longer focuses on differences between Early and Late learners. This is because
research since 1995 has shown that the critical period (CP) hypothesis proposed by Lenneberg
(1967) for L2 speech learning does not offer a plausible explanation for the age-related effects
routinely seen in L2 speech learning. Our reasoning is as follows.

First, differences between Late learners and native speakers of the target L2 cannot be attributed
to a loss of neural plasticity by the Late learners. We now know that the adult brain retains
considerable plasticity for processes relevant to L2 speech production and perception (e.g., Callan
et al., 2003; Callan et al., 2004; Ylinen et al., 2010; Zhang & Wang, 2007).

Second, the CP hypothesis was based on an evaluation of foreign-accented L2 production that was
misleading and incomplete. To be sure, immigrants who arrive in a predominantly L2-speaking
country after puberty usually speak their L2 with stronger foreign accents than those who arrived
earlier. However, many Early (“pre-critical period”) learners speak their L2 with a detectable
foreign accent even after decades of primary L2 use, and the strength of their foreign accents will
vary at least in part as a function of language use patterns Flege (2019). Further, Late learners’
foreign accents grow stronger following the supposed closure of a CP. As well, many immigrants
who are observed to speak their L2 with a foreign accent either have not yet received enough L2
input, or received too much foreign-accented L2 input (or both) to have reached their full potential
in L2 pronunciation and perception.

Third, Lenneberg (1967) believed that foreign languages have to be “taught and learned through a
conscious and labored effort” if L2 learning begins following closure of a CP (1967, p. 176). The
subjective impression of relatively great effort, presumably in comparison to children learning
their L1, is surely true for those attempting to learn an L2 in a foreign language classroom. For
those learning an L2 by immersion (e.g., immigrants) the sensation of “effort” can probably be
The SLM revised, 23

attributed to the fact that L2 learners, especially Late learners, usually have smaller lexicons than
native speakers and deploy phonetic categories that are not optimally tuned to the L2 speech
sounds they hear when attempting to access L2 words (Song & Iverson, 2018).

Finally, the CP hypothesis rested on the assumption that L2 learners can no longer gain “automatic
access” to the language-specific phonetic properties of L2 sounds from “mere exposure” to the L2
following the closure of a CP (Lenneberg, 1967, p. 176). In fact, Late learners can and do gain
access to the phonetic details defining L2 sounds without special tutoring or using cognitive
processes not previously exploited for L1 acquisition (e.g., de Leeuw & Celata, 2019; Flege &
Hammond, 1982).

3.1.2 Not just “end state” learners

The SLM-r no longer focusses on individuals who are highly experienced in the L2. We now
recognize that it is virtually impossible for L2 learners to produce and perceive an L2 sound exactly
like mature monolingual native speakers of the target L2. That being the case, it is no longer of
theoretical interest to determine if the L2 performance of a particular learner is or is not
indistinguishable from that of L2 native speakers.

Most individuals who participate in L2 research have typically, perhaps inevitably, received
different input than the members of a presumably representative native-speaker comparison group
(Schmidtke, 2016). Input differences may well lead to subtle native vs. non-native differences,
even in highly proficient and experienced L2 learners and even for L2 sounds that should be easy
(Broersma, 2005). The simple fact of being bilingual may prevent the so-called “mastery” of L2
sounds (see Hopp & Schmid, 2013, for discussion).

Early learners have been considered by many to be rapid and perfect learners of L2 speech when,
in fact, Early learners often differ from native speakers when examined closely. For example,
Højen and Flege (2006) tested native Spanish (NS) adults who had learned English as children on
the discrimination of three especially difficult pairs of English vowels. As expected, NS
monolinguals discriminated the three pairs at near-chance levels. The Early learners obtained
substantially higher scores than the NS monolinguals did but, as a group, they differed significantly
from NE speakers for two of the three pairs of English vowels. Other research suggests that the
magnitude of differences between Early learners and native speakers depends, at least in part, to
The SLM revised, 24

differences in the relative frequency of L1 and L2 use (e.g., Bosch & Ramon-Casas, 2011; Flege,
2019; Mora, Keidel, & Flege, 2010, 2015).

Persistent differences between native and non-native speakers can be seen for some L2 learners in
research examining comprehension. Non-native speakers are less successful in recognizing L2
words than are native speakers, especially in non-ideal listening conditions. This is due, at least in
part, to non-natives’ use of phonetic categories that differ from the phonetic categories deployed
by native speakers (e.g., Garcia Lecumberri, Cooke, & Cutler, 2011; Imai, Walley, & Flege, 2005;
Jongman & Wade 2007). Such differences are evident even in some Early learners who speak their
L2 without an obvious foreign accent (Rogers et al., 2006), reflecting the fact that immigrants who
are tested in L2 research are likely to have had substantially less exposure to L2 words than age-
matched monolingual native speakers (Schmidtke, 2016) regardless of when they began learning
their L2.

As we now see it, the earlier SLM focus on “end state” learning was mistaken because it is
necessary to examine early stages of L2 speech development in order to understand the process of
L2 phonetic category formation. The earlier focus on highly experienced L2 learners assumed that,
at some point, L2 speech learning reaches an asymptote or “ultimate” level of attainment. Even
though the notion that L2 competence and performance fossilize is widely accepted (Han & Odlin,
2006) it has never been tested for L2 speech learning as far as we know.

That being the case, we have decided to report here the results of unpublished research carried out
in Canada. In 1992, Murray Munro recorded a total of 24 NE speakers and 240 native Italian (NI)
speakers who had lived in Canada for 15 to 44 years (mean=32.5 years). The data obtained 1992
were reported by Flege, Munro and MacKay (1995b), who measured VOT in English words
beginning in /p t k/ produced by the 264 participants. In 2003, Jim Flege and Ian MacKay re-
recorded 20 NE and 149 NI speakers from the 1992 sample to determine if the NI participants had
learned to produce English stops more accurately over the 10.5-year interval between the
recordings. All participants were recorded in the same location using identical procedures, speech
materials, and equipment; only the testers differed.

As expected, the VOT values obtained in 1992 and 2003 for NE speakers were much the same. As
can be seen in Figure 2(a), this also held true for most but not all of the NI speakers. An inspection
of the scatterplot revealed that 20 NI participants produced English stops with longer VOT values
The SLM revised, 25

in 2003 than 1992 whereas 20 others showed the opposite pattern. When these 40 NI speakers
were removed, the 1992-2003 VOT correlation obtained for the NI speakers increased from
r(147)=.82. to r(107)=.95.

The VOT values of the NI speakers who increased (n=20) or decreased (n=20) VOT over time
were compared to the values obtained from the 20 NE speakers. The values obtained for the 60
participants, when submitted to a (3) Group x (2) Time ANOVA with repeated measures on Time
(1992 vs. 2003), yielded a significant interaction, F(2,57)=96.6, p<.01. This was because, as can
be seen in Figure 2(b), the effect of Time was non-significant for the NE speakers but significant
in opposite directions for the two NI groups (p<.01). The two NI groups did not differ significantly
in LOR. However, those who increased VOT in 2003 compared to 1992 reported using English
significantly more frequently in 2003 than those who decreased VOT (means=76.5% vs. 63.2%,
F(1,38)=4.2, p<.05) even though the two groups did not differ significantly in self-reported percent
English use in 1992 (p>.10).

These results suggest that the language use patterns of immigrants usually do stabilize, but this
does not place an upper limit on the human capacity for learning speech when phonetic input
changes. The change in percentage English use by long-time NI immigrants in Canada was likely
to have been the result of important life changes such as remarriage, a job change, relocation to a
new neighborhood or some combination of life changes. It is also possible, of course, that the
changes in self-reported L2 use were accompanied by changes in how often the NI speakers heard
English pronounced with an Italian foreign accent.

How much time and input are needed to induce VOT production changes in the L2? The results of
Sancier and Fowler (1997) suggested that two months of input may suffice. These authors
examined a bilingual who spent alternating periods in the United States and Brazil. She produced
shorter VOT values in Portuguese than English stops, and her VOT values in both languages were
shorter following a “several months stay” in Brazil than a comparable stay in the United States.

The SLM-r interpretation of these results, to be elaborated below, is that the Late learner studied
by Sancier and Fowler (1997) developed a composite L1-L2 phonetic category for perceptually
linked Portuguese and English voiceless stops. Her composite phonetic categories for voiceless
stops, which specified the articulatory goals for the production of stops in English and Portuguese,
were updated regularly to reflect recent input.
The SLM revised, 26

3.2 SLM-r hypotheses

3.2.1 Perception and production co-evolve

The SLM proposed that the accuracy of L2 segmental perception places an upper limit on the
accuracy with which L2 sounds are produced. This hypothesis has been replaced by the hypothesis
that L2 segmental production and perception co-evolve without precedence. The SLM-r “co-
evolution” hypothesis arises from the observation of inconsistencies in L2 research and from
evidence that a strong bi-directional connection exists between production and perception. This
new evidence requires adding an arrow connecting phonetic-level production to perception in
Figure 1.

Mitterer, Reinisch and McQueen (2018) observed that from the standpoint of spoken word
recognition, there is no need to assume that production and perception must be very similar. These
authors noted, for example, that native Dutch (ND) speakers differ in the extent to which they
produce Dutch /r/ as an approximant in post-vocalic position and also in terms of what kind of
trilled /r/ they use in pre-vocalic position. They observed that even though all ND speakers are
able to recognize both [rot] and [ʀot] variants of the Dutch word for red, a given speaker is
“unlikely to use both variants” (2018, p. 90) in production.

The earlier SLM “upper limit” hypothesis was supported by several kinds of evidence. First,
infants show an effect of ambient language input on perception before showing ambient language
effects on production (Kuhl, 2000) and at least some of children’s segmental production errors can
be attributed to an inability to discriminate a sound produced in error from the correct target sound
(Eilers & Oller, 1976). Second, non-native speakers can scale overall degree of perceived foreign
accent in their L2 much like native speakers even though they themselves speak their L2 with a
strong accent (Flege, 1988; MacKay, Flege, & Imai, 2006). Third, perceptual training leads to an
improved production of both consonants and vowels (e.g., Bradlow et al., 1999; Lengeris & Hazan,
2010) in the absence of explicit training on production.

The primary source of support for the SLM upper limit hypothesis, however, was the observation
of significant positive correlations between measures of segmental production and perception
accuracy (Flege, 1999). The strength of such correlations seemed to vary according to the
commensurability of the production and perception measures (e.g., Flege, 1999; Baker &
Trofimovich, 2006; Kim & Clayards, 2019). However, several observations raised doubts about
The SLM revised, 27

the correlational evidence. First, the presence of near-mergers, that is, the systematic production
of differences that cannot be readily perceived (e.g., Labov, 1994), indicates that production and
perception are not completely symmetrical. Second, studies failed to yield significant correlations
or even inverse correlations (e.g., Darcy & Krüger, 2012; Peperkamp & Bouchon, 2011; Sheldon
& Strange, 1982). Most importantly, the observation of significant positive production-perception
correlations did not demonstrate causality. The correlations could just as easily be interpreted to
mean that production accuracy places an upper limit on perceptual accuracy as the reverse (see,
e.g., Best, 1995).

The evidence now in hand suggests that a strong bi-directional connection exists between
production and perception. It is important to recognize, of course, that the correspondence between
the two is never perfect, even in monolinguals. For example, Shultz, Francis, and Llanos (2012)
examined NE speakers’ use of VOT and F0 onset frequency in the production and perception of
words beginning in /b/ and /p/. Although participants made greater use of VOT than F0 onset
frequency in perception, a significant inverse relation was observed between the two dimensions
in production, leading the authors to conclude that the goals for “efficient” production and
perception differ (2012, p. EL99).

Johnson, Fleming, and Wright (1993) observed what they called a “hyperspace” effect. These
authors asked NI listeners to select what they considered to be the best examples of various English
vowel categories from a two-dimensional array of vowel stimuli differing in F1 and F2
frequencies. The F1 and F2 values in their production of English vowels was also analyzed
acoustically. The NE participants tended to choose, as best exemplars of a vowel, stimuli having
more peripheral frequency values than they themselves produced in the same vowel (see also
Frieda, Walley, Flege, & Sloane, 2000; Newman, 2003). Here production and perception measures
were correlated but differed in absolute value.

It is plausible that although the targets for the articulation of speech sounds are defined by
perceptual representations (e.g., Tourville & Guenther, 2011), a bidirectional and co-equal link
between the two is actively maintained (Chao et al., 2019; Perkell, Guenther et al., 2004, Perkell,
Matthies et al., 2004). This is consistent with the observation (Reiterer et al. (2013, p. 9) that the
regulation of motor and sensory processes used in speech production and perception is localized
in “partly overlapping, heavily interconnected brain areas”.
The SLM revised, 28

Guenther, Hampson, and Johnson (1998) noted that brain areas specialized for speech production
are active during speech perception, and vice-versa. These authors hypothesized that “auditory
target” regions which develop during L1 acquisition guide articulation in space and time. In their
view, articulatory gestures are planned as “trajectories in auditory perceptual space” that map onto
“articulator movements”. This coupling permits auditory goals to be achieved via “motor
equivalent” gestures in which “constriction locations and degrees” may vary (1998, p. 611). This
capacity enables individual monolingual NE speakers, for example, to produce /r/ to the
satisfaction of other NE-speaking listeners using very different articulatory gestures (e.g., Mielke,
Baker, & Archangeli, 2016; Westbury, Hashi, & Lindstrom, 1998).

Evidence for the existence of bi-directional links has been provided by the results of perturbation
studies. Houde and Jordan (1998) altered the vocal output of adult NE monolinguals as they spoke
so that the output differed from what they intended to say. Most participants managed to adapt
their articulation so that their vocal output again corresponded to what they intended to say. When
the auditory distortion was removed, the participants returned to their normal mode of production.
Nasir and Ostry (2009) obtained a corollary finding for production. Most NE participants were
able to compensate for unexpected perturbations of jaw position while producing /æ/. The better
the compensation, the more the participants’ identifications of stimuli in an /ɛ/-/æ/ continuum was
observed to shift before vs. after the perturbations were administered.

The linkage between production and perception appears to be uniquely human. Schulze, Vargha-
Khade and Mishkin (2012) found that humans, unlike monkeys, are extremely good at storing
lasting memories of speech sounds in long-term memory. They attributed this capacity to the
evolution of robust and rapid links between the auditory system, localized in the posterior temporal
region, and an oromotor sensory system in the ventrolateral front region of the human cortex.

Schulze et al. (2012) examined the mimicry of words, non-words and environmental sounds by 36
normal young adults. They also examined participants’ auditory recognition memory for the same
speech and non-speech stimuli. The participants were often unable to recognize an auditory
stimulus they could not reproduce (mimic) or label. The authors hypothesized that a representation
in long-term memory cannot be created unless a novel speech sound is “pronounceable”, that is,
likely to “activate the speech production system automatically and subvocally” (2012, p. 7123).

3.2.2 L2 input
The SLM revised, 29

The SLM proposed that L2 learners gradually “discern” L1-L2 phonetic differences as they gain
experience using the L2 in daily life, and that the accumulation of detailed phonetic information
with increasing exposure to statically defined input distributions for L2 sounds will lead to the
formation of new phonetic categories for certain L2 sounds. The SLM did not provide a method
for measuring how phonetic information accumulates, nor how much phonetic input is needed to
precipitate the formation of new L2 phonetic categories. The model simply pointed to years of L2
use as a metric to quantity of L2 input. As mentioned, however, immigrants’ length of residence
(LOR) in a predominantly L2-speaking environment is problematic because it does not vary
linearly with the phonetic input that L2 learners receive and because it provides no insight into the
quality of L2 input that has been received.

It is universally accepted that infants and pre-literate children attune to the phonetic categories of
the ambient language through “exposure to a massive amount of distributional information” (Aslin,
2014, p. 2; see also Kuhl et al., 2005). For the SLM-r, input is also crucial for the formation of
language-specific L2 phonetic categories and composite L1-L2 phonetic categories. The SLM-r
defines phonetic input as the sensory stimulation associated with L2 speech sounds that are heard
and seen during the production by others of L2 utterances in meaningful conversations.

Input, which has both quantitative and qualitative dimensions, has proven difficult to measure. For
now, we simply observe that full-time equivalent (FTE) years of L2 input provides a somewhat
better estimate of input than LOR does. Years of FTE input is calculated by multiplying LOR by
the proportion of L2 use (derived from questionnaire estimates of percentage L2 use). Consider,
for example, two immigrants who have both lived in an L2-speaking country for 20 years but
report using their L2 with unequal frequencies (90% vs. 30% of the time). The former has 18.0
FTE years of English input, the latter just 6.0 years. Such a difference is likely to be crucial
inasmuch as the former, but not the latter, immigrant has probably received as much input as
monolingual children need to reach adult-like performance levels for certain L2 sounds.

Quality of input has been largely ignored in L2 speech research even though it may well determine
the extent to which L2 learners differ from native speakers. As mentioned earlier, native Spanish
(NS) speakers who learned English in childhood but often heard Spanish-accented English were
found to produce English /p t k/ with VOT values that were too short for English (as spoken by
The SLM revised, 30

most monolinguals), thereby resembling NS speakers who learned English as adults in a place
where Spanish-accented English was not prevalent.

More fine-grained measures of the quantity and quality of input are clearly needed. Promising new
methods for obtaining better measures of both are presented in the Supplementary Materials
associated with this chapter. In addition to obtaining accurate input measures it is important to note
that the context in which input is assessed may also matter. For example, the time of day when L2
input is received may influence how well the input is consolidated and thus indirectly influence
speech learning (Earle & Myers, 2015).

3.2.3 Perceived cross-language dissimilarity

The SLM-r maintains the earlier SLM hypotheses that learners subconsciously and automatically
relate L2 sounds to L1 phonetic categories, and that the greater is the perceived phonetic
dissimilarity of realizations of an L2 phonetic category from the realizations defining an L1
category, the more likely a new phonetic category will be formed for the L2 sound.

As far as we know, the consistency of L2-to-L1 mapping patterns has not been studied
longitudinally. It is probably the case, however, that mapping patterns stabilize as L2 phonetic
input is received. Iverson and Evans (2009) examined cross-language mapping patterns before and
after five vowel training sessions. The data they present (Table I, p. 871) indicated that non-native
participants were more consistent in their labeling of English vowels in terms of L1 categories in
20 of 23 possible instances at the second than first time of observation.

Cross-language mapping patterns may vary when phonetic contexts alter the realization of an L2
sound in a language-dependent manner (Levi & Strange, 2008, p. 153), for example, English
vowels spoken in different consonantal contexts (Bohn & Steinlen, 2003; Levy & Law, 2010).
Levy (2009b, p. 2680) found that NE-speaking listeners perceived the French vowel /y/ as “most
similar” to the American English vowel /u/ more often when the French vowel occurred in an
alveolar than bilabial context (see also Levy, 2009a).

It remains to be determined how best to measure cross language phonetic dissimilarity. The
importance of doing so is widely accepted but a standard measurement procedure has not yet
emerged (for discussions see Bohn, 2002; Strange, 2007). Cross-language dissimilarity must be
assessed perceptually rather than acoustically because acoustic measures sometimes diverge from
The SLM revised, 31

what listeners perceive (e.g., Levy & Strange, 2008, p. 153; Johnson, et al., 1993). The most
common procedure used in L2 research is to obtain two judgments of a single stimulus (e.g.,
Iverson & Evans, 2009; Strange, Bohn, Nishi, & Trent, 2005). Tokens of an L2 sound are randomly
presented for classification (labeling) in terms of L1 categories in an N-alternative forced-choice
format. After labelling a token, listeners then rate it for degree of perceived dissimilarity from the
L1 category just used to label the token.

Many researchers have integrated labeling and rating data in an attempt to provide a metric of
perceived L1-L2 phonetic “distance”, and thus to determine which L1 sound is closest in phonetic
space to a target L2 sound. For example, Iverson and Evans (2007) multiplied the proportions of
trials in which various L1 vowels was used to label an English vowel by the average rating obtained
for the vowel on a continuous scale ranging from “close” to “far away”. The authors noted that, in
a research design intended to compare groups of learners, this metric was “poor at predicting”
whether various English vowels had been “learned” or not learned (2007, p. 2852). Cebrian (2006)
used a similar technique to assess the perceived phonetic distance between L1 (Catalan) and L2
(English) vowels, finding little differences in the measures obtained for a group tested in Spain
and those of native Catalan participants who were long-time residents of Canada.

Both findings just mentioned appear to contradict the SLM-r proposal that as L2 learners gain
experience in the L2 they will become better able to discern L1-L2 phonetic differences which
will, in turn, increase the likelihood of a new L2 category being formed. We suspect that results
that are congruent with SLM-r predictions would be obtained if better measures of the perceived
phonetic dissimilarity of an L2 sound from the closest L1 sound were obtained.

The label-then-rate technique is an example of what Tulving (1981) called an “ecphoric” task
inasmuch as a physically presented stimulus must be compared to information stored in episodic
memory. As we see it, this technique for assessing L1-L2 phonetic distance is problematic for
several reasons. The mean L1-L2 dissimilarity ratings calculated for various members of a group
will necessarily be based on varying subsets of the L2 tokens that have been presented. This is
because individuals may map L2 sounds onto L1 categories in differing ways. The process of
classification requires participants to access information stored in long-term memory before rating
a token for dissimilarity. Filling the interval between the classification and rating responses may
influence the ratings, perhaps in diverse ways for various individual participants. Finally, this
The SLM revised, 32

method can only be used with participants who are literate and can confidently use the labels
provided by the experimenter.

An alternative method for assessing perceived dissimilarity is what Tulving (1981) would call a
“perceptual” similarity task. Flege (2005a) recommended assessing perceived cross-language
phonetic dissimilarity by presenting, in a single trial, pairs of L1 and L2 sounds for ratings using
an equal appearing interval scale. For this technique to be used effectively, the L1 and L2 sounds
under evaluation must be represented by tokens produced by multiple monolingual native speakers
of the learners’ L1 dialect, and multiple native speakers of the L2 dialect or variety being learned.
As well, the L1 and L2 sounds under investigation should be represented by tokens representing a
wide range of variants in a specific phonetic context (e.g., Lengeris, 2009, p. 141) rather than “best
exemplars”. From the perspective of the SLM-r, dissimilarity ratings must be obtained at an early
stage of L2 learning if they are to serve as a predictor of whether a new category will eventually
be formed. This is because the rated dissimilarity of L1-L2 sound pairs is likely to increase when
a new category is formed for the L2 sound (Flege, Munro, & Fox, 1994, Figure 5, see also Bohn
& Ellegaard, 2019).

3.2.4 The category precision hypothesis

According to Flege (1992), language-specific phonetic categories are characterized by a narrow


tolerance region of “good” exemplars which is centered on a perceptual “tolerance region”. Tokens
falling slightly outside the tolerance region may be heard as intended but will nonetheless be
judged to be distorted or foreign-accented instances of the category (see Flege, Takagi, & Mann,
1995, Figure 4). It is plausible that the narrow range of good exemplars of a phonetic category are
(1) found near the center of gravity of the statistical distribution of tokens of a category that an
individual has encountered (Chao, Ochoa, & Daliri, 2019), (2)define the core acoustic properties
of a category and their weighting in a way that maximizes categorization accuracy (Holt & Lotto,
2006), and (3) are deployed by listeners as a collective referent when they consciously rate the
accuracy of production of various tokens of a phonetic category in a laboratory experiment (e.g.,
Miller, 1994) and when they subconsciously perceive degree of phonetic dissimilarity of a pair of
L1 and L2 sounds in ordinary conversations.

As monolingual children mature, their L1 phonetic categories develop. The SLM proposed that L1
phonetic category development may impact L2 speech learning. More specifically, the
The SLM revised, 33

development of L1 phonetic categories may make it progressively less likely for children and
adolescents to discern cross-language phonetic differences and thus to form new phonetic
categories (Flege, 1995, p. 266).

The SLM hypothesis made explicit reference to the chronological age at the time of first exposed
to an L2. As we now see it, the SLM “age” hypothesis was problematic because it lacked
specificity and because it was not possible to dissociate the state of development of learners’ L1
phonetic categories from their overall state of neurocognitive development at the time of first
exposed to an L2. Consider, for example, two hypothetical participants, A and B, who are 38 years
and 46 years of age when tested but were 8 and 16 years of age when first exposed to their L2.
Participant A will surely produce and perceive L2 sounds more accurately than participant B after
30 years of L2 use. Such a difference could be attributed to a putative difference in neurocognitive
maturation between the two participants when they were first exposed to their L2 (Lenneberg,
1967) or to a putative difference in the state of development of their L1 phonetic categories (Flege,
1995).

The SLM-r has replaced the “age” hypothesis with the “L1 category precision” hypothesis.
According to the category precision hypothesis, the more precisely defined L1 categories are at
the time of first exposure to an L2, the more readily the phonetic difference between an L1 sound
and the closest L2 sound will be discerned and a new phonetic category formed for the L2 sound.

The SLM-r operationalizes category precision as the variability of acoustic dimensions measured
in multiple productions of a phonetic category. It should be noted, of course, that variability in the
realization of phonetic categories that are adjacent in phonetic space will be related to the
magnitude of inter-category distances in phonetic space. The source(s) of intra-subject differences
in category precision is (are) unknown at present. However, the SLM-r regards category precision
as an endogenous factor that is potentially linked to individual differences in auditory acuity, early-
stage (pre-categorical) auditory processing, and auditory working memory.

Cross-language phonetic research has focused on language-specific differences in the phonetic


categories found in various languages. Importantly, the phonetic categories developed by
individual monolingual speakers of a single language can differ as well. Consider, for example,
the production and perception of word-initial tokens of English /p t k/. All NE adults produce these
stops with long-lag VOT values in word-initial position, but the exact values that individuals
The SLM revised, 34

typically produce varies substantially (Theodore, Miller, & DeSteno, 2009) even in the absence of
dialect differences (Docherty, Watt, Llamas, Hall, & Nycz, 2011). Similarly, individual
differences exist in adults’ production of L1 vowels and these differences remain stable over time
(Heald & Nusbaum, 2015).

As monolingual children mature, their production of L1 sounds generally becomes less variable
(e.g., Kent & Forner, 1980). For example, Lee et al. (1999) observed that the normalized variability
of vowel formant frequencies continues to decrease in children until at least 14 years of age. This
developmental change in production is accompanied by changes in vowel perception, specifically,
an increase in the steepness of slopes in identification functions (Hazan & Barrett, 2000; Walley
& Flege, 1999).

Importantly, however, individual differences in production variability continue to exist in


adulthood. For example, variability in the production of VOT in English stops by NE children
generally decreases until about the age of 12 years, but even NE adults may show differing degrees
of variability VOT production. Heald and Nusbaum (2015) observed variability in formant
frequencies values in vowels produced by NE adults. We analyzed the standard deviations
associated with the means of 63 formant frequency values (9 data samples, 7 English vowels)
obtained from five NE females (Heald & Nusbaum, Tables S6 to S8). We found that one of the
women, participant 3, produced vowel formant frequencies with significantly smaller SDs
(Bonferroni adjusted p < .001) than did the remaining four female participants. This held true for
F1 frequencies (mean SD=22.1 vs. 24.1 to 35.0), for F2 (mean SD=58.1 vs. 87.8 to 155.7), and for
F3 (mean SD=95.0 vs. 136.0 to 158.4). Chao et al. (2019) also found that within-category vowel
production variability differed substantially among NE adults and was strongly related to their
category boundaries in a vowel identification task.

Perkell, Guenther et al. (2004; see also Perkell, Matthies et al., 2004) examined the perception and
production of English vowels by NE-speaking adults. The participants whose productions were
more precise, that is, showed relatively little within-vowel variability and relatively large between-
vowel distances, showed finer discrimination abilities. The authors suggested that a relatively great
auditory sensitivity is associated a relatively narrow target region in the realizations of vowel
categories and that this, in turn, is associated with relatively great precision in producing a vowel.
Similar results were obtained by Franken, Acheson, McQueen, Eisner, and Hagoort (2017), who
The SLM revised, 35

examined the production and discrimination of Dutch vowels by 40 ND adults. Vowel production
precision was defined as relatively little within-category variability and relatively great between
category distances. Once again, vowel category precision was associated with relatively great
auditory sensitivity.

Lengeris and Hazan (2010) found that individual differences in category precision that were
observed in the L1 were also evident in the L2. The authors indexed individual differences in
perceptual precision by analyzing the slopes of bilinguals’ identification functions. Those who
were most consistent (precise) when identifying L1 (Greek) vowels were also most consistent
when identifying L2 (English) vowels.

Previous research provides some support for the SLM-r category precision hypothesis, which will
need to be evaluated in future research. Baker, Trofimovich, Flege, Mack, and Halter (2008)
examined the interlingual identification of English and Korean vowels. Native Korean (NK) adults
were more likely than NK children to identify English vowels in terms of a single Korean category.
NK children identified an English vowel with a wider variety of Korean categories. The authors
did not assess category precision, but it is likely that the adults’ categories were generally more
precise than those of the children.

Kartushina and Frauenfelder (2013) provided more direct evidence that L1 category precision
affects L2 speech learning. These authors examined the production and perception of French
vowels by native Spanish (NS) adolescents who had studied French at school for about four years.
French /e/ and /ɛ/ occupy a portion of acoustic vowel space where Spanish has just one vowel, /e/.
Acoustic analyses showed more overlap in F1 and F2 values between the students’ Spanish /e/
productions and native French speakers’ productions of French /ɛ/ than between the students’
Spanish /e/ productions and French /e/.

Kartushina and Frauenfelder (2013) reported that the students whose Spanish /e/ productions were
closer in an F1-F2 acoustic space to French /e/ were better able to identify French /e/ in a five-
alternative forced-choice test than the students whose Spanish /e/ productions were more distant
from French /e/. Students whose Spanish /e/ productions showed a relatively “compact”
distribution, that is, relatively little token-to-token variability in the F1-F2 vowel space (greater
“precision” in SLM-r terminology) were more accurate in identifying French /ɛ/ than students
whose Spanish /e/ productions showed greater token-to-token variability (less precision). The
The SLM revised, 36

authors hypothesized that the students who showed relatively little token-to-token variability in
L1 vowel production may have been better able to discern phonetic differences between the
Spanish and French vowels.

The results obtained by Kartushina, Hervais-Adelman, Frauenfelder, & Golestani (2016)


suggested that an influence of L1 category precision may be evident even in the earliest stages of
L2 speech learning. These authors examined the production of Danish /ɔ/ and Russian /ɨ/ by 20
native French (NF) speakers who had no prior exposure to Danish or Russian. The NF participants
were asked to repeat multiple natural tokens of the foreign vowels as accurately as possible both
before and after articulatory training on the vowels had been administered. The accuracy with
which the foreign vowels were produced before and after training was assessed as well as was the
precision (token-to-token variability) with which the NF participants produced the foreign vowels
and the closest French vowels.

Kartushina et al. (2016) found that the NF participants produced the foreign vowels far more
accurately, and with greater precision, after than before training. Most importantly for the present
discussion, the training was not found to modify the precision with which the NF participants
produced native French vowels. This supports the view that L1 category precision is an
endogenous factor not shaped by language-specific phonetic factors.

Another finding that supported this conclusion is a recent study of vowel production in Yoloxóchitl
Mixtec. DiCanio, Nam, Amith, García, & Whalen (2015) evaluated both the extension and
precision of vowel categories in elicited and spontaneous speech samples. The authors noted that
“with a few exception … their participants were very similar in their overall degree of vowel
[production] variability across style” (2015, p. 55) The two production samples differed
systematically, but the seven talkers maintained between-vowel differences and exhibited similar
degrees of precision in both.

For the category precision hypothesis to be accepted, it will be necessary to show in prospective
research that individual differences in L1 category precision affect the discernment of L1-L2
phonetic differences as predicted. It will also be necessary to show that differences in discernment
of cross-language phonetic differences will impact the production and perception of L2 sounds. It
will also be valuable to determine if individual differences in L1 category precision affects how
much L2 input learners need to establish consistent patterns of interlingual identification and if, as
The SLM revised, 37

we suspect, individual differences in category precision in monolinguals derives from individual


differences in auditory acuity, early-stage (pre-categorical) auditory processing, and auditory
working memory.

3.2.5 Bilingual phonetic categories

The SLM-r proposes that the capacity for phonetic category formation remains intact over the life
span, but that new categories are not formed for all L2 sounds differing audibly from the closest
L1 sound. By hypothesis, the likelihood that a new phonetic category will be formed for an L2
sound depends on (1) the degree of perceived phonetic dissimilarity of an L2 sound from the
closest L1 sound, (2) how precisely defined is the closest L1 phonetic category, and (3) the quantity
and quality of L2 input that has been received.

Categories formed for L2 sounds are defined by the statistical properties of input distributions.
This kind of distributional learning is slow in L1 acquisition, and so the SLM-r maintains that it
will also be slow in L2 learning. Much more needs to be known about the time course of
distributional learning, both in the L1 and in an L2. Feldman, Griffiths, and Morgan (2009)
provided evidence that listeners need not estimate the entire distribution of instances of a category
because “simply storing [a sufficient number of] exemplars can provide an alternative method for
estimating the distribution associated with a category” (2009, p. 774). Further, the development of
categories through the estimation of distributions eliminates the need for learners to have a priori
knowledge of what kind and how many categories exist in the L2 being learned (2009, p. 774).

Purely distributional learning theories treat each token of a category as independent of neighboring
sounds, ignoring higher-level structure. Feldman, Griffiths, Goldwater, and Morgan (2013)
showed that the learning problem becomes substantially more tractable if one assumes that
children developing L1 phonetic categories learn to categorize speech sounds and to recognize
words simultaneously. However, modeling phonetic category formation in an L2 remains a
considerable challenge when the data sets to which L2 learning models are exposed resemble the
kind of highly variable input L2 learners actually receive (see Antetomaso et al., 2017, for a first
attempt).

The SLM-r proposes that the formation of a new L2 phonetic category for an L2 sound is a three-
stage stage process. First, an L2 learner must discern a phonetic difference (or differences) between
the realizations of an L2 sound and the L1 sound that is closest to it in phonetic space. Second, a
The SLM revised, 38

functional “equivalence class” of speech tokens that resemble one another, and so are close
together in phonetic space, must emerge (Kuhl, 1991; Kuhl et al., 2008; Kluender, Lotto, Holt, &
Bloedel, 1998). By hypothesis, the sounds making up such equivalence classes remain perceptually
linked to the closest L1 sound until the distribution of tokens defining the equivalence class has
stabilized. Third, at a later and as-yet undefined moment in phonetic development, the perceptual
link between the L2 “equivalence” class and the L1 category will be sundered. We speculate that
this delinking may be speeded by growth of the L2 lexicon, at least in literate learners of an L2
(Bundgaard-Nielsen, Best, & Tyler, 2011).

Once delinking has occurred, the development of a new L2 phonetic category will be based on
statistical regularities of the distribution of L2 sounds that are implicitly categorized as instances
of the new L2 phonetic category. The SLM-r regards L2 phonetic category formation as a gradual
process, not a one-time event. Consider, for example, the results of Thorin, Sadakata, Desain, and
McQueen (2018). These authors trained native Dutch (ND) university students to produce and
perceive the English vowels /ɛ/ and /æ/. The training resulted in somewhat greater improvement
for English /æ/ than /ɛ/, which is consistent with the fact that of the two English vowels, /æ/ is
more distant in an F1-F2 space from the closest Dutch vowel, /ɛ/ (see also Díaz, Mitterer,
Broersma, & Sebastián-Gallés, 2012; Flege, 1992). Evidence of discrimination peaks before
training and a shift in post-training phoneme boundaries suggested to the authors that the ND
students already had “weak” phonetic categories for English /æ/ before training, which became
“stronger” as a result of the training.

The SLM-r proposes that new phonetic categories will not be formed by individual learners for an
L2 sound that the learners judge as being too similar phonetically to the closest L1 sound.
Crucially, however, learners do not discard audible phonetic information in such cases. By
hypothesis, a perceptual link between the L2 sound and the closest L1 sound will continue to exist
and a composite L1-L2 phonetic category will develop, defined by the statistical regularities
present in the combined distributions of the perceptually linked L1 and L2 sounds.

3.2.6 L1-L2 interactions

The presence or absence of category formation is the key determinant of how phonetic systems
and subsystems reorganize. A method did not exist in 1995 for determining when a new L2
phonetic category had been formed and, alas, the same holds true today. One new technique we
The SLM revised, 39

consider promising is presented in the Supplementary Materials associated with the chapter, but it
has not yet been used in L2 speech research.

Brain imaging techniques have developed substantially since 1995 and may someday provide a
litmus test for L2 category formation. For example, if L2 learners use a new phonetic category
when processing tokens of an L2 vowel, their categorization responses might be associated with
“more efficient neural processing in frontal speech regions implicated in phonetic processing” than
would be the case if a new L2 phonetic category had not been formed (Golestani, 2016, p. 676). It
will be of special interest to determine if the processing of L2 sounds via a new phonetic category
will ever demonstrate the “neural commitment” seen for L1 sounds, namely, focal cortical
representations that persist for relatively brief intervals (Zhang, Kuhl, Imada, Kotani, & Tohkura,
2005).

Meanwhile, interactions between L1 and L2 phonetic categories provide a reflex that is diagnostic
of L2 category formation or its absence. According to the SLM-r, new L2 categories may shift
away from (i.e., dissimilate from) neighboring L1 categories to maintain phonetic contrast between
certain pairs of L1 and L2 sounds. This because, by hypothesis, the L1 and L2 phonetic categories
of a bilingual exist in a common phonetic space. In the absence of category formation for an L2
sound, on the other hand, the SLM-r predicts a merger of the phonetic properties of an L1 sound
and the L2 sound to which it remains perceptually linked. This may cause the L1 sound to shift
toward (assimilate to) the L2 sound in phonetic space.

We will illustrate the interactions predicted by the SLM-r by considering the results of two studies
examining the same participants. Flege, Schirru and MacKay (2003) provided evidence of L1-L2
category dissimilation. These authors examined the production of English /eɪ/ and Italian /e/ by
four groups of native Italian (NI) immigrants who had lived in Canada for decades. The four NI
groups differed orthogonally in age of arrival in Canada (Early vs. Late) and amount of continued
Italian use (High vs. Low). The 36 participants in two “High-Italian-use” groups reported using
Italian more than the 36 participants in two “Low-Italian-use” groups (means=48% vs. 9%) and
used Italian in more social contexts and with more other NI immigrants than members of the Low-
Italian-use groups did.

English /eɪ/ is produced with substantial formant movement, Italian /e/ with little or none. Acoustic
analyses revealed that the 36 members of the two Late learner groups (Late-low, Late-high)
The SLM revised, 40

produced English /eɪ/ in an Italian-like way, that is, with significantly less formant movement than
NE speakers produced. This suggested that many Late learners had not formed new phonetic
categories for English /eɪ/. However, the NI speakers who were likely to have received the most
English native-speaker input, Early-low, produced English /eɪ/ with significantly more formant
movement than NE speakers did. This suggested that not only had members of the Early-low group
formed new phonetic categories for English /eɪ/, their new L2 categories dissimilated from their
Italian /e/ categories to maintain contrast in a common phonetic space.

The SLM-r proposes that category assimilation may occur when a new phonetic category has not
been formed for one or more of the reasons stated above. By hypothesis, composite L1-L2 phonetic
categories develop in such cases. MacKay et al. (2001) obtained evidence of category assimilation
– the opposite of what was just reported for the same participants – in research examining how the
four NI groups produced and perceived /b/ in English and Italian.

Confirming past research, MacKay et al. (2001) found that NE speakers produced English /b/ in
three different ways: with full pre-voicing that continued until stop release, with partial pre-voicing
that ceased before release, or as short-lag stops. Also confirmed was the fact that Italian /b/ is
produced with full pre-voicing. The authors reasoned that phonetic contrast must be maintained
between phonetic elements making up the L1 and L2 phonetic system of bilinguals. That being the
case, the NI speakers could not form new “short-lag” phonetic category for English /b/ because
Italian /p/ is realized with short-lag VOT values.

MacKay, et al. (2001) found that members of the two High-Italian-use groups incorrectly identified
naturally produced short-lag tokens of English /b/, as /p/, more often than did members of the two
Low-Italian-use groups. Members of the High-Italian-use groups produced English [b] in an
English way, that is, as short-lag stops, less often than members of the two Low-Italian-use groups.
The NI speakers likely to have received the most native-speaker English input over the course of
their lives did something that never happens in Italian: they produced Italian /b/ with pre-voicing
that ceased before stop release.

The authors suggested that the NI speakers “restructured” their Italian /b/ categories to varying
degrees for use in English. The SLM-r interpretation is that they developed composite L1-L2
categories based on the combined distribution of the Italian and English /b/ tokens they had heard
over the course of their lives. Depending on how their composite Italian-English /b/ categories
The SLM revised, 41

were specified, some NI speakers modified the realization rule used to produce /b/ in both English
and Italian so that it no longer assured pre-voicing that continued without interruption until stop
release.

The L1-L2 interactions under discussion here might also contribute indirectly to cross-dialect
differences. Caramazza and Yeni-Komshian (1974) found that French monolinguals in Quebec
(Canada) produced French /b d g/ with English-like short-lag VOT values far more often than did
French monolinguals in France (means=59% vs. 6%) and they also produced French /p t k/ with
somewhat longer VOT values. We speculate that both effects seen in French Canadian arose from
exposure by French monolinguals in Canada to the French spoken by French-English bilinguals
whose L1 production had been altered by learning and using English.

According to the SLM-r, L1-on-L2 and L2-on-L1 effects arise inevitably because the phonetic
elements of a bilingual’s L1 and L2 subsystems exist in a common phonetic space. Although the
model makes no predictions regarding the magnitude of such effects, it is worth considering three
factors that might modulate them. To begin, Lev-Ari and Peperkamp (2013) measured VOT in
English stops produced by English-French bilinguals who had lived in Paris for many years. These
authors proposed the magnitude of the L2 (French) on L1 (English) effects they observed may
have been influenced by individual cognitive differences in “inhibitory skill”.

Second, the magnitude of cross-language phonetic effects may depend on how bilinguals deploy
their phonetic categories. We might expect L2-on-L1 effect sizes to increase, for example, the
more strongly activated the L2 is activated when L1 performance is examined, when bilinguals are
tested in the presence of other bilinguals, when they are using their L1 to discuss topics usually
discussed while speaking the L2, and so on (e.g., Grosjean, 1998, 2001).

Finally, we might expect the L2-on-L1 effect sizes to increase as proficiency in the L2 increases
as the result of more input and use of the L2. This might be evident even in fairly short periods of
time. Casillas (2018), for example, tested NE-speaking university students who differed according
to how many university-level Spanish language classes they had taken. The research examined the
size of a shift in the location of /b/-/p/ phoneme boundaries in VOT continua that lexically induced
“English” and “Spanish” modes of perception. Only the students who had tokens the most Spanish-
language classes showed significant phoneme boundary shifts.

3.2.7 Features weighting


The SLM revised, 42

The SLM proposed that the phonetic category formed for an L2 sound by an L2 learner might
differ from the phonetic categories formed for the same sound by monolingual native speakers of
the target L2 if the L2 sound were specified by “features … not exploited” in the learner’s L1 or
if features (perceptual cues) defining the L2 sound were “weighted differently” in than the features
specifying the closest L1 sound (Flege, 1995, pp. 239-243).

As we now see it, the earlier SLM “feature hypothesis” was incongruent with the model’s first
postulate, namely, that “the mechanisms and processes used in learning the L1 sound system,
including category formation, remain intact over the life span and can be applied to L2 learning”
(Flege, 1995, p. 239). The SLM-r abandons the earlier SLM feature hypothesis due to the
emergence of research showing that Late learners can gain access to features used to define L2
categories not exploited in L1 (Flege, Aoyama, & Bohn, 2020, this volume). It formally adopts the
“full access” hypothesis proposed by Flege (2005b; see also Escudero & Boersma, 2004). We
dedicate the remainder of this section to justifying the change.

L1 research. As L1 categories develop, the multiple cues defining them are weighted optimally
for correct categorization. For example, research has shown that NE monolinguals use VOT as the
primary cue when categorizing word-initial stops as phonologically voiced or voiceless, making
lesser use of F0 onset frequency (e.g., Whalen, Abramson, Lisker, & Mody, 1993). The secondary
cue, F0, generally exerts a measurable effect on categorization only for a subset of stimuli having
ambiguous VOT values (Kong & Edwards, 2015, 2016; Lehet & Holt, 2017).

Learning to optimally integrate multiple cues to L1 phonetic categories requires years of input
(e.g., Morrongiello, Robson, Best, & Clifton, 1984; Nittrouer, 2004). For example, Idemaru and
Holt (2013) found that, to optimally categorize word-initial English liquids, monolingual NE
children must learn to give greater weight to the onset frequency of F3 than to F2 onset frequency
values. Use of F3 frequency, the primary cue for /r/ categorization by NE adults, develops rapidly,
but the use of a secondary cue, F2, continues to develop beyond 8 or 9 years of age.

Differences in cue weighting depends importantly on cue reliability (Idemaru & Holt, 2011;
Strange, 2011), which, in turn, depends on the statistical properties of input distributions to which
individuals have been exposed during L1 acquisition (Holt & Lotto, 2006, pp. 3060-3062).
Individual differences in cue weighting among monolinguals are likely to arise from exposure to
different input distributions of tokens specifying a phonetic category (Clayards, 2018; Lee &
The SLM revised, 43

Jongman, 2018). However, future research must explore the possibility that some individual
differences in cue weighting arise from endogenous differences in auditory acuity, early stage
auditory processing, or auditory working memory.

The cue weighting patterns specified in phonetic categories are not applied rigidly by monolinguals
during the categorization of sounds in their native language. Human speech perception is
necessarily adaptive (Aslin, 2014), enabling listeners, for example, to better understand foreign-
accented renditions of their L1 after a brief exposure to foreign-accented talkers (e.g., Bradlow &
Bent, 2003). Also important is the fact that cue weighting may adapt dynamically to what has been
heard recently (Lehet & Holt, 2017; Schertz, Cho, Lotto, & Warner, 2016), and can be modified
through training (Francis, Kaganovich, & Driscoll-Huber, 2008).

Adaptation occurs at the segmental level in both production and perception. Nielsen (2011) found
that NE speakers produced /p/ with significantly longer VOT values after hearing experimental
stimuli with artificially lengthened VOT values (see also Clarke & Luce, 2005). Kraljic and
Samuel (2006) found that NE adults could recalibrate their categorization of stops based on brief
exposure to unusual productions of /t/ and /d/. The NE participants in Kraljic and Samuel
performed a lexical decision task. Half of them heard words in which the target sound, which was
ambiguous between /d/ and /t/, occurred in words known to have /d/ (e.g., crocodile) while the
remaining half of the participants were exposed to the same ambiguous stimuli in words known to
have /t/ (e.g., cafeteria). The participants exposed to ambiguous stimuli in /d/ words gave more /d/
responses in a post-test than those exposed to ambiguous stimuli in /t/ words. McQueen, Tyler,
and Cutler (2012) showed that the ability to recalibrate perception, which they termed “lexically
guided retuning”, is already present in young monolingual children.

L2 research. Iverson and Evans (2007) examined the production and perception of English vowels
by native speakers of Spanish (n=25), French (n=19), German (n=21) and Norwegian (n=18) who
had spent median periods in an English-speaking country that ranged from 0 years (the
Norwegians) to 3 years (the native French participants). The participants’ perceptual
representations for English vowels were defined by having them select the best exemplars of
various L1 and L2 vowel categories from a 5-dimensional array of stimuli differing in F1 and F2
frequencies, formant movement patterns, and duration. Participants whose L1 made little or no use
of duration and formant movement patterns were nevertheless observed to use those dimensions
The SLM revised, 44

when selecting the best exemplars of English vowels. This suggested the non-native speakers’
representations for English vowels incorporated information pertaining to these dimensions. This
conclusion regarding the use of previously unneeded dimensions was corroborated by another
finding of the study, namely that significantly fewer English vowels were correctly identified in
noise when the previously unneeded dimensions were neutralized than when those dimensions
remained present in the stimulus array.

Individual differences exist in cue weighting among monolinguals. That being the case, the SLM-
r predicts that individual differences will also be evident in the production and perception of L2
sounds that are perceptually linked to L1 sounds via the mandatory and automatic mechanism of
interlingual identification. Individual differences in cue weighting has in fact been observed in
both early (Idemaru, Holt, & Seltman, 2012; Kim, 2012) and later phrases of L2 learning
(Chandrasekaran, Sampath & Wong, 2010; Schertz et al., 2016). The SLM-r proposes that the
influence of L1 cue weighting patterns will be stronger for L2 sounds which remain perceptually
linked to an L1 category than for L2 sounds for which a new L2 phonetic category has been
formed. Cue weighing patterns for newly formed L2 phonetic categories are expected to develop
as in monolingual L1 acquisition, that is, to be based on the reliability of multiple cues to correct
categorization found in input distributions.

In research examining the formation of non-speech auditory categories, Holt and Lotto (2006)
identified effects due to the variability of multiple dimensions in the input provided during
laboratory training that might apply to L2 learning. This research might provide a starting point
for explaining inter-subject variability in L2 cue weighting. As far as we know, however, the
suggestions offered by Holt and Lotto (2006) have not yet been applied to L2 speech learning,
probably because of difficulty in specifying the distribution of speech sounds to which learners of
an L2 have been exposed.

As is the case for monolinguals, the cue weighting patterns evident for L2 learners may reflect the
properties of speech stimuli heard in the recent past (Lehet & Holt, 2017; Schertz, Cho, Lotto, &
Warner, 2016) and change in response to perceptual training (Francis, Kaganovich, & Driscoll-
Huber, 2008; Giannakopoulou, Uther, & Ylinen, 2013; Hu et al., 2016, Schertz et al., 2016; Ylinen
et al., 2010). Seen from the perspective of an attention-to-dimension perceptual learning model
(Francis & Nusbaum, 2002), changes in cue weights induced by training underscore the
The SLM revised, 45

importance of attentional allocation, something demonstrated with elegant simplicity by Pisoni et


al. (1982).

L2 research has focused on the identification of differences in cue weighting patterns between
native and non-native speakers. Consider, for example, the perception of English /i/ and /ɪ/. Flege
et al. (1997) examined identification of these vowels in a two-alternative forced-choice (2AFC)
test by adult NE speakers and by two groups of 10 native Korean (NK) adults. The NK groups
differed in FTE years of English input (means=4.3 and 0.3 for the relatively “experienced” and
“inexperienced” groups). The synthetic vowel stimuli varied orthogonally in temporal and spectral
dimensions (duration vs. F1 and F2 frequencies). All 10 NE speakers made predominant use of
spectral cues whereas eight experienced and nine inexperienced NK speakers made predominant
use of temporal cues.

Similarly, Kim, Clayards and Goad (2018) examined the use of spectral and temporal cues in an
/i/-to-/ɪ/ continuum by NK women and their children. These authors obtained a first sample 2.2
months after their participants had arrived in Canada to study English, and also 4, 8, and 12 months
after the participants’ arrival in Canada. As expected, the NE speakers made greater use of spectral
than temporal cues whereas the reverse held true for the NK speakers. However, the NK speakers
made somewhat greater use of spectral cues over the course of the longitudinal study, with greater
movement towards the English pattern evident for NK children than adults.

Although the findings of Flege et al. (1997) and Kim et al. (2018) were similar, the results of both
studies were difficult to interpret because the NK participants in both studies often reversed
category labels, something not seen in the NE participants’ responses. The reversals may have
reflected confusion by some NK participants regarding how to associate the response alternatives
(written labels in Flege et al., 1997, pictures in Kim et al., 2018) to the response alternatives that
were offered. Also, use of a 2AFC task might not have permitted listeners to adequately report
phonetic-level categorization (see, e.g., Pisoni et al., 1982).

The adult-child difference observed by Kim et al. (2018) is of special interest given that it was
obtained longitudinally. However, interpretive difficulty exists here as well because of differences
in the contexts of L2 learning for participants differing in age (see Flege, 2019). The NK adults
attended English classes 22 hours per week. The children attended school five days a week (hours
not reported) and also studied English after school an average of 31 hours per week. We can infer
The SLM revised, 46

that the children obtained more L2 input than their mothers, and possibly more native-speaker
input as well.

Other L2 cue weighting research has examined use of VOT and F0 as cues to the categorization
of L2 stops. As already mentioned, VOT is the primary cue to NE-speaking listeners’
categorization of word-initial stops as /t/ or /d/ and F0 onset frequency is a secondary cue. In
Korean, on the other hand, F0 is often the primary cue and VOT is often (but not always) a
secondary cue. Schertz et al. (2015) examined NK speakers’ use of the two cues when perceiving
English stops and measured the two dimensions in their English productions. The NK speakers
produced reliable VOT and F0 differences between English stops. Greater variability was evident
for perception than production, with some NK speakers using VOT as the primary cue, some using
F0 as the primary cue and some using both cues.

The results of Kong and Yoon (2013) provide insight into the role of input in the modification of
cue weighting patterns. These authors examined the use of VOT and F0 in the perception of
English stops by two groups of high-school students in Seoul. The stimuli consisted of an array of
tokens in which VOT and F0 onset frequency varied orthogonally. The “Low proficiency” students
were enrolled in a regular high school whereas the “High proficiency” students attended a special
foreign-language high school. The students judged stimuli using a visual analog scale with
endpoints ranging from “d-like” to “t-like”. The two groups differed little in their use of the VOT
dimension, but members of the Low-proficiency group were more sensitive to F0 variation than
members of the High proficiency group. This suggested that the more experienced students
reduced use of F0, the primary cue in Korean, when perceiving stops in English, where F0 is a
secondary cue.

Finally, the results obtained by Dmitrieva (2019) suggested that L2 learners can learn to reverse
the cue weighting pattern evident in their L1 when perceiving L2 sounds. Dmitrieva (2019) tested
34 monolingual speakers each of both English and Russian and also 37 Russian-English bilinguals
who had lived in the United State for an average of 39 years (all but four of whom had arrived
after the age of 15 years). The participants labelled, as /k/ or /g/, randomly presented stimuli from
an array differing orthogonally in duration of glottal pulsing present in the closures of word-final
stops and the duration of vowels preceding the final stops (six steps each). Russian monolinguals
relied more on glottal pulsing than vowel duration whereas English monolinguals showed the
The SLM revised, 47

reverse pattern. When tested in English, the bilinguals as a group did not differ from the
monolingual English group. That is, many or most of the Russians had learned to give greater
weight to vowel duration than glottal pulsing. When tested in Russian, the bilinguals were found
to make greater use of vowel duration than the Russian monolinguals did. This suggested that
learning to optimally categorize word-final English stops had modified how the bilinguals
perceived stops in their L1, Russian.

3.2.8 Individual differences in speech learning ability

The extent of native vs. non-native differences generally diminishes over time, especially for
individuals who began learning the L2 in childhood, but such differences are usually found persist
even in many Early learners (Flege, 2019). An important objective of L2 research is to account for
inter-subject variability, that is to say, the magnitude of persistent native vs. non-native differences.
We have emphasized the importance of input as an explanation of inter-subject variability. What,
then, accounts for differences between individuals who seem to have received similar L2 input?

Individual differences in the capacity to learn speech at any age might contribute to the inter-
subject variability evident in L2 research. The time children need to attune fully to the ambient
language phonetic system varies (e.g., Smit, Hand, Freilinger, Bernthal, & Bird, 1990) but, in the
end, typically-developing children learn to speak their L1 without noticeable pronunciation errors.
L2 speech learning is of course different. It seems reasonable to assume that individual differences
in L1 learning will be evident later when an L2 is learned but we know of no systematic research
testing this hypothesis. Preliminary support for the hypothesis is the finding that “good” and “poor”
perceivers of L2 sounds may differ in the perceiving L1 sounds when L1 perception is assessed
with sufficient sensitivity at the appropriate processing level (Díaz, Mitterer, Broersma, Escera, &
Sebastián-Gallés, 2015). Given that most children eventually learn L1 speech adequately, the
SLM-r proposes that endogenous differences in the capacity to learn speech, should they exist,
will primarily impact how much input is needed to reach specific speech learning milestones rather
than whether or not such milestones are eventually reached.

Differences in auditory acuity and in early-stage (pre-categorical) auditory processing are perhaps
the most obvious factors to consider in the search for endogenous individual differences that will
influence L2 speech learning. However, two early L2 speech studies dampened interest in auditory
factors. Stevens, Liberman, Studdert-Kennedy, and Öhman (1969) observed little difference
The SLM revised, 48

between native speakers of English and Swedish in the discrimination of front rounded vowels
found in Swedish but not English. This suggested that the auditory detection of differences
between Swedish vowels was just as possible for listeners who were unfamiliar with Swedish
vowels as it was for Swedes. Similarly, Miyawaki et al. (1975) found that native Japanese speakers
who were unable to discriminate English “ra” and “la” syllables had no trouble discriminating the
third formant (F3) components of the same stimuli when presented in isolation. Given that F3
frequency is the most important cue to the identification of a consonant as /r/ in English, many
interpreted this finding to mean that the notorious “Japanese r-l problem” resided at a phonetic
and/or phonological level rather than at an auditory level (for discussion, see Iverson, Wagner, &
Rosen, 2016).

The findings just mentioned do not necessarily mean that individual differences in auditory acuity,
early-stage (pre-categorical) auditory processing and auditory memory are unimportant. The
influence of endogenous “auditory” differences may be more evident in early than later stages of
L2 learning. Moreover, effects of auditory-level processing differences are usually evident only in
specific task conditions (e.g., Werker & Logan, 1985). In fact, later L2 research using methods
that required the phonetic categorization of speech sounds soon revealed important effects of L1
background, including effects attributable to language experience and phonetic context (e.g.,
Gottfried, 1984; Levy & Strange, 2008).

Normal-hearing monolingual adults differ in terms of auditory acuity, for example, in how finely
they can discriminate formant frequency differences in native language vowels (Kewley-Port,
2001). Differences exist in early-stage (pre-categorical) auditory processing, such as the amplitude
of the frequency-following response (Galbraith, Buranahirun, Kang, Ramo, & Lunde, 2000;
Hoormann, Falkenstein, Hohnsbein, & Blanke, 1992). Kidd, Watson, and Gygi (2007) identified
individual differences in four basic auditory abilities, and also in the ability to recognize familiar
non-speech auditory events. This last auditory capability may have derived from individual
differences in auditory working memory and/or attentional allocation.

The findings just mentioned raise the question of whether auditory acuity, and more generally pre-
categorical auditory processing (Iverson, Wagner, & Rosen, 2016), affects how much native-
speaker input is needed to learn L2 sounds once the L1 phonetic system has been established
(Kachlika, Saito, & Tierney, 2019, p. 16). Existing research suggests that it does.
The SLM revised, 49

Hazan and Kim (2010) showed that auditory sensitivity to F2 frequency was the best single
predictor of how much NE speakers benefited from computer-based training on a phonetic contrast
found in Korean but not English. Lengeris and Hazan (2010) administered perceptual training on
English vowels to 18 native Greek (NG) adults living in Athens. The L2 perceptual training yielded
robust improvements in the accuracy with which the NG participants identified and produced
English vowels. The NG participants’ ability to discriminate non-speech stimuli (isolated F2
formants) was evaluated before training. Their ability to identify English vowels after the training
correlated significantly with their discrimination of non-speech sounds, Greek vowels, and English
vowels prior to training (correlations ranging from r=0.55 to 0.56). All three variables were also
found to correlate significantly (r=0.52 to 0.68) with measures of post-training English vowel
production accuracy.

Kachlika et al. (2019) evaluated the relation between auditory processing and the identification of
two pairs of English vowels that are known to be difficulty for native Polish (NP) speakers. The
authors tested 40 Poles who had arrived in the United Kingdom after the age of 18 years and had
lived in the UK for 1 to 6 years (mean=3.6 years). The NP participants reported using English from
18% to 97% of the time (mean=66%) and had studied English at school for 0.5 to 20 years
(mean=9.4 years). The percentage of correct identifications of English vowels correlated
significantly with measures of the NP participants’ spectral processing and neural encoding of F2
formant frequency.

Individuals also differ in how they process, code, and store L2 sounds in memory. Golestani,
Molko, Dehaene, LiBihan, and Pallier (2007) found that some native French (NF) speakers rapidly
learned to distinguish an unfamiliar (foreign) phonetic distinction between dental and retroflex
stops, but other NF speakers did so poorly or not at all. The behavioral difference between “fast”
and “slow” learners was related to individual differences in functional neuroanatomy and
lateralization of language processing.

The ability to accurately mimic sounds is crucial for the learning of both L1 and L2 sounds. Most
people readily note occasional disparities between what they meant to say, as specified by their
phonetic categories, and self-heard vocal output. This is because talkers monitor their vocal
output in real time via pathways that connect self-hearing, on the one hand, and phonetic
categories and realization rules, on the other hand (e.g., Guenther, Hampson, & Johnson, 1998).
The SLM revised, 50

Reiterer et al. (2013) identified two subgroups of native German (NG) speakers for a study
examining the production of German with a feigned English accent. Members of the two groups
were equally able to speak English but differed in ability to mimic sentences from an unknown
foreign language in a pre-test. Members of the “High-” and “Low-mimicry-ability” groups later
produced German and English sentences without special instruction and were also asked to
produce German sentences with an English foreign accent based on their prior exposure to English-
accented German (see also Flege & Hammond, 1982).

Acoustic analyses and brain imaging data in Reiterer et al. (2013) pointed to between-group
differences in the ability to access, store and retrieve “auditory episodic events”. Members of the
High-mimicry-ability group were said to deploy more “detailed phonetic” knowledge of English
sounds and to have greater “phonological awareness” than members of the Low-mimicry-ability
group when producing their L1 (German) with a feigned L2 (English) foreign accent. Brain
imaging data revealed between-group differences in the strength and extent of activation in the
sensory motor cortex in a zone within the left inferior parietal cortex thought to “integrate aspects
of speech production, phonological representations [and] working memory”. The authors
suggested that this may have been a “compensation strategy” by members of the Low-mimicry-
ability group arising from a “generally weaker auditory working memory”.

The importance of auditory working memory for L2 speech learning was shown by MacKay,
Meador, and Flege (2001). These authors tested the hypothesis that variation in phonological short-
term memory (PSTM) will exert a long-term effect on the identification of consonants in an L2. A
total of 72 normal hearing native Italian (NI) speakers who had all lived in Canada for at least 15
years (mean=35 years) participated. The NI participants differed substantially in age of arrival in
Canada from Italy, years of Canadian residence, self-reported frequency of Italian use and
competence in Italian. PSTM was assessed by having the NI speakers repeat Italian non-words
ranging in length from two (quite easy) to five syllables (very difficult). The PSTM scores were
found to correlate significantly with accuracy in identifying both word-initial and word-final
English consonants (r=0.42 and 0.53, p<.001).

The PSTM hypothesis was confirmed by regression analyses examining the influence of arrival
age, years of residence, frequency of L1 use, L1 competence and the PSTM scores on consonant
identification accuracy. The NI participants’ ages of arrival in Canada and their competence in
The SLM revised, 51

Italian together accounted for 25.1% of the variance for word-initial consonants and 18.9% of the
variance in the identification of word-final English consonants. The PSTM scores were found to
account for a significant 7.8% increase in the variance accounted for the identification of initial
consonant and a significant 14.8% increase for word-final English consonants when entered into
the regression model following the other predictor variables.

Many other individual differences that might potentially influence the basic capacity to learn
speech have been identified in the literature. These include, to name but a few, musical ability
(e.g., Slevc & Miyake, 2006), selective attention (e.g., Mora & Mora-Plaza (2019) and phonemic
coding ability (Saito, Sun, & Tierney, 2019), The role of these other variables will need to be
investigated in greater detail in research focusing on individuals at both early and later stages of
L2 learning.

3.2.9 Individual differences in L1 phonetic categories

Many L2 speech studies have compared the performance of groups of L2 learners, for example,
immigrants differing in age of arrival in a predominantly L2-speaking country (“Early” vs. “Late”)
or frequency of L2 use (“High” vs. “Low”). This approach tacitly assumes that all monolingual
native speakers of the same L1 have identical, or at least very similar phonetic categories, but this
assumption is not always well founded. Differences have been observed in how individual
monolinguals produce and perceive native-language vowels and consonants.

Hillenbrand, Getty, Clark, and Wheeler (1995) observed substantial differences in how native
English (NE) men, women and children produced American English vowels. The differences were
especially evident in the production of certain pairs of English vowels. Consider, for example, the
vowels found in the English words bed (/ɛ/) and bad (/æ/). NE speakers normally make greater use
of spectral quality than duration cues to distinguish these vowels (e.g., Flege, et al., 1997), but Kim
and Clayards (2019) observed “substantial” individual differences in use of these cues in both
perception and production (p. 781). Cebrian (2006) observed that NE speakers produce the vowel
in bait (often symbolized as /eɪ/) with varying amounts of diphthongization. Individual differences
in L1 vowel categories may be evident in a perception experiment examining which stimuli in an
array that listeners “prefer” (Frieda, et al., 2000).

Individual differences also exist for L1 consonants. For example, individual NE monolinguals may
produce English fricatives with substantial differences in centroid frequency and skew (Newman,
The SLM revised, 52

Clouse, & Burnham, 2001). Some NE speakers always produce /b d g/ with pre-voicing whereas
at least some NE speakers produce only short-lag VOT values (Dmitrieva, Llanos, Shultz, &
Francis, 2015). NE monolinguals always realize /p t k/ with long-lag VOT values, but the actual
values vary substantially and consistently across individuals (e.g., Allen, Miller, & DeSteno, 2003;
Theodore, Miller, & DeSteno, 2009). The intra-subject differences in VOT production is evident
both in connected speech and the production of isolated words, and are systematic in the sense that
individuals who produce relatively long VOT values for /p/ will do the same when producing /t/
and /k/ (Chodroff & Wilson, 2017). Moreover, individual differences in VOT production may
correspond to the VOT values that NE listeners “prefer” in a goodness rating task (Newman, 2003).

L1 phonetic category differences in children may arise from exposure to different dialects
(Docherty et al., 2011, Evans & Iverson, 2004) as well as to other statistically defined differences
in input distributions that cannot be attributed to dialect differences (e.g., Theodore, Monto, &
Graham, 2020). Listeners are able to detect and remember systematic differences between
individual talkers and to exploit this knowledge in speech perception. Word recognition research
has shown that individual differences in the production of L1 speech sounds may permit listeners
to recognize the identity of a talker, even when indexical properties of the voice have been removed
(Remez, Fellowes, & Rubin, 1997), and to better recognize words spoken by familiar talkers
(Nygaard, Sommers, & Pisoni, 1994; Allen & Miller, 2004).

The SLM-r proposes that individual differences in L1 phonetic categories will sometimes impact
L2 speech learning. For example, when first arriving in Spain, a NE speaker who produces little
formant movement in the English word bait (/eɪ/) will likely be observed to produce the Spanish
/e/, which shows little formant movement, more accurately than a NE speaker who produces the
same English vowel with substantial formant movement (Cebrian, 2006).

To illustrate this concept further, let’s consider the hypothetical case of NE speakers learning
Danish. As shown in Figure 3, Flege, Frieda, Walley, and Randazza (1998) obtained mean VOT
values for 60 English words as spoken by 20 NE monolinguals. The mean values ranged from 60
to 116 ms. Imagine that an investigator wished to evaluate the role of quantity of L2 input in a
study examining how two groups of NE speakers who were matched for length of residence in
Denmark but differing in frequency of Danish use (“Low use” < 20%, “High use” > 80%) produced
Danish stops. Imagine, further, that the High-use group included individuals who produced English
The SLM revised, 53

stops with a mean VOT value of 60 ms (participant “A”) and 116 ms (“B”) when they arrived in
Denmark.

Garibaldi and Bohn (2015) found that long-time NE-speaking residents of Denmark produced /t/
with comparable VOT values in English and Danish words (means=91.7 and 90.7 ms,
respectively). Both values were shorter on average than the VOT values produced in Danish words
by Danes (mean=140.0 ms). If participants A and B were both members of the “High use” groups
and both had lived in Denmark for 3 years, B would likely be observed to produce Danish stops
just like many Danes whereas A would likely be found to produce Danish stops with VOT values
that were far too short for Danish. A researcher who did not know how A and B produced English
stops upon arrival in Denmark would erroneously conclude that B had somehow made better use
of Danish input than A did, and would be tempted to attribute this finding to an unknown individual
difference in speech learning aptitude.

Individual differences in the specification of L1 categories is rarely mentioned as a possible


explanation for inter-subject variability in L2 speech learning. A study hinting at this kind of
explanation was that of Mayr and Escudero (2010). These authors examined the production of
German vowels by seven NE college students who had studied German at school in England and
eight others who had also studied for a year in Germany. The NE students were observed to differ
in how they classified German vowels in terms of 14 different English vowels but no systematic
difference between the subgroups was observed in how accurately they produced German vowels.
NG-speaking listeners correctly identified only slightly more German vowels produced by the
study-abroad students than by the students who had only studied German at school in England
(means=59% vs. 50%). Mayr and Escudero (2010) suggested that differences in cross-language
mapping patterns may have contributed to the individual differences and that differences in the
native L1 dialects of the students might also have played a role (2010, p. 293).

Two other studies are relevant to the SLM-r hypothesis that differing L1 categories may lead to
differences in L2 speech learning. Escudero and Williams (2012) examined the AXB
discrimination and forced-choice identification of Dutch vowels by native Spanish speakers from
Spain and Peru. The authors attributed differences in the perception of some Dutch vowels between
the two Spanish dialect groups to acoustic phonetic differences in the realization of vowels in the
native L1 dialect (p. 2012, p. EL411). We suspect that the participants differing in L1 dialect had
The SLM revised, 54

somewhat different L1 vowel categories. Chládkova and Podlipský (2011) examined the
perceptual assimilation of Dutch vowels to L1 vowels by native speakers of Bohemian and
Moravian Czech. The two dialects of Czech have the same phonemic inventory (five pairs of
phonemically long and short vowels) but differ in how some vowels are specified phonetically.
The authors observed differences between the two dialect groups in the perceptual assimilation of
certain Dutch vowels to vowels in the native dialect but, unfortunately, did not determine how
these differences affected the participants’ accuracy in producing and perceiving Dutch vowels.

Some reported individual differences in L2 production may have arisen from inappropriate
elicitation techniques (see the Supplementary Materials) but individual differences that are reliable
need to be explained. The SLM-r proposes that explanations for many or most inter-subject
differences in L2 production and perception can be obtained by examining how individual learners
(1) specified L1 phonetic categories, both in terms of cue weighting and degree of category
precision, when they began learning an L2; (2) how they mapped L2 sounds onto L1 categories;
(3) how dissimilar they perceived L2 sounds to be from the closest L1 sound in their individual
phonetic inventory; and (4) how much and what kind of L2 input they received. Should
phonetically-based explanations not account for true individual differences in the production and
perception of L2 sounds, it will become necessary to evaluate the role of endogenous factors that
might influence whether new phonetic categories have or have not been formed for L2 sounds.
This includes probing for individual differences in auditory acuity, early-stage (pre-categorical)
auditory processing, and auditory working memory.

3.2.10. L2 speech learning landmarks

The SLM focused on between-group differences whereas the SLM-r focuses on how individuals
learn L2 sounds and how L2 learning influences their production and perception of L1 sounds.
This fundamental change in orientation goes well beyond the choice of a particular statistical
analysis technique. It requires new research designs and new ways to interpret the data patterns
they yield (Iverson & Evans, 2007, p. 2843).

The new SLM-r focus on individual learners was prompted by evidence that individuals may bring
somewhat different L1 categories to the task of L2 learning, and the observation that focusing on
groups may obscure differences between individuals (Hazan & Rosen, 1991, p. 197; Markham,
1999). Two practical considerations also prompted the decision to make individual learners the
The SLM revised, 55

primary unit of analysis for research carried out within the SLM-r framework. First, it is often
difficult or impossible to constitute groups differing in a single variable (e.g., “High” vs. “Low-
input” groups). Second, it is sometimes impossible to draw meaningful conclusions from grouped
data. For example, Escudero, Benders, and Lipski (2009) examined the use of spectral and
temporal cues to the Dutch /a:/-/ɑ/ contrast by native Dutch (ND) and Spanish (NS) speakers. The
authors reported that the NS group made significantly greater use of temporal cues than the ND
group did but 14 (37%) of the NS speakers made greater use of spectral than temporal cues.

Figure 3 provides evidence that convinces us, at least, of the need to focus on individual learners
of an L2. This figure shows the mean VOT values produced in 60 /t/-initial English words by
native Spanish (NS) speakers who arrived in the United States at or after the age of 16 years. Some
of the Late learners in this sample produced English /t/ with Spanish-like short-lag VOT values.
Others produced English /t/ with long-lag VOT values resembling those of native speakers, and
still others produced the English /t/ with mean VOT values that fell somewhere in between the
values typical for Spanish and English models. Appending labels to arbitrarily selected subsets of
the NS Late learners (e.g., “No learners”, “Slow learners”, “Superstars”) is tempting but cannot
explain the inter-subject variability.

Working within the SLM-r framework requires obtaining enough data from each participant to
permit treating each individual as a separate experiment. Meeting this condition makes it possible
to determine if an individual has or has not achieved specific L2 speech learning “landmarks”.
Consider, for example, the identification of word-final English stops by native Russian (NR)
speakers. For Russian monolinguals, closure voicing is far more important cue for the
identification of word-final stops as /k/ or /g/ than is preceding vowel duration. Individual NR
speakers living in the United States either do or do not make significant use of vowel duration
when identifying final stops as /k/ or /g/, and they either do or do not learn to weight vowel duration
more highly than closure voicing (Dmitrieva, 2019). A statistically significant use of vowel
duration as a perceptual cue, and a switch from closure voicing to vowel duration as the primary
perceptual cue to stop voicing in English, are specific L2 landmarks that might be assessed in L2
speech learning research. Examples of other such landmarks are, for Korean learners of English, a
switch from primary use of spectral cues to primary use of temporal cues in the identification of
English vowels as /ɛ/ or /æ/ (Flege et al., 1997; Kim et al., 2018), and the production of English /r/
by native Japanese speakers with or without overlap in the F2 and F3 values (Iverson et al., 2005;
The SLM revised, 56

Flege, Aoyama, & Bohn, 2020, this volume).

The inter-subject variability illustrated in Figure 3 highlights the need to understand why
individual L2 learners may differ from one another. As we see it, this may require knowing how
individuals specified the closest L1 phonetic category (presumably Spanish /p/, /t/ and /k/ in the
current context) when they were first exposed to their L2 (here, English), how they mapped target
L2 sounds (/p t k/) onto L1 sounds when first exposed to the L2, how phonetically dissimilar the
L2 sounds were judged to be from relevant L1 sounds, and the quantity and quality of the L2
phonetic input that individuals received (see Flege & Wayland, 2019, for discussion). It is also
important to know if the individuals under examination have or have not formed new phonetic
categories for the target L2 sounds of interest (here, English /p/, /t/ and /k/) and whether the
presence or absence of category formation for the L2 sounds of interest was influenced by
individual differences in auditory acuity, pre-categorical auditory processing, and auditory
working memory.

3.2.11. Speech learning analyses

Observing when, or if, speech learning milestones have been achieved by individual L2 learners
is the first step in data analysis. Consider, for example, the application of this approach to the
learning of word-final English stops by native Russian (NR) speakers. Acoustic phonetic
distinctions between /b/-/p/, /d/-/t/, and /k/-g/ in the final position of Russian words are
incompletely neutralized (e.g., Dmitrieva, Jongman, & Sereno, 2010; Kharlamov, 2014).
Dmitrieva (2019) tested NE monolinguals, NR monolinguals and Russians learning English in the
United States. Russian monolinguals made less perceptual use of vowel duration, and more use of
closure voicing, to distinguish /g/ from /k/ in the final position of Russian words than NE
monolinguals did for English words. The NR learners of English showed an increased use of vowel
duration as a perceptual cue and, in fact, some were found to closely resemble NE monolinguals
in this regard.

Imagine a similar study examining the categorization of word-final stops by 100 NE monolinguals
and 100 NR immigrants to the United States who are tested twice, the first time when the Russians
have lived for one year in the United States (Time1) and a second time three years later (Time2).
If 98 of the 100 NE monolinguals are found to make significant use of vowel duration at both
Time1 and Time2, and if a significantly larger number of the NR participants are found to do so
The SLM revised, 57

at Time2 than Time1 (say, 50 vs. 10), it would demonstrate phonetic learning by the Russians.
Such an analysis would not tell us, however, why 10 NR participants already showed a significant
use of vowel duration at Time1 nor why 50 had still not done so at Time 2. A more comprehensive
study would be needed to answer these crucial questions. Linear mixed-effect models (e.g.,
Magezi, 2015) could be developed to draw more general conclusions about how L2 speech is
learned over time in a comprehensive longitudinal study. We will now illustrate the kind of
research questions that might be addressed within the SLM-r framework by considering a
hypothetical longitudinal study.

The aim of the hypothetical study we have in mind would be to determine how NR speakers
without special training or aptitude learn to produce and perceive /b d g p t k/ in the final position
of English words. All participants would be 20 years or older when they arrived in the United
State. Crucially, all would be (a) enrolled for the study within 4 months of their arrival in the
United States, (b) have had similar formal education in English in Russia, and (c) little experience
conversing in English before arriving. Their later experiences learning English in the United States
and their auditory capacities, on the other hand, would be expected to vary.

The Time0 sample in the hypothetical study would focus on Russian. A NR monolingual would
test NR speakers soon after their arrival in the United States. The NR participants would be asked
to produce and perceive all six Russian stops in the final position of Russian words and provide
information pertaining to prior experience in English. Individual differences in auditory acuity,
early-stage (pre-categorical) auditory processing, and auditory working memory would also be
evaluated at Time0. Finally, the participants would be asked to categorize productions of the six
Russian stops in a 6-alternative forced-choice task and rate the perceived phonetic dissimilarity of
pairs of the corresponding English and Russian stops (36 pair types in all). The auditory capacity
tests administered at Time 0 would be re-administered at yearly intervals, but the results are not
expected to change over time.

The samples obtained at Time1 and in subsequent samples, which focus on English, would be
elicited in English by an English monolingual. In each sample, the NR participants would be asked
to produce the six English stop consonants in the final position of English words, categorize
naturally produced tokens of English /b/, /d/, /g/, /p/, /t/ and / k/, report how many hours per week
overall they have used English to communicate verbally, and how many of those hours per week
The SLM revised, 58

they used English with NE speakers. We anticipate that many NR participants would report
increased use of English over the course of the longitudinal study but that important individual
differences would be evident in English use, especially use of English with NE speakers.

The predictor variables to be examined in the study would be the NR participants’ (1) use of vowel
duration and closure voicing to categorize Russian stops at Time0; (2) the precision of the phonetic
categories they have developed for all six Russian stops when assessed at Time0; (3) their auditory
capacity at Time0, (4) estimated total hours of weekly use of English at Time1 and in subsequent
samples; and (5) estimated hours of English use with English monolinguals at the same sampling
intervals. The dependent variables to be examined at Time1 and in subsequent samples would be
the NR participants’ use of vowel duration and closure voicing in the categorization and production
of all six word-final English stops. The accuracy with which these stops have been produced would
be assessed both acoustically and via listener judgements.

Among the research questions to be addressed are the following:

1. Will the NR participants show increasing less use of closure voicing, and increasing use of
preceding vowel duration to distinguish /b/-/p/, /d/-/t/ and /k/-/g/ as they gain experience in
English?

According to the SLM-r, L2 speech learning depends on the input distributions of L2


sounds to which learners have been exposed. The expectation here is that the NR
participants will make increasing use of vowel duration and decreasing use of closure
voicing when categorizing English stops as result of the greater reliability of the former
than latter cue in the speech of NE speakers. Given the SLM-r hypothesis that production
and perception co-evolve, the model predicts the same trends in production.

The amount of input needed to achieve various L2 speech learning landmarks, and the
vowel duration and closure voicing effect sizes evident in subsequent analyses, may be
modulated by individual differences in auditory capacity. Specifically, the SLM-r predicts
that individuals with relatively limited auditory capacities will need more native-speaker
input to achieve the same landmarks (or effect sizes) than individuals with superior
auditory capacities. Evidence that some participants show no evidence of speech learning
would seriously undermine the SLM-r if that evidence of “failure to learn” cannot be
The SLM revised, 59

attributed to a paucity of English input, to inadequate auditory capacity, or to some


combination of both.

2. Will the NR participants bring somewhat different Russian phonetic categories to the task
of learning English, and will they differ in terms of how precisely their Russian categories
are defined?
We expect a positive answer to both questions. That being the case, the SLM predicts that
(a) individual differences in L1 category specification (e.g., whether individual participants
exploited vowel duration in Russian before arrival in the United States) and L1 precision
(see section 3.2.4) will influence how phonetically dissimilar the participants perceive
corresponding pairs of Russian and English stops to be at Time0, and (b) degree of
perceived cross-language phonetic dissimilarity at Time0 will subsequently influence the
extent to which individual NR participants approximate how most NE monolinguals
produce and categorize English /b/, /d/, /g/, /p/, /t/, and /k/.
3. Will the NR participants show greater evidence of learning for the voiced English stops
(/b/, /d/, /g/) than for the voiceless English stops (/p/, /t/, /k/)? Will they perceive English
/b/, /d/ and /g/ to be phonetically more dissimilar from corresponding “voiced” Russian
than they perceive English /p/, /t/ and /k/ to be from corresponding Russian stops? Will
perceived cross-language dissimilarity predict the amount of learning evident for the six
English stops?
4. A positive response to all three questions in #3 would suggest the formation of new
phonetic categories for English /b/, /d/ and /g/ but not English /p/, /t/ and /k/. This
interpretation could be evaluated within the SLM-r framework by repeating the Time0
sample, which focused on Russian stops, when English data collection has been completed.
The production and perception of Russian voiced stops are predicted to remain unchanged
if new categories have been established for the corresponding English stops (/b/, /d/, /g/)
whereas the production and perception of Russian voiceless stops are predicted to change
if “composite” Russian-English phonetic categories have developed in the absence of
phonetic category formation for the corresponding English stops (/p/, /t/, /k/). This might
be manifested, for example, by the more frequent production of audible release bursts in
the voiceless Russian stops, the prolongation of stop closure intervals, or the production of
stops with higher F1 offset frequencies values than were evident at Time0.
The SLM revised, 60

The hypothetical study we just outlined intentionally omits a comparison of groups defined by age
of arrival in the United States (say, 7 to 12 vs. 20 to 25 years of age). As we see it, the very different
experiences that adult and child immigrants typically have when learning an L2 makes such a
comparison impractical. Compared to children, for example, adult immigrants usually have had
far more formal instruction in English before arriving in the host country (usually from non-native
teachers), have larger vocabularies, and typically receive less native speaker input after arriving
than children. This is because adults typically acculturate less rapidly than children (Cheung,
Chudek, & Heine, 2011; Jia & Aaronson, 2003).

It would be valuable, however, to examine Russian children learning English in a separate or


extended study. The SLM-r predicts that the pattern of findings that emerge from research
examining adult and child L2 learners will be much the same, albeit extended over longer periods
of time for adults than children. This expectation derives from the SLM-r hypothesis that adults
and children learn L2 speech in the same way because they exploit the same capacities to learn
speech.

4. Summary

Like its predecessor, the SLM-r focuses on how sequential bilinguals produce and perceive
position-sensitive allophones of L2 vowels and consonants. Its aim is to account for how phonetic
systems reorganize over the life span in response to the phonetic input received during naturalistic
L2 learning. The core tenets of the SLM-r can be summarized as follows:

1. L2 experience. The SLM focused on highly “experienced” L2 learners and the question of
whether such learners will eventually “master” L2 sounds. The SLM-r has abandoned this
approach because it now seems evident, at least to us, that L2 learners can never match
monolingual native speakers of the target L2. This is because the phonetic elements making up
the L1 and L2 phonetic subsystems of a bilingual necessarily interact, and because the phonetic
input upon which new L2 phonetic categories are based cannot be identical to the input that
native speakers receive.

2. Production and perception. The SLM hypothesized that the accuracy of perceptual
representations for L2 sounds places an upper limit on the accuracy with which the L2 sounds
The SLM revised, 61

can be produced. The SLM-r, on the other hand, proposes that segmental production and
perception co-evolve without precedence.

3. L2 category formation. Phonetic category formation is possible regardless of age of first


exposure to an L2 and is crucial for phonetic organization and reorganization across the life
span. The creation of new phonetic categories for L2 sounds creates an important non-linearity
in the transformation of phonetic input into phonetic performance. When a new category is not
formed for L2 sounds that differ phonetically from the closest L1 sound, a composite L1-L2
phonetic category will develop that is based on phonetic input from two languages.

4. The full access hypothesis. According to the SLM “feature” hypothesis, a new phonetic
category formed for an L2 sound might differ from the phonetic category formed for the same
sound by native speakers if the L2 sound is defined, at least in part, by features not used in the
learner’s L1. The SLM-r adopts the “full access” hypothesis (Flege, 2005b) according to
which L2 learners can gain access to such non-L1 features. The SLM-r proposes that all
processes and mechanisms used to develop L1 phonetic categories, without exception, remain
intact and accessible for L2 learning.

5. Cue weighting. The SLM-r proposes that both new L2 phonetic categories and composite L1-
L2 phonetic categories are gradually shaped by the input distributions defining them and are
driven by the adaptive need to ensure the rapid and accurate categorization of phonetic
segments. By hypothesis, the weighting of multiple perceptual cues that define new L2
categories and composite L1-L2 categories is based on input distributions and so reflects the
reliability with which cues are present.

6. Phonetic factors. According to the SLM-r, the formation or non-formation of a new phonetic
category for an L2 sound depends primarily on (a) the sound’s degree of perceived phonetic
dissimilarity from the closest L1 sound, (b) the quantity and quality of L2 input obtained for
the sound in meaningful conversations, and (c) the precision with which the closest L1 category
is specified when L2 learning begins.

7. L1 category precision. The “category precision” hypothesis of the SLM-r differs importantly
from the earlier SLM “age” hypothesis which it replaces. It predicts that individuals having
relatively precise L1 phonetic categories will be better able to discern phonetic differences
between an L2 sound and the closest L1 sound than individuals having relatively imprecise L1
The SLM revised, 62

categories. This, in turn, will increase their likelihood of forming new phonetic categories for
L2 sounds. L1 category precision generally increases through childhood and into early
adolescence, but important individual differences exist at all ages. This means that variation in
L1 category precision can be dissociated from putative age-related changes in neurocognitive
plasticity at the time individuals are first exposed to an L2.

8. L1 phonetic category differences. Individual speakers of a single L1 may bring somewhat


different L1 phonetic categories to the task of learning an L2. Their L1 categories may differ in
terms of cue weighting, which is thought to depend primarily on the input received during L1
speech development, and also according to how precisely the L1 categories are defined.

9. Endogenous factors. Phonetic category formation for an L2 sound depends on the discernment
of cross-language phonetic differences, the creation of stable perceptual links between L1 and
L2 sounds, the aggregation of “equivalence classes” of L2 sounds that are perceived to be
distinct from the realizations of any L1 phonetic category, and finally the sundering of
previously establish L1-L2 perceptual links. Individual differences in auditory acuity, early-
stage (pre-categorical) auditory processing, and working auditory memory may modulate these
phonetic processes by affecting how much L2 phonetic input is needed to pass from one stage
to the next.

10. Inter-subject variability. Individuals differ in terms of how accurately they produce and
perceive L2 sounds. By hypothesis, inter-subject phonetic variability can be explained, at least
in part, by knowing how individual learner’s L1 phonetic categories were specified when they
were first exposed to an L2, how they perceptually linked L2 sounds to L1 sounds via the
mechanism of interlingual identification, how dissimilar they perceived an L2 sound to be from
the closest L1 sound, and the quantity and quality of L2 phonetic input they have received.

11. Continuous learning. The phonetic categories and realization rules deployed in the L1 and L2
phonetic subsystems remain malleable across the life span, responding to variation in the
phonetic input that has been received, even recent input. An “end state” in learning can be said
to exist only for individuals who are no longer exposed to phonetic input differing from what
they were exposed to previously in life.

The SLM-r presented here provides a framework for research that may eventually permit an
understanding of how speech is learned across the life span and why individuals seemingly differ
The SLM revised, 63

in their ability to learn L2 speech. The model is based on the results of many published studies but
will, of course, need to be evaluated in prospective research. We recognize the immensity of this
task and realize that evaluating the model will require the expenditure of considerable resources
as well as developing improved methodologies and measurement techniques. With that in mind,
we provide in the Supplementary Materials that accompany this chapter, suggestions regarding
how best to obtain speech production data, how to assess the quantity and quality of L2 input, and
how to test for category formation.

Acknowledgments

The work presented here was supported by grants from the National Institute of Deafness and
Other Communicative Disorders. Susan Guion played an important role in this research, and we
truly miss her. Special thanks are also due to Katsura Aoyama, Wieke Eefting, Anders Højen,
Satomi Imai, Ian MacKay, Murray Munro, Thorsten Piske, Carlo Schirru, Anna Maria Schmidt,
Naoyuki Takagi, Amanda Walley and Ratree Wayland. We thank Charles Chang, Olga Dmitrieva,
Francois Grosjean, Nikola Eger, and Natalia Katushina and Juan Carlo Mora Bonilla for comments
of earlier versions of this chapter.

References

Allen, J. S., Joanne L. Miller, J. L., & DeSteno, D. (2003). Individual talker differences in voice-
onset-time. Journal of the Acoustical Society of America, 113, 544-552.
Allen, J. S., & Miller, J. L. (2004). Listener sensitivity to individual differences in voice-onset-
time. Individual talker differences in voice-onset-time. Journal of the Acoustical Society of
America, 115(6), 3171-3183.
Anderson, J., Morgan, J., White, K. (2003). A statistical basis for speech sound discrimination.
Language and Speech, 46, 155-182.
Antetomaso, S., Miyazawa, K., Feldman, N., Elsner, M., Hitczenko, K., & Mazuka, R. (2017).
Modeling phonetic category learning from natural acoustic data. In M., Lamendola & J.
Scott (Eds.) Proceedings of the 41st annual Boston University Conference on Language
Development, (pp. 32-45). Somerville, MA: Cascadilla Press.
The SLM revised, 64

Aslin, R. (2014). Phonetic category learning and its influence on speech production. Ecological
Psychology, 26(4), 4-15.

Baker, W., & Trofimovich, P. (2006). Perceptual paths to accurate production of L2 vowels: The
role of individual differences. International Review of Applied Linguistics 44, 231-259.

Baker, W., Trofimovich, P., Flege, J. E., Mack, M., & Halter, R. (2008). Child-adult differences
in second-language phonological learning: The role of cross-language similarity. Language
and Speech, 51(4), 317-342.

Benders, T., Escudero, P. & Sjerps, M. J. (2012). The interrelation between acoustic context effects
and available response categories in speech sound categorization. Journal of the Acoustical
Society of America 131(4), 3079-3097.

Bent, T. (2014). Children’s perception of foreign-accented words. Journal of Child Language,


41(6), 1334-1355.

Bent, T. (2018). Development of unfamiliar accent comprehension continues through adolescence.


Journal of Child Language, 45, 1400-1411.

Bent, T., & Holt, R. F. (2018). Shhh… I need quiet! Children's understanding of American, British,
and Japanese-accented English speakers. Language and Speech, 61(4), 657-673.

Best, C. T. (1995) A direct realist view of cross-language speech perception. In Speech perception
and linguistic experience: Issues in cross-language research. W. Strange (Ed.), 107-126.
Baltimore: York Press.

Best, C. T., & Tyler, M. D. (2007). Nonnative and second-language speech perception. In O. S.
Bohn & M. J. Munro (Eds.), Language experience in second language learning: In Honor
of James Emil Flege (pp. 13-44). Amsterdam: John Benjamins.

Bloomfield, L. (1933). Language. New York: Holt.

Bohn, O.-S. (2002). On phonetic similarity. In P. Burmeister, T. Piske & A. Rohde (Eds.). An
integrated view of language development: Papers in honor of Henning Wode (pp. 191-216).
Trier, Germany: Wissenschaftlicher Verlag.

Bohn O.-S. (2020). Cross-language phonetic relationships account for most, but not all L2 speech
learning problems: The role of universal phonetic biases and generalized sensitivities. In:
The SLM revised, 65

Wrembel, M., Kiełkiewicz-Janowiak, A. & Gąsiorowski, P., (Eds.), Approaches to the


Study of Sound Structure and Speech: Interdisciplinary Work in Honour of Katarzyna
Dziubalska-Kołaczyk pp. 171-184). Abingdon, England: Routledge.

Bohn, O.-S. & Bundgaard-Nielsen, R. L. (2009). Second language speech learning with diverse
inputs. In: Piske, T. & Young-Scholten, M. (Eds.) Input matters in SLA (207-218 Clevedon,
UK: Multilingual Matters.

Bohn, O.-S. & Ellegaard, A. A. (2019). Perceptual assimilation and graded discrimination as
predictors of identification accuracy for learners differing in L2 experience: The case of
Danish learners' perception of English initial fricatives. Proceedings of the 19th
International Congress of Phonetic Sciences, Melbourne, Australia 2019, 2070-2074.

Bohn, O. S., & Steinlen, A. K. (2003). Consonantal context affects cross-language perception of
vowels. Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona,
Spain 2003, 2289-2292.

Bosch, L. & Ramon-Casas, M. (2011). Variability in vowel production by bilingual speakers: Can
input properties hinder the early stabilization of contrastive categories? Journal of
Phonetics, 39, 514-526.

Bradlow, A., Akahane-Yamada, R., Pisoni, D., & Tohkura, Y. (1999). Training Japanese listeners
to identify English /r/and /l/: Long-term retention of learning in perception and production.
Perception & Psychophysics, 61(5), 977-985.

Bradlow, A. R., & Bent, T. (2008). Listener adaptation to non-native speech. Cognition, 106(2),
707-729.

Brière, E. J. (1966). An investigation of phonological interferences. Language, 42(4), 768-796.

Broersma, M. (2005). Perception of familiar contrasts in unfamiliar positions. Journal of the


Acoustical Society of America, 117(6),3890-3901.

Buckler, H., Oczak-Arsic, S., Siddiqui, N., & Johnson, E. K. (2017). Input matters: Speed of word
recognition in 2-year-olds exposed to multiple accents. Journal of Experimental Child
Psychology, 164, 87-100.

Bundgaard-Nielsen, R. L., Best, C. T., & Tyler, M. D. (2011). Vocabulary size is associated with
The SLM revised, 66

second-language vowel perception performance in adult learners. Studies in Second


Language Acquisition, 33, 433-461.

Callan, D. E., Jones, J. A., Callan, A. M., & Akahane-Yamada, R. (2004). Learning-induced neural
plasticity associated with improved identification performance after training of a difficult
second-language phonetic contrast. NeuroImage, 19, 113-124.

Callan, D. E., Tajima, K., Callan, A. M., Kubo, R., Masaki, S., & Akahane-Yamada, R. (2003).
Phonetic perceptual identification by native- and second-language speakers differentially
activates brain regions involved with acoustic phonetic processing and those involved with
articulatory-auditory/orosensory internal models. NeuroImage, 22, 1182-1194.

Casillas, J. V., & Simonet, M. (2018). Perceptual categorization and bilingual language modes:
Assessing the double phonemic boundary in early and late bilinguals. Journal of Phonetics,
71, 51-64.

Cebrian, J. (2006). Experience and the use of non-native duration in L2 vowel categorization.
Journal of Phonetics, 34, 372-387.

Chandrasekaran, B., Sampath, P., & Wong, P. C. M., (2010). Individual variability in cue-
weighting and lexical tone learning. Journal of the Acoustical Society of America, 128(1),
456-465.

Chládkova, K., & Podlipský, V. J. (2011) Native dialect matters: Perceptual assimilation of Dutch
vowels by Czech listeners. Journal of the Acoustical Society of America, 130(4), EL186-
EL192.

Chao, S-C., Ochoa, D. & Daliri, A. (2019). Production variability and categorical perception of
vowels are strongly linked. Frontiers in Human Neuroscience. Doi:
10.3389/fnhum.2019.00096.

Cheung, B., Chudek, M., & Heine, S. (2011). Evidence for a sensitive period for acculturation:
Younger immigrants report acculturating at a faster rate. Psychological Science, 22(2), 147-
152.

Chodroff, E., & Wilson, C. (2017). Structure in talker-specific phonetic realization: Covariation
of stop consonant VOT in American English. Journal of Phonetics, 61, 30-47.
The SLM revised, 67

Clarke, C., & Luce, P. (2005). Perceptual adaptation to speaker characteristics: VOT boundaries
in stop voicing categorization. Proceedings of the ISCA Workshop on Plasticity in Speech
Perception (PSP2005); London, UK; 15-17 June 2005.

Clayards, M. (2018). Differences in cue weights for speech perception are correlated for
individuals within and across contrasts. Journal of the Acoustical Society of America,
144(3), EL172-EL177.

Darcy, I, & Krüger, F. (2012). Vowel perception and production in Turkish children acquiring L2
German. Journal of Phonetics, 40, 568-581.

DeKeyser, R., & Larson-Hall, J. (2005). What Does the Critical Period Really Mean? In J. F. Kroll
& A. M. B. de Groot (Eds.), Handbook of bilingualism: Psycholinguistic approaches (pp.
88-108). New York, NY, US: Oxford University Press.

de Leeuw, E., & Celata, C. (2019). Plasticity of native phonetic and phonological domains in the
context of bilingualism. Journal of Phonetics, 75, 88-93.

Díaz, B., Mitterer, H., Broersma, M., Escera, C., & Sebastián-Gallés, N. (2015). Variability in L2
phonemic learning originates from speech-specific capabilities: An MMN study on late
bilinguals. Bilingualism: Language and Cognition, 19(5), 955-970.

Díaz, B., Mitterer, H., Broersma, M., & Sebastián-Gallés, N. (2012). Individual differences in late
bilinguals’ L2 phonological processes: From acoustic-phonetic to lexical access. Learning
and Individual Differences, 22, 680-689.

DiCanio, C., Nam, H., Amith, J. D., García, R. C., & Whalen, D. H. (2015). Vowel variability in
elicited versus spontaneous speech: evidence from Mixtec. Journal of Phonetics, 48, 45–
59.

Docherty, G. J., Watt, D., Llamas, C., Hall, D., & Nycz, J. (2011). Variation in voice onset time
along the Scottish border. Proceedings of the 17th International Congress of Phonetic
Sciences, Hong Kong 17-21 August 2011. 591-594.

Dmitrieva, O. (2019). Transferring perceptual cue-weighting from second language into first
language: Cues to voicing in Russian speakers of English. Journal of Phonetics, 83, 128-
143.
The SLM revised, 68

Dmitrieva, O., Llanos, F., Shultz, A. A., & Francis, A. L. (2015). Phonological status, not voice
onset time, determines the acoustic realization of onset f0 as a secondary voicing cue in
Spanish and English. Journal of Phonetics, 49, 77-95.

Dmitrieva, O., Jongman, A., & Sereno, J. A. (2010). Phonological neutralization by native and
non-native speakers: The case of Russian final devoicing. Journal of Phonetics, 38(3), 483-
492.

Earle, F. S., & Myers, E. B. (2015). Overnight consolidation promotes generalization across talkers
in the identification of nonnative speech sounds. Journal of the Acoustical Society of
America, 137(1), EL91-EL97.

Eilers, R. E., & Oller, D. K. (1976). The role of speech discrimination in developmental sound
substitutions. Journal of Child Language, 3(3), 319-329.

Elman, J. L., Diehl, R. L., & Buchwald, S. E. (1977). Perceptual switching in bilinguals. Journal
of the Acoustical Society of America, 62(4), 971-974.

Escudero, P., Benders, T., & Lipski, S. (2009). Native, non-native and L2 perceptual cue weighting
for Dutch vowels: The case of Dutch, German, and Spanish listeners. Journal of Phonetics,
17(4) 452-465.

Escudero, P., & Boersma, P. (2004). Bridging the gap between L2 speech perception research and
phonological theory. Studies in Second Language Acquisition, 26(4), 551-585.

Escudero, P., Sisinni, B. & Grimaldi, M. (2014). The effect of vowel inventory and acoustic
properties in Salento Italian learners of Southern British English vowels. Journal of the
Acoustical Society of America, 135(3), 1577-1584.

Escudero, P., & Williams, D. (2012). Native dialect influences second-language vowel perception:
Peruvian versus Iberian Spanish learners of Dutch. Journal of the Acoustical Society of
America, 131(5), EL406-EL412.

Evans, B. G., & Iverson, P. (2004) Vowel normalization for accent: An investigation of best
exemplar locations in norther and southern British English sentences. Journal of the
Acoustical Society of America, 115(1), 352-361.

Evans, S., & Davis, M. H. (2015). Hierarchical organization of auditory and motor representations
The SLM revised, 69

in speech perception: Evidence from searchlight similarity analysis. Cerebral Cortex,


December, 25(12): 4772-4788. doi: 10.1093/cercor/bhv136.

Feldman, N. H., Griffiths, T. L., Goldwater, S., & Morgan, J. L. (2013). A role for the developing
lexicon in phonetic category acquisition. Psychological Review, 120(4), 751-778.

Feldman, N.H., Griffiths, T. L., & Morgan, J. L. (2009). The influence of categories on perception:
Explaining the perceptual magnet effect as optimal statistical inference. Psychological
Review 116(4), 752-782.

Flege, J. E. (1984). The detection of French accent by American listeners. Journal of the Acoustical
Society of America, 76(3), 692-707.

Flege, J. E. (1987). The production of “new” and “similar” phones in a foreign language: evidence
for the effect of equivalence classification. Journal of Phonetics, 15, 47-65.

Flege, J. E. (1988). Factors affecting degree of perceived foreign accent in English sentences.
Journal of the Acoustical Society of America, 84(1), 70-79.

Flege, J. E. (1991). Age of learning affects the authenticity of voice-onset time (VOT) in stop
consonants produced in a second language. Journal of the Acoustical Society of America,
89, 395-411.

Flege, J. E. (1992). The intelligibility of English vowels spoken by British and Dutch talkers. In
R. D. Kent (Ed.) Intelligibility in speech disorders, Theory, measurement, and
management. Pp 157-232. Amsterdam: John Benjamins Publishing Company.

Flege, J. E. (1995). Second-language speech learning: Theory, findings, and problems. In W.


Strange (Ed.) Speech perception and linguistic experience: Issue in cross-language
research (pp. 229-273). Timonium, MD: York Press.

Flege, J. E., (1998). Factors affecting degree of foreign accent in English sentences. Journal of the
Acoustical Society of America, 84, 70-79.

Flege, J. E. (1999). Relation between L2 production and perception. In J. Ohala et al. (Eds),
Proceedings of the XIVth International Congress of Phonetics Sciences (Berkeley, CA:
Department of Linguistics, University of California at Berkeley, Pp. 1273-1276.

Flege, J. E. (2005a). Origins and development of the Speech Learning Model. 1st Acoustical
The SLM revised, 70

Society of America Workshop in L2 speech learning. April 14-15, 2005, Simon Fraser
University, Vancouver, BC, Canada. DOI 10.13140/RG.2.2.10181.19681.

Flege, J. E. (2005b). Evidence for plasticity in studies examining second language speech
acquisition. ISCA Workshop on Plasticity in Speech Perception, University College
London, June 2005. DOI: 10.13140/RG.2.2.34539.80167

Flege, J. E. (2007). Language contact in bilingualism: Phonetic system interactions. In J. Cole &
Hualde, J. (Eds.) Laboratory Phonology 9 (pp. 353-380). Berlin: Mouton de Gruyter.

Flege, J. E. (2019). A non-critical period for second-language speech learning. In A. M. Nyvad,


M. Hejná et al. (Eds.) A sound approach to language matters – In honor of Ocke-Schwen
Bohn (pp. 501-541) Department of English, School of Communication & Culture, Aarhus
University. E-ISBN: 978-87-7507-440-2.

Flege, J. E., Aoyama, K., & Bohn, O.-S. (2020). The revised Speech Learning Model (SLM-r)
applied. This volume

Flege, J. E., Bohn, O.-S., & Yang, S. (1997). Effects of experience on non-native speakers’
production and perception of English vowels. Journal of Phonetics, 25, 437-470.

Flege, J. E., & Davidian, R. (1984). Transfer and developmental processes in adult foreign
language speech production. Applied Psycholinguistics, 5, 323-347.

Flege, J. E. & Eefting, W. (1986). Linguistic and developmental effects on the production and
perception of stop consonants. Phonetica, 43, 155-171.

Flege, J. E., & Eefting, W. (1987). Production and perception of English stop consonants by native
Spanish speakers. Journal of Phonetics, 15(1), 67-83.

Flege, J. E., & Eefting, W., (1988). Imitation of a VOT continuum by native speakers of Spanish
and English: Evidence for phonetic category formation. Journal of the Acoustical Society
of America, 83, 729-740.

Flege, J. E., Frieda, E. M., Walley, A. C., & Randazza, L. A. (1998). Lexical factors and segmental
accuracy in second language speech production. Studies in Second Language Acquisition,
20(2), 155-187.

Flege, J. E., & Hammond, R. (1982). Mimicry of non-distinctive phonetic differences between
The SLM revised, 71

language varieties. Studies in Second Language Acquisition, 5(1), 1-16.

Flege, J. E., & Liu, S. (2001). The effect of experience on adults’ acquisition of a second language.
Studies in Second Language Acquisition, 23, 527-552.

Flege, J. E., & Munro, M. (1994). The word unit in second language speech production and
perception. Studies in Second Language Acquisition, 16, 381-411.

Flege, J. E., Munro, M. J., & Fox, R. A. (1994). Auditory and categorical effects on cross-language
vowel perception. Journal of the Acoustical Society of America, 95(6), 3623-41.

Flege, J. E., Munro, M., & MacKay, I. R. A. (1995a). Factors affecting strength of perceived
foreign accent in a second language. Journal of the Acoustical Society of America, 97(5),
3126-3134.

Flege, J. E., Munro, M. J., & MacKay, I. R. A. (1995b). Effects of age of second-language learning
on the production of English consonants. Speech Communication, 16, 1-26.

Flege, J. E., Munro, M. J., & Skelton, L. (1992). Production of the word-final English /t/-/d/
contrast by native speakers of English, Mandarin, and Spanish. Journal of the Acoustical
Society of America, 92(1), 128-143.

Flege, J. E., & Port, R. (1981). Cross-language phonetic interference: Arabic to English. Language
and Speech, 24(2), 125-146.

Flege, J. E., Schirru, C., & MacKay, I. R. A. (2003) Interaction between the native and second
language phonetic systems. Speech Communication, 40, 467-491.

Flege, J. E., Takagi, N., & Mann, V. (1995). Japanese Adults can learn to Produce English /I/ and
/l/ accurately. Language and Speech, 38, 25-55.

Flege, J. E., & Wang, C. (1989). Native-language phonotactic constrains affect how well Chinese
subjects perceive the word-final English /t/-/d/ contrast. Journal of Phonetics, 17, 299-315.

Flege, J. E., & Wayland, R. (2019). The role of input in native Spanish Late learners’ production
and perception of English phonetic segments. Journal of Second Language Studies, 2(1),
1-45.

Francis, A. L., & Nusbaum, H. C. (2002). Selective attention and the acquisition of new phonetic
categories. Journal of Experimental Psychology: Human Perception and Performance,
The SLM revised, 72

28(2), 349-366.

Francis, A. L., Kaganovich, N., & Driscoll-Huber, C. (2008). Cue-specific effects of categorization
training on the relative weighting of acoustic cues to consonant voicing in English. Journal
of the Acoustical Society of America, 124(2), 1234.1251.

Franken, M. K., Acheson, D. J., McQueen, J. M., Eisner, F., & Hagoort, P. (2017). Individual
variability as a window on production-perception interactions in speech motor control.
Journal of the Acoustical Society of America, 142(4), 2007-2018.

Frieda, E. M., Walley, A. C., Flege, J. E., & Sloane, M. E. (2000). Adults’ perception and
production of the English vowel /i/. Journal of Speech, Language and Hearing Research,
43, 129-143.

Garcia Lecumberri, M. L., Cooke, M., & Cutler, A. (2011). Non-native speech perception in
adverse conditions: A review. Speech Communication, 52, 864-886.

Galbraith, G. C., Buranahirun, C. E., Kang, J., Ramos, O. V., & Lunde, S. E. (2000). Individual
differences in autonomic activity affects brainstem auditory frequency-following response
amplitude in humans. Neuroscience Letters, 283(3), 201-204.

Garibaldi, C. L., & Bohn, O.-S. (2015). Phonetic similarity predicts ultimate attainment quite
well: The case of Danish /i, y, u/ and /d, t/ for native speakers of English and of Spanish.
Proceedings of the 18th International Congress of Phonetic Sciences, Glasgow, 10-14
August 2015.

Giannakopoulou, A., Uther, M., & Ylinen, S. (2013). Enhanced plasticity in spoken language
acquisition for child learners: Evidence from phonetic training studies in child and adult
learners of English. Child Language Teaching and Therapy, 29(2), 201-218.

Golestani, N. (2016). Neuroimaging of phonetic perception in bilinguals. Bilingualism: Language


and Cognition 19(4), 674-682.

Golestani, N., Molko, N., Dehaene, S., LiBihan, D., & Pallier, C. (2007). Brain structure predicts
the learning of foreign speech sounds. Cerebral Cortex, 17, 575-582.

Gottfried, T. L. (1984). Effects of consonant context on the perception of French vowels. Journal
of Phonetics, 12, 91-114.
The SLM revised, 73

Grosjean, F. (1998). Studying bilinguals: Methodological and conceptual issues. Bilingualism:


Language and Cognition, 1, 131-149.

Grosjean, F. (2001). “The bilingual’s language modes,” in One Mind, Two Languages: Bilingual
Language Processing, ed. J. Nicol (Oxford: Blackwell), 1–22.

Guenther, F., Hampson, M., & Johnson, D. (1998). A theoretical investigation of reference frames
for the planning of speech movements. Psychological Review, 105(4), 611-633.

Gupta, P., & Dell, G. S. (1999). The emergence of language from serial order and procedural
memory. In B. MacWhinney (Ed.), The emergence of language (pp. 447-481). Mahwah,
NJ: Lawrence Earlbaum.

Han, Z., & Odlin, T. (Eds.). (2006). Studies of fossilization in second language acquisition.
Clevedon, England: Multilingual Matters.

Harrington, J., Palethorpe, S., & Watson, C. (2000). Monophthongal vowel changes in Received
Pronunciation: an acoustic analysis of the Queen's Christmas broadcasts. Journal of the
International Phonetic Association, 30(1-2), 63-78.

Hazan, V., & Barrett, S. (2000). The development of phonemic categorization in children aged 6-
12. Journal of Phonetics, 28, 377-396.

Hazan, V., & Kim Y. H. (2010). Can we predict who will benefit from computer-based phonetic
training? Interspeech 2010, Satellite Workshop on “Second Language Studies: Acquisition,
Learning, Education and Technology”, Waseda University, Tokyo, Japan, September 22-
24, 2010.

Hazan, V., & Rosen, S. (1991). Individual variability in the perception of cues to place variability
in initial stops. Perception & Psychophysics, 49(2), 187-200.

Heald, S. L. M., & Nusbaum, H (2015). Variability in vowel production within and between days.
PLoS ONE 10(9): e0136791. doi:10.1371/journal. pone.0136791.

Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of
American English vowels. Journal of the Acoustical Society of America, 97, 3099-3111.

Hintzman, D. L., (1986). “Schema abstraction” in a multiple trace memory model. Psychological
Review, 93(4), 411-428.
The SLM revised, 74

Hockett, C. F. (1958). A course in modern linguistics. New York: The MacMillan Company.

Højen, A., & Flege, J. E. (2006). Early learners’ discrimination of second-language vowels.
Journal of the Acoustical Society of America, 119(5), 3072-3084.

Holt, L. & Lotto, A. J. (2006). Cue weighting in auditory categorization: Implications for first and
second language acquisition. Journal of the Acoustical Society of America, 119(5), 3059-
3071.

Holt, L. L., & Lotto, A. J. (2010). Speech perception as categorization. Attention, Perception &
Psychophysics, 72(5), 1218-1227.

Hopp, H., and Schmid, M. S. (2013). Perceived foreign accent in first language attrition and second
language acquisition: The impact of age of acquisition and bilingualism. Applied
Psycholinguistics, 34, 361-394.

Hoormann, J., Falkenstein, M., Hohnsbein, J., & Blanke, L. (1992). The human frequency-
following response (FFR): Normal variability and relation to the click-evoked brainstem
response. Hearing Research, 59(2), 179-188.

Hu, W., Mi, L., Yang, Z., Tao, S., Li, M., Wang, W., & Dong, Q., & Liu, C. (2016). Shifting
perceptual weights in L2 vowel identification after training. Plos ONE, 11(9): e0162876.
DOI:10.1371/journal.pone.0162876.

Idemaru, K., & Holt, L. L. (2011). Word recognition reflects dimension-based statistical learning.
Journal of Experimental Psychology: Human Perception and Performance, 37(6), 1939.

Idemaru, K., & Holt, L. (2013). The developmental trajectory of children’s perception and
production of English /r/-/l/. Journal of the Acoustical Society of America, 133(6), 4232-
4246.

Idemaru, K., Holt, L. L., & Seltman, H. (2012). Individual differences in cue weights are stable
across time: The case of Japanese stop lengths. Journal of the Acoustical Society of
America, 132(6), 3950-3964

Imai, S., Walley, A. C. & Flege, J. E. (2005). Lexical frequency and neighborhood density effects
on the recognition of native and Spanish-accented words by native and Spanish listeners.
Journal of the Acoustical Society of America, 117(2), 896-907.
The SLM revised, 75

Iverson, P., & Evans, B. G. (2007). Learning English vowels with different first-language vowel
systems: Perception of formant targets, formant movement, and duration. Journal of the
Acoustical Society of America, 122(5), 2842-2854.

Iverson, P., Hazan, V., & Bannister, K. (2005). Phonetic training with acoustic cue manipulations:
A comparison of methods for teaching English/r/-/l/to Japanese adults. Journal of the
Acoustical Society of America, 118(5), 3267-3278.

Iverson, P., Wagner, A., & Rosen, S. (2016). Effects of language experience on pre-categorical
perception: Distinguishing general from specialized processes in speech perception. Journal
of the Acoustical Society of America, 139(4), 1799-1809.

Jia, G., & Aaronson, D. (2003). A longitudinal study of Chinese children and adolescents learning
English in the United States. Applied Psycholinguistics, 24, 131-161

Jia, G., Strange, W., Wu, Y., Collado, J., & Guan, Q. (2006). Perception and production of English
vowels by Mandarin speakers: Age related differences vary with amount of exposure.
Journal of the Acoustical Society of America, 119(2), 1118-1130.

Johnson, K. (2000) Adaptive dispersion in vowel perception. Phonetica, 57, 181-188.

Johnson, K., Flemming, E., & Wright, R. (1993). The hyperspace effect: Phonetic targets are
hyperarticulated. Language, 69, 505-528.

Jongman, A., & Wade, T. (2007). Acoustic variability and perceptual learning, The case of non-
native accented speech. In Bohn, O-S. & Munro, M. J. (Eds.) Language experience in
second language learning: In honor of James Emil Flege (pp. 135-150). Amsterdam: John
Benjamins Publishing Company.

Kachlika, M., Saito, K., & Tierney, A. (2019). Successful second language learning is tied to robust
domain-general auditory processing and stable neural representation of sound. Brain and
Language, 192, 15-24.

Kartushina, N., & Frauenfelder, U. H. (2013). On the role of L1 speech production in L2


perception: Evidence from Spanish learners of French. Interspeech 2013, 25-29 August
2013, Lyon, France, 2118-2122.

Kartushina, N., Hervais-Adelman, A., Frauenfelder, U. H., & Golestani, N. (2016). Mutual
The SLM revised, 76

influences between native and non-native vowels in production: Evidence from short-term
visual articulatory feedback training. Journal of Phonetics, 57, 21–39.

Kharlamov, V. (2014). Incomplete neutralization of the voicing contrast in word-final obstruents


in Russian: Phonological, lexical, and methodological in influences. Journal of Phonetics,
43(1), 47-56.

Kent, R. D., & Forner, L. L. (1980). Speech segment durations in sentence recitations by children
and adults. Journal of Phonetics, 8, 157-168.

Kewley-Port, D. (2001). Vowel formant discrimination, II: Effects of stimulus uncertainty,


consonantal context, and training. Journal of the Acoustical Society of America, 110(4),
2141-2155.

Kidd, G. R., Watson, C. S., & Gygi, B. (2007). Individual differences in auditory abilities. Journal
of the Acoustical Society of America, 122(1), 418-435.

Kim, D., & Clayards, M. (2019). Individual differences in the link between perception and
production and the mechanism of phonetic imitation. Language, Cognition and
Neuroscience, 34(6), 769-786.

Kim, D., Clayards, M., & Goad, H. (2018). A longitudinal study of individual differences in the
acquisition of new vowel contrasts. Journal of Phonetics, 67, 1-20.

Kim, M. R. (2012). L1-L2 transfer in VOT and f0 production by Korean English learners: L1
sound change and L2 stop production. Phonetic and Speech Sciences, 4(3), 31-41.

Kleinschmidt, D. F., & Jaeger, T. F. (2015). Robust speech perception: Recognize the familiar,
generalize to the similar, and adapt to the novel. Psychological Review, 122(2), 148-203.

Kluender, K. R., Lotto, A. J., Holt, L. L. & Bloedel, S. L. (1998). Role of experience for language-
specific functional mappings of vowel sounds. Journal of the Acoustical Society of
America, 104(6), 3568-3582.

Kohler, K., (1981). Contrastive phonology and the acquisition of phonetic skills. Phonetica, 38,
213-226.

Kong, E. J. & Edwards, J. (2015). Individual differences in L2 learner’s perceptual cue weighting
patterns. 18th International Congress of Phonetic Sciences, 10-14 August 2015, Glasgow
The SLM revised, 77

Scotland UK.

Kong, E. J. & Edwards, J. (2016). Individual differences in categorical perception of speech: Cue
weighting and executive function. Journal of Phonetics, 59, 40-57.

Kong, E. J. & Yoon, I. H. (2013). L2 proficiency effect on the acoustic cue-weighting pattern by
Korean L2 learners of English. Journal of the Korean Society of Speech Sciences, 5(4), 81-
90.

Kraljic, T., & Samuel, A. G. (2006). Generalization in perceptual learning for speech.
Psychonomic Bulletin & Review, 13(2), 262-268.

Kuhl, P. (1983). Perception of auditory equivalence classes for speech in early infancy. Infant
Behavioral Development, 6, 263-285.

Kuhl, P. (1991). Human adults and human infants show a “perceptual magnet effect” for the
prototypes of speech categories, monkeys do not. Perception & Psychophysics, 50, 93-107.

Kuhl, P. (2000). A new view of language acquisition. Proceedings of the National Academy of
Science 97(2), 11850-11857.

Kuhl, P., Conboy, B. T., Coffey-Corina, S., Padden, D., Rivera-Gaxiola, M. & Nelson, T., (2008).
Phonetic learning as a pathway to language: new data native language magnet theory
expanded (NLM-e). Philosophical Transactions of the Royal Society B, 363, 979-1000.

Kuhl, P., Conboy, B. T., Padden, D., Nelson, T. & Pruitt, J. (2005). Early speech perception and
later language development: Implications for the “Critical Period”. Language Learning and
Development, 1, 237-264.

Labov, W. (1994). Principles of linguistic change. Vol. 1: Internal factors. Oxford & Cambridge,
MA: Blackwell.

Lado, R. (1957). Linguistics across cultures: Applied linguistics for language teachers. Ann
Arbor: University of Michigan Press.

Lee, H., & Jongman, A. (2018). Effects of sound change on the weighting of acoustic cues to the
three-way laryngeal stop contrast in Korean: diachronic and dialectal comparisons.
Language and Speech, 63(3), 509-530.

Lee, S., Potamianos, A. and Narayanan, S. (1999). Acoustics of children’s speech: Developmental
The SLM revised, 78

changes of temporal and spectral parameters. Journal of the Acoustical Society of America,
105(3), 1455-1468.

Lehet, M., and Holt, L. (2017). Dimension-based statistical learning affects both speech perception
and production. Cognitive Science, 41, 885-912.

Lengeris, A. (2009) Individual differences in second-language vowel learning. Unpublished PhD


thesis, University College London.

Lengeris, A., & Hazan, V. (2010). The effect of native vowel processing ability and frequency
discrimination acuity on the phonetic training of English vowels for native speakers of
Greek. Journal of the Acoustical Society of America, 128(6), 3757-3768.

Lenneberg, E. H. (1967). Biological foundations of language. New York: Wiley.

Lev-Ari, S., & Peperkamp, S. (2013). Low inhibitory skill leads to non-native perception and
production in bilinguals’ native language. Journal of Phonetics, 41, 320-331.

Levy, E. S. (2009a). Language experience and consonantal context effects on perceptual


assimilation of French vowels by American-English learners of French. Journal of the
Acoustical Society of America, 125(2), 1138-1152.

Levy, E. S. (2009b). On the assimilation-discrimination relationship in American English adults’


French vowel learning. Journal of the Acoustical Society of America, 126(5), 2670-2682.

Levy, E. S., & Strange, W. (2008). Perception of French vowels by American English adults with
and without French language experience. Journal of Phonetics, 36, 141-157.

Lindblom, B. (1990). Explaining phonetic variation: A sketch of the H&H theory. In W. J.


Hardcastle & A. Marchal (Eds.) Speech production and speech modeling (pp. 403-439),
Dordrecht, The Netherlands: Kluwer Academic.

MacKay, I. R. A., Flege, J. E., & Imai, S. (2006). Evaluating the effects of chronological age and
sentence duration on degree of perceived foreign accent. Applied Psycholinguistics, 27,
157-183.

MacKay, I. R. A., Flege, J. E., Piske, T., Schirru, C. (2001). Category restructuring during second-
language acquisition. Journal of the Acoustical Society of America, 110, 516-528.

MacKay, I. R. A., Meador, D., & Flege, J. E. (2001). The identification of English consonants by
The SLM revised, 79

native speakers of Italian. Phonetica, 58, 103-125.

Magezi, D. (2015). Linear mixed-effects models for within-participant psychology experiments:


an introductory tutorial and free graphical user interface (LMMgui). Frontier in
Psychology, January 22, 6:2. DOI: 10.3389/fpsyg.2015.00002.

Markham, D. (1999). Phonetic imitation, accent, and the learner. Lund, Sweden: Lund University
Press.

Markham D., & Hazan, V. (2004). Acoustic-phonetic correlates of talker intelligibility for adults
and children. Journal of the Acoustic Society of America, 116(5), 3108-3118.

Maye, J., Werker, J., & Gerken, L. (2002). Infant sensitivity to distributional information can affect
phonetic discrimination. Cognition, 82, B101-B111.

Mayr, R., & Escudero, P. (2010). Explaining individual variation in L2 perception: Rounded
vowels in English learners of German. Bilingualism: Language and Cognition, 13(3), 279-
297.

McAllister, R., Flege, J. E., & Piske, T. (2003). The influence of the L1 on Swedish quantity by
native speakers of Spanish, English and Estonian. Journal of Phonetics, 30, 229-258.

McQueen, J. M., Tyler, M. D, & Cutler, A. (2012) Lexical retuning of children’s speech
perception: Evidence for knowledge about words’ component sounds. Language Learning
and Development, 8, 317-339.

Miller, J. L. (1994). On the internal structure of phonetic categories. Cognition, 50, 271-285.

Mielke, J., Baker, A., Archangeli, D. (2016). Individual-level contact limits phonological
complexity: Evidence from bunched and retroflex /ɹ/. Language, 92(1), 101-140.

Mitterer, H., Reinisch, E., & McQueen, J. M. (2018). Allophones, not phonemes in spoken-word
recognition. Journal of Memory and Language, 98, 77-92.

Miyawaki, K., Jenkins, J. J., Strange, W., Liberman, A. M., Verbrugge, R., & Fujimura, O. (1975).
An effect of linguistic experience: The discrimination of [r] and [l] by native speakers of
Japanese and English. Perception & Psychophysics, 18(5), 331-340.

Mochizuki, M. (1981). The identification of/r/and/l/in natural and synthesized speech. Journal of
Phonetics, 9(3), 283-303.
The SLM revised, 80

Mora, J. C.& Mora-Plaza, I. (2019). Contributions of cognitive attention control to L2 speech


learning. In A. M. Nyvad, M. Hejná et al. (Eds.) A sound approach to language matters –
In honor of Ocke-Schwen Bohn (pp. 477-499). Dept. of English, School of Communication
& Culture, Aarhus University.

Mora, J. C., Keidel, J. L., & Flege, J. E. (2010). Why are the Catalan contrasts between /e/-/ɛ/ and
/o/-/ᴐ/ so difficult for even early Spanish-Catalan bilinguals to perceive. In Dziubalska-
Kolaczyk, K., Wrembel, M. & Jul, M. (Eds.) New Sounds 2010: Proceedings of the 6th
International Symposium on the Acquisition of Second Language Speech, Pp. 325-330.

Mora, J. C., Keidel, J. L., & Flege, J. E. (2015). Effects of Spanish use on the production of Catalan
vowels by early Spanish-Catalan bilinguals. In J. Romero and Riera, M. (Eds.) The
phonetics-phonology interface, Representations and methodologies (pp. 33-53).
Amsterdam: John Benjamins.

Morrongiello, B., A., Robson, R. C., Best, C. T., & Clifton, R. K. (1984). Trading relations in the
perception of speech by 5-year-old children. Journal of Experimental Child Psychology,
37, 231-250.

Moyer, A. (2009). Input as a critical means to an end: Quantity and quality of experience in L2
phonological attainment. In T. Piske & M. Young-Scholten (Eds.) Input matters in SLA
(pp. 159-174). Bristol: Multilingual Matters.

Nam, Y., & Polka, L. (2016). The phonetic landscape in infant consonant perception is an uneven
terrain. Cognition, 155, 57-66.

Nathan, L., Wells, B., & Donlan, C. (1998). Children’s comprehension of unfamiliar regional
accents: A preliminary investigation. Journal of Child Language, 25, 343-365.

Neuman, A., & Hochberg L. (1983). Children’s perception of speech in reverberation. Journal of
the Acoustical Society of America, 73, 2145-2149.

Newman, R. S. (2003). Using links between speech perception and speech production to evaluate
different acoustic metrics: A preliminary report. Journal of the Acoustical Society of
America, 113(5), 2850-2860.

Newman, R. S., Clouse, S. A., & Burnham, J. L. (2001). The perceptual consequences of within-
talker variability in fricative production. Journal of the Acoustical Society of America, 109,
The SLM revised, 81

1181-1196.

Newton, C. & Ridgway, S. (2015). Novel accent perception in typically-developing school-aged


children. Child Language Teaching and Therapy, 32(1) 111-123.

Nielsen, K. (2011). Specificity and abstractness of VOT imitation. Journal of Phonetics, 39, 132-
142.

Nittrouer, S. (2004). The role of temporal and dynamic signal components in the perception of
syllable-final stop voicing by children and adults. Journal of the Acoustical Society of
America, 115(4), 1777-1790.

Nosofsky, R. M. (1986). Attention, similarity, and the identification-categorization relationship.


Journal of Experimental Psychology: General, 115, 39-57.

Nygaard, L. C., Sommers, M., S., & Pisoni, D. B. (1994). Speech perception as a talker-contingent
process. Psychological Science, 5, 42-46.

Peperkamp, S., & Bouchon, C. (2011). The Relation between perception and production in L2
Phonological Processing. Interspeech 2011, 12th Annual Conference of the International
Speech Communication Association, Florence, Italy, August 27-31, 2011.

Perkell, J. S., Guenther, F. H., Lane, H., Matthies, M. L., Stockmann, E., Tiede, M., & Zandipour,
M. (2004). The distinctness of speakers’ productions of vowel contrasts is related to their
discrimination of the contrasts. Journal of the Acoustical Society of America, 116(4), 2338-
2344.

Perkell, J. S., Matthies, M. L., Tiede, M., Lane, H., Zandipour, M., Marrone, N., Stockman, E. &
Guenther, F.H. (2004). The distinctness of speakers' /s/-/ʃ/ contrast is related to their
auditory discrimination and use of an articulatory saturation effect. Journal of Speech,
Language, and Hearing Research, 47(6), 1259-1269.

Pisoni, D. B., Aslin, R. N., Perey, A. J., & Hennessy, B. L., (1982). Some effects of laboratory
training on identification and discrimination of voicing contrasts in stop consonants.
Journal of Experimental Psychology: Human Perception and Performance, 8, 297-314.

Pisoni, D., Lively, S., & Logan, J. (1994). Perceptual learning of nonnative speech contrasts:
Implications for theories of speech perception. In J. Goodman & H. Nusbaum (Eds.), The
The SLM revised, 82

development of speech perception: The transition from speech sounds to spoken words (pp.
121-166). Cambridge, MA: The MIT Press.

Polka, L., & Bohn, O. S. (2003). Asymmetries in vowel perception. Speech Communication, 41(1),
221-231.

Polka, L., & Bohn, O. S. (2011). Natural Referent Vowel (NRV) framework: An emerging view
of early phonetic development. Journal of Phonetics, 39(4), 467-478.

Reiterer, S. M., Hu, X., Sumathi, T. A., & Singh, N. C. (2013). Are you a good mimic? Neuro-
acoustic signatures for speech imitation ability. Frontiers in Psychology, 1-3. DOI:
10.3389/fpsyg.2013.00782.

Remez, R. E., Fellowes, J. M, & Rubin, P. E. (1997). Talker identification based on phonetic
information. Journal of Experimental Psychology, Human Perception and Performance,
23, 651-666.

Rochet, B. L. (1995). Perception and production of second-language speech sounds by adults. In


W. Strange (Ed.) Speech perception and linguistic experience: Issue in cross-language
research (pp. 229-273). Timonium, MD: York Press.

Rogers, C. L., Lister, J. L., Febo, D. M., & Besing, J. M., & Abrams, H. B. (2006). Effects of
bilingualism, noise, and reverberation on speech perception by listeners with normal
hearing. Applied Psycholinguistics, 27(3), 465-485.

Saito, K., Sun, H., & Tierney, A. (2019). Explicit and implicit aptitude effects on second language
speech learning: Scrutinizing segmental and suprasegmental sensitivity and performance
via behavioral and neurophysiological measures. Bilingualism: Language and Cognition,
22(5), 1123-1140

Samuel, A. (1981). Phonemic restoration: Insights from a new methodology. Journal of


Experimental Psychology: General, 110(4), 474-494-

Sancier, M., & Fowler, C. A. (1997). Gestural drift in a bilingual speaker of Brazilian Portuguese
and English. Journal of Phonetics, 25, 421-438.

Schertz, J., Cho, T., Lotto, A., & Warner, N. (2015). Individual differences in phonetic cue use in
production and perception of a non-native sound contrast. Journal of Phonetics, 52, 183-
The SLM revised, 83

204.

Schertz, J., Cho, T., Lotto, A., & Warner, N. (2016). Individual differences in perceptual
adaptability of foreign sound categories. Attention, Perception, & Psychophysics, 78, 355-
367.

Schmidtke, J (2016). The bilingual disadvantage in speech understanding in noise is likely a


frequency effect related to reduced language exposure. Frontiers in Psychology, May 13,
7. DOI: 10.3389/fpsyg.2016.00678.

Schulze, K., Vargha-Khade, F., & Mishkin, M. (2012). Test of a motor theory of long-term
auditory memory. Proceedings of the National Academy of Sciences, 109(18), 7121-7125-

Sheldon, A., & Strange, W. (1982). The acquisition of/r/and/l/by Japanese learners of English:
Evidence that speech production can precede speech perception. Applied Psycholinguistics,
3(3), 243-261.

Shultz, A. A., Francis, A. L., Llanos, F. (2012). Differential cue weighting in perception and
production of consonant voicing. Journal of the Acoustical Society of America, 132(2),
EL95-EL101.

Slevc, L. R., & Miyake, A. (2006). Individual differences in second-language proficiency.


Psychological Science, 17(8), 675-681.

Smit, A., Hand, L., Freilinger, J., Bernthal, J., & Bird, A. (1990). The Iowa articulation norms
project and its Nebraska replication. Journal of Speech and Hearing Disorders, 55, 779-
798.

Smits, R., Sereno, J., & Jongman, A. (2006). Categorization of sounds. Journal of Experimental
Psychology: Human Perception and Performance, 32(3), 733-754.

Smith, B. L. (1979). A phonetic analysis of consonantal devoicing in children’s speech. Journal


of Child Language, 6(1), 19-28.

Snow, C. and Hoefnagel-Höhle, M. 1979. Individual differences in second-language ability: A


factor-analytic study. Language & Speech, 22, 151-162.

Song, J., & Iverson, P. (2018). Listening effort during speech perception enhances auditory and
lexical process for non-native listeners and accents. Cognition, 179, 163-170.
The SLM revised, 84

Song, J. Y., Shattuck-Hufnagel, S., & Demuth, K. (2015). Development of phonetic variants
(allophones) in 2-year-olds learning American English: a study of alveolar stops /t, d/ codas.
Journal of Phonetics, 55, 152-169.

Strange, W. (1992). Learning non-native phoneme contrasts: interactions among subject, stimulus,
and task variables. E. Tohkura, E. Vatikiotis-Bateson, & Y. Sagisaka (Eds.) In Speech
Perception, Production, and Linguistic Structure (pp. 197-219). Ohmsha: Tokyo.

Strange, W. (2007). Cross-language phonetic similarity of vowels. Language experience in second


language speech learning: In O.-S. Bohn & M. J. Munro (Eds.), Language experience in
second language speech learning, In honor of James Emil Flege (pp. 35-55). Berlin: John
Benjamins.

Strange, W. (2011). Automatic selective perception (ASP) of first and second language speech: A
working model. Journal of Phonetics, 39(4), 456-466.

Strange, W., Bohn, O.-S., Nishi, K., & Trent, S. A. (2005). Contextual variation in the acoustic
and perceptual similarity of North German and American English vowels. Journal of the
Acoustical Society of America, 118, 1751–1762.

Stevens, K. N., Liberman, A. M., Studdert-Kennedy, M., & Öhman, S. (1969). Cross-language
study of vowel perception. Language and Speech, 12, 1-23.

Takagi, N. (1993). Perception of American English /r/ and /l/ by adult Japanese learners of
English: A Unified View. Unpublished PhD dissertation, University of California at Irvine.

Theodore, R. M., Miller, J. L., & DeSteno, D. (2009). Individual talker differences in voice-onset
time: Contextual influences. Journal of the Acoustical Society of America, 125(6),3974-
3982.

Theodore, R. M., Monto, N. R., & Graham, S. (2020). Individual differences in distributional
learning for speech: What’s ideal for ideal observers? Journal of Speech, Language, and
Hearing Research, 63, 1-13

Thorin, J., Sadakata, M., Desain, P., & McQueen, J. M. (2018). Perception and production in
interaction during non-native speech category learning. Journal of the Acoustical Society
of America, 144(1), 92-103.
The SLM revised, 85

Tourville, H, A., & Guenther, F. H, (2011). The DIVA model: A neural theory of speech
acquisition and production. Language and Cognitive Processing, 26(7), 952-981.

Trubetzkoy, N. (1939). Principles of phonology, Translated by C. A. Baltaxe (University of


California Press, Berkley.

Tulving, E. (1981). Similarity relations in recognition. Journal of Verbal Learning and Verbal
Behavior, 20(5), 479-496.

Tyler, M. D. (2019) PAM-L2 and phonological category acquisition in the foreign language
classroom. In A. M. Nyvad, M. Hejná et al. (Eds.), A sound approach to language matters
– In honor of Ocke-Schwen Bohn (pp. 607-630). Department of English, School of
Communication & Culture, Aarhus University. E-ISBN: 978-87-7507-440-2.

Walley, A. C., & Flege, J. E. (1999). Effect of lexical status on children's and adults' perception of
native and non-native vowels. Journal of Phonetics, 27(3), 307-332.

Weinreich, U. (1953). Languages in contact: Findings and problems. The Hague: Mouton

Werker, J. F., & Byers-Heinlein, K. (2008). Bilingualism in infancy: first steps in perception and
comprehension. Trends in Cognitive Sciences, 12(4),144-150-

Werker, J. F., & Logan, J. (1985). Cross-language evidence for three factors in speech perception.
Perception & Psychophysics, 37, 35.44.

Westbury, J. R., Hashi, M., & Lindstrom, M. J. (1998). Differences among speakers in lingual
articulation for American English /r/. Speech Communication, 26(3) 203-226.

Whalen, D., Abramson, A. S., Lisker, L., & Mody, M. (1993). F0 gives voicing information even
without unambiguous voice onset times. Journal of the Acoustical Society of America,
93(4), 2152-2159.

Williams, L. (1977). The perception of stop consonant voicing by Spanish-English bilinguals.


Perception & Psychophysics, 21(4), 289-297.

Yeni-Komshian, G. H., Flege, J. E., & Liu, S. (2000). Pronunciation proficiency in the first and
second languages of Korean-English bilinguals. Bilingualism: Language and Cognition,
3(2), 131-149.

Ylinen, S., Uther, M., Latvala, A., Vepsӓlӓinen, S. Iverson, P., Akahane-Yamada, R., & Nӓӓtӓnen,
The SLM revised, 86

R. (2010). Training the brain to weight speech cues differently: A study of Finish second-
language users of English. Journal of Cognitive Neuroscience, 22(6), 1319-1332.

Zhang, Y., Kuhl, P. K., Imada, T., Kotani, M., & Tohkura, Y. I. (2005). Effects of language
experience: neural commitment to language-specific auditory patterns. NeuroImage, 26(3),
703-720.

Zhang, Y., & Wang, Y. (2007). Neural plasticity in speech acquisition and learning. Bilingualism:
Language and Cognition, 10(2), 147-160.
The SLM revised, 87

Figures

Figure 1. The generic three-level production-perception model assumed by the Speech


Learning Model.
The SLM revised, 88

Figure 2. The mean VOT (ms) in word-initial tokens of /p t k/ produced in English words by in
1992 and 2003 by native Italian (NI) speakers in Canada (a), and by 20 native English speakers
and 20 NI speakers each who reported using English either more or less in 2003 compared to
1992 (b). The error bars in (b) bracket +/- 1 SEM.
The SLM revised, 89

Figure 3. Mean VOT values in the production of English /t/ by native speakers of English and
native Spanish Early and Late learners of English. The error bars bracket +/- 1 SEM around
the means which were each based on 60 observations.

View publication stats

You might also like