You are on page 1of 37

L2 phonology of Cantonese speakers of English:

Voicing and aspiration contrast of stops in onset and coda.1

Angus Fung
Department of Linguistics
University of Calgary

Supervisor: John Archibald

April 2004


I would like to express my gratitude to all those who helped me to complete this thesis. I am deeply
indebted to my supervisor Prof. Dr. J. Archibald whose help, suggestions and encouragement helped
me in all the time for writing of this thesis. I extend my thanks to him for his countless hours of
discussion and commentary.

Second language acquisition is the phrase used to describe the process that

people go through when confronted by a need to use a language other than their native

one for communication. People acquire their first and second languages differently.

Some of the issues and processes involved in language acquisition include the idea of

innateness (Is language ability determined genetically?), the relevance of the language

input the language learner receives, and the nature of early (developmental) grammars

(O'Grady et al, 1989). In this paper, I am going to address a number of issues that

have to do with the acquisition of voicing and aspiration contrast in a second language

(L2). My major focus will be on what Cantonese speaker learners do when they are

learning English stops. I will also look at a few other languages and their acquisition of

new stop consonants in an L2.

Most if not all of the pronunciation problems encountered by Cantonese

learners of English may be adequately accounted for by the contrastive differences of

the two languages. I will also examine the phonological differences between the two

languages, ranging from their phoneme inventories, the characteristics of the

phonemes, the distributions of the phonemes, syllable structure. At the segmental

level, substitution by a related sound in the native language, deletion and epenthesis

are by far the most common strategies.

Cantonese and English are two typologically different languages. Cantonese is

one of the major dialects of Chinese and the language belongs to the Sino-Tibetan

language family. It is spoken in Guangdong (including Hong Kong), Macau, and in the

southern part of Guangxi (Figure 1). On the other hand, English is a Germanic language

which belongs to the Indo-European family. (Ethnologue, 2004).

Figure 1: Map of Guangdong Province (Wertz, 2003)


Figure 1

2. Phonetics of VOT

In this section, I will take a look at the production and perception of stops in

term of different voice onset time, one of the cues to contrast voicing and aspiration

contrast in stops
2.1 Articulation

Voice Onset Time (VOT) is the duration of the period of time between the

release of a plosive and the beginning of vocal fold vibration. This period is usually

measured in milliseconds (ms). It is useful to distinguish at least three types of

plosives with different VOT: voiceless-unaspirated, voiceless-aspirated and voiced.

Figure 2 shows the waveforms of the two plosives “t” and “d”. The arrows

indicate the release burst of the stop consonant and the onset of glottal vibrations for

the vowel. Clearly, the VOT is longer for the voiceless than the voiced stop. This is

due to the glottal abduction, which is the closure of the vocal folds for the voiceless

stop and its temporal relationship to the oral closing and opening movements.

Figure 2. The picture is a waveform of English [t, d] each followed by the vowel
[a]. The y-axis represents amplitude. The x-axis is time - 1.5s overall. Morton, K.

2.2 Voicing and Aspiration

When a plosive sound has a fairly long positive VOT (longer than about

50ms). The air from the lungs is traveling quite quickly through the vocal tract. It is

not slowed down either by the vocal folds, which are open, nor by a constriction in

the vocal tract because the plosive has been released. The rapid airflow creates a weak

friction noise. When a voiceless unaspirated plosive is followed by a vowel, the time

when the vocal folds begin vibrating for the vowel will coincide almost exactly with

the time when the plosive is released (give or take up to 20 milliseconds). After a

voiceless aspirated stop, however, the vocal folds will not begin vibrating until well

after the plosive is released.

The production of stops is not always uniform in terms of VOT, but when

you have two or more contrasting stops in a language, for example /t/ and /d/ in

English. The two stops would be produced within a particular range of VOT. In the

following graph (Figure 3), it shows the production of a speaker of American English

for words beginning with /d/ or /t/. The production of /d/ ranges from 0ms to 25ms and

that of /t/ ranges from 50ms to 80ms.

Figure 3. VOT production of a single normal adult speaker of American English for
words beginning with /d/ and /t/. Blumstein et al., (1980)

These are just two different possible ways of coordinating the timing between

vocal fold vibration and a closure in the mouth. Various languages make use of many

points along this VOT continuum. In the following diagrams (Figure 4), the top half

represents the closing and opening of a plosive in the mouth and the bottom half

represents the state of the vocal folds -- a straight line means voicelessness and a

wavy line means voicing.

Lip closure

1 fully voiced /ba/

2 partially voiced /ba/

3 voiceless /pa/


4 aspirated /pha/

5 strongly aspirated /pha/

Figure 4. Different VOT of stops. (Russell, 1997)

Languages that make voicing contrasts usually choose two or three points

along this continuum (Abramson, & Lisker, 1970). English has chosen to use position

2 for its voiced sounds and either 3 or 4 (depending on position in the word or

syllable) for voiceless sounds. French has chosen to use 1 (fully voiced) and 3

(voiceless unaspirated) (Flege 1987). Cantonese has chosen to use 3 (voiceless

unaspirated) and 5 (strongly aspirated) (Tsui & Ciocca 2000).

2.3 Perception of VOT

The perception of the voicing and aspiration contrasts (e.g. /p/ vs. /b/, /ph/ vs.

/p/) in stops depends on acoustic cues such as VOT. We usually do not perceive

stimuli categorically (Kess 1992). For example, we do not see a colour spectrum from

blue to red as either pure blue or pure red and nothing in between. A colour can be

kind of blue and kind of green. Whereas a stop cannot be kind of [d] and kind of [t]; it

is either a [d] or a [t].

One of the things that people seem to perceive categorically is speech. This is

called categorical perception because instead of getting a percept that is ambiguous,

you get a percept that perfectly matches an example of a particular category. So even

when the physical stimuli change continuously, people would still perceive it

categorically. For example, both /b/ and /p/ are stop consonants and to produce these,

you close your lips, then open them, release some air, and the vocal cords begin

vibrating. The difference between /ba/ and /pa/ is the different VOT of the two stops.

For /b/, VOT is very short; voicing begins at almost the same time as the air is

released. For /p/, the onset of voicing is delayed.

To show the categorical perception of stops, a study by Pisoni & Tash (1974)

used a series of synthetic stimuli that span the VOT continuum between /ba/ and /pa/.

When people were asked to identify these stimuli, they generally have no difficulty:

the lower half of the continuum is consistently identified as /ba/ and the upper half as

/pa/ as show in Figure 5. People did not report hearing something that is a bit like [b]

and a bit like [p]. Rather, they report hearing either [b] or [p]. Thus, the actual VOT

of the individual stimulus appears to be discarded, and all that remains in the percept

is category membership.

Because of the categorical perception of speech, it is not an easy task for

people to distinguish all speech sounds. Generally, they can only distinguish the

speech sounds that result in meaningful differences in their native language. To find

out an infant’s ability to discriminate different speech sound, Eimas et al, (1971)

tested two groups of infant whowere1 month and 4 months of age in their study.

Result showed that infants at both ages distinguished sounds that were members of

separate phonemes (i.e. categories) from one another but they failed to distinguish

sounds within a given category. The study also shows that infants can distinguish

speech sounds before they can produce them. Figure 6 shows the result of this

experiment. For stops with VOT at –20ms and 0ms, infants perceived them as the

same stop; it is also true for stops with VOT at 60ms to 80ms. But for the stops with

20ms and 40ms VOT, they perceived as two different stops.

Figure 6. Experimental design of infant discrimination study. Eimas et al, (1971)
Note S = perceives as the same stops; D = perceived as two different stops.

In the next section, we are going to look at the differences between English and

Cantonese phonological systems, as this would help us to account for problems and

difficulties encountered by Cantonese speakers in the process of learning English


3. Phonology of Syllables

There are 24 consonants in English and 19 consonants in Cantonese. In both

English and Cantonese, they both have six stops in bilabial /p, b/, alveolar /t, d /, and

velar /k, g/. In English, /p, t, k / are voiceless whereas /b, d, g / are voiced. In

Cantonese, however, there are no voiced plosives; all plosives are voiceless. The

feature that distinguishes between the stops is aspiration (/p, t, k/ vs. / ph, th, kh/).

Table 1. An overview of English and Cantonese consonants (Chan & Li ,2000).
Method of Place of articulation
Bilabial Labio-dental Dental Alveolar Palatal- Palatal Velar Labio-velar Glottal
(C) p, ph t, th k, kh kw, kwh
(E) p, b t, d k, g
(C) f s h
(E) f, v s, z ,  h
(C) ts, ts
(E) t, t
(C) m n 
(E) n
m 
(C) l
(E) l
(C) w j
(E) w r j

English has a relatively complex syllable structure. There can be a maximum of

three consonants before a vowel and a maximum of four consonants after a vowel. One
such example is ‘strengths’ /streks /. The syllable structure of Cantonese, in

contrast, is rather simple; the possible combinations of sounds are restricted. Unlike

English, there are no consonant clusters in Cantonese. Thus, in terms of possible

configurations of V and C, English clearly outnumbers Cantonese, the latter being

limited to V, CV, VC, and CVC. Examples are given in Table 2 below:
Table 2
Syllable structure Examples
V // _ ‘exclamation showing surprise’
CV /fu/ _ ‘husband’
VC / an/ _ ‘late’
CVC /sik/ _ ‘colour’

In terms of distribution of consonants, all the stops in English may occur in

initial or final position of a syllable except [] which cannot occur in syllale initials. In

contrast, only /p, t, k / in Cantonese may occur in syllable-final position, as illustrated

in Figure 7. It should be noted that unlike plosives in English, Cantonese plosives in

word-final position are always unreleased. For example, in the word ‘duck’, /ap/,

the word ‘prosper’,/fat/, and in the word ‘house’, /uk/. Whereas in English,

unreleased stops only occur in connected speech when a word-final stop is followed

by a word in a word initial stop. For example, the word final [p] of the word “cup” is

unreleased when it is followed by a consonant.

(1) “cup to” /kptu/

(2) “cup on” /kpn/

Figure 7. Syllable structure of Cantonese and English.

Cantonese: English:
_ _

onset rhyme onset rhyme

nucleus coda nucleus coda

(C) V (C) CCC V C C C C

p, t, k
m, n, 
p, b, t, d, k, g, f, v, s, z, , , h,
t, t, m, n, l, w, r, j

4. Explaining L2 Behavior

So, as we have examined the phonological differences between the two

languages, I would like to review the behavior of L2 learners, how do we predict what

they will do if the target forms are not found in their native language.

Second language researchers have proposed a number of theories to explain

why certain target forms are more difficult to acquire than others. One of the earliest

was the Contrastive Analysis Hypothesis (Lado, 1957). This hypothesis stated that

when two languages are similar, positive transfer will occur and hence those form will

be easy to learn; where they are different, negative transfer or interference will result

and those forms will be difficult to acquire. However, it turns out that defining

similarity and difference is not always easy. Some researches (Eckman & Iverson

1993, 1994) suggested that typological markedness be the basis of prediction.

Structures that are complex and/or especially common in human language are said to be

unmarked, while structures that are less complex or less common are said to be

marked. A definition is given in Eckman (1981). "A phenomenon A in some language

is more marked relative to some other phenomenon B if, cross-linguistically, the

presence of A in a language necessarily implies the presence of B, but the presence of

B does not necessarily imply the presence of A." In other words, when a language has

voiced stops e.g. [d], we would expect that the language would have a voiceless

counterpart, [t] but not vice versa. From that, we could say that voiced stops are more

marked than voiceless ones.

Sometimes something that is not in your L1 can be easy to acquire, e.g. English
does not make contrast between [] and [] in word initial position. But English

speakers seem able to make the contrast in French onsets without trouble. The

Markedness Differential Hypothesis (MDH) investigates second language acquisition

by comparing the relative markedness of structures in L1 and L2. In those areas in

which there are differences between a target and a native language, the degree of

difficulty will be greater when the area of difficulty is more marked in the native

language and smaller when the degree of markedness is smaller. The degree of difficulty

among those target language (TL) structures that are different from those in the native

language (NL) will correspond directly to the degree of markedness The two

considerations made by the MDH that we need to consider when predicting L2

difficulty of the target language are as follow:

-The difference between the NL and TL.

-The markedness relationships holding between those areas of differences.

In (3), the presence of nasal vowels implies the presence of oral vowels but not
vice versa. There are languages which have [a] and [a]; languages which have [a] alone,

but there are no languages which have [a] but not [a]. From that, we know that nasal

vowels are more marked than oral vowels and so we would predict that the degree of

difficulty is higher when there are nasal vowels in the target language but not the

native language.
(3) [a] implies [a]
∴Nasal vowels are more marked than oral vowels. Hence, the prediction of the
MDH would be that nasal vowels are more difficult to acquire.

On the other hand, those TL differences that are not more marked will not be

difficult. MDH can explain several major patterns of difficulties found in second

language acquisition. Now we know that what kind of target forms are difficult for L2

learners, we will discuss what L2 learners will do when the target forms are difficult to


4.1 Repair Strategies

It is a common phenomenon in second language learning which involves

modifying an L2 word so that it fits the L1 syllable structure. For example in

Japansese loanword, “strike” /straik / becomes /sutoraiki/ because Japanese mainly

allows CV in its syllable structure. Another example is found in German speakers.

When they are learning English, they would produce words with syllable final
obstruent devoicing (producing [hæt] for [hæd] “had”) because they have no voicing

contrast at the end of words in their L1.

The consonant-vowel (CV) is the least marked syllable structure because it

can be found in all languages in the world. In order for Cantonese speakers to

pronounce the target English items, Cantonese speaker would adopt a number of

strategies to break up the more complex, more marked syllabification in English.

Epenthesis is one of the strategies Cantonese speaker use. A vowel, usually a schwa
// is inserted between a consonant cluster or after a final consonant of the syllable.

Another repair strategy is deletion. In this case, Coda consonants or one of the

consonant clusters are deleted in order to obtain the more optimal syllable (CV). The

final type of strategy concerning coda consonants or onset consonant clusters is

replacement or substitution. This strategy doesn’t alter the syllable structure and it

appears quite frequently in final voiced stops (Edge, 1991). For Cantonese L2 learners

of English, the most number of errors found in these items are voice feature.

Devoicing is the most common in final voiced stops. The follow examples illustrate

the three strategy mentioned above.

(4) Solutions to syllable structure problems:
a. Epenthesis /dg/  /dg/
b. Deletion /dg/  /d/
c. Devoicing /dg/  /dk/

Different strategies for syllable structure simplification result in different

outcomes: CVC sequences undergoing final consonant deletion or epenthesis surface

as CV syllables, whereas repair strategies such as final devoicing and substitution

maintain the relatively marked, closed CVC structure. Even though both deletion and

epenthesis convert the relatively complex CVC syllable into relatively simple CV

syllables, their outcomes differ as to what degree of ambiguity they impose on a word.

According to Weinberger (1994), recoverability is a principle “subsumed under

a theory of universal grammar” languages, native speakers, and language learners avoid

or minimize ambiguity. Young children frequently delete segments in both onset and

coda position but very rarely make use of epenthesis. This is because their phonetic

ability is low and their functional knowledge (in terms of the recoverability principle)

is not yet developed. Adults learning L2s seem to exhibit far more instances of

epenthesis than children acquiring their L1. The reason why epenthesis is a more

common simplification strategy in adult L2 acquisition is, according to Weinberger

(1994), that even though adults’ phonetic skills in the target language lag behind that

of native speakers, they do have access to the recoverability principle.

To learn more about recoverability of L2 learners, Abrahamsson (2003) did a

study of Chinese-Swedish interphonology. Three Chinese subjects were included in

this longitudinal study of their L2 acquisition of Swedish. Recordings were made in a

3- to 5-week intervals from August 1990 to May 1991. This experiment was to test

his hypotheses about L2 learns’ developmental aspect and selection of repair

strategies by L2 learners regarding grammatical and functional aspects. He predicted

that the error frequencies will be relatively low in the initial stages, higher frequencies

at a later stage, and relatively low frequencies again at even later stages of acquisition.

Also, epenthetic forms will be relatively lower in early phrases of development but

greater in later phrases. Figure 8 shows the results of the overall error frequencies in

the experiment. The result agreed with his prediction that learners’ acquisition of

codas can be characterized by the following four phases: (a) an initial phase with

relatively high error rates, followed by a rapid decrease in error frequency; (b) a linear

increase in error frequency; (c) a stable plateau phase of relatively high error

frequencies; and (d) a possible decrease in error rates as acquisition proceeds.

Figure 8. overall error frequencies (deletion + epenthesis),

development over time.

Figure 9 gives a summarized description of what the pattern looks like when

the mean epenthesis proportions for the autumn semester 1990 are compared with the

mean proportions for the spring semester 1991. Subject C1 already used epenthesis

more than twice as much as deletion during the autumn semester (epenthesis-deletion

proportion: 2.13) and almost three times as often during the spring semester

(proportion: 2.87). C2’s use of epenthesis is barely half as frequent as his use of

deletion during the first semester (proportion: 0.44), but there is a significant increase

in his use of epenthesis, which is almost as frequent as deletion during the second

semester (proportion: 0.88). Finally, C4 increased her use of epenthesis, which was

nearly as frequent as deletion during the autumn of 1990 (proportion: 0.75), to a level

almost three times as frequent during the spring of 1991 (proportion: 2.77). This was

a significant change

Figure 9. proportion of epenthesis to deletion errors, development over time.

The functional or grammatical role of the coda also determines the use of

different repair strategies. In Abrahamsson 2003’s hypothesis, word-final codas that

are relatively more important for the retention of semantically relevant information

will generate lower overall frequencies of simplification, greater epenthesis-deletion

proportions, or both, than will codas containing information that is more recoverable

(or predictable) from other segments or features in the context. In Swedish, /r/ coda

can serve as a plural marker, or a tense marker and also occurred in noninflected

words. According to Abrahamsson’s hypothesis, if the final consonant of an

noninflected word has been deleted, it is generally not expressed by other explicit

markers or features in the context, and it can be argued that deletion of the stem-final

/r/ results in much greater lexical-semantic ambiguity than the partial deletion of an

inflectional morpheme. It may therefore also be argued that the retention of final /r/ is

more beneficial in noninflected words. To test the hypothesis, inflected words that

ended in either the present-tense morpheme -r/-er or the plural morpheme -r/-ar/-er/-or

were compared with noninflected words with stems that ended in a single /r/. Figure

10 shows the proportions of epenthesis to deletion, although the differences again

appear to be very small, all subjects produced significantly more epentheses for

noninflected forms than for inflected forms.

Figure 10. proportion of epenthesis to deletion errors, inflectional vs.

lexical /r/ codas.

Figure 11. proportion of epenthesis to deletion errors, present tense

vs. plural /r/ codas.

Two pairs of word classes were compared on the subject of epenthesis and

deletion. One of them is the comparison between present tense and plural, As can be

seen in Figure 11, there is no consistency between the three subjects: C1 used

epenthesis significantly more often for present-tense (proportion: 0.1) than for plural

codas (0.02); subject C2 did not differentiate his use of epenthesis between the two

inflectional categories in any significant way. The other comparison deals with

differences between an open- or closed-category words. Since Swedish word-final /r/

of open-class words is less recoverable from the context, they will thus be pronounced

more accurately with a lower overall error frequency and a higher proportion of

epenthesis than the more recoverable or predictable /r/ of closed-class words. The

result is shown in Figure 12.

Figure 12. Proportion of epenthesis to deletion errors, closed-class vs.

open-class /r/ codas.

It is generally believed that greater accuracy is obtained by L2 learners as style

becomes more formal in learners’ production of singleton consonants (Schmidt, 1987).

However, Lin (2001) found that in the case of consonant clusters, it is the learners’

choice of repair strategy but not the error rates that varies with the style of speech.

Twenty Chinese adults were included in her study of production of English onset

consonant clusters in four different types of tasks. The experiment include a wide

variety of task types, ranging from the most formal “reading of minimal pairs”, “word

list reading”, “sentence reading” to the least formal “conversation” as shown in the

following Figure 13.

Figure 13

reading of minimal pairs word list reading sentence reading conversation

most formal least formal

The results of the error rates support her hypotheses and do not conform to

the general prediction that more accuracy will be obtained from L2 learners’

production of target items as the style becomes more formal. There is no significant

difference was found in the students’ error rates in the four speech tasks as shown in

Figure 14.

Figure 14. Overall error rates in the four tasks. (Lin 2000)

Her study also showed that the use of epenthesis increased as the style of the

task became more formal, and the percentage of deletion and replacements became

higher in less formal tasks. It is also true that the proportion of epenthesis vs. deletion

should be greater in tasks without linguistic context than in tasks with linguistic

context. For tasks that were more formal or that require more attention to form or

pronunciation rather than to content, the use of epenthesis would increase. One the

other hand, when the tasks became less formal or as more attention was paid to

content rather than form, more instances of deletion and replacement would be

preferred. The results of her experiment indicate that what is shifted with style is the

learners’ choice of the repair strategies rather than the accuracy rates.

Figure 15. Percentages of the three strategies in the four tasks. (Lin2000)
Note: MP = minimal pair; WL = word list; S = sentence; C = conversation.

5. Phonetics of L2 Learners

So, can L2 learners acquire new VOT? In this section, I will review the existing

literature that studied the acquisition of different stops in L2 which are different from

their L1.

Curtin et al. (1998)

Curtin et al. (1998) studied the acquisition of Thai voice and aspiration by

English and French speakers. Thai has a 3-way voicing contrast phonemically in stops

which includes voiced, voiceless unaspirated and voiceless aspirated stops. English

also has the three phonetically different stops, but only two phonemically different

stops. Aspiration is not the contrasting feature in the language in English and so there

is no lexical distinction between aspirated and non-aspirated stops. Still there is a

phonetic difference between the [p] in “spin” and the aspirated [ph] in “pin”.

Underspecification means that underlying representations are not fully specified and

that predictable information is not underlyingly present. Underspecification theory

expresses this by assuming that underlyingly both p's are not specified for aspiration.

In this study, Curtin et al. (1998) wanted to find out whether allophonic aspiration in

English [p] vs. [ph] aids in the acquisition of contrastive aspiration in Thai /p/ and / ph

/. They also wanted to compare the developmental progression of the English learners

to that of native speakers of French. Like English, French has a 2-way voicing contrast

both phonemically. But phonetically, it only makes voicing contrast with no

aspiration contrast. You could find voiced and voiceless stops in French, but you

couldn’t find any aspirated stops in French

There is some cross-language speech perception research (Abramson and

Lisker, 1970; Strange, 1972; Pisoni et al., 1982) which has shown that English

speakers find it easier to perceptually distinguish aspirated-unaspirated segments (e.g.

/ph/ vs. /p/) than voiced-voiceless segments (e.g. /p/ vs. /b/) in the synthetic VOT

study. But in Curtin et al. (1998)’s study, result showed the opposite in one of the

tasks. English speakers did better in distinguishing voiced-voiceless segments than

aspirated-unaspirated segments in a minimal pair task. Curtin et al. (1998) claimed

that the contradictory orders (aspiration contrast are perceptually easier to distinguish

by English speakers, but English subjects did better in voicing contrast in this study)

of acquisition of L2 voiced and aspiration contrasts by native speakers of English can

be explained by the generative phonological differences between lexical and surface

representation and responses on that task must be made on the basis of lexically

stored representation. The details and the result of the experiment will be discussed

later in this section.

Aspiration is not part of the lexical representation in English; all voiceless

stops are stored as unaspirated in the lexicon and emerge in the fully specified

phonetic representation. Underspecification theory expresses this by assuming that

underlyingly both /p/s are not specified for aspiration in [ph in] and in [spin]. The

aspiration feature in [ph in] is later specified by a context-sensitive at the beginning of

a syllable; aspiration does not apply in other contexts. English has no lexical

distinction between aspirated and non-aspirated stops but still there is a phonetic

(5) Lexical representation: /pæt/ /spæt/ /bæt/
Aspiration rule: [phæt] — —
Surface representation: [phæt] [spæt] [bæt]

Triads of words that minimally differ in both voice and aspiration are found in

Thai, neither of these features is predictable and so both voice and aspiration features

are represented lexically.

(6) /bèt/ ‘fishhook’ /pèt/ ‘duck’ /phèt/ ‘spicy’

The first task of the study is a Minimal Pair Task. Nine Canadian English

speakers, 8 Canadian French speakers and 10 native speakers of Thai (controls) were

asked to choose between pictures of words that are in minimal pair relationship, when

presented with one word aurally. The pictures of the minimal pair are accompanied by

a picture of a foil that differs phonetically in more than one segment from the other

words. An aural presentation was heard and subjects were asked to respond by

pressing a key that corresponds to the position of the appropriate picture on the

screen. This task was used to study the development of lexical representation and to

find out if the subjects could lexically contrast both voice and aspiration, to see if they

can access the correct lexical entry if they hear a word.

The second task is called an ABX Task. In this task, a minimal pair ‘AB’ is

presented aurally followed by a third word ‘X’ that matches either A or B. The

tokens used for A, B and X were each produced by a different speaker. There were 72

trials: 16 each of Aspiration–Voiceless, Voiced–Voiceless and Aspiration–Voiced, and

24 Place controls. Subjects were asked to matches either A or B when they heard a

third word ‘X’.

The results of the Minimal Pair task show that aspirated–unaspirated Minimal

Pairs were discriminated by both English and French groups at a level only slightly

better than chance, performance on the voicing contrast was better (Figure 16). This

experiment lasted for 11 days and results were collected in day 2, day 4 and day 11.

From the results in the last day (day 11), we could see the developmental difference

between some of the English and French subjects. This suggests that the presence of

surface aspiration in English might facilitate the establishment of a lexical aspiration

contrast in the L2 acquisition of Thai. Because of this, Curtin et al. (1998) suggested

that L1 surface features can be lexicalized in L2 acquisition.

Figure 16. Minimal Pair Task- proportion correct (Curtin et al. 1998)

French only has voicing contrast in both lexical and surface representations, so

as expected in the ABX task, French speakers perform better on voice contrast than

on aspiration (Figure 17), similar to what they did in the Minimal Pair task. English

speakers perform similarly on voicing and aspiration contrast in the ABX task as

shown in Figure 17. This ABX results were quite different from what English

speakers did in the Minimal Pair task in which their performance on aspiration was

significantly worse than on voice.

Figure 17. ABX Task- proportion correct (Curtin et al. 1998)

Curtin et al. (1998) claimed that the Minimal Pair task accesses lexical

representations which lack aspiration in English, while the discrimination task

accesses surface representations which contain aspiration in English. We could see

from the results of an ABX discrimination task that English subjects did better than

the French subjects on aspiration.

L2 learners initially construct lexical representations that make use of only

those features that are present lexically in the L1, even though they may be able to

discriminate other L2 contrasts on the basis of surface features, and may eventually

lexicalize these surface features. Results show that aspirated–unaspirated Minimal

Pairs were better discriminated by the English speakers than the French speaker. The

French speakers perform better on the voice contrast than on aspiration.

In a task which accesses lexical representations, English learners lack aspiration

discrimination, while the task that accesses surface representations, English speakers

did better in aspiration discrimination. It was supported by results from the

discrimination task that English subjects did better than the French subjects.

Flege and Eefting (1988)

Flege and Eefting (1988) examined the imitation of a VOT continuum ranging

from /da/ to /ta/ (-60 to +90 ms) by subjects differing in age and/or linguistic

experience. Subjects were native speakers of English, native speakers of Spanish and

bilingual speakers of both. Spanish and English use different phonetic categories to

implement the contrast between /t/ and /d/. In Spanish, [d] is used to implement /d/

and [t] implements /t/. Spanish categories of [d] and [t] yield stops with VOT values

of approximately –80 ms and 20 ms respectively, in word initial position. In English,

/d/ is implemented by [d] and [t], and /t/ is implemented by [th]. English output of [d]

and [t] result in VOT values of about –80 ms and 20 ms. The rule used to implement

[th] yields VOT values of approximately 80 ms. (Flege and Eefting, 1986). Figure 18

illustrates how English and Spanish speakers divide up a VOT continuum based on

their native language catergories.

English /d/ /t/

Spanish /d/ /t./

-80 VOT in ms. 80

Figure 18. Identification of a VOT contiuum by English and Spanish speakers

In the experiment, subjects were asked to identify the stimuli before imitating

them. The stimuli, which consisted of a 16-member continuum ranging from /da/ to

/ta/, were presented twice on each trial. Results showed that regardless of the

properties of the acoustic input, children and adults who spoke only Spanish

produced only lead and short-lag VOT responses, which are their phoneme boundaries

in their L1 and they perceived the VOT continuum input as a member of either of

their L1 categories (Figure 19). English speakers also tended to produce phoneme

boundaries in their L1. They produce stop with only short-lag and long-lag VOT

values (Figure 20). On the other hand, native speaker of Spanish who spoke English

produced stops with VOT values falling into three modal VOT ranges (Figure 21).

They had acquired a new phonetic category that isn’t in their L1.

Figure 19. The frequency of VOT values Figure 20. The frequency of VOT
produced by the native Spanish subjects. values produced by the native English
SA=Spanish adult
SC=Spanish children EA= English adult
EC= English children

Figure 21. The frequency of VOT values produced by the native Spanish speakers of
English. LCB= late childhood bilinguals. ECB= early childhood bilinguals
BC= bilingual children

6. Phonology of L2 Learners

After looking at the phonetics of L2 learners, we will now consider what is

acquired to be acquired in the domain of phonology. In this section, we are reviewing

literatures that examined segmental level, which has to do with phonological segments

(consonants) and prosodic level, which has to do with syllabification in L2


Eckman & Iverson (1993)

Even when the L1 has no clusters, some clusters are easier to acquire than

other. E.g. [pl] is easier to acquire by L2 learners than [fl]. To explain the

phenomenon, Broselow & Finer (1991) proposed that a Minimal Sonority Distance

(MSD) parameter can give us the prediction on the acquisition of L2 consonant

clusters in syllable onsets. The basis for the markedness of the clusters in Broselow &

Finer (1991)’s study is the Sonority Index shown in (7) and the proposed MSD

(7) Sonority Index
Class Scale
Stops 1
Fricatives 2
Nasals 3
Liquids 4
Glides 5

The function of the MSD parameter is to provide a characterization of

consonant clusters allowed in a language. Languages can be constrained by the minimal

difference allowed in syllable onsets on the Sonority Index. Other things being equal,

languages that required a greater difference in sonority between adjacent segments will

have fewer kinds of consonant clusters in the onset. E.g. a stop-liquid cluster [pr]

would be less marked than a stop-fricative cluster [ps]. But Eckman & Iverson (1993)

argued it is typological markedness rather than sonority distance which better explains

L2 learners’ knowledge of English clusters in syllable onsets. they suggested

sequential markedness principle as the better explaination: “For any two segments A

and B and any given context X_Y, if A is less marked than B, then XAY is less

marked than XBY.” On this assumption, since [p] is less marked than [f], hence [pr]

clusters are less marked than [fr] clusters and are predicted to cause less IL difficulty

than do [fr] clusters.

Eckman & Iverson (1993) did an experiment with 11 subjects: 4 Japanese, 4

Korean, and 3 Cantonese speakers. They studied the production of English onset

consonant clusters (CCV). Threshold for definition of acquisition is said to have the

onset in the IL of a subject if the subject produces onset clusters at least 80% of the

time on at least 4 attempts. The data was collected 8 times in casual conversations

between 5 to 10 minutes. No attempt was made to control the vocabulary used by the

subject. They claimed that a less marked cluster would be present just in case one or

more of the more marked clusters is also present. 55 potential test of their claim (five

sets of onset per subject  11 subjects) were collected. Out of the 55 potential tests,

the data allow 50 to be tested (91%). Five of the potential tests yield no result

because the subject did not produce at least four tokens of the relevant clusters. Four

instances out of these 50 appeared to go against what typologcal markedness would

predict. In 92% of the cases, the subject’s performance obeyed the markedness


The four cases which ostensibly violated what typological Markedness would

predict. two cases were from Cantonese speaking subjects in which they got the two

clusters [br] and [fr] but not [pr]. Since [p] is less marked than [b] and [f], we would

expect that [pr] would also be less marked than [br] and [fr]. Analysis of the actual

errors from these two subjects showed that both of them substituted [ ph] onsets for

the intended [pr] onsets. In order to explain this, Eckman & Iverson (1993) assumed

that on the basis of similarities in VOT, the two subjects are associating their NL /p/

with the TL /b/, and their NL / ph/ with TL /p/.

(8) Mapping of the NL obstruents on to the TL.

/p/  /b/ Short-lag VOTs.
/ph /  /p/ Long-lag VOTs.

With this assumption, the subjects’ production would agree with markedness

prediction because aspirated stops are typologically more marked relative to

unaspirated stops. Hence, the [ph]-liquid onset is more marked than [p]-liquid onset

and [f]-liquid onset. From Eckman & Iverson’s explanation, it brings up the question

whether Cantonese speakers might have this kind of mapping.

Edge (1991)

This is a replication and extension of Eckman’s (1981) study on the

production of English word-final voiced obstruents by native speakers of Japanese

and Cantonese. In Edge’s (1991) study, the data of native speakers of English was

included to account for the native devoicing and epenthesis. This was done to avoid

classifying native-like articulation as evidence of IL rules since devoicing, vowel

epenthesis, and the deletion of final voiced obstruents all characterize spoken English.

7 Japanese, 7 Cantonese and 4 native speakers of English were subjects of this study.

The tasks in this study included (1) a picture-elicited storytelling task which

contained words with voiced obstruents, (2) an oral reading of a short story and (3) an

oral reading of 41 randomly ordered words. The voiced obstruents were classified in

the data as either target, deletion, glottal stop substitution, devoicing, epenthesis,

fricativization and other consonants substitution. In Eckman’s model, while the

surface phonetic forms are influenced by language-specific processes, the underlying

processes, such as terminal devoicing, are universal. Edge’s data from the Cantonese

speakers provide evidence for an IL rule of terminal devoicing and supporting

Eckman’s hypothesis. For the Cantonese subjects, 67% of the non-target variants

were devoiced and deletion appeared to be more frequent in connected speech. When

compared to deletion in the Native English subjects’ data, the deletion of Cantonese

subjects is quite different in its distribution. While deletion of /v/ in function words

(fond of playing) rarely occurred, deletion of final /g/, as in dog and of /d/ after a

diphthong in words (beside) occurred across phonetic environments. The results of

this experiment indicate that under the three tasks, devoicing is the strategy that was

most frequent used by Cantonese speakers. It is also important to take into account

native speech in formulating rules for a language learner’s IL production. After we’ve


Cichocki, et al. (1999)

Cichocki, et al. (1999) studied the acquisition of French consonants by native

speakers of Cantonese in onset and coda positions. The two consonant inventories

differ in several ways. French allows more consonants in both onset and coda

position. The number of consonants differs greatly between the two languages in coda
positions since Cantonese only allows unreleased stops /p, t, k/ and nasals in the

coda. Cantonese does not have the voiced/voicing contrast found in French stops but

does have an aspiration contrast that is implemented as voiceless unaspirated and

voiceless aspirated.

There were 6 subjects in this study and their level of proficiency in French

was at the upper beginner and lower intermediary levels. The subjects were asked to

read a passage in the first task. For the second task, subjects were given an English and

Cantonese translation of the items and were asked to give the French equivalent. The

37 words were expected to be well known. Only five words were unknown to some of

the subjects and only three cases were the target words read and repeated after the


In judging whether a response was acceptable or unacceptable, they followed

principles such as judging the response as acceptable when it was or contained a

merely sub-phonemic inaccuracy even though it contained a wrong nucleus, e.g. [ph ]

was treated as acceptable for initial /p/. They also judged as acceptable when it ended

in a nonnuclear element agreeable with the target phoneme even though it contained a
wrong nucleus, e.g. [sz] for /s/ and [sz] for /z/. Finally, they also judged as acceptable

when the target contained an allophone of the target but ended in a wrong phonetype,
e.g. [p] for /p/.

As we can see from the table below (Figure 17), focusing on the result of

stops, Cantonese speakers had greater problem in producing French initial voiceless

stop /p, t, k/ accuracy around 50% even though their native language has the

equivalent phone types. They made errors by producing the stops with prevoicing

and sometimes with a schwa-like vowel inserted after the consonant. In learning to

produce onset /p, t, k/, about 40% of their production were voiced [b, d, k], 35 % are

voiceless aspirated [ph, th, kh], and only about 20% are voiceless unaspirated [p, t, k].

This contradicts the MDH because these French stops have Cantonese counterparts

and one might expect that they be easily learned. In coda position, the result of this

experiment is expected as Cantonese speakers have more difficulty in voiced stops

than in voiceless stops. The voiced stops are nearly always devoiced in final position.

As Figure 18 shows, of all the errors made in the production of stops, 95% included

errors made involving the presence or absence of the voice feature.

Figure 17. Cichocki, et al. (1999)

To account for the difficulties with French onset stops in Cantonese speakers’

production, Cichocki, et al. (1999) suggested that we could look at the patterns of

difficulty found in first language acquisition, which shows that voiceless initial stops

are more difficult than are voiced initial stops. (Ingram, 1978). Cichocki, et al. also

claimed that one of the problems in this study is that all the subjects were learning

French as a second foreign language. It is because English is taught in all Hong Kong

schools and is the medium if instruction in many. The possibility of interference from

English cannot be neglected when we look at the data obtained in this study. My

prediction is that English speakers would not have this trouble because English

speakers has the voicing contrast in their L1. Cantonese speakers may have difficulties

contrasting voiced stops and voiceless unaspirated stops.

Figure 18. Cichocki, et al. (1999)

7. Discussion

Based on a comparison and contrast of the major differences between the

English and Cantonese phonological systems in this article, we have examined some

difficulties that Cantonese speakers may have when learning English pronunciation. It

is argued that most of the Cantonese ESL learners’ difficulties with English

pronunciation may be accounted for by reference to fundamental differences between

the phoneme inventories of the two languages, the characteristics and distribution of

the phonemes and the permissible syllable structures of the two languages in question.

In this section, we are going to look at differences between the acquisition of stops in

onset and coda position, and different repaired strategies are used under different


Onset vs. Coda

From the data of Cantonese speakers of English collected by Eckman 1981,

Cantonese speakers exhibit a voice contrast in word-initial, -medial and final position.

However, devoicing occurred in some voiced stops in coda position but not onset and

word-medial position. Although voiced stops are absent in the L1 phonology,

Cantonese speakers seems to have no difficulty in onset voiced stops. Since coda is a

more marked position than onset, we would expect that people would have more

difficulties in coda positions. Similar to Flege & Eefting (1988)’s studies of English

and Spanish speaker, Cantonese speakers judge tokens of [p, t, k] in their L1 and the

tokens of [b, d, g] to be realizations of the same phonetic categories in the coda

position even though they can detect auditorily the acoustic differences between

corresponding L1 and L2 stops.

We found that Cantonese speakers had fewer problems in the production of

onset voiced stops in the acquisition of French. the result of the study by Cichocki, et

al. (1999). This only happened in the onset but not the coda position. Since voiced

stops are more marked than voiceless stops, this is not what we expected from the

prediction by MDH. Comparable to the result in Eckman 1981, subjects in this study

also showed that they had more difficulties in coda voiced stops. Apart from the fact

that voiceless initial stops are more difficult than are voiced initial stops in L1

acquisition studies, the reason why voiceless French voiceless onsets are difficult to

acquire by Cantonese speaker may also due to the perception of the voicing contrast.

Cantonese subjects may have a wrong realization in time of the phonological units

(phonemes) that distinguish word. Voiced stops in French is easier to distinguish by

Cantonese speaker as Flege (1987) stated that, all other things being equal, we actually

learn L2 sounds which are dissimilar to the sounds in our L1 more easily than their

less dissimilar counterparts.

Repair strategies

In terms of the kind of repair strategies that Cantonese speaker will choose in

the acquisition of English voiced stop, we need to look at proficiency, formality and

the grammatical and functional aspects of the speech. In Abrahamsson (2003)’s

study, data shows that coda deletion is low in the initial phrase of development; it

would increase during the early phrase and decrease during later phrases. The

proportion of epenthesis to deletion will increase over time, which means that the use

of epenthesis would be relatively low at the early stage and increase later on in the L2

development. Error rate increases because of the fact that fluency also increases

considerably with higher L2 proficiency. Fluent speech is characterized by more focus

on content and less focus on form and so the increase of deletion and epenthesis

would be found in the early phrase of L2 development. Another factor that varies

individual L2 learner’s utilization of epenthesis versus deletion is the phenomena of

avoiding ambiguity and facilitating recoverability. As suggested by Lin 2001, it

appears that epenthesis-deletion distribution of consonant clusters correlates

positively with increased formality of the speech task such that epenthesis is

frequently employed in formal tasks (e.g.,word-list or minimal-pair reading) but less

frequently in less formal tasks (e.g.,sentence, text, and story reading or natural

conversation), where deletion is the dominant simplification strategy. Other than that,

one aspect of recoverability from the context is whether the coda is crucial part of a

noninflected lexical form or whether it is part of an inflectional morpheme. It can be

argued that the reduction of lexical forms generally increases lexical ambiguity, and this

might particularly be the case for content words. In contrast, the information

expressed by inflectional morphemes is usually redundantly expressed by other

formal markers or otherwise predictable from the context, and it might be argued that

inflectional information is more easily recoverable from the context than the

underlying form of a reduced lexical stem. It is more likely that word-final codas that

are part of a lexical stem will be pronounced less incorrectly than word-final codas

that are part of an inflectional morpheme.


Abrahamsson, N. (2003), Development and recoverability of L2 codas: A longitudinal

study of Chinese/Swedish interphonology. Studies in Second Language
Acquisition, 25:3, 313-349.

Abramson, A. & Lisker, L. (1970). Discriminability along the voicing

continuum:cross-language tests. In Hala, B.,Romportl, M. and Janota, P.,
editors, Proceedings of the Sixth International Congress of Phonetic Sciences.
Prague:Academia, 569–573.

Blumstein, S., Cooper, W., Goodglass, H., Statlender, S., & Gottlieb, J. (1980).
Production deficits in aphasia: A voice-onset time analysis. Brain and Language
9, 153–170.

Chan, A.Y.M. & Li, D.C.S. (2000). “English and Cantonese phonology in contrast:
explaining Cantonese ESL learners’ English pronunciation problems”. Language,
Culture and Curriculum, 13, 67-85.

Cichocki, W., House, A.B. Kinloch, A.M. & Lister, A.C. (1999). “Cantonese
speakers and the acquisition of French consonants”. Language Learning, 49, 95-

Curtin, S., Goad, H. & Pater, J. (1998). “Phonological transfer and levels of
representation: the perceptual acquisition of Thai voice and aspiration by
English and French.” Second Language Research 14, 4. 389-405.
Eckman, F. (1981): “On predicting phonological difficulty in second language
acquisition.” Studies in Second Language Acquisition 4: 18-30.
Eckman, F & Iverson G. (1993) “Sonority and markedness among onset clusters in the
interlanguage of ESL learners” Second Language Research 9, 3. 234-252.

Eckman, F & Iverson G. (1994). 'Pronunciation difficulties in ESL: coda consonants in

English interlanguage.' In M. Yavas (ed.), First and Second Language Phonology.
San Diego: Singular Publishing Company. 251-265.

Edge, B.A. (1991). ‘The production of word-final voiced obstruents in English by L1

speakers of Japanese and Cantonese’. Studies in Second Language Acquisition,
13, 377-393.

Eimas, P.D., Siqueland E.R., Jusczyk, P.W., & Vigorito, J. (1971). Speech perception
in infants. Science 171:303.6

Ethnologue. Website of the Summer Institute of Linguistics. (Jan, 2004)

Flege, J. (1987). 'The production of "new" and "similar" phones in a foreign language:
Evidence for the effect of equivalence classification.' Journal of Phonetics 15: 47-

Flege, J.E. & Eefting, W. (1988) "Imitation of a VOT continuum by native speakers
of English and Spanish: Evidence for phonetic category formation", Journal of
the Acoustical Society of America 83: 729-740.

Hansen, J. (2001). “Linguistics constraints on the acquisition of English syllable codas

by native speaker of Mandarin Chinese”. Applied Linguistics, 22, 338-365.

Kess, J. F. Psycholinguistics: Psychology, Linguistics, and the Study of Natural

Language. Amsterdam: John Benjamins Publishers BV, 1992.

Lado, R. 1957: Linguistics across cultures. Ann Arbor: University of Michigan Press.

Lin, Y. H. (2001). “Syllable simplification strategies-A stylistic perspective”.

Language Learning 51:4, 681-718.

Morton, K. (1995) Kate Morton's Image Resource.

O'Grady, W., Dobrovolsky, M. and Aronoff, M. (1989). Contemporary Linguistics.

New York: St. Martin's Press.

Pisoni, D., and Tash, J. (1974) Reaction times to comparisons with and across
phonetic categories. Perception and Psychophysics 15(2), 285-290.

Pisoni, D., Aslin, R., Perey, A. and Hennessy, B. (1982): Some effects of laboratory
training on identification and discrimination of voicing contrasts in stop
consonants. Journal of Experimental Psychology: Human Perception and
Performance 8, 297–314.

Radwanska-Williams, J. & Yam, J.P.S.. (2001). “The acquisition of English plosives

by Chinese learners”. In Phonetics Teaching & Learning Conference 2001.

Russell, K. (1997). Narrower transcriptions of English: Aspiration (and Voice Onset


Schmidt, R. (1987). "Sociolinguistic variation and language transfer in phonology."

In G. Ioup & SH Weinberger (Eds.), Interlanguage phonology, 365-377.
Rowley, MA: Newbury House Publishers.

Strange, W. 1972: The effects of training on the perception of synthetic speech

sounds: voice onset time. Doctoral dissertation, University of Minnesota.

Tsui, I. Y. H., & Ciocca, V. (2000). “The perception of aspiration and place of
articulation of Cantonese initial stops by normal and sensorineural hearing-
impaired listeners”. The International Journal of Language and Communication
Disorders, 35, 507-525

Wertz, R. R. (2003). Geographical Database: Map of Guangdong Province