You are on page 1of 10

Kjellin (2017): Phonetic Details of a Foreign Accent.

Most recent typo corrections 2020-01-21 kl 16:01 1/10

Phonetic Details of a Foreign Accent: A tutorial for the ambitious

by Olle Kjellin (as yet unpublished)
Based on a very illustrative figure in a very instructive article1, I'm going to show how to read
spectrograms and identify a small but crucial detail that makes a huge difference in the "foreign
accentedness" of the speech of learners of English, a detail that they absolutely should be made aware
of. After reading this tutorial, any interested teacher and learner can make their own spectrograms2 and
reveal what has to be done to make a difference, provided, of course, that the learner does strive for a
listener-friendly pronunciation. And that the teacher wants to give the learner a chance.

First, study the following four figures carefully. Don't mind about the physics or technicalities, just look
at them as pieces of art. They show the acoustic/artistic characteristics of the words ice (upper row) and
eyes (lower row) as produced by an American speaker (left column) and a Brazilian speaker (right
column; but it could typically have been by almost any other "foreigner"). How are they different?
What seems to be the main feature for the Native Speaker (NS) to distinguish the two words? The NS
will have no problems at all with NS speech, even in a noisy environment. But on hearing the speech of
a Non-Native Speaker (NNS) the NS will have to put in much more effort to be sure to perceive
correctly ─ even in a noise-free environment. Why?

American Brazilian
ice ice

eyes eyes

1 Adapted from fig. 5 in: de Castro Gomes, M. L. (2013). Understanding the Brazilian way of speaking English. In
Levis, J. & LeVelle, K. (Eds.). Proceedings of the 4th Pronunciation in Second Language Learning and Teaching
Conference, Aug. 2012. (pp. 279-289). Ames, IA: Iowa State University. Available from
2 Use the fantastic Praat software, freely downloadable from
Kjellin (2017): Phonetic Details of a Foreign Accent. Most recent typo corrections 2020-01-21 kl 16:01 2/10

1. Oscillogram and spectrogram

ice This is the "ice" spectrogram by the American
speaker. Well, actually only the lower pane shows
the spectrogram. The trace in the upper pane is
called an oscillogram. It shows, graphically, the
actual sound waves. The spectrogram below it
shows the result of ingenious analyses of the sound
waves, analyses not much different from what our
ears and brains indeed do all the time! It shows
what we perceive and, hopefully, understand of the
The word ice in IPA3 script is [aɪs]. It contains an
"ah" followed by an "ih" and an "s". The horisontal
axis of the graphs is time.
The shaded half in the left part shows how vowels
| a | ɪ | s |
look like, in this case the sequence of [a] and [ɪ].
Notice how high (wide) the oscillogram is in the shaded area, especially in the [a] part. The
oscillogram curve's excursions up and down from the zero line shows how loud the sound is. The size
of the spikes is called the amplitude. The amplitude is usually measured in dB (deciBels). It is closely
related to the intensity of the sound, which is heard as loudness. Roughly, the more open mouth the
larger the amplitude and the louder the sound.
ice (again) We'll stay a moment on the same picture. A
phonetician can tell roughly what's what in it.
Notice that the amplitude of the [s] part is smaller
than that of the [aɪ] part. Listen to yourself when
you say them: You will find that the vowels are
much louder than the [s], but the latter has more of
a "hiss" character. The hiss sound is called a
fricative. It is dominated by higher frequencies, and
that is what shows up as a "denser" (blacker)

Important: Note that the acoustic boundaries

between the sounds, roughly indicated by vertical
bars in the legend, are very blurry and inexact in
both the oscillogram and the spectrogram. This is
| a | ɪ | s |
because all speech sounds are heavily interlaced
and always influence their neighbours. Sometimes
even across neighbours! So the vertical bars in the
legend mean nothing acoustical but are only a help
for the reader to visualize the approximate, relative
duration of each phonetic segment.
There are never, ever any sharp boundaries between naturally spoken speech sounds. You should rather
think of the centres of the sounds. The IPA symbols are placed (quite roughly) at their respective
acoustic centres in this picture. Even that is an approximation. The influences on and by neighbouring
sounds is called coarticulation. We will return to that later.

3 Read about the International Phonetic Alphabet here:

By convention, IPA symbols are always written within square brackets [...].
Listen to all the sounds here:
Kjellin (2017): Phonetic Details of a Foreign Accent. Most recent typo corrections 2020-01-21 kl 16:01 3/10

2. Voice on and off

ice (yet again)
This is again from the ice picture by the American
speaker. But now only the middle part, cropped and
Notice that in the left part, where there is a vowel,
you can see vertical lines in the spectrogram, like
pokes in a fence. Each such vertical line is the result
of one beat of the vocal folds, the glottal pulse, and
each line also correponds to a peak in the
oscillogram above. If you were to count the number
of such peaks per second (horizontal axis), you
would know the fundamental frequency of the
vowel at that moment. The fundamental frequency
is called f0 ("eff zero" or "eff oh") by phoneticians.
It is this part of the acoustics that we hear as the
speaker's intonation. Nowadays the computer can
do the counting. But our brains will hear the f0 and
ɪ | s its variations instantaneously and react on the
speaker’s intonation.
In the right part of the spectrogram you can't see any vertical lines, because the voice is off. There is no
glottal pulse. The vocal folds are open, the air stream passes the vocal folds unobstructedly, but is now
instead caught by a narrow slit between the tongue tip and the teeth and gum, where it results in that
typical hissing [s] sound. The spectrogram shows how the hissing sound is built, acoustically.

ice (again)
Notice how the amplitude of the oscillogram
rapidly decreases as the word is about to go from
vowel to consonant (from shaded to unshaded area),
and how this transition actually begins well before
the [s] itself. And even continues a little bit into it.
The whole trough in the oscillogram is the region of
articulatory overlap between the [i] and the [s];

The amplitude rises again when the [s] sound

comes to life. After a short while it becomes smaller
again, as the speech effort is coming to its end, and
in the ensuing breath relaxation phase its amplitude
increases again as the speaker exhales while still
articulating the [s] with his/her tongue. (So it's an
enigma, where to choose to indicate the "acoustic
| a | ɪ | s | centre" of the [s].)
Kjellin (2017): Phonetic Details of a Foreign Accent. Most recent typo corrections 2020-01-21 kl 16:01 4/10

3. Articulation and coarticulation

ice (yet again) Look at what happens during the transition from
vowel to consonant: The spectrogram reveals that
there is some commencing hiss noise even before
the end of the vowel, and there are still a few beats
of the vocal cords even after the onset of the
consonant. This belongs to the phenomenon of
coarticulation, mentioned above, and the cut-off
point is arbitrary in its fine details. This is because
the articulatory organs responsible for the various
components here work quite independently of one
another (disregarding the fact that they are situated
in the same individual) and overlap in time: The
voice is produced when the air stream makes the
vocal folds in the larynx (voice box) vibrate; the [s]
is produced when the air stream is pressed past the
tongue tip; the vowel quality is produced by the
overall shape of the throat, tongue and lips; and the
ɪ | s passage to the nose should be efficiently shut off by
the velum (soft palate) ─ unless we want to say m,
n or ng.
To accomplish all this we have 146 muscles (not counting the numerous muscles for respiration, posture and gestures)
that have to be exquisitely coordinated in time and space by each its own specific nerve signals. This is quite a feat
indeed! Particularly considering that there is no 1-to-1 default value for the nerve signals to speech sounds, but they
have always to be adapted on the fly depending on many factors such as nerve lenghth and thickness, nearest
preceding and following sounds, current position of head and body, amount of lung inflation, mode of speech (shout,
whisper, sing, beg, demand, etc.), presence of chewing-gum, and many more. We can accomplish that thanks to having
practiced our first language so much that we have developed an audio-muscular or audio-motor memory with both
procedural and cognitive elements that makes us veritable experts of acoustic phonetics: We know at every instant how
to adjust to get the intended sound result just like the one ringing in our subconscious ears. Our articulation is result-
governed, ruled by our ears. (If we want to, we can practice our second languages too to almost the same degree.)
ice (again) A normal, average rate of speech contains about 12-14
speech sounds per second distributed in about 2-3
stressed syllables per second. But it may take about a
tenth of a second (phoneticians would say about 100
milliseconds, ms) before a muscle succeeds in moving
its attached structures after being triggered by a nerve
signal. Small structures, such as the vocal folds, are
light and move quickly, big structures like the jaw and
tongue, are heavy and slow to move. Mid-sized
structures like the lips and the velum... well, you
guessed it. So the brain has to dispatch its signals well
in advance with very different lead times for different
structures, to get the correct result coordinated at the
correct instant.
| a | ɪ | s |
To make things even worse, some nerves are faster and shorter, while other nerves are slower and longer. The nerves
to the vocal folds are particularly thin and slow, and particularly long at that, as they happen to make a long detour
down into the chest and up again (the left one even passing below the aortic arch), before reaching into the muscles of
the larynx in the neck. So transitions between voiced and unvoiced sounds are always a hassle. And hence particularly
sensitive to voicing coarticulation. (Think: dogs with voiced [gz], but cats with unvoiced [ts]; bored with voiced [o:d],
but pushed with unvoiced [ʃt]; subject with voiced [bd͡ ʒ], but substance with unvoiced [pst].... or is it [bst]? Or
somewhere in between? Think about it! When, in relation to the lip closure for b, do the vocal folds actually stop
vibrating?) The nerves to the soft palate, on the other hand, are particularly short, thick and fast. So it usually happens
that nasality too is readily coarticulated among neighbouring sounds. When coarticulation causes neighbouring sounds
to become more similar to one another in some respects, phoneticians call it assimilation. You now know why and
how assimilation of voicing and nasality may occur, and why it happens so naturally.
Kjellin (2017): Phonetic Details of a Foreign Accent. Most recent typo corrections 2020-01-21 kl 16:01 5/10

4. Formants and vowel spectrum

Next, let's have a closer look at the [aɪ] part of ice.
Notice again that there is no clearly definable border
between the two vowel qualities. There is a
continuous glide, and if you listen carefully, you may
hear an [e]-like sound about midways. So, where we
choose to place the boundary sign is rather a matter of
taste. Vowels with two main timbres are called
diphthongs, and in English a diphthong is counted as
one vowel. Yet, if you look at the black bands in the
spectrogram you can indeed see both the two main
timbres that you also can hear.
The black bands in the spectrogram are called
formants. They are called formants because they form
the vowels. The lowest formant is called the first
formant, F1. It is higher for [a] than for [ɪ], which you
will readily see in the spectrogram. F1 is higher the
| a | ɪ | more open the mouth is.

The formants are acoustical phenomena that occur thanks to resonances in the speech tube. Take a half-
empty bottle, blow air across its neck, and you will get a tone. That's its resonance tone. Empty the bottle,
blow again, and you will get another tone. Higher or lower? It is lower, because there is more air to agitate.
But it is still the same bottle. The resonance frequency depends on the amount of air, or, more exactly, the
mass of that air, not on the bottle itself. A smaller mass vibrates faster than a bigger mass. Fast vibration =
high frequency, which we will hear as a high tone. And vice versa. If you set up an array of bottles with
carefully chosen masses of air in them, you can play Beethoven or Mozart, or whatever, when you agitate
those resonance frequencies by hitting the bottles with a stick or blowing air across their necks. Because
you have very cunnily planted some mutually suitable formants into them and organised them in an easily
playable array! You are a genius.
The human speech tube too is an ingenious apparatus
with a number of such "bottles" coupled in series, and
they have soft walls so you can continuously vary their
volumes (read: vary the masses of air contained in them)
on the fly, in order to get the mutually suitable formants
of your choice at any instant. You agitate those air
masses by literally blowing aross their common neck,
which happens to be the larynx in your own neck. The
result is several resonance sounds at the same time.
Several variable resonance sounds at the same time.
They are the vowel formants, and thanks to smart
technology they show up as black bands in various
combinations in the spectrogram. Each particular
combination is unique for each particular vowel. Each
vowel spectrum is the unique key to the unique keyhole
in your ear. If it fits, you will perceive the vowel.

| a | ɪ |
Roughly speaking, you have one such "bottle" in your pharynx (the cavity behind your tongue), another bottle in
the oral cavity above your tongue, another bottle between your lips if you round them forwardly as for an oo, yet
another bottle under your tongue blade if you were to raise it, such as for an American vocalic r sound as in her,
and yet another bottle in your nasal cavity. Again roughly speaking, your pharyngeal bottle is the source for the
first formant, F1, which is also the lowest formant; and the oral cavity is the source for the second formant, F2,
the next lowest formant. The other cavities make the higher formants. The nasal formant is special, because this
cavity is coupled in parallel instead of in series. F1 and F2 are the main components for vowel keyholes.
F1 indicates mouth opening and is highest in open vowels like [a], and lowest in closed vowels like [i, u].
F2 mainly reacts on tongue position and is highest in front vowels like [i], and lowest in back vowels like [u].
This is not the whole truth about speech acoustics, but it is a good-enough approximation.
Kjellin (2017): Phonetic Details of a Foreign Accent. Most recent typo corrections 2020-01-21 kl 16:01 6/10

5. Manipulate your formants

You can increase the total mass of air by
lengthening your speech tube.
─ You actually do it every day, regularly,
subconsciously and automatically by rounding your
lips and lowering your larynx to say, for instance,

And similarly, you can decrease the mass of air by

shortening your tube to say mee.
─ You simply spread your lips and raise your
larynx. Listen to your moo and your mee and feel
them in your mouth. Can you hear and confirm that
moo somehow sounds and feels lower than mee?
| a | ɪ |
Now, try and sing moo on a melody! And then mee. They will still be moo and mee, because the
singing voice and the formants are (almost) independent of each other. The pitch of the singing (and
speaking) voice is due to the fundamental frequency of the vocal fold vibrations, the f0 as mentioned in
section 2 above. But the vowel formants are mainly due to the mass of air in the "bottles" above your
vocal folds. When you speak, sing or whisper, those air masses will be agitated, and by that the
formant spectrum will become audible, and so we can hear your vowel!

Opera singers have to practise very hard to be able to precisely adjust their loudest formants separately
depending on each note they sing, so that no formant note will conflict with the song note (f0) and ruin
the aria. In so-called throat-singing, such as for example khöömei of the Mongolian4 and Tuvan5
tradition, they actually sing with the formants instead, while keeping the f0 steady most of the time, as a
drone. Listen and enjoy!
We looked at the F1, the lowest black band in the
spectrogram. The F2, is the next lowest formant. It's
a bit trickier to follow in this case, but notice the
band just above F1 in the early [a] part, suddenly
weakening and stepping upwards to end far
separated from the F1 towards the end of the [ɪ]
part. Those are the main features of their "keys":
F1-F2 are close together in [a], as close as possible
and a little bit up from the bottom. And they are
widely separated in [ɪ], as widely as possible, F1
low and F2 high. These are the acoustic results of
quite wide open jaws in [a] moving towards a little
smile in [ɪ], and the tongue moving from back to

| a | ɪ |
In the case of moo, The F1 and F2 are again very close to each other, but as far down towards the
bottom as they can. When you perceive some "low" tone in your moo and some "high" tone in mee, it
is in fact the second formant, F2, that you hear. If you whisper the vowels, the F2 will dominate the
sound even more. Listen! Whispering is a very handy and helpful trick for all language teachers in the
world! Stay alert for more about this later.

Kjellin (2017): Phonetic Details of a Foreign Accent. Most recent typo corrections 2020-01-21 kl 16:01 7/10

6. The difference between eyes and ice

eyes Now we turn to the other word, eyes. In IPA it is
[aɪz]. The [z] symbolises that kind of buzzing
variety of s that you hear at the end of the word.
Not all languages have this sound, and for those
speakers it presents some difficulty. In phonetic
parlance it is called a voiced fricative. You may see
some vertical lines (you remember, the lines that
correspond to the pulses from the vocal folds) far
into the consonant, at least farther than into the [s],
above. But actually not all the way through!
Something other than that will have to signal to the
listener that this sound is to be regarded as a
"voiced" fricative.
| a | ɪ | z |

ice Now it's time to compare the two words as spoken by

the native speaker. Eyes above, and ice below. What
differences can you see?
Well, there are several big differences, mainly along
the horizontal axis, time or duration. But other factors
1. The whole diphthong [aɪ] is clearly longer in eyes
than in ice.
2. Particularly the [a] part is much longer in eyes.
3. Also the [ɪ] part is longer in eyes.
4. The [ɪ] part is much weaker in eyes.
5. The fricative consonant [z] is appreciably shorter in
eyes than the [s] in ice.
6. The fricative consonant [z] in eyes is appreciably
weaker than the [s] in ice. As noted above, this
"voiced" consonant [z] doesn't really have much
| a | ɪ | s | voicing in it. The main factor seems to be its lower
amplitude (and shorter duration). So instead of
"voiced-unvoiced", phoneticians often refer to them
as lenis-fortis, respectively.
The phenomenon that the vowel is (at least somewhat) longer before a "voiced" (lenis) consonant than before an
”unvoiced” (fortis) one is ubiquitous in most other languages. But particularly in English this turns out to be more
salient and one of the more important details to learn, for those who want to speak as natural English as possible. And
to teach, if you are a teacher who want to give your students the best possible chances. Listen to native pronunciation
of bad-bat, bade-bait, dog-dock, laid-late, log-lock, lab-lap, need-neat, hob-hop, hobe-hope, etc. etc., and confirm the
facts. Please note, we are not talking about "long" or "short" vowels as such here. Those terms apply to pairs like beat-
bit, bead-bid, sheep-ship, etc. But in our examples above both members of each pair contain the same-length vowels,
either a "long" or a "short" vowel as in the textbooks. But there still is a length distinction within each pair, triggered
by this "lenis-fortis" distinction.
Unfortunately, this characteristic seems seldom to be taught, or even mentioned, in the ESL or EFL classes. Obviously
the Brazilian speaker in this example had not learnt it. Go back to page 1 and compare the braziliophone ice and eyes.
They look very similar ─ and most probably it was quite difficult for a native listener to perceive the difference. But I
emphasize: These duration distinctions may even be THE most important cues for the NS listener to parse and
perceive "voiced" (bdgz) vs. "unvoiced" (ptks) consonants in natural speech. And they work perfectly fine in a noisy
environment, too, because the slight voicing difference in [s] vs. [z] is of subordinate importance and may well drown
in the noise without deleterious consequences. The timing makes the trick.
This author did not learn it, either, so I'm embarrassed to admit that spectrograms of my own pronunciation probably
would look just like the Brazilian ones. Typically Swedish broken English. I just don't dare to look.
Kjellin (2017): Phonetic Details of a Foreign Accent. Most recent typo corrections 2020-01-21 kl 16:01 8/10

7. More wonderful tricks with the formants

This section will be postponed until later. Stay tuned!

SILBO GOMERO, EXPLICADO POR SILBADORES - Whistled language of the island of La

Gomera Canary Islands

El Silbo Gomero. La Gomera, Islas Canarias

Whistled language of the island of La Gomera (Canary Islands), the Silbo Gomero
Kjellin (2017): Phonetic Details of a Foreign Accent. Most recent typo corrections 2020-01-21 kl 16:01 9/10

Coming up in the future:

Kjellin (2017): Phonetic Details of a Foreign Accent. Most recent typo corrections 2020-01-21 kl 16:01 10/10

You might also like