You are on page 1of 30


ssr: Speaker & Speech Recognition

Speech production & phonetics

Dr Philip Jackson
lecturer in speech & audio

Centre for Vision, Speech & Signal Processing,

Department of Electronic Engineering.
Speech production
The human vocal tract
Speech production 1
• Air pressure from lungs builds up
behind closed vocal cords
• Vocal cords are repeatedly forced apart
and pulled together again, producing a
series of small pulses of air
• Air in vocal tract vibrates quasi-
• Rate of vibration of the vocal folds
determines fundamental frequency, f0,
which contributes to the perceived pitch
of the voice
Speech production 2
• The vocal tract forms a resonator with a
complex shape, which admits some
harmonics of the fundamental, while
suppressing others
• A region of high-amplitude harmonics is
called a formant
• The first two formants, F1 and F2, are
important for vowel discrimination
• Speech sounds are classified according
to the way the sound is generated and
the position of the vocal apparatus
of the
• Jaw mid-sagittal MRI for vowel /i/
• Lips
• Tongue
• Tongue
• Velum
• Larynx
Anatomy of the larynx
Articulatory trajectories

“Kuda Mimi pada?”
Acoustic waveform

Y-coord tongue dorsum

Y-coord lower lip

Lips and tongue
Dynamic imaging

The sounds of language
International Phonetic Association (IPA)
• By manner:
– Origin of air stream, and inward/outward
– Whether vocal folds are vibrating (i.e.,
– Whether the velum is raised/lowered
• By place:
– Which part of the vocal tract is involved, i.e., the place
of articulation
– Shape of the lips (rounded/spread)
Manner of articulation 1
• Plosive (aka. stop)
– Produced by the abrupt release of a constriction
somewhere along the vocal tract
– English has plosives at the following places:
• Labial (lips): /p, b/
• Alveolar (palatal ridge): /t, d/
• Velar (soft palate): /k, g/
– Characterised by a short period of silence, a burst and
then turbulence noise after release
– Timing of voice onset after release distinguishes
voiced/unvoiced cognates
– Characteristics depend on context
Manner of articulation 2
• Trill – repeated vibration of one articulator against
• Tap (aka. flap) – with a single touch
• Nasal
– Oral cavity is occluded, and velum is lowered, so air
flows out through the nose
– Characterised by vowel-like structure, but weaker
– Identity cued by transitions from surrounding sounds
– English has only voiced nasals:
• Labial /m/, alveolar /n/, velar /N/
Manner of articulation 3
• Fricative
– Airstream is forced through a constriction,
causing turbulence
– Characterised by non-periodic (noisy) sound
– Frequency cut-off is inversely proportional to
the length of the cavity in front of the
– English has voiced and unvoiced fricatives:
• Labio-dental: /f, v/
• Inter-dental: /T, D/
• Alveolar: /s, z/
• Palatal-alveolar: /S, Z/
• Glottal: /h/
Vowel space

2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 kHz

FRONT heed 0.3 BACK


hid 0.4
heard 0.5
head hoard
hod 0.6

had hard 0.7

English phonemes
SAMPA Example SAMPA Example SAMPA Example
i heed @U hoe s sip
I hid aU how S ship
e head I@ here v vat
{ had e@ there D that
A hard U@ moor z zip
V hut w wet Z measure
Q hod j yet tS chin
O hoard p pea dZ gin
U hood t tea m map
u who’d k key n nap
@ about b bee N hang
3 heard d dog l led
eI hay g good r red
aI high f fat h hit
OI boy T thin
Phonemes (ideal)
• minimal unit of speech discrimination between
– language (even dialect) dependent
• defined in terms of their distribution
• described in terms of manner and place of
• useful in ASR because they allow access to
standard on-line dictionaries
• English has approximately 44 phonemes
• but, they are ideal, not real objects
Phones (real)
• Speech is inherently variable.
• Generic variation:
– rate of speaking, loudness, context
• Inter-speaker variation:
– physical, age, gender, dialect
• Intra-speaker variation:
– health, mood, external factors
• Linguistic units have no boundaries
• Features are asynchronous, e.g.,
– nasality
– lip-rounding (effect beyond syllable possible)
• Cues to identity may be in surrounding sounds
– plosives and nasals are cued by transitions
– “pod” vs. “pot”, cued by the length of the preceding
Speech is not acoustic text
Contextual effects
• Co-articulation affects how each sound
is realized in its context
– /{/ realized differently in “cab” and “cat”
– /{/ in “can” may be nasalized
– /r/ sound in “train” different to that in
– /p/ may be different in each of “pin”, “spin”,
and “apt”
– “can be” may be assimilated, to “cam be”
• Allophones, variants of a phoneme
– differences are caused by context, and are
not contrastive (“milk” vs. “leak”)
Fluent speech
• Reduction - target positions may not be reached
– vowels tend to be neutralised (centralised)
– consonants may not be fully articulated
• Elision - sounds get missed out in normal fluent
– “fish and chips” to “fish ‘n’ chips”
– “temporary” to “tempry”
– “bread and butter” to “brem budder”
• Epenthesis - sounds may be inserted
– “law and order” to “Laura Norder”
– “something” to “sumpfing”
Beyond the phoneme
• homophones, or homonyms
– “to”, “too”, “two”
– “hear”, “here”
– “glasses”, for seeing or drinking
• ambiguity of segmentation
– “grey tape” or “great ape”
– “how to wreck a nice beach”
• intonation changes meaning
– “He’s gone” or “He’s gone?”
• emphasis (aka. stress)
– “the cat sat on the mat”
External factors
• Noise - Lombard effect
• Vibration
– vibrations in the chest, oral and nasal
cavities interfere with speech production
• Fatigue
– speaking rate decreases, slurring occurs due
to loss of control
• Fear
– speaking rate increases, pitch rises due to
muscle tightening
• Cognitive loading
– interaction with other tasks, stress
Summary of speech sounds
• The speech signal
– varies from person to person, and occasion to occasion
– is not broken up into convenient units
– is altered by external physical factors
• Speech sounds
– are inherently confusable (i.e., articulated in largely
similar ways)
– may be inserted or missing altogether
– change with context, which is required to be able to
make an unambiguous interpretation