You are on page 1of 54

Lecture: 7-8

Vocal tract Model


and Acoustic Phonetics
Dr. Shikha Tripathi,PES, Blr
¡  Speech production system
§  States of vocal cords
▪  Breathing, Voiced, Unvoiced and others
§  Glottal flow model (spectral coloring)
¡  Vocal Tract Model
§  Spectral shaping

23/08/2019 Dr.Shikha Tripathi@PESU Blr 2


¡  Model of Vocal Tract
§  LTI Model for speech production
§  Uniform lossless tube model
§  Concatenated tube model
§  Lip radiation
¡  Spectrograms
§  Narrowband
§  Wideband
¡  Acoustic Phonetics
¡  Auditory Perception
23/08/2019 Dr.Shikha Tripathi@PESU Blr 3
¡  With Vocal Tract LTI assumption, when sound source
occurs at glottis, the speech waveform can be
expressed as
Speech output from lips = excitation(glottal flow input) * vocal tract impulse response

¡  Ex: Consider a Periodic glottal flow source(voicing):


u(n) = g(n)*p(n)
¡  when u(n) is passed through LTI vocal tract (VT)with
h(n)
VT output is given by

23/08/2019 Dr.Shikha Tripathi@PESU Blr 4


¡  A window centered at τ, w(n, τ) is applied to
obtain speech segment

¡  In frequency domain

1
X (ω ,τ ) = ∑ H (ωk )G(ωk )W (ω − ωk ,τ )
p
23/08/2019 Dr.Shikha Tripathi@PESU Blr 5
1
X (ω ,τ ) = ∑ H (ωk )G(ωk )W (ω − ωk ,τ )
p

Illustration of relation of glottal source harmonics ω1, ω2……ωM , vocal tract


formants F1, F2, …….FN and the spectral envelope H (ω )G(ω )

The peaks in the spectral envelope correspond to vocal-tract formant frequencies, F1, F2, …FN.

A formant corresponds to the vocal tract poles, while the


harmonics arise from the periodicity of the glottal source.
23/08/2019 Dr.Shikha Tripathi@PESU Blr 6
¡  The spectrum of Vocal tract for a perfectly
periodic source is sampled at harmonic
frequencies

¡  This makes formant estimation difficult

23/08/2019 Dr.Shikha Tripathi@PESU Blr 7


¡  Consider that a singer sings a tone while first harmonic, ω1, is
much higher than first formant F1 of vowel being sung
(a) first harmonic higher than first
formant frequency

¡  When nulls of the VT spectrum are sampled at harmonics,


the resulting sound is weak especially with instruments
• To enhance the sound, singer creates a VT
configuration with a widened jaw which
increases first formant frequency and can
match the frequency of first harmonic
generating louder sound
(b) first harmonic matched to first
formant frequency
23/08/2019 Dr.Shikha Tripathi@PESU Blr 8
¡  The nasal and oral components of Vocal Tract are
coupled by velum.
§  When VT velum is lowered, introducing an opening
into nasal passage and oral tract is shut off by tongue
and lips, sound propagates through nasal passage and
out through nose
¡  The resulting nasal sound has a spectrum that is
dominated by Low frequency formants of large
volume of nasal cavity
¡  Since nasal cavity unlike VT is constant,
characteristics of nasal sounds may be useful in
speaker identifiability
23/08/2019 Dr.Shikha Tripathi@PESU Blr 9
¡  Velum can be lowered even when VT is open.
When this coupling occurs, we get nasalized
sound
¡  Two effects of nasal passage:
§  Formant bandwidths of VT becomes broader
because of loss of energy through nasal passage
§  Anti-resonances are introduced (zeros in VT
transfer function due to absorption of energy at
the resonances of nasal passage)

23/08/2019 Dr.Shikha Tripathi@PESU Blr 10


23/08/2019 Dr.Shikha Tripathi@PESU Blr 11
2.2 Models for Speech Production 21

Fig. 2.2 Source/system model for a speech signal.

simulation of sound generation and transmission in the vocal tract


[36, 93], but, for the most part, it is sufficient to model the produc-
tion of a23/08/2019
sampled speech signal by a discrete-time system model
Dr.Shikha Tripathi@PESU Blr 12 such
u[n,τ ] = w[n,τ ]⋅ (g[n]* p[n]).

Illustration of periodic glottal flow: (a) typical glottal flow; (b) same as (a) with
lower pitch; (c) same as (a) with softer glottal flow.
23/08/2019 Dr.Shikha Tripathi@PESU Blr 13
1
X (ω ,τ ) = ∑ H (ωk )G(ωk )W (ω − ωk ,τ )
p

Illustration of relation of glottal source harmonics ω1, ω2……ωM , vocal tract


formants F1, F2, …….FN and the spectral envelope H (ω )G(ω )

The peaks in the spectral envelope correspond to vocal-tract formant frequencies, F1, F2, …FN.

A formant corresponds to the vocal tract poles, while the harmonics


arise from the periodicity of the glottal source.
23/08/2019 Dr.Shikha Tripathi@PESU Blr 14
Impulse
train
generator
Impulse Generated Speech
Response of
Vocal Tract
Random
Signal
generator

¡  In spoken speech we can change the resonant modes of vocal cavity &
we can stretch the vocal cords to modify pitch period for different
vowels. Thus the model should be time varying
Variable pitch period

Impulse Variable h(n)


train
generator
Impulse Generated Speech
Response of
x Vocal Tract
Random
Signal Amplitude
generator
Linear Time Varying (LTV) Model
23/08/2019 Dr.Shikha Tripathi@PESU Blr 15
¡  Normally pitch period and impulse response
are assumed constant over period of 20 ms.
¡  Hence short time window is used to view the
signal and over the window, parameters are
assumed to remain constant.
¡  The parameters over this short segment are
described by LTI model

23/08/2019 Dr.Shikha Tripathi@PESU Blr 16


¡  Simple model assuming vocal tract area
function A(x,t) constant in both x and t (time
invariant with uniform cross section)
x=0 x=l

l
Uniform lossless tube with ideal terminators: Tube of uniform cross section
being excited by an ideal source of volume velocity flow

23/08/2019 Dr.Shikha Tripathi@PESU Blr 17


¡  Tube of non uniform, time varying cross section

¡  For frequencies < 4 KHz it is assumed to have plane wave


propagation along the axis of the tube
¡  Another assumption: there are no losses due to viscosity or
thermal conduction, either in bulk of the fluid or at the walls of the
tube.
¡  With these assumptions & laws of conservation of mass,
momentum and energy, Portnoff (1973) gave the model using
pressure equations

23/08/2019 Dr.Shikha Tripathi@PESU Blr 18


If A(x,t) = A is constant then the equations modify to

23/08/2019 Dr.Shikha Tripathi@PESU Blr 19


¡  The solution of these equations has the form

¡  Similarly in electrical transmission lines for a lossless uniform line


the voltage v(x,t) and current i(x,t) in the line satisfy the equations:

23/08/2019 Dr.Shikha Tripathi@PESU Blr 20


Z (3.26)
o A·
The acoustic admittance, Y(x, Q), and acoustic characteristic admittance,
Yo, are defined as the respective reciprocals.
Several interesting uses are made of the acoustic impedance concept in
Appendix 3.A.2. First, the input impedances are characterized for the
The pressure equations are analogous to
open• and closed termination cases. The results are shown in Table 3.2.
electrical circuit equations and we get following
analogy:
TABLE 3.1. Analogies Between Acoustic and Electrical Quantities
for Transmission Line and Acoustic Tube Analysis.

AcoDstic Quantity Electrical Quantity

p(x, t) Sound pressure v(x, t) Voltage


v(x, t) Volume velocity iix, t) Current
pIA Acoustic inductance L Inductance
A/pc'- Acoustic capacitance C Capacitance

23/08/2019 Dr.Shikha Tripathi@PESU Blr 21


¡  From these analogies the uniform acoustic tube
behaves identically to a lossless uniform
transmission line terminated in a short circuit

(v(l,t) = 0 ) at one end and excited by a current
source (i(0,t) = ia(t) ) at the other end.
i(0,t) = ia(t)
+

ia(t) v( l,t ) = 0
_
x = 0 x = l

23/08/2019 Dr.Shikha Tripathi@PESU Blr 22


¡  In practical situation there are losses in tract.
¡  Assumption: VT can be represented as a concatenation of
lossless acoustic tubes.

Concatenated tube model. The k-th tube has cross-sectional area Ak and length lk.

¡  If large number of tubes of short length is used we can


reasonably expect the resonant frequencies of concatenated
tubes to be close to tube with continuously varying area
23/08/2019 Dr.Shikha Tripathi@PESU Blr 23
2.2 I Anatomy and Physiology of the Speech Production System 103

Nasal
sound

))))))
Pharyngeal
cavity
Vocal
folds Tongue
hump Oral
sound
output

Trachea

Lungs

Muscle
force
FIGURE 2.2. A block diagram of human speech production.

23/08/2019 Dr.Shikha Tripathi@PESU Blr 24


coupling can substantially influence the frequency characteristics of the
sound radiated from the mouth. If the velum is lowered, the nasal tract is
tract is not laid out along a straight line as in Figure 2.1, this type of
model is a reasonable approximation for wavelengths of the sounds in
speech.
The sounds of speech are generated in the system of Figure 2.1 in
several ways. Voiced sounds (vowels, liquids, glides, nasals in Table 2.1)

By Flanagan et.al ,1970


Fig. 2.1 Schematic model of the vocal tract system. (After Flanagan et al. [35].)

23/08/2019 Dr.Shikha Tripathi@PESU Blr 25


¡  The internal losses affect the sound transmission properties
of VT.
¡  We generally assume the boundary condition p(l,t)=0 at lips
¡  In electrical transmission terminology this corresponds to
short circuit
¡  Reasonable model is lip opening as an orifice in a sphere
¡  In this model at low frequency the opening can be considered
a radiating surface, with radiated sound waves being
diffracted by the spherical baffle that represents head.



¡  Radiating surface (lip opening) is small compared to size of
sphere
23/08/2019 Dr.Shikha Tripathi@PESU Blr 26
¡  Radiation: The transfer function is considered as
relation between volume velocity at lips to volume
velocity at source
¡  If we wish to obtain model for pressure at lips, then
effects of radiation must be included. We desire
PL(z)= R(z) UL(z) Volume velocity(source) at Lips
Pressure at Lips Radiation at Lips
¡  Reasonable assumption to radiation effects is
obtained with R(z)=R0(1-z-1); a first backward
difference (All zero model)

23/08/2019 Dr.Shikha Tripathi@PESU Blr 27


23/08/2019 Dr.Shikha Tripathi@PESU Blr 28
¡  The nature of source: periodic,noisy,impulsive
and combination of these
¡  The shape of VT: Described w.r.t place of tongue
hump along the oral tract and degree of
constriction of the hump. Also determined by
possible connection to nasal passage by velum
¡  The time domain waveform which gives
pressure change with time at lips output
¡  The time varying spectral characteristics
revealed through the Spectrogram(??)

23/08/2019 Dr.Shikha Tripathi@PESU Blr 29


¡  Phoneme: Fundamental distinctive unit of a
language
§  It is the speech sound class that differentiates words
of a language(cat, bat, mat, pat)
§  In American English there are 40 phonemes
(bet 32 – 64 for most languages)
¡  Allophones: Slight acoustic variation on each
phone
§  Permissible freedom allowed within each language in
producing a phoneme (transition between phonemes
affects its utterance, context also plays a role)
¡  Phonetics: Study of sounds of human speech
¡  Phonetic Transcription: Set of symbols used to
represent phonemes
23/08/2019 Dr.Shikha Tripathi@PESU Blr 30
¡  Three basic areas of study:

§  Articulatory phonetics: Study of production of


speech by articulatory and vocal tract of speaker
§  Acoustic phonetics: Study of transmission of
speech from speaker to listener(study of sound
waves generated by human vocal organs for
communication)
§  Auditory phonetics: Study of phonetics of the
reception and perception of speech by listener

23/08/2019 Dr.Shikha Tripathi@PESU Blr 31


¡  The relation between phonemes and their
acoustic realizations form the bases for many
speech applications
¡  Also to understand speech perception
¡  Acoustic phonetics treats speech signal as the
output of speech production process and relates
the signal to its linguistic input (e.g., sentence)
¡  It considers differentiation of sounds on an
acoustic (not articulatory) basis
¡  There is much variation in speech for each
phoneme(due to variation in articulation by
different vocal tracts)
¡  In this study common acoustic features are
described for each phoneme
23/08/2019 Dr.Shikha Tripathi@PESU Blr 32
¡  Most speech analysis uses spectral displays
which show distribution of speech energy as a
function of frequency
¡  Ear effectively extracts spectral amplitudes
from speech signals
¡  Most acoustic-phonetic features are more
apparent spectrally than in time domain
¡  Dynamic behavior of formants and spectral
regions of energy appear to be primary
acoustic cues to phonemes
23/08/2019 Dr.Shikha Tripathi@PESU Blr 33
¡  Spectrogram: A basic tool for spectral analysis
¡  Converts a two-dimensional speech waveform
into a 3-D pattern( amplitude/frequency/
time)

Narrowband Spectrogram Wideband Spectrogram


(gives good spectral resolution) (gives good time resolution)

23/08/2019 Dr.Shikha Tripathi@PESU Blr 34
¡  Consider a narrowband (20 ms Hamming window)
and a wideband (4 ms Hamming window)
spectrogram for
‘which tea party did Baker go to?’
¡  Spectrogarm is computed with 512-point FFT
¡  For narrowband, 20 ms window is shifted at 5-ms
frame interval
¡  For wideband 4-ms window is shifted at 1-ms frame
interval
¡  Both reveal speech spectral envelope H (ω )G(ω )
consisting of vocal tract formant and glottal
contribution
23/08/2019 Dr.Shikha Tripathi@PESU Blr 35
‘which tea party did Baker go to?’

Wideband Spectrogram

Narrowband Spectrogram
23/08/2019 Dr.Shikha Tripathi@PESU Blr 36
¡  Speech sound t in ‘tea’ and ‘to’ is blurry in
narrowband spectrogram while sharp in
wideband spectrogram
¡  Consider utterance that transitions from
normal voicing to a diplophonic voicing as the
pitch becomes very low (“Jazz Hour”)

23/08/2019 Dr.Shikha Tripathi@PESU Blr 37


Spectrograms for uttterance “Jazz Hour”
23/08/2019 Dr.Shikha Tripathi@PESU Blr 38
¡  Here pitch is so low that horizontal striations
are barely visible in narrowband spectrogram
inspite of window length of 40-ms
¡  In wideband spectrogram, we clearly see
vertical striations corresponding to primary
glottal pulse and secondary and diplophonic
pulses

23/08/2019 Dr.Shikha Tripathi@PESU Blr 39


3.4.1 Spectrograms

A basic tool for spectral analysis is the wideband spectrogram, which is discussed
further in Chapter 6. A spectrogram converts a two-dimensional speech waveform (ampli-
tude/time) into a three-dimensional pattern (amplitude/frequency/time) (Figure 3.10). With
time and frequency on the horizontal and vertical axes, respectively, amplitude is noted by the
¡  Consider spectrogram of short sections of
darkness of the display. Peaks in the spectrum (e.g., formant resonances) appear as dark
horizontal bands [73]. Voiced sounds cause vertical marks in the spectrogram due to an
vowels from a male speaker
increase in speech amplitude each time the vocal folds close. The noise in unvoiced sounds
causes rectangular dark patterns, randomly punctuated with light spots due to instantaneous
¡  Formants for each vowel is are noted by dots

Figure 3.10 Spectrogram of short sections of English vowels from a male speaker.
Spectrogram of short sections of English Vowels spoken by male speaker
Formants for each vowel are noted by dots.

23/08/2019 Dr.Shikha Tripathi@PESU Blr 40


¡  Time and Frequency on horizontal and
vertical axes resp., amplitude is represented
by darkness of the display
¡  Peaks in the spectrum(e.g., formant
resonances) appear as dark horizontal bands
§  Voiced sounds cause vertical marks in the
spectrogram due to an increase in speech
amplitude each time the vocal folds close.
§  The noise in unvoiced sounds causes rectangular
dark patterns, randomly punctuated with light
spots due to instantaneous variations in energy

23/08/2019 Dr.Shikha Tripathi@PESU Blr 41


¡  Spectrograms are used to analyze formant frequencies
¡  If detailed information is needed individual spectral cross
Section 3.4 • Acoustic Phonetics 57
sections need to be analyzed
,
/ \
60
-.
CO
'1:'
/
/ \
\
,J
, ,\

QJ
" \
\
'1:'
= \
\

-" ,
Q. /\
40 \
E .... \
ta
QJ
\
\
" """
>
·zta \
\ /\
'i> /
a:: \ \
20 \ / \
,,
.---." " ....
/
U
- - -- /

"""

0
0 1 2 3
Frequency (kHz)

Cross-sections of spectra from the middle of English vowels of a male


Figure 3.11 Cross-sections of spectra from the middle of English vowels of a male
speaker, showing formants as spectral peaks.
speaker, showing formants as spectral peaks.
23/08/2019 Dr.Shikha Tripathi@PESU Blr 42

variations in energy. Spectrograms portray only spectral amplitude, ignoring phase informa-
¡  Wideband spectrograms employ 300 Hz
bandpass filters with response times of a few
ms, which yield good time resolution (for
accurate durational measurements) but
smoothed spectra.
¡  Smoothing speech energy over 300 Hz
produces good formant displays of dark
bands, where the center frequency of each
resonance is assumed to be in the middle of
the band (provided that the skirts of a
resonance are approximately symmetric).
23/08/2019 Dr.Shikha Tripathi@PESU Blr 43
¡  Vowels are voiced (except when whispered),
are the phonemes with the greatest intensity,
and range in duration from 50 to 400 ms in
normal speech.
¡  Like all sounds excited solely by a periodic
glottal source, vowel energy is primarily
concentrated below 1kHz and falls off at
about -6 dB/oct with frequency.
¡  Many relevant acoustic aspects of vowels can
be seen in Figure, which shows brief portions
of waveforms for five English vowels.
23/08/2019 Dr.Shikha Tripathi@PESU Blr 44
58 Chapter 3 • Speech Production and Acoustic Phonetics

Iii

Time

leI

Time

10.1
(1)
'0
.E Time
Q..
e
-<

101
(1)
'0

Time
Q..
e
-e

luI

Time

23/08/2019 Dr.Shikha Tripathi@PESU Blr 45

Figure 3.12 Typical acoustic waveforms for five English vowels. Each plot shows 40 ms of
¡  The signals are quasi-periodic due to repeated
excitations of the vocal tract by vocal fold
closures. Thus, vowels have line spectra with
frequency spacing of F0 Hz (i.e, energy
concentrated at multiples of F0).
¡  The largest harmonic amplitudes are near the
low- formant frequencies.

23/08/2019 Dr.Shikha Tripathi@PESU Blr 46


¡  Vowels are distinguished primarily by the
locations of their first three formant
frequencies.
¡  Due to varying vocal tract shapes and sizes,
formants vary considerably for different
speakers
¡  Figure plots values for F1 and F2 for the ten
vowels spoken by 60 speakers.

23/08/2019 Dr.Shikha Tripathi@PESU Blr 47


have extreme F I-F2 values and most other vowels have formant values lying close to one of

3500
3000
........
N
:I: 2500
u
c
= 2000

..
c::r
Q,)

c
..E
ns 1500

-cc
0
u
Q,)
V)
1000

500 ...
1400
First-formant frequency (Hz)

Figure 3.13 Plot of FI vs F2 for vowels spoken by 60 speakers. (After Peterson and
Plot of FI vs F2 for vowels spoken by 60 speakers.
Barney [31].)

23/08/2019 Dr.Shikha Tripathi@PESU Blr 48
¡  There is much overlap across speakers, such
that vowels with the same FI-F2 are heard as
different phonemes when uttered by
different speakers.
¡  Other aspects of the vowels (e.g., F0, upper
formants, bandwidths) enable listeners to
make correct interpretations.
¡  Each speaker keeps his vowels well apart in
three-dimensional FI-F3 space.

23/08/2019 Dr.Shikha Tripathi@PESU Blr 49


¡  Plot of F
60 1 –F2 values for vowels shows a pattern Chapter 3 • Speech Production and Acoustic Phonetics
called as vowel triangle
800

600 J

N

400

200
800 1200 1600 2000 Figure 3.14 The vowel triangle for the vowels
F2 (Hz) of Table 3.2.

¡  Point vowels /i,a,u/ have extreme values and


most other vowels have formant values close to
the sides of the triangle: the li/-Ial axis for the front vowels and the lui-luI axis for the
back vowels. (For English, a "vowel quadrilateral" is more appropriate, including IreI as the
one of the sides of the triangle
fourth point vowel, but the relative infrequency of lrel in other languages has led to the
tradition of a vowel triangle.) The vowel triangle has articulatory as well as acoustic
23/08/2019 interpretations: F I decreases with tongue height, and F2 decreases as the tongue is shifted 50
Dr.Shikha Tripathi@PESU Blr
backward. Thus, the high front Iii has the lowest FI and highest F2 of all vowels, and the low
vowel lui has the highest Fl. Vowels tend to have a more central tongue position laterally as
¡  The /i/-Ial axis for the front vowels and the lu/-laI axis
for the back vowels. (For English, a "vowel
quadrilateral" is more appropriate, including IaeI as
the fourth point vowel, but the relative infrequency of
lael in other languages has led to the tradition of a
vowel triangle.)
¡  The vowel triangle has articulatory as well as acoustic
interpretations: FI decreases with tongue height, and
F2 decreases as the tongue is shifted backward.
¡  Thus, the high front /i/has the lowest FI and highest F2
of all vowels, and the low vowel la/ has the highest Fl.
¡  Vowels tend to have a more central tongue position
laterally as tongue height decreases, and so the F2
difference between low vowels lael and Ial is much
less than between the high vowels Ii/ and lu/.
23/08/2019 Dr.Shikha Tripathi@PESU Blr 51
In English, the combinations of features give 40 phonemes as follows

Phonemes in American English.


23/08/2019 Dr.Shikha Tripathi@PESU Blr 52
¡  Acoustic Phonetics
¡  Auditory Perception
¡  Time domain methods of speech processing

23/08/2019 Dr.Shikha Tripathi@PESU Blr 53


23/08/2019 Dr.Shikha Tripathi@PESU Blr 54

You might also like