Lec7-8 Speech Processing

Lecture: 7-8
Vocal tract Model

and Acoustic Phonetics
Dr. Shikha Tripathi,PES, Blr
¡  Speech production system
§  States of vocal cords
▪  Breathing, Voiced, Unvoiced and others
§  Glottal flow model (spectral coloring)
¡  Vocal Tract Model
§  Spectral shaping
23/08/2019 Dr.Shikha Tripathi@PESU Blr 2

¡  Model of Vocal Tract
§  LTI Model for speech production
§  Uniform lossless tube model
§  Concatenated tube model
§  Lip radiation
¡  Spectrograms
§  Narrowband
§  Wideband
¡  Acoustic Phonetics
¡  Auditory Perception
¡  With Vocal Tract LTI assumption, when sound source
occurs at glottis, the speech waveform can be
expressed as
Speech output from lips = excitation(glottal flow input) * vocal tract impulse response
¡  Ex: Consider a Periodic glottal flow source(voicing):

u(n) = g(n)*p(n)
¡  when u(n) is passed through LTI vocal tract (VT)with
h(n)
VT output is given by


¡  A window centered at τ, w(n, τ) is applied to
obtain speech segment
¡  In frequency domain

1
X (ω ,τ ) = ∑ H (ωk )G(ωk )W (ω − ωk ,τ )
p
1
X (ω ,τ ) = ∑ H (ωk )G(ωk )W (ω − ωk ,τ )
p
Illustration of relation of glottal source harmonics ω1, ω2……ωM , vocal tract

formants F1, F2, …….FN and the spectral envelope H (ω )G(ω )
The peaks in the spectral envelope correspond to vocal-tract formant frequencies, F1, F2, …FN.
A formant corresponds to the vocal tract poles, while the

harmonics arise from the periodicity of the glottal source.
¡  The spectrum of Vocal tract for a perfectly
periodic source is sampled at harmonic
frequencies
¡  This makes formant estimation difficult

¡  Consider that a singer sings a tone while first harmonic, ω1, is
much higher than first formant F1 of vowel being sung
(a) first harmonic higher than first
formant frequency

¡  When nulls of the VT spectrum are sampled at harmonics,

the resulting sound is weak especially with instruments
• To enhance the sound, singer creates a VT
configuration with a widened jaw which
increases first formant frequency and can
match the frequency of first harmonic
generating louder sound
(b) first harmonic matched to first
formant frequency
¡  The nasal and oral components of Vocal Tract are
coupled by velum.
§  When VT velum is lowered, introducing an opening
into nasal passage and oral tract is shut off by tongue
and lips, sound propagates through nasal passage and
out through nose
¡  The resulting nasal sound has a spectrum that is
dominated by Low frequency formants of large
volume of nasal cavity
¡  Since nasal cavity unlike VT is constant,
characteristics of nasal sounds may be useful in
speaker identifiability
¡  Velum can be lowered even when VT is open.
When this coupling occurs, we get nasalized
sound
¡  Two effects of nasal passage:
§  Formant bandwidths of VT becomes broader
because of loss of energy through nasal passage
§  Anti-resonances are introduced (zeros in VT
transfer function due to absorption of energy at
the resonances of nasal passage)

2.2 Models for Speech Production 21
Fig. 2.2 Source/system model for a speech signal.
simulation of sound generation and transmission in the vocal tract

[36, 93], but, for the most part, it is sufficient to model the produc-
tion of a23/08/2019
sampled speech signal by a discrete-time system model
Dr.Shikha Tripathi@PESU Blr 12 such
u[n,τ ] = w[n,τ ]⋅ (g[n]* p[n]).
Illustration of periodic glottal flow: (a) typical glottal flow; (b) same as (a) with
lower pitch; (c) same as (a) with softer glottal flow.
1
X (ω ,τ ) = ∑ H (ωk )G(ωk )W (ω − ωk ,τ )
p
Illustration of relation of glottal source harmonics ω1, ω2……ωM , vocal tract

formants F1, F2, …….FN and the spectral envelope H (ω )G(ω )
The peaks in the spectral envelope correspond to vocal-tract formant frequencies, F1, F2, …FN.
A formant corresponds to the vocal tract poles, while the harmonics

arise from the periodicity of the glottal source.
Impulse
train
generator
Impulse Generated Speech
Response of
Vocal Tract
Random
Signal
generator
¡  In spoken speech we can change the resonant modes of vocal cavity &
we can stretch the vocal cords to modify pitch period for different
vowels. Thus the model should be time varying
Variable pitch period
Impulse Variable h(n)

train
generator
Impulse Generated Speech
Response of
x Vocal Tract
Random
Signal Amplitude
generator
Linear Time Varying (LTV) Model
¡  Normally pitch period and impulse response
are assumed constant over period of 20 ms.
¡  Hence short time window is used to view the
signal and over the window, parameters are
assumed to remain constant.
¡  The parameters over this short segment are
described by LTI model

¡  Simple model assuming vocal tract area
function A(x,t) constant in both x and t (time
invariant with uniform cross section)
x=0 x=l
l
Uniform lossless tube with ideal terminators: Tube of uniform cross section
being excited by an ideal source of volume velocity flow

¡  Tube of non uniform, time varying cross section
¡  For frequencies < 4 KHz it is assumed to have plane wave

propagation along the axis of the tube
¡  Another assumption: there are no losses due to viscosity or
thermal conduction, either in bulk of the fluid or at the walls of the
tube.
¡  With these assumptions & laws of conservation of mass,
momentum and energy, Portnoff (1973) gave the model using
pressure equations

If A(x,t) = A is constant then the equations modify to

¡  The solution of these equations has the form
¡  Similarly in electrical transmission lines for a lossless uniform line

the voltage v(x,t) and current i(x,t) in the line satisfy the equations:

Z (3.26)
o A·
The acoustic admittance, Y(x, Q), and acoustic characteristic admittance,
Yo, are defined as the respective reciprocals.
Several interesting uses are made of the acoustic impedance concept in
Appendix 3.A.2. First, the input impedances are characterized for the
The pressure equations are analogous to
open• and closed termination cases. The results are shown in Table 3.2.
electrical circuit equations and we get following
analogy:
TABLE 3.1. Analogies Between Acoustic and Electrical Quantities
for Transmission Line and Acoustic Tube Analysis.
AcoDstic Quantity Electrical Quantity
p(x, t) Sound pressure v(x, t) Voltage

v(x, t) Volume velocity iix, t) Current
pIA Acoustic inductance L Inductance
A/pc'- Acoustic capacitance C Capacitance

¡  From these analogies the uniform acoustic tube
behaves identically to a lossless uniform
transmission line terminated in a short circuit

(v(l,t) = 0 ) at one end and excited by a current
source (i(0,t) = ia(t) ) at the other end.
i(0,t) = ia(t)
+
ia(t) v( l,t ) = 0
_
x = 0 x = l

¡  In practical situation there are losses in tract.
¡  Assumption: VT can be represented as a concatenation of
lossless acoustic tubes.
Concatenated tube model. The k-th tube has cross-sectional area Ak and length lk.
¡  If large number of tubes of short length is used we can

reasonably expect the resonant frequencies of concatenated
tubes to be close to tube with continuously varying area
2.2 I Anatomy and Physiology of the Speech Production System 103
Nasal
sound
))))))
Pharyngeal
cavity
Vocal
folds Tongue
hump Oral
sound
output
Trachea
Lungs
Muscle
force
FIGURE 2.2. A block diagram of human speech production.

coupling can substantially influence the frequency characteristics of the
sound radiated from the mouth. If the velum is lowered, the nasal tract is
tract is not laid out along a straight line as in Figure 2.1, this type of
model is a reasonable approximation for wavelengths of the sounds in
speech.
The sounds of speech are generated in the system of Figure 2.1 in
several ways. Voiced sounds (vowels, liquids, glides, nasals in Table 2.1)
By Flanagan et.al ,1970

Fig. 2.1 Schematic model of the vocal tract system. (After Flanagan et al. [35].)


¡  The internal losses affect the sound transmission properties
of VT.
¡  We generally assume the boundary condition p(l,t)=0 at lips
¡  In electrical transmission terminology this corresponds to
short circuit
¡  Reasonable model is lip opening as an orifice in a sphere
¡  In this model at low frequency the opening can be considered
a radiating surface, with radiated sound waves being
diffracted by the spherical baffle that represents head.

¡  Radiating surface (lip opening) is small compared to size of
sphere
¡  Radiation: The transfer function is considered as
relation between volume velocity at lips to volume
velocity at source
¡  If we wish to obtain model for pressure at lips, then
effects of radiation must be included. We desire
PL(z)= R(z) UL(z) Volume velocity(source) at Lips
Pressure at Lips Radiation at Lips
¡  Reasonable assumption to radiation effects is
obtained with R(z)=R0(1-z-1); a first backward
difference (All zero model)

¡  The nature of source: periodic,noisy,impulsive
and combination of these
¡  The shape of VT: Described w.r.t place of tongue
hump along the oral tract and degree of
constriction of the hump. Also determined by
possible connection to nasal passage by velum
¡  The time domain waveform which gives
pressure change with time at lips output
¡  The time varying spectral characteristics
revealed through the Spectrogram(??)

¡  Phoneme: Fundamental distinctive unit of a
language
§  It is the speech sound class that differentiates words
of a language(cat, bat, mat, pat)
§  In American English there are 40 phonemes
(bet 32 – 64 for most languages)
¡  Allophones: Slight acoustic variation on each
phone
§  Permissible freedom allowed within each language in
producing a phoneme (transition between phonemes
affects its utterance, context also plays a role)
¡  Phonetics: Study of sounds of human speech
¡  Phonetic Transcription: Set of symbols used to
represent phonemes
¡  Three basic areas of study:
§  Articulatory phonetics: Study of production of

speech by articulatory and vocal tract of speaker
§  Acoustic phonetics: Study of transmission of
speech from speaker to listener(study of sound
waves generated by human vocal organs for
communication)
§  Auditory phonetics: Study of phonetics of the
reception and perception of speech by listener

¡  The relation between phonemes and their
acoustic realizations form the bases for many
speech applications
¡  Also to understand speech perception
¡  Acoustic phonetics treats speech signal as the
output of speech production process and relates
the signal to its linguistic input (e.g., sentence)
¡  It considers differentiation of sounds on an
acoustic (not articulatory) basis
¡  There is much variation in speech for each
phoneme(due to variation in articulation by
different vocal tracts)
¡  In this study common acoustic features are
described for each phoneme
¡  Most speech analysis uses spectral displays
which show distribution of speech energy as a
function of frequency
¡  Ear effectively extracts spectral amplitudes
from speech signals
¡  Most acoustic-phonetic features are more
apparent spectrally than in time domain
¡  Dynamic behavior of formants and spectral
regions of energy appear to be primary
acoustic cues to phonemes
¡  Spectrogram: A basic tool for spectral analysis
¡  Converts a two-dimensional speech waveform
into a 3-D pattern( amplitude/frequency/
time)
Narrowband Spectrogram Wideband Spectrogram

(gives good spectral resolution) (gives good time resolution)

¡  Consider a narrowband (20 ms Hamming window)
and a wideband (4 ms Hamming window)
spectrogram for
‘which tea party did Baker go to?’
¡  Spectrogarm is computed with 512-point FFT
¡  For narrowband, 20 ms window is shifted at 5-ms
frame interval
¡  For wideband 4-ms window is shifted at 1-ms frame
interval
¡  Both reveal speech spectral envelope H (ω )G(ω )
consisting of vocal tract formant and glottal
contribution
‘which tea party did Baker go to?’
Wideband Spectrogram
Narrowband Spectrogram
¡  Speech sound t in ‘tea’ and ‘to’ is blurry in
narrowband spectrogram while sharp in
wideband spectrogram
¡  Consider utterance that transitions from
normal voicing to a diplophonic voicing as the
pitch becomes very low (“Jazz Hour”)

Spectrograms for uttterance “Jazz Hour”
¡  Here pitch is so low that horizontal striations
are barely visible in narrowband spectrogram
inspite of window length of 40-ms
¡  In wideband spectrogram, we clearly see
vertical striations corresponding to primary
glottal pulse and secondary and diplophonic
pulses

3.4.1 Spectrograms
A basic tool for spectral analysis is the wideband spectrogram, which is discussed
further in Chapter 6. A spectrogram converts a two-dimensional speech waveform (ampli-
tude/time) into a three-dimensional pattern (amplitude/frequency/time) (Figure 3.10). With
time and frequency on the horizontal and vertical axes, respectively, amplitude is noted by the
¡  Consider spectrogram of short sections of
darkness of the display. Peaks in the spectrum (e.g., formant resonances) appear as dark
horizontal bands [73]. Voiced sounds cause vertical marks in the spectrogram due to an
vowels from a male speaker
increase in speech amplitude each time the vocal folds close. The noise in unvoiced sounds
causes rectangular dark patterns, randomly punctuated with light spots due to instantaneous
¡  Formants for each vowel is are noted by dots
Figure 3.10 Spectrogram of short sections of English vowels from a male speaker.
Spectrogram of short sections of English Vowels spoken by male speaker
Formants for each vowel are noted by dots.

¡  Time and Frequency on horizontal and
vertical axes resp., amplitude is represented
by darkness of the display
¡  Peaks in the spectrum(e.g., formant
resonances) appear as dark horizontal bands
§  Voiced sounds cause vertical marks in the
spectrogram due to an increase in speech
amplitude each time the vocal folds close.
§  The noise in unvoiced sounds causes rectangular
dark patterns, randomly punctuated with light
spots due to instantaneous variations in energy

¡  Spectrograms are used to analyze formant frequencies
¡  If detailed information is needed individual spectral cross
Section 3.4 • Acoustic Phonetics 57
sections need to be analyzed
,
/ \
60
-.
CO
'1:'
/
/ \
\
,J
, ,\
QJ
" \
\
'1:'
= \
\
-" ,
Q. /\
40 \
E .... \
ta
QJ
\
\
" """
>
·zta \
\ /\
'i> /
a:: \ \
20 \ / \
,,
.---." " ....
/
U
- - -- /
"""
0
0 1 2 3
Frequency (kHz)
Cross-sections of spectra from the middle of English vowels of a male

Figure 3.11 Cross-sections of spectra from the middle of English vowels of a male
speaker, showing formants as spectral peaks.
speaker, showing formants as spectral peaks.
variations in energy. Spectrograms portray only spectral amplitude, ignoring phase informa-
¡  Wideband spectrograms employ 300 Hz
bandpass filters with response times of a few
ms, which yield good time resolution (for
accurate durational measurements) but
smoothed spectra.
¡  Smoothing speech energy over 300 Hz
produces good formant displays of dark
bands, where the center frequency of each
resonance is assumed to be in the middle of
the band (provided that the skirts of a
resonance are approximately symmetric).
¡  Vowels are voiced (except when whispered),
are the phonemes with the greatest intensity,
and range in duration from 50 to 400 ms in
normal speech.
¡  Like all sounds excited solely by a periodic
glottal source, vowel energy is primarily
concentrated below 1kHz and falls off at
about -6 dB/oct with frequency.
¡  Many relevant acoustic aspects of vowels can
be seen in Figure, which shows brief portions
of waveforms for five English vowels.
58 Chapter 3 • Speech Production and Acoustic Phonetics
Iii
Time
leI
Time
10.1
(1)
'0
.E Time
Q..
e
-<
101
(1)
'0
Time
Q..
e
-e
luI
Time
Figure 3.12 Typical acoustic waveforms for five English vowels. Each plot shows 40 ms of
¡  The signals are quasi-periodic due to repeated
excitations of the vocal tract by vocal fold
closures. Thus, vowels have line spectra with
frequency spacing of F0 Hz (i.e, energy
concentrated at multiples of F0).
¡  The largest harmonic amplitudes are near the
low- formant frequencies.

¡  Vowels are distinguished primarily by the
locations of their first three formant
frequencies.
¡  Due to varying vocal tract shapes and sizes,
formants vary considerably for different
speakers
¡  Figure plots values for F1 and F2 for the ten
vowels spoken by 60 speakers.

have extreme F I-F2 values and most other vowels have formant values lying close to one of
3500
3000
........
N
:I: 2500
u
c
= 2000
..
c::r
Q,)
c
..E
ns 1500
-cc
0
u
Q,)
V)
1000
500 ...
1400
First-formant frequency (Hz)
Figure 3.13 Plot of FI vs F2 for vowels spoken by 60 speakers. (After Peterson and
Plot of FI vs F2 for vowels spoken by 60 speakers.
Barney [31].)

¡  There is much overlap across speakers, such
that vowels with the same FI-F2 are heard as
different phonemes when uttered by
different speakers.
¡  Other aspects of the vowels (e.g., F0, upper
formants, bandwidths) enable listeners to
make correct interpretations.
¡  Each speaker keeps his vowels well apart in
three-dimensional FI-F3 space.

¡  Plot of F
60 1 –F2 values for vowels shows a pattern Chapter 3 • Speech Production and Acoustic Phonetics
called as vowel triangle
800
600 J
•
N
400
200
800 1200 1600 2000 Figure 3.14 The vowel triangle for the vowels
F2 (Hz) of Table 3.2.
¡  Point vowels /i,a,u/ have extreme values and

most other vowels have formant values close to
the sides of the triangle: the li/-Ial axis for the front vowels and the lui-luI axis for the
back vowels. (For English, a "vowel quadrilateral" is more appropriate, including IreI as the
one of the sides of the triangle
fourth point vowel, but the relative infrequency of lrel in other languages has led to the
tradition of a vowel triangle.) The vowel triangle has articulatory as well as acoustic
23/08/2019 interpretations: F I decreases with tongue height, and F2 decreases as the tongue is shifted 50
Dr.Shikha Tripathi@PESU Blr
backward. Thus, the high front Iii has the lowest FI and highest F2 of all vowels, and the low
vowel lui has the highest Fl. Vowels tend to have a more central tongue position laterally as
¡  The /i/-Ial axis for the front vowels and the lu/-laI axis
for the back vowels. (For English, a "vowel
quadrilateral" is more appropriate, including IaeI as
the fourth point vowel, but the relative infrequency of
lael in other languages has led to the tradition of a
vowel triangle.)
¡  The vowel triangle has articulatory as well as acoustic
interpretations: FI decreases with tongue height, and
F2 decreases as the tongue is shifted backward.
¡  Thus, the high front /i/has the lowest FI and highest F2
of all vowels, and the low vowel la/ has the highest Fl.
¡  Vowels tend to have a more central tongue position
laterally as tongue height decreases, and so the F2
difference between low vowels lael and Ial is much
less than between the high vowels Ii/ and lu/.
In English, the combinations of features give 40 phonemes as follows
Phonemes in American English.

¡  Acoustic Phonetics
¡  Auditory Perception
¡  Time domain methods of speech processing


Lec7-8 Speech Processing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec7-8 Speech Processing

Uploaded by

Copyright:

Available Formats

Lecture: 7-8

Vocal tract Model

23/08/2019 Dr.Shikha Tripathi@PESU Blr 2

¡ Ex: Consider a Periodic glottal flow source(voicing):

23/08/2019 Dr.Shikha Tripathi@PESU Blr 4

Illustration of relation of glottal source harmonics ω1, ω2……ωM , vocal tract

A formant corresponds to the vocal tract poles, while the

¡ This makes formant estimation difficult

23/08/2019 Dr.Shikha Tripathi@PESU Blr 7

¡ When nulls of the VT spectrum are sampled at harmonics,

23/08/2019 Dr.Shikha Tripathi@PESU Blr 10

Fig. 2.2 Source/system model for a speech signal.

simulation of sound generation and transmission in the vocal tract

Illustration of relation of glottal source harmonics ω1, ω2……ωM , vocal tract

A formant corresponds to the vocal tract poles, while the harmonics

Impulse Variable h(n)

23/08/2019 Dr.Shikha Tripathi@PESU Blr 16

23/08/2019 Dr.Shikha Tripathi@PESU Blr 17

¡ For frequencies < 4 KHz it is assumed to have plane wave

23/08/2019 Dr.Shikha Tripathi@PESU Blr 18

23/08/2019 Dr.Shikha Tripathi@PESU Blr 19

¡ Similarly in electrical transmission lines for a lossless uniform line

23/08/2019 Dr.Shikha Tripathi@PESU Blr 20

AcoDstic Quantity Electrical Quantity

p(x, t) Sound pressure v(x, t) Voltage

23/08/2019 Dr.Shikha Tripathi@PESU Blr 21

23/08/2019 Dr.Shikha Tripathi@PESU Blr 22

¡ If large number of tubes of short length is used we can

23/08/2019 Dr.Shikha Tripathi@PESU Blr 24

By Flanagan et.al ,1970

23/08/2019 Dr.Shikha Tripathi@PESU Blr 25

23/08/2019 Dr.Shikha Tripathi@PESU Blr 27

23/08/2019 Dr.Shikha Tripathi@PESU Blr 29

§ Articulatory phonetics: Study of production of

23/08/2019 Dr.Shikha Tripathi@PESU Blr 31

Narrowband Spectrogram Wideband Spectrogram

23/08/2019 Dr.Shikha Tripathi@PESU Blr 37

23/08/2019 Dr.Shikha Tripathi@PESU Blr 39

23/08/2019 Dr.Shikha Tripathi@PESU Blr 40

23/08/2019 Dr.Shikha Tripathi@PESU Blr 41

Cross-sections of spectra from the middle of English vowels of a male

23/08/2019 Dr.Shikha Tripathi@PESU Blr 45

23/08/2019 Dr.Shikha Tripathi@PESU Blr 46

23/08/2019 Dr.Shikha Tripathi@PESU Blr 47

23/08/2019 Dr.Shikha Tripathi@PESU Blr 49

¡ Point vowels /i,a,u/ have extreme values and

Phonemes in American English.

23/08/2019 Dr.Shikha Tripathi@PESU Blr 53

You might also like

¡  Ex: Consider a Periodic glottal flow source(voicing):

¡  This makes formant estimation difficult

¡  When nulls of the VT spectrum are sampled at harmonics,

¡  For frequencies < 4 KHz it is assumed to have plane wave

¡  Similarly in electrical transmission lines for a lossless uniform line

¡  If large number of tubes of short length is used we can

§  Articulatory phonetics: Study of production of

¡  Point vowels /i,a,u/ have extreme values and