This action might not be possible to undo. Are you sure you want to continue?
I
S
I
P
I
S
I
P
s
p
e e
c
h
s
p
e e
c
h
FUNDAMETALS OF SPEECH RECOGNITION:
A SHORT COURSE
by
Dr. Joseph Picone
Institute for Signal and Information Processing
Department of Electrical and Computer Engineering
Mississippi State University
ABSTRACT
Modern speech understanding systems merge interdisciplinary
technologies from Signal Processing, Pattern Recognition,
Natural Language, and Linguistics into a uniﬁed statistical
framework. These systems, which have applications in a wide
range of signal processing problems, represent a revolution in
Digital Signal Processing (DSP). Once a ﬁeld dominated by
vectororiented processors and linear algebrabased
mathematics, the current generation of DSPbased systems rely
on sophisticated statistical models implemented using a
complex software paradigm. Such systems are now capable of
understanding continuous speech input for vocabularies of
several thousand words in operational environments.
In this course, we will explore the core components of modern
statisticallybased speech recognition systems. We will view
speech recognition problem in terms of three tasks: signal
modeling, network searching, and language understanding. We
will conclude our discussion with an overview of stateoftheart
systems, and a review of available resources to support further
research and technology development.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 SHORT COURSE INTRO
Who am I?
Afﬁliations:
Associate Professor, Electrical and Computer Engineering
Director, Institute for Signal and Information Processing
Location:
Mississippi State University
413 Simrall, Box 9571
Mississippi State, Mississippi 39762
Phone/Fax: (601) 3253149
Email: picone@isip.msstate.edu (http://isip.msstate.edu)
Education:
Ph.D. in E.E., Illinois Institute of Technology, December 1983
M.S. in E.E., Illinois Institute of Technology, May 1980
B.S. in E.E., Illinois Institute of Technology, May 1979
Areas of Research:
Speech Recognition and Understanding, Signal Processing
Educational Responsibilities:
Signal and Systems (third year UG course)
Introduction to Digital Signal Processing (fourth year B.S./M.S. course)
Fundamentals of Speech Recognition (M.S./Ph.D. course)
Research Experience:
MS State (1994present):
large vocabulary speech recognition, objectoriented DSP
Texas Instruments (19871994):
telephonebased speech recognition, Tsukuba R&D Center
AT&T Bell Laboratories (19851987):
isolated word speech recognition, lowrate speech coding
Texas Instruments (19831985):
robust recognition, low rate speech coding
AT&T Bell Laboratories (19801983):
medium rate speech coding, low rate speech coding
Professional Experience:
Senior Member of the IEEE, Professional Engineer
Associate Editor of the IEEE Speech and Audio Transactions
Associate Editor of the IEEE Signal Processing Magazine
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 SHORT COURSE INTRO
Course Outline
Day/Time: Topic: Page:
Wednesday, May 13:
1. 8:30 AM to 10:00 AM Fundamentals of Speech 1
2. 10:30 AM to 12:00 PM Basic Signal Processing Concepts 20
3. 1:00 PM to 2:30 PM Signal Measurements 47
4. 3:00 PM to 4:30 PM Signal Processing in Speech Recognition 64
Thursday, May 14:
5. 8:30 AM to 10:00 AM Dynamic Programming (DP) 73
6. 10:30 AM to 12:00 PM Hidden Markov Models (HMMs) 85
7. 1:00 PM to 2:30 PM Acoustic Modeling and Training 104
8. 3:00 PM to 4:30 PM Speech Recognition Using HMMs 120
Friday, May 15:
9. 8:30 AM to 10:00 AM Language Modeling 134
10. 10:30 AM to 12:00 PM State of the Art and Future Directions 139
Suggested Textbooks:
1. J. Deller, et. al., DiscreteTime Processing of Speech Signals, MacMillan
Publishing Co., ISBN 0023283017
2. L. Rabiner and B.H. Juang, Fundamentals of Speech Recognition,
PrenticeHall, ISBN 0130151572
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 1 OF 147
Session I:
Fundamentals of Speech
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 2 OF 147
Pragmatics
Hypothesizer
Sentence
Hypothesizer
Word
Hypothesizer
Syllable
Hypothesizer
Phone
Hypothesizer
Signal
Model
Concept or Action
Speech Signal
Sentence Prediction
Word Prediction
Syllable Prediction
Phone Prediction
Acoustic Modeling
A GENERIC SOLUTION
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 3 OF 147
BASIC TECHNOLOGY:
A PATTERN RECOGNITION PARADIGM
BASED ON HIDDEN MARKOV MODELS
Search Algorithms: P W
t
i
O
t
( )
P O
t
W
t
i
( )P W
t
i
( )
P O
t
( )
 =
Pattern Matching: W
t
i
P O
t
O
t 1 –
… W
t
i
, , ( ) , [ ]
Signal Model: P O
t
W
t 1 –
W
t
W
t 1 +
, , ( ) ( )
Recognized Symbols: P S O ( ) max arg
T
P W
t
i
O
t
O
t 1 –
… , , ( ) ( )
i
∏
=
Language Model: P W
t
i
( )
Prediction
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 4 OF 147
WHAT IS THE BASIC PROBLEM?
Feature No. 1
Feature No. 2
ph
1
in the context: a ph
1
b
ph
2
in the context: c ph
2
d
ph
3
in the context e ph
3
f
What do we do about the regions of overlap?
Context is required to disambiguate them.
The problem of where words begin and end is similar.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 5 OF 147
Speech Production Physiology
Hard Palate
Soft Palate
(velum)
Pharyngeal
cavity
Larynx
Esophagus
Nasal cavity
Nostril
Lip
Tongue
Teeth
Oral
Jaw
Trachea
Lung
Diaphragm
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 6 OF 147
A Block Diagram of Human Speech Production
mouth
nostrils
Nasal
cavity
Oral
cavity
Pharyngeal
cavity
Lungs
Tongue
hump
Velum
Muscle
force
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 7 OF 147
What Does A Speech Signal Look Like?
0.50 0.60 0.70 0.80 0.90 1.00
20000.0
10000.0
0.0
10000.0
20000.0
Sanken MU2C dynamic microphone
Time (secs)
0.50 0.60 0.70 0.80 0.90 1.00
20000.0
10000.0
0.0
10000.0
20000.0
Condenser microphone
Time (secs)
0.50 0.60 0.70 0.80 0.90 1.00
20000.0
10000.0
0.0
10000.0
20000.0
Time (secs)
“ichi” (f1001)
“ichi” (f1001)
“san” (f1002) Sanken MU2C dynamic microphone
0.50 0.60 0.70 0.80 0.90 1.00
20000.0
10000.0
0.0
10000.0
20000.0
Time (secs)
“san” (f1003) Sanken MU2C dynamic microphone
0.50 0.60 0.70 0.80 0.90 1.00
20000.0
10000.0
0.0
10000.0
20000.0
Time (secs)
“san” (m0001) Sanken MU2C dynamic microphone
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 8 OF 147
What Does A Speech Signal Spectrogram Look Like?
Standard wideband spectrogram ( , ): f
s
10 kHz = T
w
6 ms =
Narrowband Spectrogram ( , ): f
s
8 kHz = T
w
30 ms =
The doctor examined the patient’s knees. Orthographic:
0 1 2 0.5 1.5
5 kHz
0 kHz
4 kHz
3 kHz
2 kHz
1 kHz
Time (secs)
“Drown” (female)
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 9 OF 147
Phonemics and Phonetics
Some simple deﬁnitions:
Phoneme: • an ideal sound unit with a complete set of articulatory
gestures.
• the basic theoretical unit for describing how speech
conveys linguistic meaning.
• (For English, there are about 42 phonemes.)
• Types of phonemes: vowels, semivowels, dipthongs,
and consonants.
Phonemics: • the study of abstract units and their relationships in a
language
Phone: • the actual sounds that are produced in speaking (for
example, “d” in letter pronounced “l e d er”).
Phonetics: • the study of the actual sounds of the language
Allophones: • the collection of all minor variants of a given sound
(“t” in eight versus “t” in “top”)
• Monophones, Biphones, Triphones —sequences of
one, two, and three phones. Most often used to
describe acoustic models.
Three branches of phonetics:
• Articulatory phonetics: manner in which the speech sounds are
produced by the articulators of the vocal system.
• Acoustic phonetics: sounds of speech through the analysis of the
speech waveform and spectrum
• Auditory phonetics: studies the perceptual response to speech
sounds as reﬂected in listener trials.
Issues:
• Broad phonemic transcriptions vs. narrow phonetic transcriptions
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 10 OF 147
Phonemic and Phonetic Transcription  Standards
Major governing bodies for phonetic alphabets:
International Phonetic Alphabet (IPA) — over 100 years of history
ARPAbet — developed in the late 1970’s to support ARPA research
TIMIT — TI/MIT variant of ARPAbet used for the TIMIT corpus
Worldbet — developed recently by Jim Hieronymous (AT&T) to deal with
multiple languages within a single ASCII system
Example:
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 11 OF 147
When We Put This All Together:
We Have An Acoustic Theory of Speech Production
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 12 OF 147
Consonants Can Be Similarly Classiﬁed
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 13 OF 147
Sound Propagation
A detailed acoustic theory must consider the effects of the following:
• Time variation of the vocal tract shape
• Losses due to heat conduction and viscous friction at the vocal tract walls
• Softness of the vocal tract walls
• Radiation of sound at the lips
• Nasal coupling
• Excitation of sound in the vocal tract
Let us begin by considering a simple case of a lossless tube:
ρ
u ∂
t ∂
 grad"" p + 0 =
1
ρc
2

p ∂
t ∂
 div u + 0 =
p ∂
x ∂
 – ρ
u A ⁄ ( ) ∂
t ∂
 =
u ∂
x ∂
 –
1
ρc
2

pA ( ) ∂
t ∂

A ∂
t ∂
 + =
p p x t , ( ) = u u x t , ( ) = ρcA A x t , ( ) =
A(x)
x
Lips Glottis
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 14 OF 147
For frequencies that are long compared to the dimensions of the vocal tract (less
than about 4000 Hz, which implies a wavelength of 8.5 cm), sound waves satisfy
the following pair of equations:
or
where
is the variation of the sound pressure in the tube
is the variation in the volume velocity
is the density of air in the tube (1.2 mg/cc)
is the velocity of sound (35000 cm/s)
is the area function (about 17.5 cm long)
Uniform Lossless Tube
If , then the above equations reduce to:
The solution is a traveling wave:
which is analogous to a transmission line:
What are the salient features of the lossless transmission line model?
ρ
u A ⁄ ( ) ∂
t ∂
 grad"" p + 0 =
1
ρc
2

p ∂
t ∂

A ∂
t ∂
 div u + + 0 =
p ∂
x ∂
 – ρ
u A ⁄ ( ) ∂
t ∂
 =
u ∂
x ∂
 –
1
ρc
2

pA ( ) ∂
t ∂

A ∂
t ∂
 + =
p p x t , ( ) =
u u x t , ( ) =
ρ
c
A A x t , ( ) =
A x t , ( ) A =
p ∂
x ∂
 –
ρ
A

u ∂
t ∂
 =
u ∂
x ∂
 –
A
ρc
2

p ∂
t ∂
 =
u x t , ( ) u
+
t x c ⁄ – ( ) u

t x c ⁄ + ( ) – =
p x t , ( )
ρc
A
 u
+
t x c ⁄ – ( ) u

t x c ⁄ + ( ) + [ ] =
v ∂
x ∂
 – L
i ∂
t ∂
 =
i ∂
x ∂
 – C
v ∂
t ∂
 =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 15 OF 147
where
The sinusoisdal steady state solutions are:
where is the characteristic impedance.
The transfer function is given by:
This function has poles located at every . Note that these
correspond to the frequencies at which the tube becomes a quarter
wavelength: .
Acoustic Quantity Analogous Electric Quantity
p  pressure v  voltage
u  volume velocity i  current
ρ/A  acoustic inductance L  inductance
A/(ρc
2
)  acoustic capacitance
C  capacitance
p x t , ( ) jZ
o
Ω l x – ( ) c ⁄ [ ] sin
Ωl c ⁄ [ ] cos
U
G
Ω ( )e
jΩt
=
u x t , ( )
Ω l x – ( ) c ⁄ [ ] cos
Ωl c ⁄ [ ] cos
U
G
Ω ( )e
jΩt
=
Z
0
ρc
A
 =
U l Ω , ( )
U 0 Ω , ( )

1
Ωl c ⁄ ( ) cos
 =
2n 1 + ( )πc
2l

Ωl
c

π
2
 =
¸ ,
¸ _
Ω
c
4l
 =
¸ ,
¸ _
⇒
H(f)
f
0
1 kHz 2 kHz 3 kHz 4 kHz
Is this model realistic?
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 16 OF 147
L
2
L
1
⁄ 1.5 = A
2
A
1
⁄ 8 =
L
2
L
1
⁄ 1.0 = A
2
A
1
⁄ 8 =
Resonator Geometry Formant Patterns
L=17.6 cm
F
1
F
2
F
3
F
4
500 1500
x x x x
2500 3500
L
2
L
1
⁄ 8 = A
2
A
1
⁄ 8 =
2 1
F
1
F
2
F
3
F
4
320 1200
x x x x
2300 3430
L
2
L
1
⁄ 1.2 = A
2
A
1
⁄ 1 8 ⁄ =
F
1
F
2
F
3
F
4
780 1240
x x x x
2720 3350
L
1
L
2
+ 17.6 cm =
2 1
2 1
F
1
F
2
F
3
F
4
220 1800
x x x x
2230 3800
L
1
L
2
+ 14.5 cm =
2 1
F
1
F
2
F
3
F
4
260 1990
x x x x
3050 4130
L
2
L
1
⁄ 1 3 ⁄ = A
2
A
1
⁄ 1 8 ⁄ =
1 2
L
1
L
2
+ 17.6 cm =
F
1
F
2
F
3
F
4
630 1770
x x x x
2280 3440
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 17 OF 147
Excitation Models
How do we couple energy into the vocal tract?
The glottal impedance can be approximated by:
The boundary condition for the volume velocity is:
For voiced sounds, the glottal volume velocity looks something like this:
Z
G
R
G
jΩL
G
+ =
U 0 Ω , ( ) U
G
Ω ( ) P 0 Ω , ( ) Z
G
Ω ( ) ⁄ – =
Air Flow
Trachea Vocal
Cords
Vocal Tract
Muscle Force
u
g
(t)
R
G
L
G
Subglottal
Pressure
p
s
(t)
Lungs
0 5 10 15 20 25
0
500
1000
time (ms)
Volume Velocity
(cc/sec)
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 18 OF 147
The Complete Digital Model (Vocoder)
Impulse
Train
Generator
Glottal
Pulse
Model
x
Random
Noise
Generator
x
Vocal Tract
Model
V(z)
Lip
Radiation
Model
fundamental
frequency
A
v
A
N
u
G
(n)
p
L
(n)
Notes:
• Sample frequency is typically 8 kHz to 16 kHz
• Frame duration is typically 10 msec to 20 msec
• Window duration is typically 30 msec
• Fundamental frequency ranges from 50 Hz to 500 Hz
• Three resonant frequencies are usually found within 4 kHz bandwidth
• Some sounds, such as sibilants (“s”) have extremely high bandwidths
Questions:
What does the overall spectrum look like?
What happened to the nasal cavity?
What is the form of V(z)?
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 19 OF 147
Fundamental Frequency Analysis
How do we determine the fundamental frequency?
We use the (statistical) autocorrelation function:
Ψ i ( )
x n ( )x n i – ( )
n 0 =
N 1 –
∑
x n ( )
2
n 0 =
N 1 –
∑
¸ ,
¸ _
x n i – ( )
2
n 0 =
N 1 –
∑
¸ ,
¸ _
 =
F
0
50 100 150 lag
6.25 ms
12.5 ms
18.75 ms time
160.0 Hz 80 Hz 53.3 Hz
frequency
Ψ i ( )
1
1
Other common representations:
Average Magnitude Difference Function (AMDF):
Zero Crossing Rate:
γ i ( ) x n ( ) x n i – ( ) –
n 0 =
N 1 –
∑
=
F
0
ZF
s
2
 = Z x n ( ) [ ] sgn x n 1 – ( ) [ ] sgn –
n 0 =
N 1 –
∑
=
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 20 OF 147
Session II:
Basic Signal Processing Concepts
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 21 OF 147
The Sampling Theorem and Normalized Time/Frequency
If the highest frequency contained in an analog signal, , is and
the signal is sampled at a rate , then can be EXACTLY
recovered from its sample values using the interpolation function:
.
may be expressed as:
where .
x
a
t ( ) F
max
B =
F
s
2F
max
2B = > x
a
t ( )
g t ( )
2πBt ( ) sin
2πBt
 =
x
a
t ( )
x
a
t ( ) x
a
n
F
s
 ( )g t
n
F
s
 – ( )
n ∞ – =
∞
∑
=
x
a
n
F
s
 ( ) x
a
nT ( ) x n ( ) = =
Given a continuous signal:
,
A discretetime sinusoid may be expressed as:
,
which, after regrouping, gives:
,
where , and is called normalized radian frequency and
represents normalized time.
x t ( ) A 2πft θ + ( ) cos =
x n ( ) A 2πf
n
f
s

¸ ,
¸ _
θ +
¸ ,
¸ _
cos =
x n ( ) A ωn θ + ( ) cos =
ω 2π
f
f
s

¸ ,
¸ _
= n
10.0 5.0 0.0 5.0 10.0
0.5
0.0
0.5
1.0
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 22 OF 147
Transforms
The transform of a discretetime signal is deﬁned as:
The Fourier transform of can be computed from the transform as:
The Fourier transform may be viewed as the transform evaluated around the
unit circle.
The Discrete Fourier Transform (DFT) is deﬁned as a sampled version of the
Fourier shown above:
The inverse DFT is given by:
The Fast Fourier Transform (FFT) is simply an efﬁcient computation of the DFT.
The Discrete Cosine Transform (DCT) is simply a DFT of a signal that is
assumed to be real and even (real and even signals have real spectra!):
Note that these are not the only transforms used in speech processing
(wavelets, fractals, Wigner distributions, etc.).
z
X z ( ) x n ( )z
n –
n ∞ – =
∞
∑
≡ x n ( )
1
2πj
 X z ( )z
n 1 –
dz
C
∫
°
=
x n ( ) z
X ω ( ) X z ( )
z e
jω
=
x n ( )e
jωn –
n ∞ – =
∞
∑
= =
z
X k ( ) x n ( )e
j2πkn N ⁄ –
n 0 =
N 1 –
∑
= , k 0 1 2 … N 1 – , , , , =
x n ( ) X k ( )e
j2πkn N ⁄
n 0 =
N 1 –
∑
= , n 0 1 2 … N 1 – , , , , =
X k ( ) x 0 ( ) 2 x n ( ) 2πkn N ⁄ cos
n 1 =
N 1 –
∑
+ = , k 0 1 2 … N 1 – , , , , =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 23 OF 147
TimeDomain Windowing
Let denote a ﬁnite duration segment of a signal:
This introduces frequency domain aliasing (the socalled picket fence effect):
x n ( ) { }
xˆ n ( ) x n ( )w n ( ) =
Popular Windows
Generalized Hanning: w
H
k ( ) w k ( ) α 1 α – ( )
2π
N
k
¸ ,
¸ _
cos + = 0 α 1 < <
α 0.54, = Hamming window
α 0.50, = Hanning window
0.0 2000.0 4000.0 6000.0 8000.0
40.0
30.0
20.0
10.0
0.0
10.0
20.0
30.0
40.0
f
s
=8000 Hz, f
1
=1511 Hz, L=25, N=2048
FrameBased Analysis With Overlap
Consider the problem of performing a piecewise linear analysis of a signal:
L L L
x
1
(n)
x
2
(n)
x
3
(n)
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 24 OF 147
Difference Equations, Filters, and Signal Flow Graphs
A linear timeinvariant systemcan be characterized by a constantcoefﬁcient
difference equations:
Is this system linear if the coefﬁcients are timevarying?
Such systems can be implemented as signal ﬂow graphs:
y n ( ) a
k
y n k – ( )
k 1 =
N
∑
– b
k
x n k – ( )
k 0 =
M
∑
+ =
y n ( ) x n ( )
+
+
+ +
+
+
+
+
.
.
.
.
.
.
.
.
.
z
1
z
1
z
1
a
1
a
N
a
N1
a
2
b
0
b
1
b
2
b
N1
b
N
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 25 OF 147
Minimum Phase, Maximum Phase, MIxed Phase,
and Speech Perception
An FIR ﬁlter composed of all zeros that are inside the unit circle is minimum
phase. There are many realizations of a system with a given magnitude response;
one is a minimum phase realization, one is a maximumphase realization, others
are inbetween. Any nonminimum phase polezero system can be decomposed
into:
It can be shown that of all the possible realizations of , the minimumphase
version is the most compact in time:
Deﬁne:
Then, for all and all possible realizations of .
Why is minimum phase such an important concept in speech processing?
We prefer systems that are invertible:
We would like both systems to be stable. The inverse of a nonminimum phase
system is not stable.
We end with a very simple question:
Is phase important in speech processing?
H z ( ) H
min
z ( )H
ap
z ( ) =
H ω ( )
E n ( ) h k ( )
2
k 0 =
n
∑
=
E
min
n ( ) E n ( ) ≥ n H ω ( )
H z ( )H
1 –
z ( ) 1 =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 26 OF 147
Probability Spaces
A formal deﬁnition of probability involves the speciﬁcation of:
• a sample space
The sample space, , is the set of all possible outcomes, plus the
null outcome. Each element in is called a sample point.
• a ﬁeld (or algebra)
The ﬁeld, or algebra, is a set of subsets of closed under
complementation and union (recall Venn diagrams).
• and a probability measure.
A probability measure obeys these axioms:
1. (implies probabilities less than one)
2.
3. For two mutually exclusive events:
Two events are said to be statistically independent if:
The conditional probability of B given A is:
Hence,
S
S
S
P S ( ) 1 =
P A ( ) 0 ≥
P A B ∪ ( ) P A B , ( ) P A ( ) P B ( ) + = =
P A B ∩ ( ) P A ( )P B ( ) =
P B A ( )
P B A ∩ ( )
P A ( )
 =
P B A ∩ ( ) P B A ( )P A ( ) =
S S
A
B
A
B
Mutually Exclusive
P B A ∩ ( ) P B A ( )P A ( ) =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 27 OF 147
Functions of Random Variables
Probability Density Functions:
Cumulative Distributions:
f(x)  discrete (Histogram)
1 2 3 4
f(x)  continuous
1 2 3 4
F(x)  discrete
1 2 3 4
F(x)  continuous
1 2 3 4
Probability of Events:
f(x)  continuous
1 2 3 4
P 2 x < 3 ≤ ( ) f x ( ) x d
2
3
∫
F 3 ( ) F 2 ( ) – = =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 28 OF 147
Important Probability Density Functions
Uniform (Unix rand function):
Gaussian:
Laplacian (speech signal amplitude, durations):
Gamma (durations):
We can extend these concepts to Ndimensional space. For example:
Two random variables are statistically independent if:
This implies:
and
f x ( )
1
b a –
 a x b ≤ <
0 elsewhere
¹
¹
'
¹
¹
=
f x ( )
1
2πσ
2

x µ – ( )
2
–
2σ
2

¹ ¹
' ;
¹ ¹
exp =
f x ( )
1
2σ
2

2 x –
σ

¹ ¹
' ;
¹ ¹
exp =
f x ( )
k
2 π x
 k x – { } exp =
P A A
x
∈ B A
y
∈ ( )
P A B , ( )
P B ( )

A
x
∫
f x y , ( ) y d x d
A
y
∫
f y ( ) y d
A
y
∫
 = =
P A B , ( ) P A ( )P B ( ) =
f
xy
x y , ( ) f
x
x ( ) f
y
y ( ) = F
xy
x y , ( ) F
x
x ( )F
y
y ( ) =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 29 OF 147
Mixture Distributions
A simple way to form a more generalizable pdf that still obeys some para
metric shape is to use a weighted sum of one or more pdfs. We refer to this
as a mixture distribution.
If we start with a Gaussian pdf:
The most common form is a Gaussian mixture:
where
Obviously this can be generalized to other shapes.
Such distributions are useful to accommodate multimodal behavior:
f
i
x ( )
1
2πσ
i
2

x µ
i
– ( )
2
–
2σ
i
2

¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
exp =
f x ( ) c
i
f
i
x ( )
i 1 =
N
∑
=
c
i
i 1 =
N
∑
1 =
Derivation of the optimal coefﬁcients, however, is most often a nonlinear
optimization problem.
1 2 3 4 1 2 3 4
f
1
x ( )
f
2
x ( )
f
3
x ( )
µ
1
µ
2
µ
3
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 30 OF 147
Expectations and Moments
The statistical average of a scalar function, , of a random variable is:
and
The central moment is deﬁned as:
The joint central moment between two random variables is deﬁned as:
We can deﬁne a correlation coefﬁcient between two random variables as:
We can extend these concepts to a vector of random variables:
What is the difference between a random vector and a random process?
What does wide sense stationary mean? strict sense stationary?
What does it mean to have an ergodic random process?
How does this inﬂuence our signal processing algorithms?
g x ( )
E g x ( ) [ ] x
i
P x x
i
= ( )
i 1 =
∞
∑
= E g x ( ) [ ] g x ( ) f x ( ) x d
∞ –
∞
∫
≡
E x µ – ( )
i
[ ] x µ – ( )
i
f x ( ) x d
∞ –
∞
∫
=
E x µ
x
– ( )
i
y µ
y
– ( )
k
[ ] x µ
i
– ( )
i
∞ –
∞
∫
y µ
y
– ( )
k
f x y , ( ) x d y d
∞ –
∞
∫
=
ρ
xy
c
xy
ρ
x
ρ
y
 =
x x
1
x
2
… x
N
, , , [ ]
T
=
f
x
x ( )
1
2π
N 2 ⁄
C

1
2
 – x µ – ( )
T
C
x
1 –
x µ – ( )
¹ ¹
' ;
¹ ¹
exp =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 31 OF 147
Correlation and Covariance of WSS Random Processes
For a signal, , we can compute the following useful quantities:
Autocovariance:
If a random process is wide sense stationary:
Hence, we deﬁne a very useful function known as the autocorrelation:
If is zero mean WSS:
What is the relationship between the autocorrelation and the spectrum:
For a linear timeinvariant system, :
The notion of random noise is central to signal processing:
white noise?
Gaussian white noise?
zeromean Gaussian white noise?
colored noise?
Therefore, we now embark upon one of the last great mysteries of life:
How do we compare two random vectors?
x n ( )
c i j , ( ) E x n i – ( ) µ
i
– ( ) x n j – ( ) µ
j
– ( ) [ ] =
E x n i – ( )x n j – ( ) [ ] µ
i
µ
j
– =
1
N
 x n i – ( )x n j – ( )
n 0 =
N 1 –
∑
1
N
 x n i – ( )
n 0 =
N 1 –
∑
¸ ,
¸ _
1
N
 x n j – ( )
n 0 =
N 1 –
∑
¸ ,
¸ _
– =
c i j , ( ) c i j – 0 , ( ) =
r k ( )
1
N
 x n ( )x n k – ( )
n 0 =
N 1 –
∑
=
x n ( )
c i j , ( ) r i j – ( ) =
DFT r k ( ) { } X k ( )
2
=
h n ( )
DFT r
y
k ( ) { } DFT h n ( ) { }
2
DFT r
x
k ( ) { } =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 32 OF 147
Distance Measures
What is the distance between pt. a and pt. b?
The Ndimensional real Cartesian space,
denoted is the collection of all Ndimensional
vectors with real elements. A metric, or distance
measure, is a realvalued function with three properties:
:
1. .
2.
3.
The Minkowski metric of order , or the metric, between and is:
(the norm of the difference vector).
Important cases are:
1. or city block metric (sum of absolute values),
2. , or Euclidean metric (meansquared error),
3. or Chebyshev metric,
ℜ
N
x y z , , ∀ ℜ
N
∈
d x y , ( ) 0 ≥
d x y , ( ) 0 = if and only if x y =
d x y , ( ) d x z , ( ) d z y , ( ) + ≤
s l
s
x y
d
s
x y , ( ) x
k
y
k
–
s
k 1 =
N
∑ s
≡ x y –
s
=
l
1
d
1
x y , ( ) x
k
y
k
–
k 1 =
N
∑
=
l
2
d
2
x y , ( ) x
k
y
k
–
2
k 1 =
N
∑
=
l
∞
d
∞
x y , ( )
max
k
x
k
y
k
– =
0
1
2
0 1 2
gold bars
diamonds
b
a
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 33 OF 147
We can similarly deﬁne a weighted Euclidean distance metric:
where:
, , and .
Why are Euclidean distances so popular?
One reason is efﬁcient computation. Suppose we are given a set of
reference vectors, , a measurement, , and we want to ﬁnd the nearest
neighbor:
This can be simpliﬁed as follows:
We note the minimum of a square root is the same as the minimum of a
square (both are monotonically increasing functions):
Therefore,
Thus, a Euclidean distance is virtually equivalent to a dot product (which
can be computed very quickly on a vector processor). In fact, if all reference
vectors have the same magnitude, can be ignored (normalized
codebook).
d
2w
x y , ( ) x y –
T
W x y – =
x
x
1
x
2
…
x
k
= y
y
1
y
2
…
y
k
= W
w
11
w
12
… w
1k
w
21
w
22
… w
2k
… … … …
w
k1
w
k2
… w
kk
=
M
x
m
y
NN min
m
d
2
x
m
y , ( ) =
d
2
x
m
y , ( )
2
x
m
j
y
j
– ( )
2
j 1 =
k
∑
x
m
j
2
2x
m
j
y
j
– y
j
2
+
j 1 =
k
∑
= =
x
m
2
2x
m
y • – y
2
+ =
C
m
C
y
2x
m
y • – + =
NN min
m
d
2
x
m
y , ( ) C
m
2x
m
y • – = =
C
m
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 34 OF 147
Prewhitening of Features
Consider the problem of comparing features of different scales:
Suppose we represent these points in space in two coordinate systems
using the transformation:
System 1:
and
System 2:
and
The magnitude of the distance has changed. Though the rankordering of
distances under such linear transformations won’t change, the cumulative
effects of such changes in distances can be damaging in pattern
recognition. Why?
z Vx =
β
1
1i
ˆ
0 j
ˆ
+ = β
2
0i
ˆ
1 j
ˆ
+ =
a
1 0
0 1
1
1
= b
1 0
0 1
1
2
=
d
2
a b , ( ) 0
2
1
2
+ 1 = =
γ
1
2 – i
ˆ
0 j
ˆ
+ = γ
1
1 – i
ˆ
1 j
ˆ
+ =
a
2 – 0
1 – 1
1 –
1
= b
2 – 0
1 – 1
3
2
 –
2
=
d
2
a b , ( ) 1 –
3
2
 –
¸ ,
¸ _
–
¸ ,
¸ _
2
1 2 – ( )
2
+
5
4
 = =
0
1
2
0 1 2
gold bars
diamonds
b
a
System 1
System 2
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 35 OF 147
We can simplify the distance calculation in the transformed space:
This is just a weighted Euclidean distance.
Suppose all dimensions of the vector are not equal in importance. For
example, suppose one dimension has virtually no variation, while another is
very reliable. Suppose two dimensions are statistically correlated. What is a
statistically optimal transformation?
Consider a decomposition of the covariance matrix (which is symmetric):
where denotes a matrix of eigenvectors of and denotes a diagonal
matrix whose elements are the eigenvalues of . Consider:
The covariance of , is easily shown to be an identity matrix (prove this!)
We can also show that:
Again, just a weighted Euclidean distance.
• If the covariance matrix of the transformed vector is a diagonal matrix,
the transformation is said to be an orthogonal transform.
• If the covariance matrix is an identity matrix, the transform is said to be
an orthonormal transform.
• A common approximation to this procedure is to assume the dimensions
of are uncorrelated but of unequal variances, and to approximate by
a diagonal matrix, . Why? This is known as varianceweighting.
d
2
Vx Vy , ( ) Vx Vy – [ ]
T
Vx Vy – [ ] =
x y – [ ]
T
V
T
V x y – [ ] =
d
2W
x y , ( ) =
C ΦΛΦ
T
=
Φ C Λ
C
z Λ
1 2 ⁄ –
Φx =
z C
z
d
2
z
1
z
2
, ( ) x
1
x
2
– [ ]
T
C
x
1 –
x
1
x
2
– [ ] =
x C
Λ
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 36 OF 147
“NoiseReduction”
The prewhitening transform, , is normally created as a
matrix in which the eigenvalues are ordered from largest to smallest:
where
.
In this case, a new feature vector can be formed by truncating the
transformation matrix to rows. This is essentially discarding the least
important features.
A measure of the amount of discriminatory power contained in a feature, or
a set of features, can be deﬁned as follows:
This is the percent of the variance accounted for by the ﬁrst features.
Similarly, the coefﬁcients of the eigenvectors tell us which dimensions of the
input feature vector contribute most heavily to a dimension of the output
feature vector. This is useful in determining the “meaning” of a particular
feature (for example, the ﬁrst decorrelated feature often is correlated with
the overall spectral slope in a speech recognition system — this is
sometimes an indication of the type of microphone).
z Λ
1 2 ⁄ –
Φx = k k ×
z
1
z
2
…
z
k
λ
1
1 2 ⁄ –
λ
2
1 2 ⁄ –
…
λ
k
1 2 ⁄ –
v
11
v
12
… v
13
v
21
v
22
… v
2k
… … … …
v
k1
v
k2
… v
kk
x
1
x
2
…
x
k
=
λ
1
λ
2
… λ
k
> > >
l k <
% var
λ
j
j 1 =
l
∑
λ
j
j 1 =
k
∑
 =
l
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 37 OF 147
Computational Procedures
Computing a “noisefree” covariance matrix is often difﬁcult. One might
attempt to do something simple, such as:
and
On paper, this appears reasonable. However, often, the complete set of
feature vectors contains valid data (speech signals) and noise (nonspeech
signals). Hence, we will often compute the covariance matrix across a
subset of the data, such as the particular acoustic event (a phoneme or
word) we are interested in.
Second, the covariance matrix is often illconditioned. Stabilization
procedures are used in which the elements of the covariance matrix are
limited by some minimum value (a noiseﬂoor or minimum SNR) so that the
covariance matrix is better conditioned.
But how do we compute eigenvalues and eigenvectors on a computer?
One of the hardest things to do numerically! Why?
Suggestion: use a canned routine (see Numerical Recipes in C).
The deﬁnitive source is EISPACK (originally implemented in Fortran, now
available in C). A simple method for symmetric matrices is known as the
Jacobi transformation. In this method, a sequence of transformations are
applied that set one offdiagonal element to zero at a time. The product of
the subsequent transformations is the eigenvector matrix.
Another method, known as the QR decomposition, factors the covariance
matrix into a series of transformations:
where is orthogonal and is upper diagonal. This is based on a
transformation known as the Householder transform that reduces columns
of a matrix below the diagonal to zero.
c
ij
x
i
µ
i
– ( ) x
j
µ
j
– ( )
n 0 =
N 1 –
∑
= µ
i
x
i
n 0 =
N 1 –
∑
=
C QR =
Q R
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 38 OF 147
Maximum Likelihood Classiﬁcation
Consider the problem of assigning a measurement to one of two sets:
What is the best criterion for making a decision?
Ideally, we would select the class for which the conditional probability is highest:
However, we can’t estimate this probability directly from the training data. Hence,
we consider:
By deﬁnition
and
from which we have
c
∗
argmax
c
= P c cˆ = ( ) x x
ˆ
= ( ) ( )
c
∗
argmax
c
= P x x
ˆ
= ( ) c cˆ = ( ) ( )
P c cˆ = ( ) x x
ˆ
= ( ) ( )
P c cˆ = ( ) x x
ˆ
= ( ) , ( )
P x x
ˆ
= ( )
 =
P x x
ˆ
= ( ) c cˆ = ( ) ( )
P c cˆ = ( ) xˆ x
ˆ
= ( ) , ( )
P c cˆ = ( )
 =
P c cˆ = ( ) x x
ˆ
= ( ) ( )
P x x
ˆ
= ( ) c cˆ = ( ) ( )P c cˆ = ( )
P x x
ˆ
= ( )
 =
µ
1
µ
2
x
f
x c
x c 1 = ( ) ( ) f
x c
x c 2 = ( ) ( )
xˆ
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 39 OF 147
Clearly, the choice of that maximizes the right side also maximizes the left side.
Therefore,
if the class probabilities are equal,
A quantity related to the probability of an event which is used to make a decision
about the occurrence of that event is often called a likelihood measure.
A decision rule that maximizes a likelihood is called a maximum likelihood
decision.
In a case where the number of outcomes is not ﬁnite, we can use an analogous
continuous distribution. It is common to assume a multivariate Gaussian
distribution:
We can elect to maximize the log, rather than the likelihood (we refer
to this as the log likelihood). This gives the decision rule:
(Note that the maximization became a minimization.)
We can deﬁne a distance measure based on this as:
c
c
∗
argmax
c
P x x
ˆ
= ( ) c cˆ = ( ) ( ) [ ] =
argmx
c
P x x
ˆ
= ( ) c cˆ = ( ) ( )P c cˆ = ( ) [ ] =
c
∗
argmx
c
P x x
ˆ
= ( ) c cˆ = ( ) ( ) [ ] =
f
x c
x
1
… x
N
, , c ( ) f
x c
x
ˆ
cˆ ( ) =
1
2π C
x c

1
2
 – x
ˆ
µ
x c
– ( )
T
C
x c
1 –
x
ˆ
µ
x
ˆ
c
– ( )
¹ ¹
' ;
¹ ¹
exp =
f
x c
x c ( ) [ ] ln
c
∗
argmin
c
x
ˆ
µ
x c
– ( )
T
C
x c
1 –
x
ˆ
µ
x
ˆ
c
– ( ) C
x c
1 –
¹ ¹
' ;
¹ ¹
ln + =
d
ml
x µ
x c
, ( ) x
ˆ
µ
x c
– ( )
T
C
x c
1 –
x
ˆ
µ
x
ˆ
c
– ( ) C
x c
1 –
¹ ¹
' ;
¹ ¹
ln + =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 40 OF 147
Note that the distance is conditioned on each class mean and covariance.
This is why “generic” distance comparisons are a joke.
If the mean and covariance are the same across all classes, this expression
simpliﬁes to:
This is frequently called the Mahalanobis distance. But this is nothing more
than a weighted Euclidean distance.
This result has a relatively simple geometric interpretation for the case of a
single random variable with classes of equal variances:
d
M
x µ
x c
, ( ) x
ˆ
µ
x c
– ( )
T
C
x c
1 –
x
ˆ
µ
x
ˆ
c
– ( ) =
The decision rule involves setting a threshold:
and,
If the variances are not equal, the threshold shifts towards the distribution
with the smaller variance.
What is an example of an application where the classes are not
equiprobable?
a
µ
1
µ
2
+
2

¸ ,
¸ _
σ
2
µ
1
µ
2
–

P c 2 = ( )
P c 1 = ( )

¸ ,
¸ _
ln + =
if x a < x c 1 = ( ) ∈
else x c 2 = ( ) ∈
µ
1
µ
2
x
f
x c
x c 1 = ( ) ( ) f
x c
x c 2 = ( ) ( )
a
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 41 OF 147
Probabilistic Distance Measures
How do we compare two probability distributions to measure their overlap?
Probabilistic distance measures take the form:
where
1. is nonnegative
2. J attains a maximum when all classes are disjoint
3. J=0 when all classes are equiprobable
Two important examples of such measures are:
(1) Bhattacharyya distance:
(2) Divergence
Both reduce to a Mahalanobislike distance for the case of Gaussian vectors
with equal class covariances.
Such metrics will be important when we attempt to cluster feature vectors
and acoustic models.
J g f
x c
x
ˆ
cˆ ( ) P c cˆ = ( ) cˆ 1 2 … K , , , = , , { } x
ˆ
d
∞ –
∞
∫
=
J
J
B
f
x c
x
ˆ
1 ( ) f
x c
x
ˆ
2 ( ) x
ˆ
d
∞ –
∞
∫
ln – =
J
D
f
x c
x
ˆ
1 ( ) f
x c
x
ˆ
2 ( ) – [ ]
f
x c
x
ˆ
1 ( )
f
x c
x
ˆ
2 ( )
 ln x
ˆ
d
∞ –
∞
∫
=
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 42 OF 147
Probabilistic Dependence Measures
A probabilistic dependence measure indicates how strongly a feature is
associated with its class assignment. When features are independent of
their class assignment, the class conditional pdf’s are identical to the
mixture pdf:
When their is a strong dependence, the conditional distribution should be
signiﬁcantly different than the mixture. Such measures take the form:
An example of such a measure is the average mutual information:
The discrete version of this is:
Mutual information is closely related to entropy, as we shall see shortly.
Such distance measures can be used to cluster data and generate vector
quantization codebooks. A simple and intuitive algorithms is known as the
Kmeans algorithm:
Initialization: Choose K centroids
Recursion: 1. Assign all vectors to their nearest neighbor.
2. Recompute the centroids as the average of all vectors
assigned to the same centroid.
3. Check the overall distortion. Return to step 1 if some
distortion criterion is not met.
f
x c
x
ˆ
cˆ ( ) f
x
x
ˆ
( ) = c ∀
J g f
x c
x
ˆ
cˆ ( ) f
x
x
ˆ
( ) P c cˆ = ( ) cˆ 1 2 … K , , , = , , , { } x
ˆ
d
∞ –
∞
∫
=
M
avg
c x
ˆ
, ( ) P c cˆ = ( )
c 1 =
K
∑
… f
x c
x
ˆ
cˆ ( )
f
x c
x
ˆ
cˆ ( )
f
x
x
ˆ
( )
 log x
ˆ
d
∞ –
∞
∫
∞ –
∞
∫
=
M
avg
c x
ˆ
, ( ) P c cˆ = ( )
c 1 =
K
∑
P x x
l
= ( )
P x x
l
= c cˆ = ( )
P x x
l
= ( )

2
log
i 1 =
L
∑
=
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 43 OF 147
What Is Information? (When not to bet at a casino...)
Consider two distributions of discrete random variables:
Which variable is more unpredictable?
Now, consider sampling random numbers from a random number generator
whose statistics are not known. The more numbers we draw, the more we
discover about the underlying distribution. Assuming the underlying distribution is
from one of the above distributions, how much more information do we receive
with each new number we draw?
The answer lies in the shape of the distributions. For the random variable x, each
class is equally likely. Each new number we draw provides the maximum amount
of information, because, on the average, it will be from a different class (so we
discover a new class with every number). On the other hand, for y, chances are,
c=3 will occur 5 times more often than the other classes, so each new sample will
not provide as much information.
We can deﬁne the information associated with each class, or outcome, as:
Since , information is a positive quantity. A base 2 logarithm is used so
that discrete outcomes can be measured in bits. For the distributions
above,
Huh??? Does this make sense?
I c cˆ = ( )
1
P c cˆ = ( )

2
log ≡ P c cˆ = ( )
2
log – =
0 P x ( ) 1 ≤ ≤
K 2
M
K =
I
x
c 1 = ( ) 1 – ( ) 1 4 ⁄ ( )
2
log 2 bits = = I
y
c 1 = ( ) 1 – ( )
1
8

¸ ,
¸ _
2
log 3 bits = =
4 1 2 3 x
c=1 c=2 c=3 c=4
f(x)
Uniform Random Variable
1/4
4 1 2 3 y
c=1 c=2
c=3
c=4
f(y)
1/2
3/4
1/4
1/2
3/4
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 44 OF 147
What is Entropy?
Entropy is the expected (average) information across all outcomes:
Entropy using is also measured in bits, since it is an average of information.
For example,
We can generalize this to a joint outcome of N random vectors from the same
distribution, which we refer to as the joint entropy:
If the random vectors are statistically independent:
If the random vectors are independent and indentically distributed:
We can also deﬁne conditional entropy as:
For continuous distributions, we can deﬁne an analogous quantity for entropy:
(bits)
A zeromean Gaussian random variable has maximum entropy ( .
Why?
H c ( ) E I c ( ) [ ] ≡ P c c
k
= ( ) P c c
k
= ( )
2
log
k 1 =
K
∑
– =
log
2
H
x
1
4
 
¸ ,
¸ _
1
4
 
¸ ,
¸ _
2
log
k 1 =
4
∑
– 2.0 bits = = H
y
1
8

¸ ,
¸ _
1
8
 
¸ ,
¸ _
2
log
k 1 =
3
∑
5
8
 
¸ ,
¸ _
5
8

¸ ,
¸ _
2
log + – 0.8 bits = =
H x 1 ( ) … x N ( ) , , [ ] …
l
1
1 =
N
∑
P x 1 ( ) x
l
1
= ( ) … x N ( ) x
l
N
= ( ) , , [ ]
l
N
N
∑
– =
P x 1 ( ) x
l
1
= ( ) … x N ( ) x
l
N
= ( ) , , [ ]
2
log ×
H x 1 ( ) … x N ( ) , , [ ] H x n ( ) [ ]
n 1 =
N
∑
=
H x 1 ( ) … x N ( ) , , [ ] NH x 1 ( ) [ ] =
H x y ( ) P x x
l
= ( ) y y
k
= ( ) ( ) P x x
l
= ( ) y y
k
= ( ) ( ) [ ]
2
log
l 1 =
L
∑
k 1 =
K
∑
=
H f x ( ) f x ( )
2
log x d
∞ –
∞
∫
– =
1
2
 2πeσ
2
( )
2
log
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 45 OF 147
Mutual Information
The pairing of random vectors produces less information than the events
taken individually. Stated formally:
The shared information between these events is called the mutual
information, and is deﬁned as:
From this deﬁnition, we note:
This emphasizes the idea that this is information shared between these two
random variables.
We can deﬁne the average mutual information as the expectation of the
mutual information:
Note that:
Also note that if and are independent, then there is no mutual
information between them.
Note that to compute mutual information between two random variables, we
need a joint probability density function.
I x y , ( ) I x ( ) I y ( ) + ≤
M x y , ( ) I x ( ) I y ( ) + [ ] I x y , ( ) – ≡
M x y , ( )
P x y , ( )
P x ( )P y ( )

2
log =
1
P x ( )

2
log
1
P x y , ( )
P y ( )


2
log + =
1
P x ( )

2
log
1
P x y ( )

2
log – =
I x ( ) I x y ( ) – =
I y ( ) I y x ( ) – =
M x y , ( ) E
P x y , ( )
P x ( )P y ( )

¸ ,
¸ _
2
log =
P x x
l
= ( ) y y
k
= ( ) , ( )
P x x
l
= ( ) y y
k
= ( ) , ( )
P x x
l
= ( )P y y
k
= ( )

2
log
l 1 =
L
∑
k 1 =
K
∑
=
M x y , ( ) H x ( ) H x y ( ) – H y x ( ) H y ( ) – = =
x y
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 46 OF 147
How Does Entropy Relate To DSP?
Consider a window of a signal:
What does the sampled ztransform assume about the signal outside the
window?
What does the DFT assume about the signal outside the window?
How do these inﬂuence the resulting spectrum that is computed?
What other assumptions could we make about the signal outside the
window? How many valid signals are there?
How about ﬁnding the spectrum that corresponds to the signal that matches
the measured signal within the window, and has maximum entropy?
What does this imply about the signal outside the window?
This is known as the principle of maximum entropy spectral estimation.
Later we will see how this relates to minimizing the meansquare error.
x(n)
n
window
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 47 OF 147
Session III:
Signal Measurements
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 48 OF 147
ShortTerm Measurements
What is the point of this lecture?
T
f
5 ms =
T
w
10 ms =
T
f
10 ms =
T
w
20 ms =
T
f
20 ms =
T
w
30 ms =
T
f
20 ms =
T
w
30 ms =
Hamm. Win.
T
f
20 ms =
T
w
60 ms =
Hamm. Win.
Speech Signal
Recursive
50 Hz LPF
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 49 OF 147
Time/Frequency Properties of Windows
%Overlap
T
w
T
f
– ( )
T
w
 100% × =
• Generalized Hanning: w
H
k ( ) w k ( ) α 1 α – ( )
2π
N
k
¸ ,
¸ _
cos + = 0 α 1 < <
α 0.54, = Hamming window
α 0.50, = Hanning window
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 50 OF 147
RecursiveinTime Approaches
Deﬁne the shortterm estimate of the power as:
We can view the above operation as a movingaverage ﬁlter applied to the
sequence .
This can be computed recursively using a linear constantcoefﬁcient
difference equation:
Common forms of this general equation are:
(Leaky Integrator)
(Firstorder weighted average)
(2
nd
order Integrator)
Of course, these are nothing more than various types of lowpass ﬁlters, or
adaptive controllers. How do we compute the constants for these
equations?
In what other applications have we seen such ﬁlters?
P n ( )
1
N
s
 w m ( )s n
N
s
2
 m + – ( )
¸ ,
¸ _
2
m 0 =
N
s
1 –
∑
=
s
2
n ( )
P n ( ) a
pw
i ( )P n i – ( )
i 1 =
N
a
∑
– b
pw
j ( )s
2
n j – ( )
j 1 =
N
b
∑
+ =
P n ( ) αP n 1 – ( ) s
2
n ( ) + =
P n ( ) αP n 1 – ( ) 1 α – ( )s
2
n ( ) + =
P n ( ) αP n 1 – ( ) βP n 2 – ( ) s
2
n ( ) γ s
2
n 1 – ( ) + + + =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 51 OF 147
Relationship to Control Systems
The ﬁrstorder systems can be related to physical quantities by observing
that the system consists of one real pole:
can be deﬁned in terms of the bandwidth of the pole.
For secondorder systems, we have a number of alternatives. Recall that a
secondorder system can consist of at most one zero and one pole and their
complex conjugates. Classical ﬁlter design algorithms can be used to design
the ﬁlter in terms of a bandwidth and an attenuation.
An alternate approach is to design the system in terms of its unitstep
response:
H z ( )
1
1 αz
1 –
–
 =
α
There are many forms of such controllers (often known as
servocontrollers). One very interesting family of such systems are those
that correct to the velocity and acceleration of the input. All such systems
can be implemented as a digital ﬁlter.
u(n)
P(n)
n
Overshoot
rise times
1.0
0.5
0
settling time
ﬁnal response threshold
h(n)
n
Equivalent impulse response
rise time
fall time
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 52 OF 147
ShortTerm Autocorrelation and Covariance
Recall our deﬁnition of the autocorrelation function:
Note:
• this can be regarded as a dot product of and .
• let’s not forget preemphasis, windowing, and centering the window
w.r.t. the frame, and that scaling is optional.
What would C++ code look like:
r k ( )
1
N
 s n ( )s n k – ( )
n k =
N 1 –
∑
1
N
 s n ( )s n k + ( )
n 0 =
N 1 – k –
∑
= = 0 k p < <
s n ( ) s n i + ( )
arraystyle:
for(k=0; k<M; k++) {
r[k] = 0.0;
for(n=0; n<Nk; n++) {
r[k] += s[n] * s[n+k]
}
}
pointerstyle:
for(k=0; k<M; k++) {
*r = 0.0; s = sig; sk = sig + k;
for(n=0; n<Ni; n++) {
*r += (s++) * (sk++);
}
}
We note that we can save some multiplications by reusing products:
This is known as the factored autocorrelation computation.
It saves about 25% CPU, replacing multiplications with additions and more
complicated indexing.
Similarly, recall our deﬁnition of the covariance function:
Note:
• we use Np points
• symmetric so that only the terms need to be computed
This can be simpliﬁed using the recursion:
r 3 [ ] s 3 [ ] s 0 [ ] s 6 [ ] + ( ) s 4 [ ] s 1 [ ] s 7 [ ] + ( ) … s N [ ]s N 3 – [ ] + + + =
c k l , ( )
1
N
 s n k – ( )s n l – ( )
n p =
N 1 –
∑
= 0 l k p < < <
k l ≥
c k l , ( ) c k 1 – l 1 – , ( ) s p k – ( )s p l – ( ) s N k – ( )s N l – ( ) – + =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 53 OF 147
Example of the Autocorrelation Function
Autocorrelation functions for the word “three” comparing the consonant
portion of the waveform to the vowel (256point Hamming window).
Note:
• shape for the low order lags  what does this correspond to?
• regularity of peaks for the vowel —why?
• exponentiallydecaying shape —which harmonic?
• what does a negative correlation value mean?
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 54 OF 147
An Application of ShortTerm Power: Speech SNR Measurement
Problem: Can we estimate the SNR from a speech ﬁle?
Energy Histogram:
The SNR can deﬁned as:
What percentiles to use?
Typically, 80%/20%, 85%/15%, or 95%/15% are used.
SNR 10
E
s
E
n

10
log 10
E
s
E
n
+ ( ) E
n
–
E
n

10
log = =
0.0 10.0 20.0 30.0
10000.0
5000.0
0.0
5000.0
10000.0
E
p(E)
E
cdf(E)
Nominal Signal+Noise Level
80%
Nominal Noise Level
20%
Cumulative Distribution:
100% by deﬁnition
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 55 OF 147
Linear Prediction (General)
Let us deﬁne a speech signal, , and a predicted value: .
Why the added terms? This prediction error is given by:
We would like to minimize the error by ﬁnding the best, or optimal, value of .
Let us deﬁne the shorttime average prediction error:
We can minimize the error w.r.t for each by differentiating and
setting the result equal to zero:
s n ( ) s
˜
n ( ) α
k
s n k – ( )
k 1 =
p
∑
=
e n ( ) s n ( ) s
˜
n ( ) – s n ( ) α
k
s n k – ( )
k 1 =
p
∑
– = =
α
k
{ }
E e
2
n ( )
n
∑
=
s n ( ) α
k
s n k – ( )
k 1 =
p
∑
–
¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
2
n
∑
=
s
2
n ( )
n
∑
2s n ( ) α
k
s n k – ( )
k 1 =
p
∑
¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
n
∑
– α
k
s n k – ( )
k 1 =
p
∑
¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
2
n
∑
+ =
s
2
n ( )
n
∑
2 α
k
s n ( )s n k – ( )
n
∑
k 1 =
p
∑
– α
k
s n k – ( )
k 1 =
p
∑
¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
2
n
∑
+ =
α
l
1 l p ≤ ≤ E
α
l
∂
∂E
0 2 – s n ( )s n l – ( )
n
∑
2 α
k
s n k – ( )
k 1 =
p
∑
¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
s n l – ( )
n
∑
+ = =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 56 OF 147
Linear Prediction (Cont.)
Rearranging terms:
or,
This equation is known and the linear prediction (YuleWalker) equation.
are known as linear prediction coefﬁcients, or predictor coefﬁcients.
By enumerating the equations for each value of , we can express this in
matrix form:
where,
The solution to this equation involves a matrix inversion:
and is known as the covariance method. Under what conditions does
exist?
Note that the covariance matrix is symmetric. A fast algorithm to ﬁnd the
solution to this equation is known as the Cholesky decomposition (a
approach in which the covariance matrix is factored into lower and upper
triangular matrices).
s n ( )s n l – ( )
n
∑
α
k
s n k – ( )s n l – ( )
n
∑
¸ ,
¸ _
k 1 =
p
∑
=
c l 0 , ( ) α
k
c k l , ( )
k 1 =
p
∑
=
α
k
{ }
l
c Cα =
α
α
1
α
2
…
α
p
= C
c 1 1 , ( ) c 1 2 , ( ) … c 1 p , ( )
c 2 1 , ( ) c 2 2 , ( ) … c 2 p , ( )
… … … …
c p 1 , ( ) c p 2 , ( ) … c p p , ( )
= c
c 1 0 , ( )
c 2 0 , ( )
…
c p 0 , ( )
=
α C
1 –
c =
C
1 –
VDV
T
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 57 OF 147
The Autocorrelation Method
Using a different interpretation of the limits on the error minimization —
forcing data only within the frame to be used — we can compute the
solution to the linear prediction equation using the autocorrelation method:
where,
Note that is symmetric, and all of the elements along the diagonal are
equal, which means (1) an inverse always exists; (2) the roots are in the left
half plane.
The linear prediction process can be viewed as a ﬁlter by noting:
and
where
is called the analyzer; what type of ﬁlter is it? (pole/zero? phase?)
is called the synthesizer; under what conditions is it stable?
α R
1 –
r =
α
α
1
α
2
…
α
p
= R
r 0 ( ) r 1 ( ) … r p 1 – ( )
r 1 ( ) r 0 ( ) … r p 2 – ( )
… … … …
r p 1 – ( ) r p 2 – ( ) … r 0 ( )
= r
r 1 ( )
r 2 ( )
…
r p ( )
=
R
e n ( ) s n ( ) α
k
s n k – ( )
k 1 =
p
∑
– =
E z ( ) S z ( ) A z ( ) =
A z ( ) 1 α
k
z
k –
k 1 =
p
∑
– =
A z ( )
1
A z ( )

INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 58 OF 147
Linear Prediction Error
We can return to our expression for the error:
and substitute our expression for and show that:
Autocorrelation Method:
Covariance Method:
Later, we will discuss the properties of these equations as they relate to the
magnitude of the error. For now, note that the same linear prediction
equation that applied to the signal applies to the autocorrelation function,
except that samples are replaced by the autocorrelation lag (and hence the
delay term is replaced by a lag index).
Since the same coefﬁcients satisfy both equations, this conﬁrms our
hypothesis that this is a model of the minimumphase version of the input
signal.
Linear prediction has numerous formulations including the covariance
method, autocorrelation formulation, lattice method, inverse ﬁlter
formulation, spectral estimation formulation, maximum likelihood
formulation, and inner product formulation. Discussions are found in
disciplines ranging from system identiﬁcation, econometrics, signal
processing, probability, statistical mechanics, and operations research.
E e
2
n ( )
n
∑
s n ( ) α
k
s n k – ( )
k 1 =
p
∑
–
¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
2
n
∑
= =
α
k
{ }
E r 0 ( ) α
k
r k ( )
k 1 =
p
∑
– =
E c 0 0 , ( ) α
k
c 0 k , ( )
k 1 =
p
∑
– =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 59 OF 147
LevinsonDurbin Recursion
The predictor coefﬁcients can be efﬁciently computed for the autocorrelation
method using the LevinsonDurbin recursion:
This recursion gives us great insight into the linear prediction process. First,
we note that the intermediate variables, , are referred to as reﬂection
coefﬁcients.
Example: p=2
This reduces the LP problem to and saves an order of magnitude in
computational complexity, and makes this analysis amenable to ﬁxedpoint
digital signal processors and microprocessors.
for i 1 2 … p , , , =
E
0
r 0 ( ) =
k
i
r i ( ) a
i 1 –
j ( )r i j – ( )
j 1 =
i 1 –
∑
–
¸ ,
¸ _
E
i 1 –
⁄ =
for j 1 2 … i 1 – , , , =
a
i
i ( ) k
i
=
a
i
j ( ) a
i 1 –
j ( ) k
i
a
i 1 –
i j – ( ) – =
E
i
1 k
i
2
– ( )E
i 1 –
=
k
i
{ }
E
0
r 0 ( ) =
k
1
r 1 ( ) r 0 ( ) ⁄ =
a
1
1 ( ) k
1
r 1 ( ) r 0 ( ) ⁄ = =
E
1
1 k
1
2
– ( )E
0
=
r
2
0 ( ) r
2
1 ( ) –
r 0 ( )
 =
k
2
r 2 ( )r 0 ( ) r
2
1 ( ) –
r
2
0 ( ) r
2
1 ( ) –
 =
a
2
2 ( ) k
2
r 2 ( )r 0 ( ) r
2
1 ( ) –
r
2
0 ( ) r
2
1 ( ) –
 = =
a
2
1 ( ) a
1
1 ( ) k
2
a
1
1 ( ) – =
r 1 ( )r 0 ( ) r 1 ( )r 2 ( ) –
r
2
0 ( ) r
2
1 ( ) –
 =
α
1
a
2
1 ( ) =
α
2
a
2
2 ( ) =
O p ( )
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 60 OF 147
Linear Prediction Coefﬁcient Transformations
The predictor coefﬁcients and reﬂection coefﬁcients can be transformed
back and forth with no loss of information:
Predictor to reﬂection coefﬁcient transformation:
Reﬂection to predictor coefﬁcient transformation:
Also, note that these recursions require intermediate storage for .
From the above recursions, it is clear that . In fact, there are several
important results related to :
(1)
(2) , implies a harmonic process (poles on the unit circle).
(3) implies an unstable synthesis ﬁlter (poles outside the unit circle).
(4)
This gives us insight into how to determine the LP order during the calcula
tions. We also see that reﬂection coefﬁcients are orthogonal in the sense
that the best order “p” model is also the ﬁrst “p” coefﬁcients in the order
“p+1” LP model (very important!).
for i p p 1 – … 1 , , , =
k
i
a
i
i ( ) =
a
i 1 –
j ( )
a
i
j ( ) k
i
a
i
i j – ( ) +
1 k
i
2
–
 = 1 j i 1 – ≤ ≤
for i 1 2 … p , , , =
a
i
i ( ) k
i
=
a
i
j ( ) a
i 1 –
j ( ) k
i
a
i 1 –
i j – ( ) – = 1 j i 1 – ≤ ≤
a
i
{ }
k
i
1 ≠
k
i
k
i
1 <
k
i
1 =
k
i
1 >
E
0
E
1
… E
p
> > >
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 61 OF 147
SpectralMatching Interpretation of Linear Prediction
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 62 OF 147
NoiseWeighting
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 63 OF 147
The Real Cepstrum
Goal: Deconvolve spectrum for multiplicative processes
In practice, we use the “real” cepstrum:
and manifest themselves at the low and high end of the “quefrency”
domain respectively.
We can derive cepstral parameters directly from LP analysis:
To obtain the relationship between cepstral and predictor coefﬁcients, we can
differentiate both sides is taken with respect to :
which simpliﬁes to
Note that the order of the cepstral coefﬁcients need not be the same as the order
of the LP model. Typically, 1016 LP coefﬁcients are used to generate 1012
cepstral coefﬁcients.
C
s
ω ( ) S f ( ) log =
V f ( )U f ( ) log =
V f ( ) log U f ( ) log + =
V f ( ) U f ( )
A z ( ) ln C z ( ) c
n
z
n –
n 1 =
∞
∑
= =
z
1 –
z
1 –
d
d
A z ( ) ln { }
z
1 –
d
d 1
1 α
k
z
k –
k 1 =
p
∑
–
 ln
¹ ¹
¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
¹ ¹
z
1 –
d
d
c
n
z
n –
n 1 =
∞
∑
= =
c
1
a
1
=
c
n
1 k n ⁄ – ( )α
k
c
n k –
a
n
+
k 1 =
n 1 –
∑
=
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 64 OF 147
Session IV:
Signal Processing
In Speech Recognition
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 65 OF 147
THE SIGNAL MODEL (“FRONTEND”)
Spectral
Modeling
Parametric
Transform
Spectral
Analysis
Spectral
Shaping
Speech
Conditioned Signal
Spectral Measurements
Spectral Parameters
Observation Vectors
Digital Signal Processing
Speech Recognition
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 66 OF 147
A TYPICAL “FRONTEND”
Time
Derivative
Cepstral
Analysis
Fourier
Transform
Signal
melspaced cepstral coefﬁcients
second
derivative
ﬁrst
derivative
(rate of change)
absolute
spectral
measurements
energy
deltaenergy
Time
Derivative
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 67 OF 147
Putting It All Together
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 68 OF 147
Mel Cepstrum
Bark 13
0.76 f
1000

¸ ,
¸ _
atan =
3.5
f
2
7500)
2
( )

¸ ,
¸ _
atan +
mel f 2595 10 1
f
7000
 +
¸ ,
¸ _
log =
BW
crit
25 75 1 1.4
f
1000

¸ ,
¸ _
2
+
0.69
+ =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 69 OF 147
Mel Cepstrum Computation Via A Filter Bank
Index
Bark Scale Mel Scale
Center
Freq.
(Hz)
BW
(Hz)
Center
Freq.
(Hz)
BW
(Hz)
1 50 100 100 100
2 150 100 200 100
3 250 100 300 100
4 350 100 400 100
5 450 110 500 100
6 570 120 600 100
7 700 140 700 100
8 840 150 800 100
9 1000 160 900 100
10 1170 190 1000 124
11 1370 210 1149 160
12 1600 240 1320 184
13 1850 280 1516 211
14 2150 320 1741 242
15 2500 380 2000 278
16 2900 450 2297 320
17 3400 550 2639 367
18 4000 700 3031 422
19 4800 900 3482 484
20 5800 1100 4000 556
21 7000 1300 4595 639
22 8500 1800 5278 734
23 10500 2500 6063 843
24 13500 3500 6964 969
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 70 OF 147
Mel Cepstrum Computation
Via
Oversampling Using An FFT
• • •
f
i1
f
i
S(f)
frequency
• Note that an FFT yields frequency samples at
• Oversampling provides a smoother estimate of the envelope of the spectrum
• Other analogous techniques efﬁcient sampling techniques exist for different
frequency scales (bilinear transform, sampled autocorrelation, etc.)
k
N

¸ ,
¸ _
f
s
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 71 OF 147
Perceptual Linear Prediction
Goals: Apply greater weight to perceptuallyimportant portions of the spectrum
Avoid uniform weighting across the frequency band
Algorithm:
• Compute the spectrum via a DFT
• Warp the spectrum along the Bark frequency scale
• Convolve the warped spectrum with the power spectrum of the simulated
critical band masking curve and downsample (to typically 18 spectral
samples)
• Preemphasize by the simulated equalloudness curve:
• Simulate the nonlinear relationship between intensity and perceived
loudness by performing a cubicroot amplitude compression
• Compute an LP model
Claims:
• Improved speaker independent recognition performance
• Increased robustness to noise, variations in the channel, and microphones
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 72 OF 147
Linear Regression and Parameter Trajectories
Premise: Time differentiation of features is a noisy process
Approach: Fit a polynomial to the data to provide a smooth trajectory for a
parameter; use closedloop estimation of the polynomial
coefﬁcients
Static feature:
Dynamic feature:
Acceleration feature:
We can generalize this using an r
th
order regression analysis:
where (the number of analysis frames in time length T) is odd,
and the orthogonal polynomials are of the form:
This approach has been generalized in such a way that the weights on the
coefﬁcients can be estimated directly from training data to maximize the
likelihood of the estimated feature (maximum likelihood linear regression).
s n ( ) c
k
n ( ) =
s˙ n ( ) c
k
n ∆ + ( ) c
k
n ∆ – ( ) – ≈
s˙˙ n ( ) s˙ n ∆ + ( ) s˙ n ∆ – ( ) – ≈
R
rk
t T ∆T , , ( )
P
r
X L , ( )C
k
t X
L 1 +
2
 –
¸ ,
¸ _
∆T +
¸ ,
¸ _
X 1 =
L
∑
P
r
2
X L , ( )
X 1 =
L
∑
 =
L T ∆T ⁄ =
P
0
X L , ( ) 1 =
P
1
X L , ( ) X =
P
2
X L , ( ) X
2 1
12
 L
2
1 – ( ) – =
P
3
X L , ( ) X
3 1
20
 3L
2
7 – ( )X – =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 73 OF 147
Session V:
Dynamic Programming
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 74 OF 147
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 75 OF 147
The Principle of Dynamic Programming
• An efﬁcient algorithm for ﬁnding the optimal path through a network
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 76 OF 147
Word Spotting Via
Relaxed and Unconstrained Endpointing
• Endpoints, or boundaries, need not be ﬁxed — numerous types of
constraints can be invoked
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 77 OF 147
Slope Constraints:
Increased Efﬁciency and Improved Performance
• Local constraints used to achieve slope constraints
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 78 OF 147
DTW, Syntactic Constraints, and Beam Search
Consider the problem of connected digit recognition: “325 1739”. In the
simplest case, any digit can follow any other digit, but we might know the
exact number of digits spoken.
An elegant solution to the problem of ﬁnding the best overall sentence
hypothesis is known as level building (typically assumes models are same
length.
• Though this algorithm is no longer widely used, it gives us a glimpse into
the complexity of the syntactic pattern recognition problem.
R
1
R
2
R
3
R
4
R
5
F
1
F
5
F
10
F
15
F
20
F
25
F
30
Possible word endings for ﬁrst word
Reference
Test
Possible starts for second word
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 79 OF 147
Level Building For An Unknown Number Of Words
• Paths can terminate on any level boundary indicating a different number
of words was recognized (note the signiﬁcant increase in complexity)
• A search band around the optimal path can be maintained to reduce the
search space
• Nextbest hypothesis can be generated (Nbest)
• Heuristics can be applied to deal with free endpoints, insertion of silence
between words, etc.
• Major weakness is the assumption that all models are the same length!
R
1
R
2
R
3
R
4
R
5
F
1
F
5
F
10
F
15
F
20
F
25
F
30
Reference
Test
Beam
M=5
M=4
M=3
M=2
M=1
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 80 OF 147
The OneStage Algorithm (“Bridle Algorithm”)
The level building approach is not conducive to models of different lengths,
and does not make it easy to include syntactic constraints (which words can
follow previous hypothesized words).
An elegant algorithm to perform this search in one pass is demonstrated
below:
Reference
Test
R
1
R
2
R
3
Model
• Very close to current stateoftheart doublystochastic algorithms (HMM)
• Conceptually simple, but difﬁcult to implement because we must
remember information about the interconnections of hypotheses
• Amenable to beamsearch concepts and fastmatch concepts
• Supports syntactic constraints by limited the choices for extending a
hypothesis
• Becomes complex when extended to allow arbitrary amounts of silence
between words
• How do we train?
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 81 OF 147
Introduction of Syntactic Information
• The search space for vocabularies of hundreds of words can become
unmanageable if we allow any word to follow any other word (often called
the nogrammar case)
• Our rudimentary knowledge of language tells us that, in reality, only a
small subset of the vocabulary can follow a given word hypothesis, but
that this subset is sensitive to the given word (we often refer to this as
“contextsensitive”)
• In real applications, userinterface design is crucial (much like the
problem of designing GUI’s), and normally results in a speciﬁcation of a
language or collection of sentence patterns that are permissible
• A simple way to express and manipulate this information in a dynamic
programming framework is a via a state machine:
B
C D
Start
Stop
E
For example, when you enter state C, you output one of the following
words: {daddy, mommy}.
If:
state A: give
state B: me
state C: {daddy, mommy}
state D: come
state E: here
We can generate phrases such as:
Daddy give me
• We can represent such information numerous ways (as we shall see)
A
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 82 OF 147
Early Attempts At Introducing Syntactic Information Were “AdHoc”
Feature Extractor
Unconstrained Endpoint
Dynamic Programming
(Word Spotting)
Recognized Sequence of Words (“Sentences”)
P(w
2
)
P(w
1
)
P(w
3
)
P(w
4
)
P(w
4
)
P(w
i
)
Finite Automaton
Reference Models
Speech Signal
Measurements
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 83 OF 147
BASIC TECHNOLOGY —A PATTERN RECOGNITION PARADIGM
BASED ON
HIDDEN MARKOV MODELS
Search Algorithms: P W
t
i
O
t
( )
P O
t
W
t
i
( )P W
t
i
( )
P O
t
( )
 =
Pattern Matching: W
t
i
P O
t
O
t 1 –
… W
t
i
, , ( ) , [ ]
Signal Model: P O
t
W
t 1 –
W
t
W
t 1 +
, , ( ) ( )
Recognized Symbols: P S O ( ) max arg
T
P W
t
i
O
t
O
t 1 –
… , , ( ) ( )
i
∏
=
Language Model: P W
t
i
( )
Prediction
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 84 OF 147
BEAM SEARCH
• The modern view of speech recognition is a problem consisting of
two operations: signal modeling and search.
• Finding the most probable sequence of words is an optimization
problem that can be solved by dynamic programming
• Optimal solutions are intractable; fortunately, suboptimal solutions
yield good performance
• Beam search algorithms are used to tradeoff complexity vs.
accuracy
Log(P)
Time
Best Path
A Search Error?
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 85 OF 147
Session VI:
Hidden Markov Models
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 86 OF 147
A Simple Markov Model For Weather Prediction
What is a ﬁrstorder Markov chain?
We consider only those processes for which the righthand side is
independent of time:
with the following properties:
The above process can be considered observable because the output
process is a set of states at each instant of time, where each state
corresponds to an observable event.
Later, we will relax this constraint, and make the output related to the states
by a second random process.
Example: A threestate model of the weather
State 1: precipitation (rain, snow, hail, etc.)
State 2: cloudy
State 3: sunny
P q
t
j = q
t 1 –
i = q
t 2 –
k = … , , ( ) [ ] P q
t
j = q
t 1 –
i = [ ] =
a
ij
P q
t
j = q
t 1 –
i = [ ] = 1 i j , N ≤ ≤
a
ij
0 ≥ j i , ∀
a
ij
j 1 =
N
∑
1 = i ∀
1 2
3
0.4 0.6
0.3
0.2
0.2
0.1 0.1
0.3
0.8
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 87 OF 147
Basic Calculations
Example: What is the probability that the weather for eight consecutive
days is “sunsunsunrainrainsuncloudysun”?
Solution:
O = sun sun sun rain rain sun cloudy sun
3 3 3 1 1 3 2 3
Example: Given that the system is in a known state, what is the probability
that it stays in that state for d days?
O = i i i ... i j
Note the exponential character of this distribution.
We can compute the expected number of observations in a state given
that we started in that state:
Thus, the expected number of consecutive sunny days is (1/(10.8)) = 5;
the expected number of cloudy days is 2.5, etc.
What have we learned from this example?
P O Model ( ) P 3 [ ]P 3 3 [ ]P 3 3 [ ]P 1 3 [ ]P 1 1 [ ]P 3 1 [ ]P 2 3 [ ]P 3 2 [ ] =
π
3
a
33
a
31
a
11
a
13
a
32
a
23
=
1.536 10
4 –
× =
P O Model q
1
i = , ( ) P O q
1
i = , Model ( ) P q
1
i = ( ) ⁄ =
π
i
a
ii
d 1 –
1 a
ii
– ( ) π
i
⁄ =
a
ii
d 1 –
1 a
ii
– ( ) =
p
i
d ( ) =
d
i
d p
i
d ( )
d 1 =
∞
∑
da
ii
d 1 –
1 a
ii
– ( )
d 1 =
∞
∑
1
1 a
ii
–
 = = =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 88 OF 147
Why Are They Called “Hidden” Markov Models?
Consider the problem of predicting the outcome of a coin toss experiment.
You observe the following sequence:
What is a reasonable model of the system?
O HHTTTHTTH…H ( ) =
1 2
P(H) 1P(H)
1P(H)
P(H)
1 2
a
11
a
22
1a
11
1a
22
1 2
3
Heads Tails
1Coin Model
(Observable Markov Model)
O = H H T T H T H H T T H ...
S = 1 1 2 2 1 2 1 1 2 2 1 ...
2Coins Model
(Hidden Markov Model)
O = H H T T H T H H T T H ...
S = 1 1 2 2 1 2 1 1 2 2 1 ...
P(H) = P
1
P(H) = P
2
P(T) = 1P
1
P(T) = 1P
2
a
11
a
22
a
12
a
21
a
13
a
31
a
32
a
23
a
33
P(H): P
1
P
2
P
3
P(T): 1P
1
1P
2
1P
3
3Coins Model
(Hidden Markov Model)
O = H H T T H T H H T T H ...
S = 3 1 2 3 3 1 1 2 3 1 3 ...
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 89 OF 147
Why Are They Called Doubly Stochastic Systems?
The UrnandBall Model
P(red) = b
1
(1)
P(green) = b
1
(2)
P(blue) = b
1
(3)
P(yellow) = b
1
(4)
...
P(red) = b
2
(1)
P(green) = b
2
(2)
P(blue) = b
2
(3)
P(yellow) = b
2
(4)
...
P(red) = b
3
(1)
P(green) = b
3
(2)
P(blue) = b
3
(3)
P(yellow) = b
3
(4)
...
O = {green, blue, green, yellow, red, ..., blue}
How can we determine the appropriate model for the observation
sequence given the system above?
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 90 OF 147
Elements of a Hidden Markov Model (HMM)
• N — the number of states
• M — the number of distinct observations per state
• The statetransition probability distribution
• The output probability distribution
• The initial state distribution
We can write this succinctly as:
Note that the probability of being in any state at any time is completely
determined by knowing the initial state and the transition probabilities:
Two basic problems:
(1) how do we train the system?
(2) how do we estimate the probability of a given sequence
(recognition)?
This gives rise to a third problem:
If the states are hidden, how do we know what states were used to
generate a given output?
How do we represent continuous distributions (such as feature vectors)?
A a
ij
{ } =
B b
j
k ( ) { } =
π π
i
{ } =
λ A B π , , ( ) =
π t ( ) A
t 1 –
π =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 91 OF 147
Formalities
The discrete observation HMM is restricted to the production of a ﬁnite set
of discrete observations (or sequences). The output distribution at any state
is given by:
The observation probabilities are assumed to be independent of time. We
can write the probability of observing a particular observation, , as:
The observation probability distribution can be represented as a matrix
whose dimension is K rows x S states.
We can deﬁne the observation probability vector as:
, or,
The mathematical speciﬁcation of an HMM can be summarized as:
For example, reviewing our cointoss model:
b k i , ( ) P y t ( ) k = x t ( ) i = ( ) ≡
y t ( )
b y t ( ) i ( ) P y t ( ) y t ( ) = x t ( ) i = ( ) ≡
p t ( )
P y t ( ) 1 = ( )
P y t ( ) 2 = ( )
…
P y t ( ) K = ( )
= p t ( ) Bπ t ( ) BA
t 1 –
π 1 ( ) = =
M S π 1 ( ) A B y
k
1 k K ≤ ≤ , { } , , , , { } =
1 2
3
a
11
a
22
a
12
a
21
a
13
a
31
a
32
a
23
a
33
P(H): P
1
P
2
P
3
P(T): 1P
1
1P
2
1P
3
S 3 =
π 1 ( )
1 3 ⁄
1 3 ⁄
1 3 ⁄
¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
=
A
a
11
a
12
a
13
a
21
a
22
a
23
a
31
a
32
a
33
=
B
P
1
P
2
P
3
1 P
1
– 1 P
2
– 1 P
3
–
=
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 92 OF 147
Recognition Using Discrete HMMs
Denote any partial sequence of observations in time by:
The forward partial sequence of observations at time is
The backward partial sequence of observations at time is
A complete set of observations of length is denoted as .
What is the likelihood of an HMM?
We would like to calculate — however, we can’t. We can
(see the introductory notes) calculate . Consider the brute
force method of computing this. Let denote a speciﬁc
state sequence. The probability of a given observation sequence being
produced by this state sequence is:
The probability of the state sequence is
Therefore,
To ﬁnd , we must sum over all possible paths:
This requires ﬂops. For and , this gives about
computations per HMM!
y
t
1
t
2
y t
1
( ) y t
1
1 + ( ) y t
1
2 + ( ) , , … , y t
2
( ) , { } ≡
t
y
1
t
y 1 ( ) y 2 ( ) , … , y t ( ) , { } ≡
t
y
t 1 +
T
y t 1 + ( ) y t 2 + ( ) , … , y T ( ) , { } ≡
T y y
1
T
≡
P M y y = ( )
P y y = M ( )
ϑ i
1
i
2
… i
T
, , , { } =
P y ϑ M , ( ) b y 1 ( ) i
1
( )b y 2 ( ) i
2
( )…b y T ( ) i
T
( ) =
P ϑ M ( ) P x 1 ( ) i
1
= ( )a i
2
i
1
( )a i
3
i
2
( )…a i
T
i
T 1 –
( ) =
P y ϑ M ( ) , ( ) P x 1 ( ) i
1
= ( )a i
2
i
1
( )a i
3
i
2
( )…a i
T
i
T 1 –
( ) =
x b y 1 ( ) i
1
( )b y 2 ( ) i
2
( )…b y T ( ) i
T
( )
P y M ( )
P y M ( ) P y ϑ M ( ) , ( )
ϑ ∀
∑
=
O 2TS
T
( ) S 5 = T 100 =
1.6 10
72
×
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 93 OF 147
The “Any Path” Method (ForwardBackward, BaumWelch)
The forwardbackward (FB) algorithm begins by deﬁning a “forwardgoing”
probability sequence:
and a “backwardgoing” probability sequence:
Let us next consider the contribution to the overall sequence probability
made by a single transition:
α y
1
t
( ) P y
1
t
y
1
t
= x t ( ) i = , M ( ) ≡
β y
t 1 +
T
i ( ) P y
t 1 +
T
y
t 1 +
T
= x t ( ) i = M , ( ) ≡
Summing over all possibilities for reaching state “ “:
α y
1
t 1 +
j , ( ) α y
1
t
i , ( )P x t 1 + ( ) j = x t ( ) i = ( ) × =
P y t 1 + ( ) y t 1 + ( ) = x t 1 + ( ) j = ( )
α y
1
t
i , ( )a j i ( )b y t 1 + ( ) j ( ) =
j
α y
1
t 1 +
j , ( ) α y
1
t
i , ( )a j i ( )b y t 1 + ( ) j ( )
i 1 =
S
∑
=
i
S
2
1
•
•
•
•
•
•
i
S
2
1
•
•
•
•
•
•
j
α y
1
t
i , ( )
α y
1
t 1 +
i , ( )
y t 1 – ( )
y t ( ) y t 1 + ( )
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 94 OF 147
BaumWelch (Continued)
The recursion is initiated by setting:
Similarly, we can derive an expression for :
This recursion is initialized by:
We still need to ﬁnd :
for any state . Therefore,
But we also note that we should be able to compute this probability using
only the forward direction. By considering , we can write:
These equations suggest a recursion in which, for each value of we iterate
over ALL states and update . When , is computed by
summing over ALL states.
The complexity of this algorithm is , or for and ,
approximately 2500 ﬂops are required (compared to ﬂops for the
exhaustive search).
α y
1
t
j , ( ) P x 1 ( ) j = ( )b y 1 ( ) j ( ) =
β
β y
i 1 +
T
i ( ) β y
t 2 +
T
j ( )a j i ( )b y t 1 + ( ) j ( )
j 1 =
S
∑
=
β y
T 1 +
T
i ( )
1, if i is a legal final state
0, otherwise
¹
'
¹
≡
P y M ( )
P y x t ( ) i = , M ( ) α y
1
t
i , ( )β y
t 1 +
T
i ( ) =
i
P y M ( ) α y
1
t
i , ( )β y
t 1 +
T
i ( )
i 1 =
S
∑
=
t T =
P y M ( ) α y
1
T
i , ( )
i 1 =
S
∑
=
t
α y
1
t
j , ( ) t T = P y M ( )
O S
2
T ( ) S 5 = T 100 =
10
72
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 95 OF 147
The Viterbi Algorithm
Instead of allowing any path to produce the output sequence, and hence,
creating the need to sum over all paths, we can simply assume only one
path produced the output. We would like to ﬁnd the single most likely path
that could have produced the output. Calculation of this path and probability
is straightforward, using the dynamic programming algorithm previously
discussed:
where
(in other words, the predecessor node with the best score). Often,
probabilities are replaced with the logarithm of the probability, which
converts multiplications to summations. In this case, the HMM looks
remarkably similar to our familiar DP systems.
Beam Search
In the context of the best path method, it is easy to see that we can employ
a beam search similar to what we used in DP systems:
In other words, for a path to survive, its score must be within a range of the
best current score. This can be viewed as a timesynchronous beam
search. It has the advantage that, since all hypotheses are at the same point
in time, their scores can be compared directly. This is due to the fact that
each hypothesis accounts for the same amount of time (same number of
frames).
D t i , ( ) a i j
∗
, ( )b k i ( )D t 1 – j
∗
, ( ) =
j
∗
max arg
valid j
D t 1 – j , ( ) { } =
D
min
t i , ( ) D
min
t i
∗
t
, ( ) δ t ( ) – ≥
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 96 OF 147
Training Discrete Observation HMMs
Training refers to the problem of ﬁnding such that the model,
, after an iteration of training, better represents the training data than the
previous model. The number of states is usually not varied or reestimated,
other than via the modiﬁcation of the model inventory. The apriori
probabilities of the likelihood of a model, , are normally not reestimated
as well, since these typically come from the language model.
The ﬁrst algorithm we will discuss is one based on the ForwardBackward
algorithm (BaumWelch Reestimation):
Also, denotes a random variable that models the transitions at time
and a random variable that models the observation being emitted at
state at time . The symbol “•” is used to denote an arbitrary event.
Next, we need to deﬁne some intermediate quantities related to particular
events at a given state at a given time:
where the sequences , , , and were deﬁned previously (last lecture).
Intuitively, we can think of this as the probability of observing a transition
fromstate to state at time for a particular observation sequence, , (the
utterance in progress), and model .
π 1 ( ) A B , , { }
M
π 1 ( )
u
j i
label for a transition from state i to state j ≡
u
• i
set of transitions exiting state i ≡
u
j •
set of transitions entering j ≡
u t ( ) t
y
j
t ( )
j t
ζ i j , t ; ( ) P u t ( ) u
j i
= y M , ( ) ≡
P u t ( ) u
j i
= y , M ( ) P y M ( ) ⁄ =
α y
1
t
i , ( )a j i ( )b y t 1 + ( ) j ( )β y
t 2 +
T
j ( )
P y M ( )
 ,
0 , other t
t 1 2 … T , , , =
¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
=
α β a b
i j t y
M
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 97 OF 147
We can also make the following deﬁnition:
This is the probability of exiting state . Also,
which is the probability of being in state at time . Finally,
which is the probability of observing symbol at state at time t.
Note that we make extensive use of the forward and backward probabilities
in these computations. This will be key to reducing the complexity of the
computations by allowing an interactive computation.
γ i t ; ( ) P u t ( ) u
• i
∈ y M , ( ) ≡ ζ i j , t ; ( )
j 1 =
S
∑
=
α y
1
t
i , ( )β y
t 1 +
T
i ( )
P y M ( )
 ,
0, other t
t 1 2 … T , , , =
¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
=
i
ν j t ; ( ) P x t ( ) j = y M , ( ) ≡
γ j t ; ( ), t 1 2 … T , , , =
α y
1
T
j , ( ) , t T =
0, other t ¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
=
α y
1
t
j , ( )β y
t 1 +
T
j ( )
P y M ( )
 ,
0, other t
t 1 2 … T , , , =
¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
=
j t
δ j k , t ; ( ) P y
j
t ( ) k = y M , ( ) ≡
ν j t ; ( ), if y t ( ) k = and 1 t T ≤ ≤
0, otherwise
¹ ¹
' ;
¹ ¹
=
α y
1
t
j , ( )β y
t 1 +
T
j ( )
P y M ( )
 ,
0, otherwise
if y t ( ) k = and 1 t T ≤ ≤
¹ ¹
¹ ¹
' ;
¹ ¹
¹ ¹
=
k j
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 98 OF 147
From these four quantities, we can deﬁne four more intermediate quantities:
Finally, we can begin relating these quantities to the problem of reestimating
the model parameters. Let us deﬁne four more random variables:
We can see that:
What we have done up to this point is to develop expressions for the
estimates of the underlying components of the model parameters in terms
of the state sequences that occur during training.
But how can this be when the internal structure of the model is hidden?
ζ i j , • ; ( ) P u • ( ) u
j i
∈ y M , ( ) ζ i j , t ; ( )
t 1 =
T
∑
= =
γ i • ; ( ) P u • ( ) u
• i
∈ y M , ( ) γ i t ; ( )
t 1 =
T
∑
= =
ν j • ; ( ) P u • ( ) u
j •
∈ y M , ( ) ν j t ; ( )
t 1 =
T
∑
= =
δ j k • ; , ( ) P y
j
• ( ) k = y M , ( ) δ j k t ; , ( )
t 1 =
T
∑
ν j t ; ( )
t 1 =
y t ( ) k =
T
∑
= = =
n u
j i
( ) number of transitions of the type u
j i
≡
n u
• i
( ) number of transitions of the type u
• i
≡
n u
j •
( ) number of transitions of the type u
j •
≡
n y
j
• ( ) k = ( ) number of times the observation k and state j jointly occur ≡
ζ i j , • ; ( ) E n u
j i
( ) y M , { } =
γ i • ; ( ) E n u
• i
( ) y M , { } =
ν j • ; ( ) E n u
j •
( ) y M , { } =
δ j k • ; , ( ) E n y
j
• ( ) k = ( ) y M , { } =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 99 OF 147
Following this line of reasoning, an estimate of the transition probability is:
Similarly,
Finally,
This process is often called reestimation by recognition, because we need
to recognize the input with the previous models in order that we can
compute the new model parameters from the set of state sequences used to
recognize the data (hence, the need to iterate).
But will it converge? Baum and his colleagues showed that the new model
guarantees that:
a j i ( )
E n u
j i
( ) y M , { }
E n u
• i
( ) y M , { }

ζ i j , • ; ( )
γ i • ; ( )
 = =
α y
1
t
i , ( )a j i ( )b y t 1 + ( ) j ( )β y
t 2 +
T
j ( )
t 1 =
T 1 –
∑
α y
1
t
i , ( )β y
t 1 +
T
i ( )
t 1 =
T 1 –
∑
 =
b k j ( )
E n n y
j
• ( ) k = ( ) y M , ( ) y M ,
¹ ¹
' ;
¹ ¹
E n u
j •
( ) y M , { }

ζ i j , • ; ( )
γ i • ; ( )
 = =
α y
1
t
j , ( )β y
t
T
j ( )
t 1 =
y t ( ) k =
T
∑
α y
1
t
j , ( )β y
t 1 +
T
j ( )
t 1 =
T
∑
 =
P x 1 ( ) i = ( )
α y
1
1
i , ( )β y
2
T
i ( )
P y M ( )
 =
P y M ( ) P y M ( ) ≥
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 100 OF 147
Since this is a highly nonlinear optimization, it can get stuck in local minima:
We can overcome this by starting training from a different initial point, or
“bootstrapping” models from previous models.
Analogous procedures exist for the Viterbi algorithm, though they are
much simpler and more intuitive (and more DPlike):
and,
These have been shown to give comparable performance to the
forwardbackward algorithm at signiﬁcantly reduced computation. It also is
generalizable to alternate formulations of the topology of the acoustic model
(or language model) drawn from formal language theory. (In fact, we can
even eliminate the ﬁrstorder Markovian assumption.)
Further, the above algorithms are easily applied to many problems
associated with language modeling: estimating transition probabilities and
word probabilities, efﬁcient parsing, and learning hidden structure.
But what if a transition is never observed in the training database?
a j i ( )
E n u
j i
( ) y M , { }
E n u
• i
( ) y M , { }
 =
b k j ( )
E n u
j i
( ) y M , { }
E n u
• i
( ) y M , { }
 =
P y M ( )
M
Global
Maximum
Local
Maximum
Perturbation Distance
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 101 OF 147
Continuous Density HMMs
The discrete HMM incorporates a discrete probability density function,
captured in the matrix , to describe the probability of outputting a symbol: B
Signal measurements, or feature vectors, are continuousvalued
Ndimensional vectors. In order to use our discrete HMM technology, we
must vector quantize (VQ) this data — reduce the continuousvalued
vectors to discrete values chosen from a set of M codebook vectors. Initially,
most HMMs were based on VQ frontends. However, recently, the
continuous density model has become widely accepted.
Let us assume a parametric model of the observation pdf:
The likelihood of generating observation in state is deﬁned as:
Note that taking the negative logarithm of will produce a loglikelihood,
or a Mahalanobislike distance. But what form should we choose for ?
Let’s assume a Gaussian model, of course:
Note that this amounts to assigning a mean and covariance matrix to each
state — a signiﬁcant increase in complexity. However, shortcuts such as
varianceweighting can help reduce complexity.
Also, note that the log of the output probability at each state becomes
precisely the Mahalanobis distance (principal components) we studied at
the beginning of the course.
M S π 1 ( ) A f
y x
ξ i ( ) 1 i S ≤ ≤ ,
¹ ¹
' ;
¹ ¹
, , ,
¹ ¹
' ;
¹ ¹
=
y t ( ) j
b y t ( ) j ( ) f
y x
y t ( ) j ( ) ≡
b( )
f ( )
f
y x
y i ( )
1
2π C
i

1
2
 – y µ
i
– ( )
T
C
i
1 –
y µ
i
– ( )
¹ ¹
' ;
¹ ¹
exp =
output distribution for state k
b k j ( )
k 1 2 3 4 5 6 • • •
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 102 OF 147
Mixture Distributions
Of course, the output distribution need not be Gaussian, or can be
multimodal to reﬂect the fact that several contexts are being encoded into a
single state (male/female, allophonic variations of a phoneme, etc.). Much
like a VQ approach can model any discrete distribution, we can use a
weighted linear combination of Gaussians, or a mixture distribution, to
achieve a more complex statistical model.
Mathematically, this is expressed as:
In order for this to be a valid pdf, the mixture coefﬁcients must be
nonnegative and satisfy the constraint:
Note that mixture distributions add signiﬁcant complexity to the system: m
means and covariances at each state.
Analogous reestimation formulae can be derived by deﬁning the
intermediate quantity:
f
y x
y i ( ) c
im
ℵ y µ
im
C
im
, ; ( )
m 1 =
M
∑
=
c
im
m 1 =
M
∑
1 = , 1 i S ≤ ≤
ν i t l , ; ( ) P x t ( ) i = y t ( )produced in accordance with mixture l ( ) ≡
α y
1
t
i , ( )β y
t 1 +
T
i ( )
α y
1
t
j , ( )β y
t 1 +
T
j ( )
j 1 =
S
∑

c
il
ℵ y
tl
t
µ
il
C
il
, ; ( )
c
im
ℵ y
t
t
µ
im
C
im
, ; ( )
m 1 =
M
∑
 × =
b y j ( )
three mixtures
composite (offset)
y
u
1
u
2
u
3
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 103 OF 147
The mixture coefﬁcients can now be reestimated using:
the mean vectors can be reestimated as:
the covariance matrices can be reestimated as:
and the transition probabilities, and initial probabilities are reestimated as
usual.
The Viterbi procedure once again has a simpler interpretation:
and
The mixture coefﬁcient is reestimated as the number of vectors associated
with a given mixture at a given state:
c
il
ν i • l , ; ( )
ν i • m , ; ( )
m 1 =
M
∑
 =
µ
il
ν i t l , ; ( ) y t ( )
t 1 =
T
∑
ν i • l , ; ( )
 =
C
il
ν i t l , ; ( ) y t ( ) µ
il
– [ ] y t ( ) µ
il
– [ ]
T
t 1 =
T
∑
ν i • l , ; ( )
 =
µ
il
1
N
il
 y t ( )
t 1 =
y t ( ) il ∼
T
∑
=
C
il
1
N
il
 y t ( ) µ
il
– [ ] y t ( ) µ
il
– [ ]
T
t 1 =
y t ( ) il ∼
T
∑
=
c
il
N
il
N
i
 =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 104 OF 147
Session VII:
Acoustic Modeling and Training
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 105 OF 147
State Duration Probabilities
Recall that the probability of staying in a state was given by an
exponentiallydecaying distribution:
This model is not necessarily appropriate for speech. There are three
approaches in use today:
• FiniteState Models (encoded in acoustic model topology)
P O Model q
1
i = , ( ) P O q
1
i = , Model ( ) P q
1
i = ( ) ⁄ a
ii
d 1 –
1 a
ii
– ( ) = =
(Note that this model doesn’t have skip states; with skip states, it
becomes much more complex.)
• Discrete State Duration Models ( parameters per state)
• Parametric State Duration Models (one to two parameters)
Reestimation equations exist for all three cases. Duration models are often
important for larger models, such as words, where duration variations can
be signiﬁcant, but not as important for smaller units, such as
contextdependent phones, where duration variations are much better
understood and predicted.
D
P d
i
d = ( ) τ
d
= 1 d D ≤ ≤
f d
i
( )
1
2σ
i
2

2 d –
σ
i

¹ ¹
' ;
¹ ¹
exp =
1 2 3
11
12
21
22
31
32
Macro State
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 106 OF 147
Scaling in HMMs
As difﬁcult as it may seem to believe, standard HMM calculations exceed
the precision of 32bit ﬂoating point numbers for the simplest of models. The
large numbers of multiplications of numbers less than one leads to
underﬂow. Hence, we must incorporate some form of scaling.
It is possible to scale the forwardbackward calculation (see Section 12.2.5)
by normalizing the calculations by:
at each timestep (timesynchronous normalization).
However, a simpler and more effective way to scale is to deal with log
probabilities, which work out nicely in the case of continuous distributions.
Even so, we need to somehow prevent the best path score from growing
with time (increasingly negative in the case of log probs). Fortunately, at
each time step, we can normalize all candidate bestpath scores, and
proceed with the dynamic programming search. More on this later...
Similarly, it is often desirable to tradeoff the important of transition
probabilities and observation probabilities. Hence, the loglikelihood of an
output symbol being observed at a state can be written as:
or, in the log prob space:
This result emphasizes the similarities between HMMs and DTW. The
weights, and can be used to control the importance of the “language
model.”
c t ( )
1
α
˜
y
1
t
i , ( )
i 1 =
S
∑
 =
P y
1
t
x t ( ) i = ( ) , y M , ( ) P x t 1 – ( ) j = y M , ( ) a
ij
( )
α
b
i
y t ( ) ( )
β
+ =
P y
1
t
… , ( )
¹ ¹
' ;
¹ ¹
log P x t 1 – ( ) j = ( )… { } log α a
ij
( ) log + β D
y µ ,
log + =
α β
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 107 OF 147
An Overview of the Training Schedule
Note that a priori segmentation of the utterance is not required, and that the
recognizer is forced to recognize the utterance during training (via the build
grammar operation). This forces the recognizer to learn contextual
variations, provided the seed model construction is done “properly.”
What about speaker independence?
Speaker dependence?
Speaker adaptation?
Channel adaptation?
Seed Model Construction
Build Grammar
Recognize
Backtrace/Update
Replace Parameters
Next Utterance
Next Iteration
HandExcised Data
Last Utterance?
Convergence?
Supervised
Training
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 108 OF 147
Alternative Criteria For Optimization
As we have previously seen, an HMM system using the standard
BaumWelch reestimation algorithm learns to emulate the statistics of the
training database. We refer to this training mode as “representation.” This is
not necessarily equivalent to minimizing the recognition error rate. It
depends to a great deal on the extent to which the statistics of the training
database match the test database. Often, especially in openset speaker
independent cases, the test database contains data never seen in the
training database.
A potential solution to this problem is to attempt to force the recognizer to
learn to discriminate (reduce recognition errors explicitly). This is analogous
to developing a system to make an Mway choice in an Ndimensional
space by learning decision boundaries rather than clustering the data and
modeling the statistics of each cluster.
One approach to modifying the HMM structure to do this is to use a different
cost function. For example, we can minimize the discrimination information
(or equivalently the crossentropy) between the data and the statistical
characterizations of the data implied by the model.
Recall the deﬁnition of the discrimination information:
Note that a small DI is good, because it implies a given measurement
matches both the signal statistics and the model statistics (which means the
relative mismatch between the two is small).
However, such approaches have proven to be intractable — leading to
highly nonlinear equations. A more successful approach has been to
maximize the average mutual information. Recall our deﬁnition for the
average mutual information:
J
DI
f y y ( )
f y y ( )
f y M ( )
 log y d
∞ –
∞
∫
=
M y M , ( ) P y y
l
= M M
r
= , ( )
P y y
l
= M M
r
= , ( )
P y y
l
= ( )P M M
r
= ( )
 log
r 1 =
R
∑
l 1 =
L
∑
=
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 109 OF 147
This can be written as:
Note that the last term constitutes a rewrite of .
If we assume there is exactly one training string (which is the error we want
to correct), and it is to be used to train , then, if we assume
, we can approximate by:
The ﬁrst term in the summation corresponds to the probability of correct
recognition, which we want to maximize. The second term corresponds to
the probability of incorrect recognition, which we want to minimize.
This method has a rather simple interpretation for discrete HMMs:
M y M , ( ) P y y
l
= M M
r
= , ( ) ×
r 1 =
R
∑
l 1 =
L
∑
=
P y y
l
= M M
r
= , ( ) [ ] log P y y
l
= ( ) [ ] log –
P y y
l
= M M
r
= , ( ) ×
r 1 =
R
∑
l 1 =
L
∑
=
[ P y y
l
= M M
r
= , ( ) log –
P y y
l
= M M
m
= ( )P M M
m
= ( ) ]
m 1 =
R
∑
log
P y y
l
= ( )
M
l
P y y
l
= M M
r
= ( ) δ l r – ( ) ≈ M y M , ( )
M y M , ( ) P y y
l
= M M
l
= , ( )
l 1 =
L
∑
P y y
l
= M M
m
= ( )P M M
m
= ( )
m 1 =
R
∑
log – ≈
2
1
5
3
4
7
6
8
Increment Counts Along Best Path
Decrement Counts Along Competing Incorrect Paths
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 110 OF 147
An Overview of the Corrective Training Schedule
Seed Model Construction
Build Grammar
Backtrace/Update
Replace Parameters
Next Utterance
Next Iteration
HandExcised Data
Last Utterance?
Convergence?
Supervised
Training
Recognize
Recognize (Open)
Recognition Error
Backtrace/Update
Replace Parameters
Next Utterance
Last Utterance?
Convergence?
Corrective
Training
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 111 OF 147
Unfortunately, this method is not readily extensible to continuous speech,
and has proved inconclusive in providing measurable reductions in error
rate. However, discriminative training algorithms continues to be an
important are of research in HMMs.
Later, when we study neural networks, we will observe that some of the
neural network approaches are ideally suited towards implementation of
discriminative training.
Distance Measures for HMMs
Recall the KMEANS clustering algorithm:
Initialization: Choose K centroids
Recursion: 1. Assign all vectors to their nearest neighbor.
2. Recompute the centroids as the average of all vectors
assigned to the same centroid.
3. Check the overall distortion. Return to step 1 if some
distortion criterion is not met.
Clustering of HMM models is often important in reducing the number of
contextdependent phone models (which can often approach 10,000 for
English) to a manageable number (typically a few thousand models are
used). We can use standard clustering algorithms, but we need some way of
computing the distance between two models.
A useful distance measure can be deﬁned as:
where is a sequence generated by of length . Note that this
distance metric is not symmetric:
D M
1
M
2
, ( )
1
T
2
 P y
2
M
1
( ) log P y
2
M
2
( ) log – [ ] ≡
y
2
M
2
T
2
D M
1
M
2
, ( ) D M
2
M
1
, ( ) ≠
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 112 OF 147
A symmetric version of this is:
The sequence is often taken to be the sequence of mean vectors
associated with each state (typically for continuous distributions).
Often, phonological knowledge is used to cluster models. Models sharing
similar phonetic contexts are merged to reduce the complexity of the model
inventory. Interestingly enough, this can often be performed by inserting an
additional network into the system that maps contextdependent phone
models to a pool of states.
D' M
1
M
2
, ( )
D M
1
M
2
, ( ) D M
2
M
1
, ( ) +
2
 =
y
2
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 113 OF 147
The DTW Analogy
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 114 OF 147
Alternative Acoustic Model Topologies
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 115 OF 147
Managing Model Complexity
Continuous Density
Mixture Distributions
Discrete Density
Vector Quantization
SemiContinuous HMMs
Tied States
Clustered States
Tied Mixtures
Increasing Memory
Increasing Number of Free Parameters
Increasing Performance?
Increasing CPU
Increasing Storage
• Numerous techniques to robustly estimate model parameters; among the
most popular is deleted interpolation:
A ε
t
A
t
1 ε
t
– ( ) A
u
+ =
B ε
t
B
t
1 ε
t
– ( )B
u
+ =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 116 OF 147
Seed Model Construction: Duration Distributions
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 117 OF 147
Seed Model Construction: Number of States Proportional to Duration
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 118 OF 147
Duration in Context Clustering
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 119 OF 147
Examples of Word Models
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 120 OF 147
Session VIII:
Speech Recognition
Using
Hidden Markov Models
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 121 OF 147
BASIC TECHNOLOGY:
A PATTERN RECOGNITION PARADIGM
BASED ON HIDDEN MARKOV MODELS
Search Algorithms: P W
t
i
O
t
( )
P O
t
W
t
i
( )P W
t
i
( )
P O
t
( )
 =
Pattern Matching: W
t
i
P O
t
O
t 1 –
… W
t
i
, , ( ) , [ ]
Signal Model: P O
t
W
t 1 –
W
t
W
t 1 +
, , ( ) ( )
Recognized Symbols: P S O ( ) max arg
T
P W
t
i
O
t
O
t 1 –
… , , ( ) ( )
i
∏
=
Language Model: P W
t
i
( )
Prediction
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 122 OF 147
Nonspeech
High Performance Isolated Word Recognition
Using A Continuous Speech Recognizer
1 2
Isolated Word Recognition:
Nonspeech
{Word}
S S
Nonspeech:typically an acoustic model of one frame in duration that
models the background noise.
{Word}: any word from the set of possible words that can be spoken
• The key point here is that, with such a system, the recognizer ﬁnds the
optimal start/stop times of the utterance with respect to the acoustic
model inventory (a hypothesisdirected search)
1
Simple Continuous Speech Recognition (“No Grammar”):
Nonspeech/ {Word}
Nonspeech Nonspeech
S
Nonspeech
S
Nonspeech
• system recognizes arbitrarily long sequences of words or nonspeech events
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 123 OF 147
The Great Debate: “BottomUp” or “TopDown”
S
NP V CONJ PRON V PP
ART N
THE CHILD CRIED AS SHE LEFT IN THE RED PLANE
PREP NP
ART ADJ N
D x C Y l d k r Y d x S S i l E f t I n D x r E d p l e n
Parsing refers to the problem of determining if a given sequence could have
been generated from a given state machine.
This computation, as we shall see, typically requires an elaborate search of all
possible combinations of symbols output from the state machine.
This computation can be efﬁciently performed in a “bottomup” fashion if the
probabilities of the input symbols are extremely accurate, and only a few
symbols are possible at each point at the lower levels of the tree.
If the input symbols are ambiguous, “topdown” parsing is typically preferred.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 124 OF 147
Network Searching and Beam Search
Premise: Suboptimal solutions are useful; global optimums are hard to ﬁnd.
2
1
3
4
•
•
•
•
•
•
6
5
7
8
•
•
•
•
•
•
j
The network search considers several competing constraints:
• The longer the hypothesis, the lower the probability
• The overall best hypothesis will not necessarily be the best initial
hypothesis
• We want to limit the evaluation of the same substring by different
hypotheses (difﬁcult in practice)
• We would like to maintain as few active hypotheses as possible
• All states in the network will not be active at all times
6
5
7
8
•
•
•
•
•
•
Best Path
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 125 OF 147
Popular Search Algorithms
• Time Synchronous (Viterbi, DP)
• State Synchronous (BaumWelch)
• Stack Decoding (NBest, BestFirst, A
*
)
• Hybrid Schemes
— ForwardBackward
— Multipass
— Fast Matching
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 126 OF 147
one (1.0)
S
A
zero (0.3)
oh (0.7)
0.5
one (0.3)
ﬁve (0.5)
0.5
one (0.8)
six (0.2)
0.1
zero (0.3)
oh (0.7)
0.9
three (1.0)
four (1.0) two (1.0)
B
C D E
two (1.0) three (1.0) four (1.0)
0.9 0.9 0.9
0.1
0.1 0.1
0.1
nine (1.0)
eight (1.0)
seven (0.5)
four (0.5)
six (1.0)
Generalization of the HMM
Consider the following state diagram showing a simple language model
involving constrained digit sequences:
Note the similarities to our acoustic models.
What is the probability of the sequence “zero one two three four ﬁve zero six
six seven seven eight eight” ?
How would you ﬁnd the average length of a digit sequence generated from
this language model?
G H
I
J
ﬁve (0.8)
zero (0.1)
F
oh (0.1)
eight (1.0)
nine (1.0)
ﬁve (1.0)
six (1.0)
0.9
0.9
0.4
0.4
1.0
1.0
0.1
0.1 0.2
seven (1.0)
0.9
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 127 OF 147
In the terminology associated with formal language theory, this HMM is
known as a ﬁnite state automaton.
The word stochastic can also be applied because the transitions and output
symbols are governed by probability distributions.
Further, since there are multiple transitions and observations generated at
any point in time (hence, ambiguous output), this particular graph is
classiﬁed as a nondeterministic automaton.
In the future, we will refer to this system as a stochastic ﬁnite state
automaton (FSA or SFSA) when it is used to more linguistic information.
We can also express this system as a regular grammar:
S zero, A
p
1
S oh, A
p
2
S one, A
p
3
S ﬁve, A
p
4
A zero, A
p
5
A oh, A
p
6
A one, B
p
7
A six, B
p
8
B one, B
p
9
B two, C
p
10
C two, C
p
11
C three, D
p
12
D three, D
p
13
D four, E
p
14
E four, E
p
15
E ﬁve, F
p
16
F zero, F
p
17
F oh, F
p
18
F ﬁve, F
p
19
F six, G
p
20
G six, G
p
21
G seven, H
p
22
H seven, H
p
23
H four, H
p
24
H eight, I
p
25
H nine, J
p
26
I eight, I
p’
27
I eight.
p”
27
I nine, J
p’
28
I nine.
p”
28
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 128 OF 147
Note that rule probabilities are not quite the same as transition probabilities,
since they need to combine transition probabilities and output probabilities.
For example, consider p
7
:
In general,
Note that we must adjust probabilities at the terminal systems when the
grammar is nondeterministic:
to allow generation of a ﬁnal terminal.
Hence, our transition from HMMs to stochastic formal languages is clear
and wellunderstood.
p
7
0.9 ( ) 0.8 ( ) =
P y y
k
= x x
i
= ( ) a
ij
b k ( )
j
∑
=
p
k
p'
k
p''
k
+ =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 129 OF 147
The Artiﬁcial Neural Network (ANN)
Ë Premise: complex computational operations can be implemented by
massive integration of individual components
Ë Topology and interconnections are key: in many ANN systems,
spatial relationships between nodes have some physical relevance
Ë Properties of largescale systems: ANNs also reﬂect a growing body
of theory stating that largescale systems built from a small unit need
not simply mirror properties of a smaller system (contrast fractals
and chaotic systems with digital ﬁlters)
Why Artiﬁcial Neural Networks?
Ë Important physical observations:
— The human central nervous system contains 10
11
—10
14
nerve
cells, each of which interacts with 10
3
—10
4
other neurons
— Inputs may be excitatory (promote ﬁring) or inhibitory
∫
S • ( )
node: n
k
y
’
1
y
’
2
y
’
3
Θ
k
y
k
The Artiﬁcial Neuron — Nonlinear
scalar output vector input
y
k
S w
ki
y'
i
n 1 =
N
∑
θ
k
– ( ) ≡
b
k
y
k
( )
node: n
k
a
ik
a
ik
a
lk
a
kq
a
kr
a
ks
The HMM State — Linear
α y
1
t 1 +
j , ( ) α y
1
t
i , ( )a j i ( )b y t 1 + ( ) j ( ) =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 130 OF 147
Typical Thresholding Functions — A Key Difference
The input to the thresholding function is a weighted sum of the inputs:
The output is typically deﬁned by a nonlinear function:
u
k
w
k
T
y' ≡
S(u)
u
S(u) S(u)
S(u)
1
1
u
u u
1
Linear
Ramp
Step
Sigmoid
S u ( ) 1 e
u –
+ ( )
1 –
=
Sometimes a bias is introduced into the threshold function:
This can be represented as an extra input whose value is always 1:
y
k
S w
k
T
y' θ
k
– ( ) ≡ S u
k
θ
k
– ( ) =
y'
N 1 +
1 – = w
k N 1 + ,
θ
k
=
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 131 OF 147
Radial Basis Functions
Another popular formulation involves the use of a Euclidean distance:
Note the parallel to a continuous distribution HMM.
This approach has a simple geometric interpretation:
y
k
S w
ik
y'
i
– ( )
2
i 1 =
N
∑
θ
k
– ( ) S w
k
y' –
2
θ
k
– ( ) = =
Another popular variant of this design is to use a Gaussian nonlinearity:
What types of problems are such networks useful for?
• pattern classiﬁcation (N—way choice; vector quantization)
• associative memory (generate an output from a noisy input;
character recognition)
• feature extraction (similarity transformations; dimensionality
reduction)
We will focus on multilayer perceptrons in our studies. These have been
shown to be quite useful for a wide range of problems.
S u ( ) e
u
2
–
=
w
k
Θ
k
y’
1
y’
2
S(u)
u
1
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 132 OF 147
Multilayer Perceptrons (MLP)
This architecture has the following characteristics:
• Network segregated into layers: N
i
cells per layer, L layers
• feedforward, or nonrecurrent, network (no feedback from the output of a
node to the input of a node)
An alternate formulation of such a net is known as the learning vector
quantizer (LVQ) — to be discussed later.
The MLP network, not surprisingly, uses a supervised learning algorithm.
The network is presented the input and the corresponding output, and must
learn the optimal weights of the coefﬁcients to minimize the difference
between these two.
The LVQ network uses unsupervised learning — the network adjusts itself
automatically to the input data, thereby clustering the data (learning the
boundaries representing a segregation of the data). LVQ is popular because
it supports discriminative training.
x
0
x
1
...
x
N
y
0 y
1
... y
N
Input Layer
Hidden Layer
Output Layer
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 133 OF 147
Why Artiﬁcial Neural Networks?
• An ability to separate classes that are not linearly separable:
A threelayer perceptron is required to determine arbitrarilyshaped decision
regions.
• Nonlinear statistical models
The ANN is capable of modeling arbitrarily complex probability
distributions, much like the difference between VQ and continuous
distributions in HMM.
• Contextsensitive statistics
Again, the ANN can learn complex statistical dependencies provided
there are enough degrees of freedom in the system.
Why not Artiﬁcial Neural Networks? (The Price We Pay...)
• Difﬁcult to deal with patterns of unequal length
• Temporal relationships not explicitly modeled
And, of course, both of these are extremely important to the speech
recognition problem.
A
B
A
B
Linearly Separable
Decision
Boundary
Meshed Classes
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 134 OF 147
Session IX:
Language Modeling
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 135 OF 147
Perplexity
How do we evaluate the difﬁculty of a recognition task?
• Instantaneous/local properties that inﬂuence peak resource requirements:
maximum number of branches at a node; acoustic confusability
• Global properties that inﬂuence average difﬁculty: perplexity
If there are possible words output from a random source, and the source is
not statistically independent (as words in a language tend to be), the entropy
associated with this source is deﬁned as:
For an ergodic source, we can compute temporal averages:
Of course, these probabilities must be estimated from training data (or test data).
Perplexity, a common measure used in speech recognition, is simply:
and represents the average branching factor of the language.
Perplexities range from11 for digit recognition to several hundred for LVCSR. For
a language in which all words are equally likely for all time,
W
H w ( )
1
N
 E P w
1
N
w
1
N
= ( ) log
¹ ¹
' ;
¹ ¹
–
w
1
N
∑
N ∞ →
lim – =
1
N
 P w
1
N
w
1
N
= ( ) P w
1
N
w
1
N
= ( ) log
w
1
N
∑
N ∞ →
lim – =
H w ( )
1
N
 P w
1
N
w
1
N
= ( ) log
N ∞ →
lim – =
Q w ( ) 2
H w ( )
=
Q w ( ) W =
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 136 OF 147
Types of Language Models
Common Language Models:
• No Grammar (Digits)
• Sentence pattern grammars (Resource Management)
• Word Pair/Bigram (RM, Wall Street Journal)
• Word Class (WSJ, etc.)
• Trigram (WSJ, etc.)
• BackOff Models (Merging Bigrams and Trigrams)
• Long Range NGrams and CoOccurrences (SWITCHBOARD)
• Triggers and Cache Models (WSJ)
• Link Grammars (SWITCHBOARD)
How do we deal with OutOfVocabulary words?
• Garbage Models
• Monophone and Biphone Grammars with optional Dictionary Lookup
• Filler Models
Can we adapt the language model?
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 137 OF 147
Other Formal Languages
• Unrestricted Grammars
• Context Sensitive (NGrams, Uniﬁcation)
• Context Free (LR)
• Regular (or Finite State)
• Issues
— stochastic parsing
— memory, pruning, and complexity
ω
1
Aω
2
ω
1
βω
2
→
A β →
A aB →
A b →
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 138 OF 147
n i: z
NP
[case:obj]
def:+ form: acc
num: pl
type: bodypart
The doctor examined the patient’s knees. Orthographic:
d A k t exr dh ex I g z ae m I n d dhex pH eI sh I nt s
/# dh i # d A k t exr # I g z ae m ex n + d # dh i # p e sh ex n t + z # n i + z #/
Phonetic:
Phonemic:
VP
∃(X) & ∃(Y) & ∃(Z) & doctor(X) & patient(Y) & knees(Z) & partof(Y,Z) & examined(X,Z) Logical:
S
NP
[case:subj]
NP
N N Det
tense: past
arg1: subj
[type:caregiver]
arg2: obj
[type:bodypart]
V
form: gen
num: sing
type: patient
N
form: nom
num: sing
type: caregiver
def:+
Det
Lexical:
Syntactic:
n i: z d A k t exr
0 1 2 0.5 1.5
5 kHz
0 kHz
4 kHz
3 kHz
2 kHz
1 kHz
Time (secs)
dh ex I g z ae m I n d dhex pH eI sh I nt s Phonetic:
What can you do with all of this?
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 139 OF 147
Session IX:
State of the Art
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 140 OF 147
WHAT IS SPEECH UNDERSTANDING?
Ë Speech Recognition: transcription of words
performance measured by word error rate
Ë Speech understanding: acting on intentions
performance measured by the number of “queries” successfully
answered (adds a natural language dimension to the problem)
DIMENSIONS OF THE PROBLEM
Ë Performance is a function of the vocabulary
how many words are known to the system
how acoustically similar are the competing choices at each point
how many words are possible at each point in the dialog
Ë Vocabulary size is a function of memory size
Ë Usability is a function of the number of possible
combinations of vocabulary words
restrictive syntax is good for performance and bad for usability
Ë Hence, performance is a function of memory size
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 141 OF 147
Digit Recognition
10 words
(perplexity = 10)
TODAY’S TECHNOLOGY
MIPS
10 100 30
Memory
100K
1M
10M
100M
NAB Newswire
100K words
(perplexity =250)
Word Error Rate
Vocabulary Size (Words)
10 100 1K 10K
0.1%
1.0%
10.0%
100.0%
100K
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 142 OF 147
PROGRESS IN CSR PERFORMANCE
87 88 89.1 89.2 90 91 92a 92b 93 94
1
10
100
s
s s
s
s
s
q
q
q
v
v
w
Word Error Rate (%)
Time
s Resource Management (1000 Words)
q Wall Street Journal (5000 words)
v Wall Street Journal (20,000 words)
w NAB News (unlimited vocabulary)
Note: Machine performance is still at least two orders of
magnitude lower than humans
(Data from George Doddington, ARPA HLT Program Manager, HLT’95)
5
50
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 143 OF 147
WHAT IS THE CONTEXT FOR SPEECH
UNDERSTANDING RESEARCH IN THE 1990’s?
Ë The Application Need
Interaction with online data (Internet)
Automated information agents (“24hour Help Desk”)
Global multilingual interactions
Ë The Technology Challenge
Create natural transparent interaction with
computers (“2001,” “Star Trek”)
Bring computing to the masses (vanishing window)
Intelligent presentation of information
(“hands/eyes busy applications”)
Ë Application Areas
Command/Control (Telecom/Workstations)
Database Query (Internet)
Dictation (Workstations)
Machine Translation (Workstations)
RealTime Interpretation (Telecom)
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 144 OF 147
THE TECHNICAL CHALLENGE
Ë Barriers
Speech understanding is a vast empirical problem:
• Hierarchies of hidden representations (partofspeech, noun
phrases, units of meaning) produce immensely complex models
• Training of complex models requires huge and dynamic knowledge
bases
Interconnected and interdependent levels of
representation:
• Correct recognition and transcription of speech depends on
understanding the meaning encoded in speech
• Correct understanding and interpretation of text depends on the
domain of discourse
Ë Approach
Capitalize on exponential growth in computer power
and memory:
• Statistical modeling and automatic training
• Shared resources and infrastructure
Applicationfocused technical tasks
• New metrics of performance based on user feedback and
productivity enhancement (human factors)
(G. Doddington, ARPA HLT Program Manager, “Spoken Language Technology
Discussion,” ARPA Human Language Technology Workshop, January 1995)
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 145 OF 147
Example No. 1: HTK (Cambridge University)
Signal Processing:
Sample Frequency = 16 kHz
Frame Duration = 10 ms
Window Duration = 25 ms (Hamming window)
FFTBased Spectrum Analysis
MelFrequency Filter Bank (24)
LogDCT Cepstral Computation (Energy plus 12 coefﬁcients)
Linear Regression Using A 5 Frame Window For and
Parameters
Acoustic Modeling:
Simple LeftToRight Topology With Three Emitting States
10 component Gaussian Mixtures at Each State
BaumWelch Reestimation
CrossWord ContextDependent Phone Models
StateTying Using Phonetic Decision Trees (Common Broad
Phonetic Classes)
+/ 2 Phones Used In Phonetic Decision Trees Including Word
Boundary
Language Modeling:
NGram Language Model (3, 4, and 5Grams Have Been Reported)
Discounting (DeWeight More Probable NGrams)
BackOff Model Using Bigrams
SinglePass TimeSynchronous Tree Search
Notes:
Approx. 3 to 8 Million Free Parameters
StateoftheArt Performance
Relatively Low Complexity
Excellent RealTime Performance
∆ ∆
2
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 146 OF 147
Example No. 2: Abbot Hybrid Connectionist HMM (Cambridge Univ.)
Signal Processing:
Sample Frequency = 16 kHz
Frame Duration = 16 ms
Window Duration = 32 ms
FFTBased Spectrum Analysis
MelFreq. Filter Bank (20) + 3 Voicing Features (replaced by PLP)
Each Feature Normalized To Zero Mean and Unit Variance (Grand Covar.)
Acoustic Modeling:
Recurrent Neural Network
Single Layer FeedForward Network With Four Frame Delay
Sigmoidal Nonlinearities
A Single Network Computes All Posterior Phone Probabilities
ViterbiLike BackPropagationThroughTime Training
Parallel Combination of BackwardInTime and ForwardInTime Networks
ContextDependent Probabilities Computed Using A Second Layer
Linear Input Mapping For Speaker Adaptation (Similar to MLLR)
Language Modeling:
SinglePass Stack Decoding Algorithm
NGram Language Model
PhoneLevel and WordLevel Pruning
TreeBased Lexicon
NO ContextDependent CrossWord Models
Notes:
Approx. 80K Free Parameters
Performance Comparable To Best HMMs
Relatively Low Complexity
Excellent RealTime Performance
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996 TEXAS INSTRUMENTS PAGE 147 OF 147
Additional Resources
Web:
CMU: http://www.ri.cmu.edu/speech
comp.speech: ftp://svrftp.eng.cam.ac.uk/pub/pub/comp.speech/
UCSB: http://mambo.ucsc.edu/psl/speech.html
IEEE SP (Rice): http://spib.rice.edu/spib.html
Public Domain:
Oregon Graduate Institute
Cambridge Connectionist Group — AbbotDemo
... and many more ...
ISIP (about one year away)
Commercial:
Entropic Waves / HTK
Matlab (coming soon  also Entropic)
MAY 1517, 1996
SHORT COURSE
INTRO
Who am I? Afﬁliations: Associate Professor, Electrical and Computer Engineering Director, Institute for Signal and Information Processing Location: Mississippi State University 413 Simrall, Box 9571 Mississippi State, Mississippi 39762 Phone/Fax: (601) 3253149 Email: picone@isip.msstate.edu (http://isip.msstate.edu) Education: Ph.D. in E.E., Illinois Institute of Technology, December 1983 M.S. in E.E., Illinois Institute of Technology, May 1980 B.S. in E.E., Illinois Institute of Technology, May 1979 Areas of Research: Speech Recognition and Understanding, Signal Processing Educational Responsibilities: Signal and Systems (third year UG course) Introduction to Digital Signal Processing (fourth year B.S./M.S. course) Fundamentals of Speech Recognition (M.S./Ph.D. course) Research Experience: MS State (1994present): large vocabulary speech recognition, objectoriented DSP Texas Instruments (19871994): telephonebased speech recognition, Tsukuba R&D Center AT&T Bell Laboratories (19851987): isolated word speech recognition, lowrate speech coding Texas Instruments (19831985): robust recognition, low rate speech coding AT&T Bell Laboratories (19801983): medium rate speech coding, low rate speech coding Professional Experience: Senior Member of the IEEE, Professional Engineer Associate Editor of the IEEE Speech and Audio Transactions Associate Editor of the IEEE Signal Processing Magazine
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996
SHORT COURSE
INTRO
Course Outline Day/Time: Wednesday, May 13: 1. 8:30 AM to 10:00 AM 2. 10:30 AM to 12:00 PM 3. 1:00 PM to 2:30 PM 4. 3:00 PM to 4:30 PM Thursday, May 14: 5. 8:30 AM to 10:00 AM 6. 10:30 AM to 12:00 PM 7. 1:00 PM to 2:30 PM 8. 3:00 PM to 4:30 PM Friday, May 15: 9. 8:30 AM to 10:00 AM 10. 10:30 AM to 12:00 PM Language Modeling State of the Art and Future Directions 134 139 Dynamic Programming (DP) Hidden Markov Models (HMMs) Acoustic Modeling and Training Speech Recognition Using HMMs 73 85 104 120 Fundamentals of Speech Basic Signal Processing Concepts Signal Measurements Signal Processing in Speech Recognition 1 20 47 64 Topic: Page:
Suggested Textbooks: 1. J. Deller, et. al., DiscreteTime Processing of Speech Signals, MacMillan Publishing Co., ISBN 0023283017 2. L. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, PrenticeHall, ISBN 0130151572
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 1 OF 147 Session I: Fundamentals of Speech INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 2 OF 147 A GENERIC SOLUTION Concept or Action Pragmatics Hypothesizer Sentence Prediction Sentence Hypothesizer Word Prediction Word Hypothesizer Syllable Prediction Syllable Hypothesizer Phone Prediction Phone Hypothesizer Speech Signal Signal Model Acoustic Modeling INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
P(O t. … )) i i i Language Model: P(W t) P(O t W t)P(W t) i Search Algorithms: P(W t O t) = P(O t) i i Prediction i i Pattern Matching: [ W t.MAY 1517. W t. W t + 1 )) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . Ot – 1. … W t) ] Signal Model: P(O t ( W t – 1. 1996 TEXAS INSTRUMENTS PAGE 3 OF 147 BASIC TECHNOLOGY: A PATTERN RECOGNITION PARADIGM BASED ON HIDDEN MARKOV MODELS Recognized Symbols: P(S O) = arg max T ∏ P(W t ( Ot. O t – 1.
1996 TEXAS INSTRUMENTS PAGE 4 OF 147 WHAT IS THE BASIC PROBLEM? Feature No. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517. 1 What do we do about the regions of overlap? Context is required to disambiguate them. 2 ph3 in the context e ph3 f ph1 in the context: a ph1 b ph2 in the context: c ph2 d Feature No. The problem of where words begin and end is similar.
1996 TEXAS INSTRUMENTS PAGE 5 OF 147 Speech Production Physiology Hard Palate Soft Palate (velum) Nasal cavity Nostril Pharyngeal cavity Larynx Teeth Esophagus Oral Jaw Trachea Lip Tongue Lung Diaphragm INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
1996 TEXAS INSTRUMENTS PAGE 6 OF 147 A Block Diagram of Human Speech Production Nasal cavity Velum nostrils Pharyngeal cavity Tongue hump Oral cavity mouth Lungs Muscle force INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
50 0.0 20000.60 0.00 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .80 0.0 0.70 Time (secs) 0.MAY 1517.50 0.80 0.0 0.90 1.0 10000.60 Condenser microphone 0.0 10000.0 “san” (f1002) Sanken MU2C dynamic microphone 20000.80 0.0 “san” (m0001) Sanken MU2C dynamic microphone 20000.90 1.50 0.0 10000.0 20000.0 10000.80 0.70 Time (secs) 0.0 0.0 10000.00 20000.70 Time (secs) 0.50 0.0 0.0 10000.0 10000.70 Time (secs) 0.60 Sanken MU2C dynamic microphone 0.0 20000.0 0.0 10000.0 10000.00 20000.0 10000.0 0.0 0.0 0.0 0.0 0.60 0.70 Time (secs) 0.80 0.50 20000.90 1.90 1.90 1.00 “ichi” (f1001) 20000.60 0.00 “san” (f1003) Sanken MU2C dynamic microphone 0. 1996 TEXAS INSTRUMENTS PAGE 7 OF 147 What Does A Speech Signal Look Like? “ichi” (f1001) 20000.
5 2 Narrowband Spectrogram ( f s = 8 kHz .5 1 Time (secs) 1.MAY 1517. T w = 6 ms ): Orthographic: 5 kHz 4 kHz 3 kHz 2 kHz 1 kHz 0 kHz The doctor examined the patient’s knees. 1996 TEXAS INSTRUMENTS PAGE 8 OF 147 What Does A Speech Signal Spectrogram Look Like? Standard wideband spectrogram ( f s = 10 kHz . T w = 30 ms ): “Drown” (female) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 0 0.
• (For English. • the basic theoretical unit for describing how speech conveys linguistic meaning. semivowels. • the study of the actual sounds of the language Phonetics: Allophones: • the collection of all minor variants of a given sound (“t” in eight versus “t” in “top”) • Monophones.MAY 1517. Three branches of phonetics: • Articulatory phonetics: manner in which the speech sounds are produced by the articulators of the vocal system. there are about 42 phonemes. Phonemics: • the study of abstract units and their relationships in a language Phone: • the actual sounds that are produced in speaking (for example.) • Types of phonemes: vowels. Biphones. Most often used to describe acoustic models. narrow phonetic transcriptions INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . and three phones. Issues: • Broad phonemic transcriptions vs. 1996 TEXAS INSTRUMENTS PAGE 9 OF 147 Phonemics and Phonetics Some simple deﬁnitions: Phoneme: • an ideal sound unit with a complete set of articulatory gestures. • Acoustic phonetics: sounds of speech through the analysis of the speech waveform and spectrum • Auditory phonetics: studies the perceptual response to speech sounds as reﬂected in listener trials. two. dipthongs. Triphones — sequences of one. “d” in letter pronounced “l e d er”). and consonants.
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 10 OF 147 Phonemic and Phonetic Transcription .Standards Major governing bodies for phonetic alphabets: International Phonetic Alphabet (IPA) — over 100 years of history ARPAbet — developed in the late 1970’s to support ARPA research TIMIT — TI/MIT variant of ARPAbet used for the TIMIT corpus Worldbet — developed recently by Jim Hieronymous (AT&T) to deal with multiple languages within a single ASCII system Example: INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 11 OF 147 When We Put This All Together: We Have An Acoustic Theory of Speech Production INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
1996 TEXAS INSTRUMENTS PAGE 12 OF 147 Consonants Can Be Similarly Classiﬁed INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
MAY 1517..2 ∂t ρc ∂( u ⁄ A ) ∂p – .= ρ ∂t ∂x 1 ∂( pA ) Lips ∂A ∂u – .+ A(x)∂x 2 ∂t ρc ∂t x INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING ..+ grad"" p = 0 ∂t Glottis 1 ∂ p ..+ div u = 0 . tt) A u p p x ) A ∂u ρ . 1996 TEXAS INSTRUMENTS PAGE 13 OF 147 Sound Propagation A detailed acoustic theory must consider the effects of the following: • Time variation of the vocal tract shape • Losses due to heat conduction and viscous friction at the vocal tract walls • Softness of the vocal tract walls • Radiation of sound at the lips • Nasal coupling • Excitation of sound in the vocal tract Let us begin by considering a simple case of a lossless tube: c ρ = u ((x.= .
= .– .A ∂t ∂x The solution is a traveling wave: u(x..5 cm).[ u (t – x ⁄ c) + u (t + x ⁄ c) ] A which is analogous to a transmission line: ∂i ∂v – . t) = A .MAY 1517.2 mg/cc) is the velocity of sound (35000 cm/s) ∂( u ⁄ A ) ∂p – . t ) is the variation in the volume velocity ρ c is the density of air in the tube (1.+ div u = 0 . t) = u (t – x ⁄ c) – u (t + x ⁄ c) ρc + p(x.= ρ ∂t ∂x 1 ∂( pA ) ∂ A ∂u – . sound waves satisfy the following pair of equations: ∂( u ⁄ A ) ρ .+ ∂t ∂x ρc 2 ∂t or A = A(x.2 ∂t ∂t ρc where p = p ( x.= C ∂t ∂x +  A ∂p ∂u .∂x ρc 2 ∂t What are the salient features of the lossless transmission line model? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . t) is the area function (about 17.+ grad"" p = 0 ∂t 1 ∂ p ∂A ..= .= L ∂t ∂x ∂v ∂i – . t) = . which implies a wavelength of 8. t ) is the variation of the sound pressure in the tube u = u ( x. then the above equations reduce to: ρ ∂u ∂p – .+ .5 cm long) Uniform Lossless Tube If A(x.= . 1996 TEXAS INSTRUMENTS PAGE 14 OF 147 For frequencies that are long compared to the dimensions of the vocal tract (less than about 4000 Hz.
Note that these 2l correspond to the frequencies at which the tube becomes a quarter Ωl π c wavelength: .volume velocity ρ/A . Ω) cos ( Ωl ⁄ c ) ( 2n + 1 )πc This function has poles located at every . t) = jZ o U G(Ω)e cos [ Ωl ⁄ c ] jΩt cos [ Ω ( l – x ) ⁄ c ] u(x.voltage i .MAY 1517.is the characteristic impedance.= U (0.capacitance The sinusoisdal steady state solutions are: jΩt sin [ Ω ( l – x ) ⁄ c ] p(x. ⇒ Ω =  ..inductance C . A The transfer function is given by: U (l. .acoustic capacitance Analogous Electric Quantity v .acoustic inductance A/(ρc2) . c 2 4l H(f) f 0 1 kHz 2 kHz 3 kHz 4 kHz Is this model realistic? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . t) = U G(Ω)e cos [ Ωl ⁄ c ] ρc where Z 0 = .pressure u . Ω) 1 .= . 1996 TEXAS INSTRUMENTS PAGE 15 OF 147 where Acoustic Quantity p .current L .
6 cm F1 x 500 Formant Patterns F2 x 1500 F3 x 2500 F4 x 3500 L2 ⁄ L1 = 8 2 A2 ⁄ A1 = 8 1 F1 x 320 F2 x 1200 F3 x 2300 F4 x 3430 L 2 ⁄ L 1 = 1.5 2 A2 ⁄ A1 = 8 1 F1 x 260 F2 x 1990 F3 x 3050 F4 x 4130 L 1 + L 2 = 14.2 2 A2 ⁄ A1 = 1 ⁄ 8 1 F1 x 780 F2 x 1240 F3 x 2720 F4 x 3350 L 2 ⁄ L 1 = 1.6 cm L 2 ⁄ L 1 = 1.0 2 A2 ⁄ A1 = 8 1 F1 x 220 F2 x 1800 F3 x 2230 F4 x 3800 L 1 + L 2 = 17.5 cm L2 ⁄ L1 = 1 ⁄ 3 2 1 A2 ⁄ A1 = 1 ⁄ 8 F1 x 630 F2 x F3 x F4 x 3440 1770 2280 L 1 + L 2 = 17.MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 16 OF 147 Resonator Geometry L=17.6 cm INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
Ω) ⁄ Z G(Ω) For voiced sounds. Ω) = U G(Ω) – P(0. 1996 TEXAS INSTRUMENTS PAGE 17 OF 147 Excitation Models How do we couple energy into the vocal tract? Air Flow Trachea Vocal Cords Vocal Tract Muscle Force RG ug(t) LG Lungs ps(t) Subglottal Pressure The glottal impedance can be approximated by: Z G = R G + jΩL G The boundary condition for the volume velocity is: U (0.MAY 1517. the glottal volume velocity looks something like this: Volume Velocity (cc/sec) 1000 500 0 0 5 10 15 time (ms) 20 25 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
MAY 1517. such as sibilants (“s”) have extremely high bandwidths Questions: What does the overall spectrum look like? What happened to the nasal cavity? What is the form of V(z)? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 1996 TEXAS INSTRUMENTS PAGE 18 OF 147 The Complete Digital Model (Vocoder) fundamental frequency Av Impulse Train Generator Glottal Pulse Model x Vocal Tract Model V(z) Random Noise Generator Lip Radiation Model uG(n) pL(n) x AN Notes: • Sample frequency is typically 8 kHz to 16 kHz • Frame duration is typically 10 msec to 20 msec • Window duration is typically 30 msec • Fundamental frequency ranges from 50 Hz to 500 Hz • Three resonant frequencies are usually found within 4 kHz bandwidth • Some sounds.
25 ms 160.3 Hz lag time frequency Other common representations: Average Magnitude Difference Function (AMDF): N–1 γ (i) = Zero Crossing Rate: Z Fs F 0 = 2 ∑ x(n) – x(n – i) n=0 N–1 Z = ∑ sgn [ x(n) ] – sgn [ x(n – 1) ] n=0 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 1996 TEXAS INSTRUMENTS PAGE 19 OF 147 Fundamental Frequency Analysis How do we determine the fundamental frequency? F0 We use the (statistical) autocorrelation function: N–1 ∑ x(n)x(n – i) ∑ n=0 n=0 Ψ(i) = N–1 N–1 2 2 x(n) x(n – i) ∑ n=0 Ψ(i) 1 50 1 6.0 Hz 100 12.75 ms 53.MAY 1517.5 ms 80 Hz 150 18.
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 20 OF 147 Session II: Basic Signal Processing Concepts INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
5 0.0 Given a continuous signal: x(t) = A cos ( 2πft + θ ) .MAY 1517.0 10. 1996 TEXAS INSTRUMENTS PAGE 21 OF 147 The Sampling Theorem and Normalized Time/Frequency If the highest frequency contained in an analog signal.) Fs Fs 1. is F max = B and the signal is sampled at a rate F s > 2F max = 2B . gives: x(n) = A cos ( ωn + θ ) . and is called normalized radian frequency and n f s represents normalized time.5 10.0 0.)g(t – . then x a(t) can be EXACTLY recovered from its sample values using the interpolation function: sin ( 2πBt ) g(t) = . x a(t) . f where ω = 2π .0 0.0 5.. f s which. after regrouping. .0 5. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . Fs ∞ n = –∞ ∑ n n x a(.0 0.) = x a(nT ) = x(n) . A discretetime sinusoid may be expressed as: n x(n) = A cos 2πf . + θ . 2πBt x a(t) may be expressed as: x a(t) = n where x a(.
). …. n = 0. 2. N – 1 n=0 The Fast Fourier Transform (FFT) is simply an efﬁcient computation of the DFT. etc. k = 0. Wigner distributions.X (z)z dz 2πj ∫ ° C The Fourier transform of x(n) can be computed from the z transform as: X (ω) = X (z) = z=e jω n = –∞ ∑ ∞ x(n)e – jωn The Fourier transform may be viewed as the z transform evaluated around the unit circle. The Discrete Fourier Transform (DFT) is deﬁned as a sampled version of the Fourier shown above: N–1 X (k) = ∑ x(n)e – j2πkn ⁄ N . 2. N – 1 n=0 The inverse DFT is given by: N–1 x ( n) = ∑ X(k)e j2πkn ⁄ N . 1. 1996 TEXAS INSTRUMENTS PAGE 22 OF 147 Transforms The z transform of a discretetime signal is deﬁned as: X (z) ≡ n = –∞ ∑ ∞ x(n)z –n n–1 1 x(n) = .MAY 1517. …. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 1. fractals. N – 1 n=1 Note that these are not the only transforms used in speech processing (wavelets. …. k = 0. The Discrete Cosine Transform (DCT) is simply a DFT of a signal that is assumed to be real and even (real and even signals have real spectra!): N–1 X ( k ) = x ( 0) + 2 ∑ x(n) cos 2πkn ⁄ N . 2. 1.
MAY 1517.0 30.0 2000.0 20.50.0 20.0 0.0 4000. N=2048 40.0 0. N α = 0.0 10.0 6000. f1=1511 Hz.0 Popular Windows Generalized Hanning: 2π k w H (k) = w(k) α + ( 1 – α ) cos .0 30. α = 0.54. L=25. 0<α<1 Hamming window Hanning window FrameBased Analysis With Overlap Consider the problem of performing a piecewise linear analysis of a signal: L x1(n) x2(n) x3(n) L L INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 1996 TEXAS INSTRUMENTS PAGE 23 OF 147 TimeDomain Windowing Let { x(n) } denote a ﬁnite duration segment of a signal: ˆ x(n) = x(n)w(n) This introduces frequency domain aliasing (the socalled picket fence effect): 40.0 10.0 8000.0 fs=8000 Hz.
and Signal Flow Graphs A linear timeinvariant system can be characterized by a constantcoefﬁcient difference equations: y(n) = – ∑ ak y(n – k) + ∑ bk x(n – k) k=1 k=0 N M Is this system linear if the coefﬁcients are timevarying? Such systems can be implemented as signal ﬂow graphs: x(n) y(n) + z1 b0 + + a1 z1 b1 + + . Filters... + ... a2 b2 + . aN1 z1 bN1 + aN bN INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 1996 TEXAS INSTRUMENTS PAGE 24 OF 147 Difference Equations...MAY 1517.
one is a maximumphase realization. The inverse of a nonminimum phase system is not stable. We end with a very simple question: Is phase important in speech processing? –1 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . others are inbetween. Why is minimum phase such an important concept in speech processing? We prefer systems that are invertible: H (z)H (z) = 1 We would like both systems to be stable. and Speech Perception An FIR ﬁlter composed of all zeros that are inside the unit circle is minimum phase. the minimumphase version is the most compact in time: Deﬁne: E(n) = ∑ k=0 n h(k) 2 Then. Maximum Phase. 1996 TEXAS INSTRUMENTS PAGE 25 OF 147 Minimum Phase. E min(n) ≥ E(n) for all n and all possible realizations of H (ω) . Any nonminimum phase polezero system can be decomposed into: H (z) = H min(z)H ap(z) It can be shown that of all the possible realizations of H (ω) . There are many realizations of a system with a given magnitude response.MAY 1517. one is a minimum phase realization. MIxed Phase.
MAY 1517, 1996
TEXAS INSTRUMENTS
PAGE 26 OF 147
Probability Spaces A formal deﬁnition of probability involves the speciﬁcation of: • a sample space The sample space, S , is the set of all possible outcomes, plus the null outcome. Each element in S is called a sample point. • a ﬁeld (or algebra) The ﬁeld, or algebra, is a set of subsets of S closed under complementation and union (recall Venn diagrams). • and a probability measure. A probability measure obeys these axioms: 1. P(S) = 1 (implies probabilities less than one) 2. P( A) ≥ 0 3. For two mutually exclusive events: P( A ∪ B) = P( A, B) = P( A) + P(B) Two events are said to be statistically independent if: P( A ∩ B) = P( A)P(B) The conditional probability of B given A is: P(B ∩ A) P(B A) = P( A) Hence, P(B ∩ A) = P(B A)P( A)
Mutually Exclusive S B A S
P(B ∩ A) = P(B A)P( A)
B A
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996
TEXAS INSTRUMENTS
PAGE 27 OF 147
Functions of Random Variables Probability Density Functions: f(x)  discrete (Histogram) f(x)  continuous
1 2 3 Cumulative Distributions: F(x)  discrete
4
1
2
3
4
F(x)  continuous
1 2 3 Probability of Events:
4 f(x)  continuous
1
2
3
4
1 P(2 < x ≤ 3) =
2
3
3
4
∫ f (x) dx = F(3) – F(2)
2
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996
TEXAS INSTRUMENTS
PAGE 28 OF 147
Important Probability Density Functions Uniform (Unix rand function): 1 a<x≤b f ( x) = b – a 0 elsewhere Gaussian: –( x – µ )2 1 f (x) =  exp  2 2σ 2 2πσ Laplacian (speech signal amplitude, durations): – 2 x 1 f (x) =  exp  σ 2 2σ Gamma (durations): k f (x) =  exp { – k x } 2 πx We can extend these concepts to Ndimensional space. For example:
P( A, B) Ax Ay P( A ∈ A x B ∈ A y) =  = P( B) f ( y) d y
∫ ∫ f ( x , y) d y d x ∫
Ay
Two random variables are statistically independent if: P( A, B) = P( A)P ( B ) This implies: f xy(x, y) = f x(x) f y(y) and F xy(x, y) = F x(x)F y(y)
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
exp .MAY 1517. If we start with a Gaussian pdf: 2 –( x – µi ) 1 f i(x) = . 1996 TEXAS INSTRUMENTS PAGE 29 OF 147 Mixture Distributions A simple way to form a more generalizable pdf that still obeys some parametric shape is to use a weighted sum of one or more pdfs. is most often a nonlinear optimization problem. Such distributions are useful to accommodate multimodal behavior: µ1 f 1(x) f 2(x) µ2 µ3 f 3(x) 4 3 2 1 1 2 3 4 Derivation of the optimal coefﬁcients. 2 2σ i 2 2πσ i The most common form is a Gaussian mixture: f ( x) = where N ∑ ci f i(x) i=1 ∑ ci = 1 i=1 N Obviously this can be generalized to other shapes. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . We refer to this as a mixture distribution. however.
g(x) .MAY 1517. …. 1996 TEXAS INSTRUMENTS PAGE 30 OF 147 Expectations and Moments The statistical average of a scalar function. x N ] T 1 T –1 1 f x(x) = .exp – ( x – µ ) C x ( x – µ ) N ⁄ 2 2π C 2 What is the difference between a random vector and a random process? What does wide sense stationary mean? strict sense stationary? What does it mean to have an ergodic random process? How does this inﬂuence our signal processing algorithms? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . y) dx dy i k We can deﬁne a correlation coefﬁcient between two random variables as: c xy ρ xy = ρxρy We can extend these concepts to a vector of random variables: x = [ x 1. of a random variable is: E [ g( x ) ] = ∑ xi P(x = xi) i=1 i ∞ ∞ and E [ g( x ) ] ≡ –∞ ∞ ∫ g(x) f (x) dx The central moment is deﬁned as: E[( x – µ) ] = –∞ ∫ ( x – µ ) f ( x) d x i The joint central moment between two random variables is deﬁned as: ∞ ∞ E [( x – µx) ( y – µy) ] = i k –∞ –∞ ∫ ∫ ( x – µ i ) ( y – µ y ) f (x. x 2.
MAY 1517. j) = E [ ( x ( n – i ) – µ i ) ( x(n – j) – µ j ) ] = E [ x ( n – i )x(n – j) ] – µ i µ j 1 = N N–1 N–1 N–1 ∑ n=0 1 x(n – i)x(n – j) – N ∑ n=0 1 x(n – i)  N ∑ n=0 x(n – j) If a random process is wide sense stationary: c(i. we now embark upon one of the last great mysteries of life: How do we compare two random vectors? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING N–1 n=0 ∑ x ( n )x(n – k) 2 2 . j) = r( i – j ) What is the relationship between the autocorrelation and the spectrum: DFT { r(k) } = X ( k ) For a linear timeinvariant system. 1996 TEXAS INSTRUMENTS PAGE 31 OF 147 Correlation and Covariance of WSS Random Processes For a signal. we can compute the following useful quantities: Autocovariance: c(i. x(n) . j) = c( i – j . 0) Hence. we deﬁne a very useful function known as the autocorrelation: 1 r(k) = N If x(n) is zero mean WSS: c(i. h(n) : DFT { r y(k) } = DFT { h(n) } DFT { r x(k) } The notion of random noise is central to signal processing: white noise? Gaussian white noise? zeromean Gaussian white noise? colored noise? Therefore.
d ∞(x. b? The Ndimensional real Cartesian space. y) = 3. y) = ∑ k=1 N N xk – yk 2. d(x. or Euclidean metric (meansquared error). between x and y is: d s(x. a and pt. y) ≡ s ∑ k=1 N xk – yk s = x–y s (the norm of the difference vector). Important cases are: 1. y) ≥ 0 . or the l s metric. l 2 . d(x. 2. y) N N gold bars b a 2 1 0 0 1 2 diamonds The Minkowski metric of order s . d(x. or distance measure. y) = 0 if and only if x = y 3. A metric. y) = max x k – y k k ∑ k=1 xk – yk 2 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 1996 TEXAS INSTRUMENTS PAGE 32 OF 147 Distance Measures What is the distance between pt. y) ≤ d(x. l ∞ or Chebyshev metric. l 1 or city block metric (sum of absolute values). is a realvalued function with three properties: ∀ x . y. z ∈ ℜ : 1. denoted ℜ is the collection of all Ndimensional vectors with real elements.MAY 1517. d 2(x. d 1(x. z) + d(z.
a Euclidean distance is virtually equivalent to a dot product (which can be computed very quickly on a vector processor).y = y1 y2 … yk w 11 w 12 … w 1k w 21 w 22 … w 2k . x m . y) = = 2 ∑ ( xm – y j ) j k 2 = ∑ xm j=1 2 k 2 j – 2x m y j + y j j 2 j=1 xm 2 – 2x m • y + y = C m + C y – 2x m • y Therefore. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . y) = where: x1 x = x2 … xk . and we want to ﬁnd the nearest neighbor: NN = min d 2(x m. y) = C m – 2x m • y m Thus. C m can be ignored (normalized codebook). In fact. Suppose we are given a set of M reference vectors. NN = min d 2(x m. y . y) m This can be simpliﬁed as follows: We note the minimum of a square root is the same as the minimum of a square (both are monotonically increasing functions): d 2(x m. if all reference vectors have the same magnitude. a measurement. … … … … w k1 w k2 … w kk x–y W x–y T Why are Euclidean distances so popular? One reason is efﬁcient computation. and W = . 1996 TEXAS INSTRUMENTS PAGE 33 OF 147 We can similarly deﬁne a weighted Euclidean distance metric: d 2w(x.MAY 1517.
b) = System 2: 0 0 +1 = 1 1 2 diamonds γ 1 = – 2 ˆ + 0 ˆ and γ 1 = – 1 ˆ + 1 ˆ i j i j System 1 a = –2 0 –1 –1 1 1 3 – 2 0 – b = 2 –1 1 2 System 2 d 2(a. 1996 TEXAS INSTRUMENTS PAGE 34 OF 147 Prewhitening of Features Consider the problem of comparing features of different scales: Suppose we represent these points in space in two coordinate systems using the transformation: z = Vx System 1: ˆ ˆ β 1 = 1i + 0 ˆ and β 2 = 0i + 1 ˆ j j 2 a = 1 0 1 0 1 1 2 gold bars b a b = 10 1 01 2 2 1 0 d 2(a. Why? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . Though the rankordering of distances under such linear transformations won’t change.MAY 1517. the cumulative effects of such changes in distances can be damaging in pattern recognition. b) = 2 – 1 – – 3 + ( 1 – 2 )2 =  2 5 4 The magnitude of the distance has changed.
Λ . while another is very reliable. V y) = = [V x – V y] [V x – V y] [ x – y] V V [ x – y] T T T = d 2W (x. Consider: z = Λ –1 ⁄ 2 Φx The covariance of z . y) This is just a weighted Euclidean distance. z 2) = [ x1 – x2 ] C x [ x1 – x2 ] T –1 Again. Suppose two dimensions are statistically correlated. • If the covariance matrix of the transformed vector is a diagonal matrix. and to approximate C by a diagonal matrix. C z is easily shown to be an identity matrix (prove this!) We can also show that: d 2(z 1. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . • A common approximation to this procedure is to assume the dimensions of x are uncorrelated but of unequal variances.MAY 1517. the transformation is said to be an orthogonal transform. For example. What is a statistically optimal transformation? Consider a decomposition of the covariance matrix (which is symmetric): C = ΦΛΦ T where Φ denotes a matrix of eigenvectors of C and Λ denotes a diagonal matrix whose elements are the eigenvalues of C . Why? This is known as varianceweighting. 1996 TEXAS INSTRUMENTS PAGE 35 OF 147 We can simplify the distance calculation in the transformed space: d 2(V x. Suppose all dimensions of the vector are not equal in importance. the transform is said to be an orthonormal transform. suppose one dimension has virtually no variation. • If the covariance matrix is an identity matrix. just a weighted Euclidean distance.
This is useful in determining the “meaning” of a particular feature (for example. or a set of features. is normally created as a k × k The prewhitening transform. z = Λ matrix in which the eigenvalues are ordered from largest to smallest: z1 z2 … zk where λ1 > λ2 > … > λk . INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . In this case. a new feature vector can be formed by truncating the transformation matrix to l < k rows.MAY 1517. can be deﬁned as follows: l –1 ⁄ 2 λ1 = –1 ⁄ 2 v 11 v 12 … v 13 x 1 –1 ⁄ 2 v 21 v 22 … v 2k x 2 λ2 … … … … … … – 1 ⁄ 2 v k1 v k2 … v kk x k λk ∑ λj j=1 % var =  ∑ λj j=1 k This is the percent of the variance accounted for by the ﬁrst l features. the ﬁrst decorrelated feature often is correlated with the overall spectral slope in a speech recognition system — this is sometimes an indication of the type of microphone). This is essentially discarding the least important features. A measure of the amount of discriminatory power contained in a feature. Similarly. the coefﬁcients of the eigenvectors tell us which dimensions of the input feature vector contribute most heavily to a dimension of the output feature vector. 1996 TEXAS INSTRUMENTS PAGE 36 OF 147 “NoiseReduction” Φx .
Second. However. now available in C). The deﬁnitive source is EISPACK (originally implemented in Fortran. known as the QR decomposition. This is based on a transformation known as the Householder transform that reduces columns of a matrix below the diagonal to zero. Stabilization procedures are used in which the elements of the covariance matrix are limited by some minimum value (a noiseﬂoor or minimum SNR) so that the covariance matrix is better conditioned. the complete set of feature vectors contains valid data (speech signals) and noise (nonspeech signals). 1996 TEXAS INSTRUMENTS PAGE 37 OF 147 Computational Procedures Computing a “noisefree” covariance matrix is often difﬁcult. the covariance matrix is often illconditioned. Another method. factors the covariance matrix into a series of transformations: C = QR where Q is orthogonal and R is upper diagonal. such as: N–1 c ij = ∑ ( xi – µi ) ( x j – µ j ) and µi = ∑ xi n=0 N–1 n=0 On paper. One might attempt to do something simple. Hence. A simple method for symmetric matrices is known as the Jacobi transformation. But how do we compute eigenvalues and eigenvectors on a computer? One of the hardest things to do numerically! Why? Suggestion: use a canned routine (see Numerical Recipes in C). INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . In this method. a sequence of transformations are applied that set one offdiagonal element to zero at a time. The product of the subsequent transformations is the eigenvector matrix. this appears reasonable. we will often compute the covariance matrix across a subset of the data. such as the particular acoustic event (a phoneme or word) we are interested in.MAY 1517. often.
Hence. we consider: ˆ ˆ c∗ = argmax P(( x = x ) ( c = c )) c By deﬁnition ˆ ˆ P(( c = c ) . we can’t estimate this probability directly from the training data. 1996 TEXAS INSTRUMENTS PAGE 38 OF 147 Maximum Likelihood Classiﬁcation Consider the problem of assigning a measurement to one of two sets: f x c(x ( c = 1 )) f x c(x ( c = 2 )) µ1 ˆ x µ2 x What is the best criterion for making a decision? Ideally. we would select the class for which the conditional probability is highest: ˆ ˆ c∗ = argmax P(( c = c ) ( x = x )) c However.( x = x )) ˆ ˆ P(( x = x ) ( c = c )) = ˆ P(c = c) from which we have ˆ ˆ ˆ P(( x = x ) ( c = c ))P(c = c) ˆ ˆ P(( c = c ) ( x = x )) = ˆ P( x = x ) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.( x = x )) ˆ ˆ P(( c = c ) ( x = x )) = ˆ P( x = x ) and ˆ ˆ ˆ P(( c = c ) .
MAY 1517. the choice of c that maximizes the right side also maximizes the left side.… . In a case where the number of outcomes is not ﬁnite. µ x c) = ( x – µ x c ) C x c ( x – µ ˆ ) + ln C x c x c INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . ˆ ˆ c∗ = argmx [ P(( x = x ) ( c = c )) ] c A quantity related to the probability of an event which is used to make a decision about the occurrence of that event is often called a likelihood measure.x N c) = f x c(x c) T –1 1 ˆ 1 ˆ = . This gives the decision rule: T –1 –1 ˆ ˆ c∗ = argmin ( x – µ x c ) C x c ( x – µ ˆ ) + ln C x c x c c (Note that the maximization became a minimization.( x – µ x c ) C x c ( x – µ ˆ ) x c 2π C x c 2 We can elect to maximize the log.exp – . we can use an analogous continuous distribution. It is common to assume a multivariate Gaussian distribution: ˆ ˆ f x c(x 1 . ˆ ˆ c∗ = argmax [ P(( x = x ) ( c = c )) ] c ˆ ˆ ˆ = argmx [ P(( x = x ) ( c = c ))P(c = c) ] c if the class probabilities are equal.) We can deﬁne a distance measure based on this as: T –1 –1 ˆ ˆ d ml(x. Therefore. A decision rule that maximizes a likelihood is called a maximum likelihood decision. 1996 TEXAS INSTRUMENTS PAGE 39 OF 147 Clearly. ln [ f x c(x c) ] rather than the likelihood (we refer to this as the log likelihood).
the threshold shifts towards the distribution with the smaller variance. But this is nothing more than a weighted Euclidean distance. This result has a relatively simple geometric interpretation for the case of a single random variable with classes of equal variances: f x c(x ( c = 1 )) f x c(x ( c = 2 )) µ1 a µ2 x The decision rule involves setting a threshold: 2 µ1 + µ2 σ P(c = 2)  + .ln  a = 2 µ – µ P ( c = 1 ) 1 2 and.MAY 1517. µ x c) = ( x – µ x c ) C x c ( x – µ ˆ ) x c This is frequently called the Mahalanobis distance. 1996 TEXAS INSTRUMENTS PAGE 40 OF 147 Note that the distance is conditioned on each class mean and covariance. if else x<a x ∈ (c = 1) x ∈ (c = 2) If the variances are not equal. this expression simpliﬁes to: T –1 ˆ ˆ d M ( x . If the mean and covariance are the same across all classes. What is an example of an application where the classes are not equiprobable? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . This is why “generic” distance comparisons are a joke.
J=0 when all classes are equiprobable Two important examples of such measures are: (1) Bhattacharyya distance: ∞ J B = – ln –∞ ∫ ˆ ˆ ˆ f x c(x 1) f x c(x 2) dx (2) Divergence ∞ JD = –∞ ∫ ˆ f x c( x 1 ) ˆ 1) – f (x 2) ] ln .P(c = c) .MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 41 OF 147 Probabilistic Distance Measures How do we compare two probability distributions to measure their overlap? Probabilistic distance measures take the form: ∞ J = –∞ ˆ ˆ ˆ ˆ ˆ ∫ g { f x c(x c) . J is nonnegative 2. …. 2. Such metrics will be important when we attempt to cluster feature vectors and acoustic models.c = 1. K } dx where 1.ˆ [ f x c( x x c ˆ 2) f x c( x Both reduce to a Mahalanobislike distance for the case of Gaussian vectors with equal class covariances.dx ˆ . INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . J attains a maximum when all classes are disjoint 3.
P(c = c) . When features are independent of their class assignment. the class conditional pdf’s are identical to the mixture pdf: ˆ ˆ ˆ f x c(x c) = f x(x) ∀c When their is a strong dependence.c = 1. Assign all vectors to their nearest neighbor. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . Check the overall distortion. 3. K } dx An example of such a measure is the average mutual information: ˆ M avg(c. 1996 TEXAS INSTRUMENTS PAGE 42 OF 147 Probabilistic Dependence Measures A probabilistic dependence measure indicates how strongly a feature is associated with its class assignment. Such measures take the form: ∞ J = –∞ ˆ ˆ ˆ ˆ ˆ ˆ ∫ g { f x c(x c) . x) = ˆ ∑ P(c = c) ∫ … ∫ –∞ K ∞ ∞ c=1 –∞ ˆ ˆ f x c(x c) ˆ ˆ f x c(x c) log . 2. 2. the conditional distribution should be signiﬁcantly different than the mixture.ˆ ˆ) f x( x The discrete version of this is: ˆ M avg(c. Recompute the centroids as the average of all vectors assigned to the same centroid. f x(x) .MAY 1517.dx . x) = ˆ ∑ P(c = c) ∑ c=1 i=1 K L ˆ P(x = x l c = c) P(x = x l) log 2 P( x = x ) l Mutual information is closely related to entropy. as we shall see shortly. …. A simple and intuitive algorithms is known as the Kmeans algorithm: Initialization: Recursion: Choose K centroids 1. Return to step 1 if some distortion criterion is not met. Such distance measures can be used to cluster data and generate vector quantization codebooks.
We can deﬁne the information associated with each class. For the distributions 1 I y(c = 1) = ( – 1 ) log  = 3 bits 2 8 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . each class is equally likely. because. 1996 TEXAS INSTRUMENTS PAGE 43 OF 147 What Is Information? (When not to bet at a casino. it will be from a different class (so we discover a new class with every number). or outcome. information is a positive quantity.) Consider two distributions of discrete random variables: f(x) 3/4 1/2 1/4 c=1 c=2 c=3 c=4 Uniform Random Variable f(y) 3/4 1/2 1/4 x c=1 c=2 1 2 3 c=4 4 y c=3 1 2 3 4 Which variable is more unpredictable? Now. as: ˆ I (c = c) ≡ log 1 ˆ . The more numbers we draw. A base 2 logarithm is used so that K discrete outcomes can be measured in 2 above. consider sampling random numbers from a random number generator whose statistics are not known. Assuming the underlying distribution is from one of the above distributions. so each new sample will not provide as much information. Each new number we draw provides the maximum amount of information. on the average. For the random variable x. c=3 will occur 5 times more often than the other classes. the more we discover about the underlying distribution. for y. chances are. On the other hand. how much more information do we receive with each new number we draw? The answer lies in the shape of the distributions..MAY 1517. I x(c = 1) = ( – 1 ) log 2 ( 1 ⁄ 4 ) = 2 bits Huh??? Does this make sense? = K bits..= – log 2 P(c = c) ˆ 2 P(c = c) M Since 0 ≤ P(x) ≤ 1 .
( x(N ) = xl 1 N N N )] l1 = 1 lN × log P [ ( x(1) = x l ). ( x(N ) = x l ) ] 1 N 2 If the random vectors are statistically independent: H [ x(1) .log 2( 2πeσ ) . For example.…. since it is an average of information. 2 Why? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING ∫ f (x) log 2 f (x) dx (bits) . we can deﬁne an analogous quantity for entropy: ∞ H = – –∞ 2 1 A zeromean Gaussian random variable has maximum entropy ( . 1996 TEXAS INSTRUMENTS PAGE 44 OF 147 What is Entropy? Entropy is the expected (average) information across all outcomes: H (c) ≡ E [ I (c) ] = – ∑ P(c = ck) log 2 P(c = ck) k=1 K Entropy using log 2 is also measured in bits. …. x(N ) ] = – ∑ … ∑ P [ ( x(1) = xl ). ….8 bits We can generalize this to a joint outcome of N random vectors from the same distribution.MAY 1517.…. x(N ) ] = NH [ x(1) ] We can also deﬁne conditional entropy as: H ( x y) = K L ∑ ∑ k=1 l=1 P(( x = x l ) ( y = y k )) log [ P(( x = x l ) ( y = y k )) ] 2 For continuous distributions. which we refer to as the joint entropy: H [ x(1) .….0 bits Hy = – k=1 ∑  log 2  +  log 2  8 8 8 8 1 1 5 5 = 0. Hx = – 4 3 ∑ k=1 1 log 1  4 2 4 = 2. x(N ) ] = N ∑ H [ x(n) ] n=1 If the random vectors are independent and indentically distributed: H [ x(1) .
( y = y k )) P(( x = x l ).MAY 1517. and is deﬁned as: M (x. we need a joint probability density function. We can deﬁne the average mutual information as the expectation of the mutual information: P( x . y) 2 P( x)P(y) = log 1 1 1 1 . yP(y) = I (x) – I (x y) = I ( y) – I ( y x ) This emphasizes the idea that this is information shared between these two random variables. y) = H (x) – H (x y) = H (y x) – H (y) Also note that if x and y are independent. Stated formally: I (x.= log . then there is no mutual information between them. Note that to compute mutual information between two random variables. y) = E log .– log ) 2 P( x ) 2 P( x ) 2 P( x y) 2 P( x. ∑ ∑ k=1 l=1 K L P(( x = x l ). y) = log P(x. ( y = y k )) log 2 P( x = x l)P(y = y k) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 2 P( x)P(y) = Note that: M (x. 1996 TEXAS INSTRUMENTS PAGE 45 OF 147 Mutual Information The pairing of random vectors produces less information than the events taken individually. y) M (x. y) ≤ I (x) + I (y) The shared information between these events is called the mutual information. y) ≡ [ I (x) + I (y) ] – I (x.+ log . y) From this deﬁnition. we note: M (x.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . and has maximum entropy? What does this imply about the signal outside the window? This is known as the principle of maximum entropy spectral estimation. Later we will see how this relates to minimizing the meansquare error. 1996 TEXAS INSTRUMENTS PAGE 46 OF 147 How Does Entropy Relate To DSP? Consider a window of a signal: x(n) window n What does the sampled ztransform assume about the signal outside the window? What does the DFT assume about the signal outside the window? How do these inﬂuence the resulting spectrum that is computed? What other assumptions could we make about the signal outside the window? How many valid signals are there? How about ﬁnding the spectrum that corresponds to the signal that matches the measured signal within the window.MAY 1517.
1996 TEXAS INSTRUMENTS PAGE 47 OF 147 Session III: Signal Measurements INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
Win. 1996 TEXAS INSTRUMENTS PAGE 48 OF 147 ShortTerm Measurements What is the point of this lecture? T f = 5 ms T w = 10 ms T f = 10 ms T w = 20 ms T f = 20 ms T w = 30 ms T f = 20 ms T w = 30 ms Hamm. Recursive 50 Hz LPF Speech Signal INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517. T f = 20 ms T w = 60 ms Hamm. Win.
54. α = 0.MAY 1517. 0<α<1 Hamming window Hanning window INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .× 100% Tw • Generalized Hanning: 2π k w H (k) = w(k) α + ( 1 – α ) cos . 1996 TEXAS INSTRUMENTS PAGE 49 OF 147 Time/Frequency Properties of Windows (T w – T f ) %Overlap = .50. N α = 0.
or adaptive controllers.+ m) 2 We can view the above operation as a movingaverage ﬁlter applied to the sequence s (n) . these are nothing more than various types of lowpass ﬁlters. This can be computed recursively using a linear constantcoefﬁcient difference equation: 2 P(n) = – ∑ a pw(i)P(n – i) + ∑ b pw( j)s (n – j) 2 i=1 j=1 2 Na Nb Common forms of this general equation are: P(n) = αP(n – 1) + s (n) P(n) = αP(n – 1) + ( 1 – α )s (n) 2 2 2 (Leaky Integrator) (Firstorder weighted average) (2ndorder Integrator) P(n) = αP(n – 1) + βP(n – 2) + s (n) + γ s (n – 1) Of course. 1996 TEXAS INSTRUMENTS PAGE 50 OF 147 RecursiveinTime Approaches Deﬁne the shortterm estimate of the power as: 1 P(n) = Ns Ns – 1 ∑ m=0 2 Ns w(m)s(n – . How do we compute the constants for these equations? In what other applications have we seen such ﬁlters? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
MAY 1517, 1996
TEXAS INSTRUMENTS
PAGE 51 OF 147
Relationship to Control Systems The ﬁrstorder systems can be related to physical quantities by observing that the system consists of one real pole: 1 H (z) = –1 1 – αz α can be deﬁned in terms of the bandwidth of the pole. For secondorder systems, we have a number of alternatives. Recall that a secondorder system can consist of at most one zero and one pole and their complex conjugates. Classical ﬁlter design algorithms can be used to design the ﬁlter in terms of a bandwidth and an attenuation. An alternate approach is to design the system in terms of its unitstep response: P(n) 1.0 ﬁnal response threshold 0.5 0 rise times h(n) rise time fall time n There are many forms of such controllers (often known as servocontrollers). One very interesting family of such systems are those that correct to the velocity and acceleration of the input. All such systems can be implemented as a digital ﬁlter. settling time Equivalent impulse response n Overshoot u(n)
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996
TEXAS INSTRUMENTS
PAGE 52 OF 147
ShortTerm Autocorrelation and Covariance Recall our deﬁnition of the autocorrelation function: 1 r(k) = N Note: • this can be regarded as a dot product of s(n) and s(n + i) . • let’s not forget preemphasis, windowing, and centering the window w.r.t. the frame, and that scaling is optional. What would C++ code look like: arraystyle: for(k=0; k<M; k++) { r[k] = 0.0; for(n=0; n<Nk; n++) { r[k] += s[n] * s[n+k] } } pointerstyle: for(k=0; k<M; k++) { *r = 0.0; s = sig; sk = sig + k; for(n=0; n<Ni; n++) { *r += (s++) * (sk++); } }
N–1
∑
1 s(n)s(n – k) = N
N–1–k
∑
s(n)s(n + k)
0<k< p
n=k
n=0
We note that we can save some multiplications by reusing products: r [ 3 ] = s [ 3 ] ( s [ 0 ] + s [ 6 ] ) + s [ 4 ] ( s [ 1 ] + s [ 7 ] ) + … + s [ N ]s [ N – 3 ] This is known as the factored autocorrelation computation. It saves about 25% CPU, replacing multiplications with additions and more complicated indexing. Similarly, recall our deﬁnition of the covariance function: 1 c(k, l) = N
N–1
∑ s(n – k)s(n – l)
0<l<k< p
n= p
Note: • we use Np points • symmetric so that only the k ≥ l terms need to be computed This can be simpliﬁed using the recursion: c ( k, l ) = c(k – 1, l – 1) + s( p – k)s( p – l) – s(N – k)s(N – l)
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996
TEXAS INSTRUMENTS
PAGE 53 OF 147
Example of the Autocorrelation Function
Autocorrelation functions for the word “three” comparing the consonant portion of the waveform to the vowel (256point Hamming window). Note: • shape for the low order lags  what does this correspond to? • regularity of peaks for the vowel — why? • exponentiallydecaying shape — which harmonic? • what does a negative correlation value mean?
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
0 10000. or 95%/15% are used.0 Energy Histogram: p(E) E Cumulative Distribution: cdf(E) 80% 20% Nominal Noise Level Nominal Signal+Noise Level The SNR can deﬁned as: (Es + En) – En Es SNR = 10 log .0 0. E 100% by deﬁnition INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 80%/20%.MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 54 OF 147 An Application of ShortTerm Power: Speech SNR Measurement Problem: Can we estimate the SNR from a speech ﬁle? 10000.0 20.0 10.0 5000. 85%/15%.0 5000.0 30.= 10 log En 10 E n 10 What percentiles to use? Typically.0 0.
value of { α k } . and a predicted value: s(n) = Why the added terms? This prediction error is given by: ˜ e(n) = s(n) – s(n) = s(n) – ∑ αk s(n – k) . or optimal. 1996 TEXAS INSTRUMENTS PAGE 55 OF 147 Linear Prediction (General) ˜ Let us deﬁne a speech signal.MAY 1517. Let us deﬁne the shorttime average prediction error: E = ∑ e (n) 2 n 2 p α k s(n – k) = s(n) – n k=1 ∑ ∑ ∑ ∑ p p 2 s (n) – = α k s(n – k) + α k s(n – k) 2s(n) n k = 1 n n k=1 2 ∑ ∑ ∑ ∑ p 2 = s (n) – 2 α k s(n)s(n – k) + α k s(n – k) n k = 1 n n k=1 ∑ ∑ p 2 ∑ ∑ We can minimize the error w.r. s(n) .t α l for each 1 ≤ l ≤ p by differentiating E and setting the result equal to zero: ∂E = 0 = –2 ∂ αl ∑ p s(n)s(n – l) + 2 α k s(n – k) s(n – l) n n k = 1 ∑ ∑ INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . k=1 p ∑ αk s(n – k) k=1 p We would like to minimize the error by ﬁnding the best.
0) c = c(2. T INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 2) … c(1. we can express this in matrix form: c = Cα where. p) … … … c( p. A fast algorithm to ﬁnd the solution to this equation is known as the Cholesky decomposition (a V DV approach in which the covariance matrix is factored into lower and upper triangular matrices). 2) … … c( p. Under what conditions does C exist? Note that the covariance matrix is symmetric. 0) = ∑ αk c(k. or predictor coefﬁcients. p) … c(2. c(l. 0) The solution to this equation involves a matrix inversion: α = C –1 c –1 and is known as the covariance method. { α k } are known as linear prediction coefﬁcients. 1) c(2. 0) … c( p.MAY 1517. 1) c(1. By enumerating the equations for each value of l .) Rearranging terms: ∑ s(n)s(n – l) = ∑ αk ∑ s(n – k)s(n – l) n k=1 p n p or. α1 α = α2 … αp c(1. 1996 TEXAS INSTRUMENTS PAGE 56 OF 147 Linear Prediction (Cont. 1) c( p. 2) C = c(2. l) k=1 This equation is known and the linear prediction (YuleWalker) equation. p) c(1.
which means (1) an inverse always exists. 1996 TEXAS INSTRUMENTS PAGE 57 OF 147 The Autocorrelation Method Using a different interpretation of the limits on the error minimization — forcing data only within the frame to be used — we can compute the solution to the linear prediction equation using the autocorrelation method: α = R r where. and all of the elements along the diagonal are equal.is called the synthesizer. under what conditions is it stable? A( z) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . The linear prediction process can be viewed as a ﬁlter by noting: e(n) = s(n) – and E(z) = S(z) A(z) where A(z) = 1 – p ∑ αk s(n – k) k=1 p ∑ αk z k=1 –k A(z) is called the analyzer. (2) the roots are in the lefthalf plane. what type of ﬁlter is it? (pole/zero? phase?) 1 . α1 α = α2 … αp r(0) r(1) r(1) r(0) R = … … r ( p – 1) r ( p – 2) … r ( p – 1) … r ( p – 2) … … … r(0) r(1) r = r(2) … r ( p) –1 Note that R is symmetric.MAY 1517.
note that the same linear prediction equation that applied to the signal applies to the autocorrelation function. inverse ﬁlter formulation. Since the same coefﬁcients satisfy both equations. we will discuss the properties of these equations as they relate to the magnitude of the error. lattice method. k) k=1 p Later. statistical mechanics. except that samples are replaced by the autocorrelation lag (and hence the delay term is replaced by a lag index). maximum likelihood formulation. and operations research. 0) – ∑ α k r (k ) k=1 p ∑ αk c(0. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . Discussions are found in disciplines ranging from system identiﬁcation. autocorrelation formulation. probability. 1996 TEXAS INSTRUMENTS PAGE 58 OF 147 Linear Prediction Error We can return to our expression for the error: p 2 α k s(n – k) E = e ( n) = s(n) – n n k=1 2 ∑ ∑ ∑ and substitute our expression for { α k } and show that: Autocorrelation Method: E = r(0) – Covariance Method: E = c(0. For now. and inner product formulation. econometrics. spectral estimation formulation. signal processing. this conﬁrms our hypothesis that this is a model of the minimumphase version of the input signal. Linear prediction has numerous formulations including the covariance method.MAY 1517.
Example: p=2 E 0 = r(0) k 1 = r(1) ⁄ r(0) a 1(1) = k 1 = r(1) ⁄ r(0) E 1 = ( 1 – k 1 )E 0 r (0) – r (1) = r(0) r(2)r(0) – r (1) k 2 = 2 2 r (0) – r (1) 2 2 2 2 r(2)r(0) – r (1) a 2(2) = k 2 = 2 2 r (0) – r (1) a 2(1) = a 1(1) – k 2 a 1(1) r(1)r(0) – r(1)r(2) = 2 2 r (0) – r (1) α 1 = a 2(1) α 2 = a 2(2) 2 This reduces the LP problem to O( p) and saves an order of magnitude in computational complexity. { k i } .MAY 1517. are referred to as reﬂection 2 coefﬁcients. i – 1 a i(i) = k i a i( j) = a i – 1( j) – k i a i – 1(i – j) E i = ( 1 – k i )E i – 1 This recursion gives us great insight into the linear prediction process. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . …. we note that the intermediate variables. First. 2. and makes this analysis amenable to ﬁxedpoint digital signal processors and microprocessors. …. p E 0 = r(0) k i = r(i) – i–1 ∑ j=1 a i – 1( j)r(i – j) ⁄ E i – 1 for j = 1. 2. 1996 TEXAS INSTRUMENTS PAGE 59 OF 147 LevinsonDurbin Recursion The predictor coefﬁcients can be efﬁciently computed for the autocorrelation method using the LevinsonDurbin recursion: for i = 1.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . In fact. (3) k i > 1 implies an unstable synthesis ﬁlter (poles outside the unit circle). We also see that reﬂection coefﬁcients are orthogonal in the sense that the best order “p” model is also the ﬁrst “p” coefﬁcients in the order “p+1” LP model (very important!). 1 k i = a i(i) a i( j) + k i a i(i – j) a i – 1( j) = 2 1 – ki 1≤ j≤i–1 Reﬂection to predictor coefﬁcient transformation: for i = 1. there are several important results related to k i : (1) k i < 1 (2) k i = 1 . From the above recursions. …. p a i(i) = k i a i( j) = a i – 1( j) – k i a i – 1(i – j) 1≤ j≤i–1 Also. 1996 TEXAS INSTRUMENTS PAGE 60 OF 147 Linear Prediction Coefﬁcient Transformations The predictor coefﬁcients and reﬂection coefﬁcients can be transformed back and forth with no loss of information: Predictor to reﬂection coefﬁcient transformation: for i = p. implies a harmonic process (poles on the unit circle).MAY 1517. it is clear that k i ≠ 1 . 2. p – 1. …. (4) E 0 > E 1 > … > E p This gives us insight into how to determine the LP order during the calculations. note that these recursions require intermediate storage for { a i } .
1996 TEXAS INSTRUMENTS PAGE 61 OF 147 SpectralMatching Interpretation of Linear Prediction INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
1996 TEXAS INSTRUMENTS PAGE 62 OF 147 NoiseWeighting INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
Typically. we use the “real” cepstrum: C s ( ω ) = log S ( f ) = log V ( f )U ( f ) = log V ( f ) + log U ( f ) V ( f ) and U ( f ) manifest themselves at the low and high end of the “quefrency” domain respectively. 1016 LP coefﬁcients are used to generate 1012 cepstral coefﬁcients. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . = – 1 p –1 –1 dz dz dz n = 1 –k 1– αk z ∑ ∑ k=1 which simpliﬁes to c1 = a1 n–1 cn = ∑ ( 1 – k ⁄ n )αk cn – k + an k=1 Note that the order of the cepstral coefﬁcients need not be the same as the order of the LP model. we can differentiate both sides is taken with respect to z : ∞ –n 1 d d d cn z { ln A ( z ) } = ln . We can derive cepstral parameters directly from LP analysis: ln A ( z ) = C ( z ) = ∑ cn z n=1 –1 ∞ –n To obtain the relationship between cepstral and predictor coefﬁcients.MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 63 OF 147 The Real Cepstrum Goal: Deconvolve spectrum for multiplicative processes In practice.
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 64 OF 147 Session IV: Signal Processing In Speech Recognition INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 65 OF 147 THE SIGNAL MODEL (“FRONTEND”) Speech Digital Signal Processing Spectral Shaping Conditioned Signal Spectral Analysis Spectral Measurements Spectral Modeling Spectral Parameters Parametric Transform Observation Vectors Speech Recognition INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
1996 TEXAS INSTRUMENTS PAGE 66 OF 147 A TYPICAL “FRONTEND” Signal Fourier Transform Cepstral Analysis melspaced cepstral coefﬁcients TimeDerivative TimeDerivative second derivative ﬁrst absolute derivative spectral (rate of change) measurements energy deltaenergy INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 67 OF 147 Putting It All Together INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
76 f Bark = 13 atan . 1000 2 f + 3.69  BW crit = 25 + 75 1 + 1. ( 7500 ) 2 ) f mel f = 2595 log 10 1 +  7000 f 2 0.4 1000 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 1996 TEXAS INSTRUMENTS PAGE 68 OF 147 Mel Cepstrum 0.MAY 1517.5 atan .
1996 TEXAS INSTRUMENTS PAGE 69 OF 147 Mel Cepstrum Computation Via A Filter Bank Bark Scale Index Center Freq. (Hz) 100 200 300 400 500 600 700 800 900 1000 1149 1320 1516 1741 2000 2297 2639 3031 3482 4000 4595 5278 6063 6964 BW (Hz) 100 100 100 100 100 100 100 100 100 124 160 184 211 242 278 320 367 422 484 556 639 734 843 969 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517. (Hz) 50 150 250 350 450 570 700 840 1000 1170 1370 1600 1850 2150 2500 2900 3400 4000 4800 5800 7000 8500 10500 13500 BW (Hz) 100 100 100 100 110 120 140 150 160 190 210 240 280 320 380 450 550 700 900 1100 1300 1800 2500 3500 Mel Scale Center Freq.
) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . f s N • Oversampling provides a smoother estimate of the envelope of the spectrum • Other analogous techniques efﬁcient sampling techniques exist for different frequency scales (bilinear transform. etc. sampled autocorrelation. 1996 TEXAS INSTRUMENTS PAGE 70 OF 147 Mel Cepstrum Computation Via Oversampling Using An FFT S(f) ••• frequency fi1 fi k • Note that an FFT yields frequency samples at .MAY 1517.
1996 TEXAS INSTRUMENTS PAGE 71 OF 147 Perceptual Linear Prediction Goals: Apply greater weight to perceptuallyimportant portions of the spectrum Avoid uniform weighting across the frequency band Algorithm: • • • Compute the spectrum via a DFT Warp the spectrum along the Bark frequency scale Convolve the warped spectrum with the power spectrum of the simulated critical band masking curve and downsample (to typically 18 spectral samples) Preemphasize by the simulated equalloudness curve: Simulate the nonlinear relationship between intensity and perceived loudness by performing a cubicroot amplitude compression Compute an LP model • • • Claims: • • Improved speaker independent recognition performance Increased robustness to noise. and microphones INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . variations in the channel.MAY 1517.
T . L ) = X 2 1 2 P 2 ( X . L ) = X – . ∆T ) = L ∑ ∑ L Pr ( X. and the orthogonal polynomials are of the form: P0 ( X.( 3L – 7 )X 20 This approach has been generalized in such a way that the weights on the coefﬁcients can be estimated directly from training data to maximize the likelihood of the estimated feature (maximum likelihood linear regression). INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . use closedloop estimation of the polynomial coefﬁcients Static feature: s(n) = c k(n) Dynamic feature: s(n) ≈ c k(n + ∆) – c k(n – ∆) ˙ Acceleration feature: ˙ s˙(n) ≈ s(n + ∆) – s(n – ∆) ˙ ˙ We can generalize this using an r th order regression analysis: L+1 P r ( X .MAY 1517. L ) = X – . 1996 TEXAS INSTRUMENTS PAGE 72 OF 147 Linear Regression and Parameter Trajectories Premise: Time differentiation of features is a noisy process Approach: Fit a polynomial to the data to provide a smooth trajectory for a parameter. L )C k t + X –  ∆T 2 X=1 R rk ( t. L ) = 1 P1 ( X. L ) 2 X=1 where L = T ⁄ ∆T (the number of analysis frames in time length T) is odd.( L – 1 ) 12 3 1 2 P 3 ( X .
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 73 OF 147 Session V: Dynamic Programming INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 74 OF 147 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 75 OF 147 The Principle of Dynamic Programming • An efﬁcient algorithm for ﬁnding the optimal path through a network INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
1996 TEXAS INSTRUMENTS PAGE 76 OF 147 Word Spotting Via Relaxed and Unconstrained Endpointing • Endpoints. need not be ﬁxed — numerous types of constraints can be invoked INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . or boundaries.MAY 1517.
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 77 OF 147 Slope Constraints: Increased Efﬁciency and Improved Performance • Local constraints used to achieve slope constraints INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
Syntactic Constraints. Reference R5 R4 R3 R2 Possible starts for second word Possible word endings for ﬁrst word R1 F1 F5 F10 F15 Test F20 F25 F30 • Though this algorithm is no longer widely used. In the simplest case. 1996 TEXAS INSTRUMENTS PAGE 78 OF 147 DTW. An elegant solution to the problem of ﬁnding the best overall sentence hypothesis is known as level building (typically assumes models are same length. it gives us a glimpse into the complexity of the syntactic pattern recognition problem. any digit can follow any other digit. and Beam Search Consider the problem of connected digit recognition: “325 1739”. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . but we might know the exact number of digits spoken.MAY 1517.
MAY 1517. Major weakness is the assumption that all models are the same length! F1 F5 F10 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . etc. 1996 TEXAS INSTRUMENTS PAGE 79 OF 147 Level Building For An Unknown Number Of Words Reference M=5 R5 M=4 R4 M=3 Beam R3 M=2 R2 M=1 R1 • • • • • F20 F25 F30 F15 Test Paths can terminate on any level boundary indicating a different number of words was recognized (note the signiﬁcant increase in complexity) A search band around the optimal path can be maintained to reduce the search space Nextbest hypothesis can be generated (Nbest) Heuristics can be applied to deal with free endpoints. insertion of silence between words.
and does not make it easy to include syntactic constraints (which words can follow previous hypothesized words).MAY 1517. but difﬁcult to implement because we must remember information about the interconnections of hypotheses • Amenable to beamsearch concepts and fastmatch concepts • Supports syntactic constraints by limited the choices for extending a hypothesis • Becomes complex when extended to allow arbitrary amounts of silence between words • How do we train? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . An elegant algorithm to perform this search in one pass is demonstrated below: Reference Model R3 R2 R1 Test • Very close to current stateoftheart doublystochastic algorithms (HMM) • Conceptually simple. 1996 TEXAS INSTRUMENTS PAGE 80 OF 147 The OneStage Algorithm (“Bridle Algorithm”) The level building approach is not conducive to models of different lengths.
you output one of the following words: {daddy. when you enter state C. in reality. only a small subset of the vocabulary can follow a given word hypothesis. If: state A: give state B: me state C: {daddy. but that this subset is sensitive to the given word (we often refer to this as “contextsensitive”) • In real applications.MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 81 OF 147 Introduction of Syntactic Information • The search space for vocabularies of hundreds of words can become unmanageable if we allow any word to follow any other word (often called the nogrammar case) • Our rudimentary knowledge of language tells us that. mommy}. mommy} state D: come state E: here We can generate phrases such as: Daddy give me • We can represent such information numerous ways (as we shall see) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . userinterface design is crucial (much like the problem of designing GUI’s). and normally results in a speciﬁcation of a language or collection of sentence patterns that are permissible • A simple way to express and manipulate this information in a dynamic programming framework is a via a state machine: Start A B Stop C D E For example.
1996 TEXAS INSTRUMENTS PAGE 82 OF 147 Early Attempts At Introducing Syntactic Information Were “AdHoc” Recognized Sequence of Words (“Sentences”) Finite Automaton P(w2) P(w1) P(w4) P(w3) P(wi) P(w4) Unconstrained Endpoint Dynamic Programming (Word Spotting) Reference Models Measurements Feature Extractor Speech Signal INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
1996 TEXAS INSTRUMENTS PAGE 83 OF 147 BASIC TECHNOLOGY — A PATTERN RECOGNITION PARADIGM BASED ON HIDDEN MARKOV MODELS Recognized Symbols: P(S O) = arg max T ∏ P(W t ( Ot.MAY 1517. … W t) ] Signal Model: P(O t ( W t – 1. P(O t. W t. … )) i i i Language Model: P(W t) P(O t W t)P(W t) i Search Algorithms: P(W t O t) = P(O t) i i Prediction i i Pattern Matching: [ W t. W t + 1 )) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . O t – 1. Ot – 1.
MAY 1517. accuracy Log(P) A Search Error? Best Path Time INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . suboptimal solutions yield good performance • Beam search algorithms are used to tradeoff complexity vs. 1996 TEXAS INSTRUMENTS PAGE 84 OF 147 BEAM SEARCH • The modern view of speech recognition is a problem consisting of two operations: signal modeling and search. fortunately. • Finding the most probable sequence of words is an optimization problem that can be solved by dynamic programming • Optimal solutions are intractable.
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 85 OF 147 Session VI: Hidden Markov Models INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
and make the output related to the states by a second random process.) State 2: cloudy State 3: sunny 0. where each state corresponds to an observable event. Example: A threestate model of the weather State 1: precipitation (rain. … ) ] = P [ q t = j q t – 1 = i ] We consider only those processes for which the righthand side is independent of time: a ij = P [ q t = j q t – 1 = i ] with the following properties: a ij ≥ 0 ∀ j.1 0. etc. snow. Later.2 2 0. hail.2 0.8 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . j ≤ N ∑ aij = 1 j=1 N The above process can be considered observable because the output process is a set of states at each instant of time. we will relax this constraint.6 0.3 3 0. q t – 2 = k.3 1 0.1 0. 1996 TEXAS INSTRUMENTS PAGE 86 OF 147 A Simple Markov Model For Weather Prediction What is a ﬁrstorder Markov chain? P [ q t = j ( q t – 1 = i.4 0. i ∀i 1 ≤ i.MAY 1517.
. i j P ( O Model . q 1 = i Model) ⁄ P(q 1 = i) = π i a ii d–1 ( 1 – a ii ) ⁄ π i = a ii ( 1 – a ii ) = p i(d) Note the exponential character of this distribution.5. the expected number of cloudy days is 2.8)) = 5. the expected number of consecutive sunny days is (1/(10.. 1996 TEXAS INSTRUMENTS PAGE 87 OF 147 Basic Calculations Example: What is the probability that the weather for eight consecutive days is “sunsunsunrainrainsuncloudysun”? Solution: O = sun 3 sun 3 sun 3 rain 1 rain 1 sun 3 cloudy 2 sun 3 P ( O Model ) = P [ 3 ]P [ 3 3 ]P [ 3 3 ]P [ 1 3 ]P [ 1 1 ]P [ 3 1 ]P [ 2 3 ]P [ 3 2 ] = π 3 a 33 a 31 a 11 a 13 a 32 a 23 = 1.536 × 10 –4 Example: Given that the system is in a known state. What have we learned from this example? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . etc.MAY 1517. what is the probability that it stays in that state for d days? O= i i i .q 1 = i ) = P(O. We can compute the expected number of observations in a state given that we started in that state: di = d–1 ∑ d=1 ∞ d p i(d) = ∑ d=1 ∞ da ii d–1 1 ( 1 – a ii ) = 1 – a ii Thus.
S = 1 1 2 2 1 2 1 1 2 2 1 .... You observe the following sequence: O = ( HHTTTHTTH…H ) What is a reasonable model of the system? P(H) 1P(H) 1 Heads P(H) 2 Tails 1P(H) 1Coin Model (Observable Markov Model) O = H H T T H T H H T T H .. a33 P(H): P(T): P1 1P1 P2 1P2 P3 1P3 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .. S = 3 1 2 3 3 1 1 2 3 1 3 . S = 1 1 2 2 1 2 1 1 2 2 1 . 2 a23 3Coins Model (Hidden Markov Model) O = H H T T H T H H T T H . a11 1a11 1 P(H) = P1 P(T) = 1P1 a11 a12 1 a31 a13 3 a21 a32 1a22 a22 2 P(H) = P2 P(T) = 1P2 a22 2Coins Model (Hidden Markov Model) O = H H T T H T H H T T H ... 1996 TEXAS INSTRUMENTS PAGE 88 OF 147 Why Are They Called “Hidden” Markov Models? Consider the problem of predicting the outcome of a coin toss experiment....MAY 1517...
blue.MAY 1517... = b2(1) = b2(2) = b2(3) P(red) P(green) P(blue) .. 1996 TEXAS INSTRUMENTS PAGE 89 OF 147 Why Are They Called Doubly Stochastic Systems? The UrnandBall Model P(red) P(green) P(blue) .. yellow. blue} How can we determine the appropriate model for the observation sequence given the system above? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .. = b1(1) = b1(2) = b1(3) P(red) P(green) P(blue) ... green. red.. . = b3(1) = b3(2) = b3(3) P(yellow) = b1(4) P(yellow) = b2(4) P(yellow) = b3(4) O = {green..
B. 1996 TEXAS INSTRUMENTS PAGE 90 OF 147 Elements of a Hidden Markov Model (HMM) • N — the number of states • M — the number of distinct observations per state • The statetransition probability distribution A = { a ij } • The output probability distribution B = { b j(k) } • The initial state distribution π = { π i } We can write this succinctly as: λ = ( A. how do we know what states were used to generate a given output? How do we represent continuous distributions (such as feature vectors)? t–1 π INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . π ) Note that the probability of being in any state at any time is completely determined by knowing the initial state and the transition probabilities: π(t) = A Two basic problems: (1) how do we train the system? (2) how do we estimate the probability of a given sequence (recognition)? This gives rise to a third problem: If the states are hidden.MAY 1517.
We can deﬁne the observation probability vector as: P(y(t) = 1) p(t) = P(y(t) = 2) … P(y(t) = K ) . The output distribution at any state is given by: b(k. 1996 TEXAS INSTRUMENTS PAGE 91 OF 147 Formalities The discrete observation HMM is restricted to the production of a ﬁnite set of discrete observations (or sequences).MAY 1517. { y k. B. as: b(y(t) i) ≡ P(y(t) = y(t) x(t) = i) The observation probability distribution can be represented as a matrix whose dimension is K rows x S states. π(1). y(t) . or. i) ≡ P(y(t) = k x(t) = i) The observation probabilities are assumed to be independent of time. 1 ≤ k ≤ K } } For example. p(t) = Bπ(t) = B A t–1 π(1) The mathematical speciﬁcation of an HMM can be summarized as: M = { S. A. reviewing our cointoss model: S = 3 1 ⁄ 3 π(1) = 1 ⁄ 3 1 ⁄ 3 a 11 a 12 A = a 21 a 22 a 13 a 23 a11 a12 1 a31 a13 3 a21 a32 a22 2 a23 a 31 a 32 a 33 a33 P(H): P(T): P1 1P1 P2 1P2 P3 1P3 B = P1 P2 P3 1 – P1 1 – P2 1 – P3 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . We can write the probability of observing a particular observation.
y(t) } The backward partial sequence of observations at time t is y t + 1 ≡ { y(t + 1) .y(t 2) } 1 t2 The forward partial sequence of observations at time t is y 1 ≡ { y(1) .y(t 1 + 1) . this gives about 1. i 2.( ϑ M )) = P(x(1) = i 1)a(i 2 i 1)a(i 3 i 2)…a(i T i T – 1) x b(y(1) i 1)b(y(2) i 2)…b(y(T ) i T ) To ﬁnd P(y M ) . i T } denote a speciﬁc state sequence. Let ϑ = { i 1. ….… . we must sum over all possible paths: P(y M ) = T T T t ∑ P(y .M ) = b(y(1) i 1)b(y(2) i 2)…b(y(T ) i T ) The probability of the state sequence is P(ϑ M ) = P(x(1) = i 1)a(i 2 i 1)a(i 3 i 2)…a(i T i T – 1) Therefore.y(T ) } A complete set of observations of length T is denoted as y ≡ y 1 .6 × 10 72 computations per HMM! INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 1996 TEXAS INSTRUMENTS PAGE 92 OF 147 Recognition Using Discrete HMMs Denote any partial sequence of observations in time by: y t ≡ { y(t 1) .( ϑ M )) ∀ϑ This requires O(2T S ) ﬂops. What is the likelihood of an HMM? We would like to calculate P(M y = y) — however.MAY 1517. P(y . Consider the brute force method of computing this.y(t 1 + 2) .y(2) .… . we can’t. We can (see the introductory notes) calculate P(y = y M ) .… .y(t + 2) . For S = 5 and T = 100 . The probability of a given observation sequence being produced by this state sequence is: P(y ϑ .
i)P(x(t + 1) = j x(t) = i) × P(y(t + 1) = y(t + 1) x(t + 1) = j) = α(y 1 .i)a( j i)b(y(t + 1) j) t INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .i)a( j i)b(y(t + 1) j) t Summing over all possibilities for reaching state “ j “: t+1 α ( y 1 .MAY 1517. BaumWelch) The forwardbackward (FB) algorithm begins by deﬁning a “forwardgoing” probability sequence: α(y 1) ≡ P(y = y 1 .M ) t+1 t t t Let us next consider the contribution to the overall sequence probability made by a single transition: S ••• S ••• α(y 1 . 1996 TEXAS INSTRUMENTS PAGE 93 OF 147 The “Any Path” Method (ForwardBackward.i) α(y 1 j t+1 t i ••• i ••• .i) 2 2 1 y(t – 1) α(y 1 t+1 t 1 y(t) y(t + 1) . j) = α(y 1 . j) = ∑ i=1 S α(y 1 .x(t) = i M ) 1 and a “backwardgoing” probability sequence: β(y t + 1 i) ≡ P(y T T T = y t + 1 x(t) = i .
By considering t = T .x(t) = i M ) = α(y 1 . P( y M ) = t T if i is a legal final state otherwise ∑ α(y1 . we can write: P(y M ) = ∑ α(y1 . P(y M ) is computed by summing over ALL states. j) = P(x(1) = j)b(y(1) j) Similarly. Therefore. When t = T . approximately 2500 ﬂops are required (compared to 10 exhaustive search). The complexity of this algorithm is O(S T ) . We still need to ﬁnd P(y M ) : P(y . for each value of t we iterate over ALL states and update α(y 1 . 72 2 t ﬂops for the INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . we can derive an expression for β : T β(y i + 1 i) = t ∑ β( y t + 2 T j=1 S j)a( j i)b(y(t + 1) j) This recursion is initialized by: 1. 1996 TEXAS INSTRUMENTS PAGE 94 OF 147 BaumWelch (Continued) The recursion is initiated by setting: α(y 1 .i) T i=1 S These equations suggest a recursion in which.i)β(y t + 1 i) for any state i .MAY 1517.i)β(yt + 1 i) t T i=1 S But we also note that we should be able to compute this probability using only the forward direction. T β( y T + 1 i ) ≡ 0. j) . or for S = 5 and T = 100 .
We would like to ﬁnd the single most likely path that could have produced the output. creating the need to sum over all paths. 1996 TEXAS INSTRUMENTS PAGE 95 OF 147 The Viterbi Algorithm Instead of allowing any path to produce the output sequence. j) } valid j (in other words. probabilities are replaced with the logarithm of the probability. using the dynamic programming algorithm previously discussed: D(t. Calculation of this path and probability is straightforward. the predecessor node with the best score). for a path to survive. it is easy to see that we can employ a beam search similar to what we used in DP systems: D min(t. which converts multiplications to summations.MAY 1517. i∗ t) – δ(t) In other words. we can simply assume only one path produced the output. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . i) = a(i. This is due to the fact that each hypothesis accounts for the same amount of time (same number of frames). since all hypotheses are at the same point in time. j∗)b(k i)D(t – 1. Beam Search In the context of the best path method. its score must be within a range of the best current score. and hence. j∗) where j∗ = arg max { D(t – 1. their scores can be compared directly. This can be viewed as a timesynchronous beam search. the HMM looks remarkably similar to our familiar DP systems. Often. i) ≥ D min(t. It has the advantage that. In this case.
and b were deﬁned previously (last lecture). T other t where the sequences α . u(t) denotes a random variable that models the transitions at time t and y (t) a random variable that models the observation being emitted at j state j at time t . t = 1. a . other than via the modiﬁcation of the model inventory.t) ≡ P ( u(t) = u j i y. …. The apriori probabilities of the likelihood of a model. π(1) . Next. j .MAY 1517. B } such that the model. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . y . we can think of this as the probability of observing a transition from state i to state j at time t for a particular observation sequence. M .. Intuitively. (the utterance in progress). A. better represents the training data than the previous model.y M ) ⁄ P(y M ) α(y t . M ) = P(u(t) = u j i . and model M . 1996 TEXAS INSTRUMENTS PAGE 96 OF 147 Training Discrete Observation HMMs Training refers to the problem of ﬁnding { π(1). we need to deﬁne some intermediate quantities related to particular events at a given state at a given time: ζ(i. The ﬁrst algorithm we will discuss is one based on the ForwardBackward algorithm (BaumWelch Reestimation): u j i ≡ label for a transition from state i to state j u • i ≡ set of transitions exiting state i u j • ≡ set of transitions entering j Also. The symbol “•” is used to denote an arbitrary event.i)a( j i)b(y(t + 1) j)β(y T 1 t + 2 j) . β . The number of states is usually not varied or reestimated. = P( y M ) 0 . are normally not reestimated as well. after an iteration of training. 2. since these typically come from the language model.
This is the probability of exiting state i .t). T γ ( j .if y(t) = k and 1 ≤ t ≤ T = P( y M ) 0. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . …. α(y t . . otherwise which is the probability of observing symbol k at state j at time t. δ( j . This will be key to reducing the complexity of the computations by allowing an interactive computation. Also.M ) = ∑ ζ(i.k . T other t which is the probability of being in state j at time t . = P( y M ) 0. 1996 TEXAS INSTRUMENTS PAGE 97 OF 147 We can also make the following deﬁnition: γ (i .M ) t = 1. other t α(y t . if y(t) = k and 1 ≤ t ≤ T = otherwise 0.t) ≡ P(x(t) = j y . Note that we make extensive use of the forward and backward probabilities in these computations.t) ≡ P(u(t) ∈ u • i y .M ) j ν( j . j)β(y T 1 t + 1 j) . 2. j)β(y T 1 t + 1 j) . j . ….t). 2.i)β(y T 1 t + 1 i) .. t = T 1 0. ν( j . = α(y T . T other t t = 1.t) ≡ P(y (t) = k y . j) .. Finally. = P(y M ) 0.MAY 1517.. …. 2. t = 1.t) j=1 S α(y t .
M } j What we have done up to this point is to develop expressions for the estimates of the underlying components of the model parameters in terms of the state sequences that occur during training.•) = E { n(u • i) y . we can deﬁne four more intermediate quantities: ζ(i. j . k . But how can this be when the internal structure of the model is hidden? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . j .M } δ( j.M ) = δ( j.•) = E { n(y (•) = k) y .t) ∑ ν( j .•) = E { n(u j •) y .•) = P(y (•) = k y .M ) = ∑ ζ(i. we can begin relating these quantities to the problem of reestimating the model parameters.•) = E { n(u j i) y .M } ν( j .M } γ (i . Let us deﬁne four more random variables: n(u j i) ≡ number of transitions of the type u j i n(u • i) ≡ number of transitions of the type u • i n(u j •) ≡ number of transitions of the type u j • n(y (•) = k) ≡ number of times the observation k and state j jointly occur j We can see that: ζ(i.•) = P(u(•) ∈ u j • y . k .•) = P(u(•) ∈ u j i y .t) T t=1 T ν( j .•) = P(u(•) ∈ u • i y .t) Finally. 1996 TEXAS INSTRUMENTS PAGE 98 OF 147 From these four quantities.t) = ∑ t=1 t = 1 y(t) = k t=1 T ν( j .M ) = j ∑ δ( j.t) t=1 T T γ (i . k .MAY 1517. j .M ) = ∑ γ (i .
j .MAY 1517. j)β(yt + 1 t T t=1 j) Finally.= γ (i .•) b(k j) = . 1996 TEXAS INSTRUMENTS PAGE 99 OF 147 Following this line of reasoning.i)β(y 2 i) P(x(1) = i) = P(y M ) This process is often called reestimation by recognition.•) a( j i) = . because we need to recognize the input with the previous models in order that we can compute the new model parameters from the set of state sequences used to recognize the data (hence.M ) y .= E { n(u • i) y . the need to iterate).i)a( j i)b(y(t + 1) t T –1 t=1 t T j)β(y t + 2 j) T t=1 =  ∑ α(y1 .i)β(yt + 1 i) Similarly. an estimate of the transition probability is: E { n(u j i) y . E n(n(y (•) = k) y .•) E { n(u j •) y . j)β(y t j) t T =  ∑ α(y1 .M } ζ(i. j . α(y 1 .M } ∑ t = 1 y(t) = k T T α(y 1 .M j ζ(i.•) T –1 ∑ α(y1 . But will it converge? Baum and his colleagues showed that the new model guarantees that: P(y M ) ≥ P(y M ) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING 1 T .M } γ (i .
M } and.MAY 1517. the above algorithms are easily applied to many problems associated with language modeling: estimating transition probabilities and word probabilities.M } b(k j) = E { n(u • i) y . Analogous procedures exist for the Viterbi algorithm. It also is generalizable to alternate formulations of the topology of the acoustic model (or language model) drawn from formal language theory.M } a( j i) = E { n(u • i) y . 1996 TEXAS INSTRUMENTS PAGE 100 OF 147 Since this is a highly nonlinear optimization. it can get stuck in local minima: Perturbation Distance P(y M ) Local Global Maximum Maximum M We can overcome this by starting training from a different initial point. though they are much simpler and more intuitive (and more DPlike): E { n(u j i) y . (In fact.M } These have been shown to give comparable performance to the forwardbackward algorithm at signiﬁcantly reduced computation. we can even eliminate the ﬁrstorder Markovian assumption. or “bootstrapping” models from previous models. But what if a transition is never observed in the training database? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . E { n(u j i) y .) Further. efﬁcient parsing. and learning hidden structure.
MAY 1517, 1996
TEXAS INSTRUMENTS
PAGE 101 OF 147
Continuous Density HMMs The discrete HMM incorporates a discrete probability density function, captured in the matrix B , to describe the probability of outputting a symbol:
b(k j)
output distribution for state k
1 2 3 4 5 6 ••• k Signal measurements, or feature vectors, are continuousvalued Ndimensional vectors. In order to use our discrete HMM technology, we must vector quantize (VQ) this data — reduce the continuousvalued vectors to discrete values chosen from a set of M codebook vectors. Initially, most HMMs were based on VQ frontends. However, recently, the continuous density model has become widely accepted. Let us assume a parametric model of the observation pdf: M = S, π(1), A, f y x(ξ i) ,1 ≤ i ≤ S The likelihood of generating observation y(t) in state j is deﬁned as: b( y(t) j) ≡ f y x( y(t) j) Note that taking the negative logarithm of b( ) will produce a loglikelihood, or a Mahalanobislike distance. But what form should we choose for f ( ) ? Let’s assume a Gaussian model, of course: 1 T –1 1 f y x( y i) =  exp –  ( y – µ i ) C i ( y – µ i ) 2π C i 2 Note that this amounts to assigning a mean and covariance matrix to each state — a signiﬁcant increase in complexity. However, shortcuts such as varianceweighting can help reduce complexity. Also, note that the log of the output probability at each state becomes precisely the Mahalanobis distance (principal components) we studied at the beginning of the course.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996
TEXAS INSTRUMENTS
PAGE 102 OF 147
Mixture Distributions Of course, the output distribution need not be Gaussian, or can be multimodal to reﬂect the fact that several contexts are being encoded into a single state (male/female, allophonic variations of a phoneme, etc.). Much like a VQ approach can model any discrete distribution, we can use a weighted linear combination of Gaussians, or a mixture distribution, to achieve a more complex statistical model.
b(y j)
composite (offset)
three mixtures
y u1 u2 u3
Mathematically, this is expressed as: f y x( y i) =
∑
m=1
M
c im ℵ( y ;µ im ,C im)
In order for this to be a valid pdf, the mixture coefﬁcients must be nonnegative and satisfy the constraint:
∑
m=1
M
c im = 1 ,
1≤i≤S
Note that mixture distributions add signiﬁcant complexity to the system: m means and covariances at each state. Analogous reestimation formulae can be derived by deﬁning the intermediate quantity: ν(i ;t, l) ≡ P(x(t) = i y(t)produced in accordance with mixture l)
t α( y 1 ,i)β( y t + 1 i) c il ℵ( y tl ;µ il ,C il) =  × S M t T
∑
α( y 1 , j)β( y t + 1 j)
t
T
∑
c im ℵ( y t ;µ im ,C im)
t
j=1
m=1
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996
TEXAS INSTRUMENTS
PAGE 103 OF 147
The mixture coefﬁcients can now be reestimated using: ν(i ;• ,l) c il = M
∑
ν(i ;• ,m)
m=1
the mean vectors can be reestimated as:
∑ ν(i ;t ,l) y(t)
µ il = t = 1 ν(i ;• ,l) the covariance matrices can be reestimated as:
T
∑
T
ν(i ;t ,l) [ y(t) – µ il ] [ y(t) – µ il ]
T
C il = t = 1 ν(i ;• ,l) and the transition probabilities, and initial probabilities are reestimated as usual. The Viterbi procedure once again has a simpler interpretation: 1 µ il = N il and 1 C il = N il
∑
t = 1 y(t) ∼ il
T
y(t)
∑
t = 1 y(t) ∼ il
T
[ y(t) – µ il ] [ y(t) – µ il ]
T
The mixture coefﬁcient is reestimated as the number of vectors associated with a given mixture at a given state: N il c il = Ni
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
1996 TEXAS INSTRUMENTS PAGE 104 OF 147 Session VII: Acoustic Modeling and Training INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.q 1 = i ) = P(O. 2 σi 2σ i Reestimation equations exist for all three cases. such as words. There are three approaches in use today: • FiniteState Models (encoded in acoustic model topology) Macro State 12 22 32 11 21 31 1 2 3 (Note that this model doesn’t have skip states. where duration variations are much better understood and predicted. where duration variations can be signiﬁcant. q 1 = i Model) ⁄ P(q 1 = i) = a ii d–1 ( 1 – a ii ) This model is not necessarily appropriate for speech.) • Discrete State Duration Models ( D parameters per state) P(d i = d) = τ d 1≤d≤D • Parametric State Duration Models (one to two parameters) – 2 d 1 f (d i) = . but not as important for smaller units. with skip states. such as contextdependent phones.exp . it becomes much more complex. Duration models are often important for larger models. 1996 TEXAS INSTRUMENTS PAGE 105 OF 147 State Duration Probabilities Recall that the probability of staying in a state was given by an exponentiallydecaying distribution: P ( O Model .
5) by normalizing the calculations by: 1 c(t) = S ∑ i=1 ˜ t α(y 1 .” t α β INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . More on this later. Fortunately. 1996 TEXAS INSTRUMENTS PAGE 106 OF 147 Scaling in HMMs As difﬁcult as it may seem to believe.2.i) at each timestep (timesynchronous normalization). in the log prob space: t log P(y 1 . Hence..( x(t) = i ) y .. standard HMM calculations exceed the precision of 32bit ﬂoating point numbers for the simplest of models. The weights. It is possible to scale the forwardbackward calculation (see Section 12. The large numbers of multiplications of numbers less than one leads to underﬂow. α and β can be used to control the importance of the “language model. we must incorporate some form of scaling. Hence. the loglikelihood of an output symbol being observed at a state can be written as: P(y 1 . we can normalize all candidate bestpath scores.M ) = P(x(t – 1) = j y . we need to somehow prevent the best path score from growing with time (increasingly negative in the case of log probs). and proceed with the dynamic programming search. at each time step. a simpler and more effective way to scale is to deal with log probabilities. Similarly.M ) ( a ij ) + b i( y(t)) or. Even so. it is often desirable to tradeoff the important of transition probabilities and observation probabilities. which work out nicely in the case of continuous distributions. µ This result emphasizes the similarities between HMMs and DTW.…) = log { P(x(t – 1) = j)… } + α log ( a ij ) + β log D y.MAY 1517. However.
This forces the recognizer to learn contextual variations.” What about speaker independence? Speaker dependence? Speaker adaptation? Channel adaptation? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . and that the recognizer is forced to recognize the utterance during training (via the build grammar operation). 1996 TEXAS INSTRUMENTS PAGE 107 OF 147 An Overview of the Training Schedule HandExcised Data Seed Model Construction Build Grammar Recognize Supervised Training Backtrace/Update Next Utterance Last Utterance? Next Iteration Convergence? Replace Parameters Note that a priori segmentation of the utterance is not required.MAY 1517. provided the seed model construction is done “properly.
such approaches have proven to be intractable — leading to highly nonlinear equations. It depends to a great deal on the extent to which the statistics of the training database match the test database.M = M r) log l P(y = y )P(M = M r) l = 1r = 1 ∑∑ L R l l INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . we can minimize the discrimination information (or equivalently the crossentropy) between the data and the statistical characterizations of the data implied by the model. an HMM system using the standard BaumWelch reestimation algorithm learns to emulate the statistics of the training database. One approach to modifying the HMM structure to do this is to use a different cost function.d y f ( y M) Note that a small DI is good. However.MAY 1517. We refer to this training mode as “representation. For example. because it implies a given measurement matches both the signal statistics and the model statistics (which means the relative mismatch between the two is small). A potential solution to this problem is to attempt to force the recognizer to learn to discriminate (reduce recognition errors explicitly).” This is not necessarily equivalent to minimizing the recognition error rate. especially in openset speaker independent cases. 1996 TEXAS INSTRUMENTS PAGE 108 OF 147 Alternative Criteria For Optimization As we have previously seen. A more successful approach has been to maximize the average mutual information. This is analogous to developing a system to make an Mway choice in an Ndimensional space by learning decision boundaries rather than clustering the data and modeling the statistics of each cluster.M ) = P(y = y . the test database contains data never seen in the training database.M = M r) M ( y . Recall our deﬁnition for the average mutual information: P(y = y . Recall the deﬁnition of the discrimination information: ∞ J DI = –∞ ∫ f ( y y) f ( y y) log . Often.
then. and it is to be used to train M l .M = M r) ] – log [ P(y = y ) ] = l l ∑ ∑ P(y = y . The second term corresponds to the probability of incorrect recognition. This method has a rather simple interpretation for discrete HMMs: Decrement Counts Along Competing Incorrect Paths 6 2 7 8 4 5 3 1 Increment Counts Along Best Path INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . if we assume P(y = y M = M r) ≈ δ(l – r) . 1996 TEXAS INSTRUMENTS PAGE 109 OF 147 This can be written as: M ( y .M = M r) – log l ∑ m=1 R P(y = y M = M m)P(M = M m) ] l l Note that the last term constitutes a rewrite of P(y = y ) .M ) = ∑ ∑ P(y = y . we can approximate M (y . which we want to minimize.M ) ≈ l ∑ l=1 L P(y = y .M = M l) – log l ∑ m=1 R P(y = y M = M m)P(M = M m) l The ﬁrst term in the summation corresponds to the probability of correct recognition.M = M r) × l l = 1r = 1 L R [ log P(y = y . which we want to maximize. If we assume there is exactly one training string (which is the error we want to correct).MAY 1517.M ) by: M ( y .M = M r) × l l = 1r = 1 L R log [ P(y = y .
1996 TEXAS INSTRUMENTS PAGE 110 OF 147 An Overview of the Corrective Training Schedule HandExcised Data Seed Model Construction Build Grammar Recognize Supervised Training Backtrace/Update Next Utterance Last Utterance? Replace Parameters Next Iteration Convergence? Recognize (Open) Corrective Training Recognition Error Backtrace/Update Next Utterance Last Utterance? Replace Parameters Convergence? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
3. and has proved inconclusive in providing measurable reductions in error rate. but we need some way of computing the distance between two models.M 1) 2 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . However. Distance Measures for HMMs Recall the KMEANS clustering algorithm: Initialization: Recursion: Choose K centroids 1. this method is not readily extensible to continuous speech. Recompute the centroids as the average of all vectors assigned to the same centroid. when we study neural networks. A useful distance measure can be deﬁned as: 2 2 1 D(M 1 .[ log P(y M 1) – log P(y M 2) ] T2 where y is a sequence generated by M 2 of length T 2 . discriminative training algorithms continues to be an important are of research in HMMs. Return to step 1 if some distortion criterion is not met.M 2) ≠ D(M 2 . Assign all vectors to their nearest neighbor. 1996 TEXAS INSTRUMENTS PAGE 111 OF 147 Unfortunately. We can use standard clustering algorithms. Check the overall distortion. Note that this distance metric is not symmetric: D(M 1 . 2.MAY 1517.M 2) ≡ . Later. we will observe that some of the neural network approaches are ideally suited towards implementation of discriminative training.000 for English) to a manageable number (typically a few thousand models are used). Clustering of HMM models is often important in reducing the number of contextdependent phone models (which can often approach 10.
1996 TEXAS INSTRUMENTS PAGE 112 OF 147 A symmetric version of this is: D(M 1 .M 2) = 2 The sequence y is often taken to be the sequence of mean vectors associated with each state (typically for continuous distributions).M 1) D'(M 1 .MAY 1517.M 2) + D(M 2 . Interestingly enough. Often. Models sharing similar phonetic contexts are merged to reduce the complexity of the model inventory. this can often be performed by inserting an additional network into the system that maps contextdependent phone models to a pool of states. phonological knowledge is used to cluster models. 2 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 113 OF 147 The DTW Analogy INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
1996 TEXAS INSTRUMENTS PAGE 114 OF 147 Alternative Acoustic Model Topologies INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
among the most popular is deleted interpolation: A = εt At + ( 1 – εt ) Au B = ε t B t + ( 1 – ε t )B u INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 1996 TEXAS INSTRUMENTS PAGE 115 OF 147 Managing Model Complexity Tied Mixtures Clustered States Tied States SemiContinuous HMMs Continuous Density Mixture Distributions Discrete Density Vector Quantization Increasing Performance? Increasing Storage Increasing Number of Free Parameters Increasing CPU Increasing Memory • Numerous techniques to robustly estimate model parameters.MAY 1517.
1996 TEXAS INSTRUMENTS PAGE 116 OF 147 Seed Model Construction: Duration Distributions INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
1996 TEXAS INSTRUMENTS PAGE 117 OF 147 Seed Model Construction: Number of States Proportional to Duration INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 118 OF 147 Duration in Context Clustering INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 119 OF 147 Examples of Word Models INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
1996 TEXAS INSTRUMENTS PAGE 120 OF 147 Session VIII: Speech Recognition Using Hidden Markov Models INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .MAY 1517.
O t – 1. 1996 TEXAS INSTRUMENTS PAGE 121 OF 147 BASIC TECHNOLOGY: A PATTERN RECOGNITION PARADIGM BASED ON HIDDEN MARKOV MODELS Recognized Symbols: P(S O) = arg max T ∏ P(W t ( Ot. P(O t.MAY 1517. … W t) ] Signal Model: P(O t ( W t – 1. W t + 1 )) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . … )) i i i Language Model: P(W t) P(O t W t)P(W t) i Search Algorithms: P(W t O t) = P(O t) i i Prediction i i Pattern Matching: [ W t. Ot – 1. W t.
the recognizer ﬁnds the optimal start/stop times of the utterance with respect to the acoustic model inventory (a hypothesisdirected search) Simple Continuous Speech Recognition (“No Grammar”): Nonspeech/ {Word} Nonspeech S 1 Nonspeech S • system recognizes arbitrarily long sequences of words or nonspeech events INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . {Word}: any word from the set of possible words that can be spoken • The key point here is that.MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 122 OF 147 High Performance Isolated Word Recognition Using A Continuous Speech Recognizer Isolated Word Recognition: Nonspeech Nonspeech Nonspeech S 1 {Word} 2 Nonspeech S Nonspeech:typically an acoustic model of one frame in duration that models the background noise. with such a system.
This computation. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . typically requires an elaborate search of all possible combinations of symbols output from the state machine. 1996 TEXAS INSTRUMENTS PAGE 123 OF 147 The Great Debate: “BottomUp” or “TopDown” S NP V CONJ PRON V PP ART N PREP NP ART ADJ N THE CHILD CRIED AS SHE LEFT IN THE RED PLANE Dx CYld krYd xS Si lEft In Dx rEd plen Parsing refers to the problem of determining if a given sequence could have been generated from a given state machine. If the input symbols are ambiguous. This computation can be efﬁciently performed in a “bottomup” fashion if the probabilities of the input symbols are extremely accurate. as we shall see. “topdown” parsing is typically preferred.MAY 1517. and only a few symbols are possible at each point at the lower levels of the tree.
the lower the probability The overall best hypothesis will not necessarily be the best initial hypothesis We want to limit the evaluation of the same substring by different hypotheses (difﬁcult in practice) We would like to maintain as few active hypotheses as possible All states in the network will not be active at all times • • • . 1996 TEXAS INSTRUMENTS PAGE 124 OF 147 Network Searching and Beam Search Premise: Suboptimal solutions are useful. 1 ••• 5 ••• 5 ••• 6 ••• Best Path j 7 8 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING 2 ••• 6 ••• 7 8 3 4 The network search considers several competing constraints: • • The longer the hypothesis. global optimums are hard to ﬁnd.MAY 1517.
1996 TEXAS INSTRUMENTS PAGE 125 OF 147 Popular Search Algorithms • Time Synchronous (Viterbi. A*) • Hybrid Schemes — ForwardBackward — Multipass — Fast Matching INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . BestFirst.MAY 1517. DP) • State Synchronous (BaumWelch) • Stack Decoding (NBest.
9 A 0.3) oh (0. 1996 TEXAS INSTRUMENTS PAGE 126 OF 147 Generalization of the HMM Consider the following state diagram showing a simple language model involving constrained digit sequences: S zero (0.9 0.1 C three (1.3) oh (0.0) What is the probability of the sequence “zero one two three four ﬁve zero six six seven seven eight eight” ? How would you ﬁnd the average length of a digit sequence generated from this language model? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .0) seven (0.0) I 0.9 0.0) one (1.0) three (1.0 0.0) 1.0) 0.1 D four (1.1 B two (1.3) ﬁve (0.0) 0.0) two (1.7) 0.8) zero (0.5 0.5 one (0.5) eight (1.9 ﬁve (1.9 one (0.1 E 0.7) 0. nine (1.1 six (1.5) 0.0 Note the similarities to our acoustic models.2) 0.5) four (0.0) 0.1) oh (0.4 0.0) 0.9 six (1.4 0.2 eight (1.0) 0.1 zero (0.MAY 1517.1 ﬁve (0.9 seven (1.1) nine (1.0) 0.8) six (0.0) four (1.0) F G H J 1.
A p3 D p14 three. In the future. A ﬁve. B two. G seven. The word stochastic can also be applied because the transitions and output symbols are governed by probability distributions. E p15 H p26 eight. F oh. ambiguous output). 1996 TEXAS INSTRUMENTS PAGE 127 OF 147 In the terminology associated with formal language theory. A oh. C three.MAY 1517. F six. A p5 four. this particular graph is classiﬁed as a nondeterministic automaton. H p23 B C p12 G H p24 two. B six. A A p8 F F p20 I one. D seven. F p19 nine. A oh. C p11 six. E ﬁve. A p7 zero. this HMM is known as a ﬁnite state automaton. p’28 S A p6 E F p18 I I p”28 zero. B p9 ﬁve. Further. We can also express this system as a regular grammar : p1 p13 p25 S p2 zero. J nine. since there are multiple transitions and observations generated at any point in time (hence. we will refer to this system as a stochastic ﬁnite state automaton (FSA or SFSA) when it is used to more linguistic information. I nine. F p17 eight. G p21 A B p10 F G p22 one. J p’27 S S p4 D E p16 H I p”27 one. I eight. D four. H INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING C H . H four.
1996 TEXAS INSTRUMENTS PAGE 128 OF 147 Note that rule probabilities are not quite the same as transition probabilities. consider p7: p 7 = ( 0. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .9 ) ( 0. since they need to combine transition probabilities and output probabilities. For example.8 ) In general. P(y = y k x = x i) = ∑ aij b(k) j Note that we must adjust probabilities at the terminal systems when the grammar is nondeterministic: p k = p' k + p'' k to allow generation of a ﬁnal terminal.MAY 1517. Hence. our transition from HMMs to stochastic formal languages is clear and wellunderstood.
1996 TEXAS INSTRUMENTS PAGE 129 OF 147 The Artiﬁcial Neural Network (ANN) Ë Ë Ë Premise: complex computational operations can be implemented by massive integration of individual components Topology and interconnections are key: in many ANN systems. each of which interacts with 103 — 104 other neurons Inputs may be excitatory (promote ﬁring) or inhibitory The HMM State — Linear — The Artiﬁcial Neuron — Nonlinear Θk y’3 y’2 y’1 vector input N node: nk alk yk aik aik scalar output node: nk akq b k( y k) akr aks ∫ S(•) y k ≡ S( ∑ wki y'i – θk) α(y 1 t+1 .i)a( j i)b(y(t + 1) j) t n=1 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . j) = α(y 1 . spatial relationships between nodes have some physical relevance Properties of largescale systems: ANNs also reﬂect a growing body of theory stating that largescale systems built from a small unit need not simply mirror properties of a smaller system (contrast fractals and chaotic systems with digital ﬁlters) Why Artiﬁcial Neural Networks? Ë Important physical observations: — The human central nervous system contains 1011 — 1014 nerve cells.MAY 1517.
1996 TEXAS INSTRUMENTS PAGE 130 OF 147 Typical Thresholding Functions — A Key Difference The input to the thresholding function is a weighted sum of the inputs: u k ≡ w k y' The output is typically deﬁned by a nonlinear function: T S(u) Linear Ramp S(u) 1 u u S(u) Step 1 u Sigmoid S(u) S(u) = ( 1 + e ) 1 u –u –1 Sometimes a bias is introduced into the threshold function: y k ≡ S(w k y' – θ k) = S(u k – θ k) This can be represented as an extra input whose value is always 1: y' N + 1 = – 1 w k.MAY 1517. N + 1 = θ k T INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
character recognition) • feature extraction (similarity transformations.MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 131 OF 147 Radial Basis Functions Another popular formulation involves the use of a Euclidean distance: y k = S( ∑ ( wik – y'i ) i=1 N 2 – θ k) = S( w k – y' – θ k) 2 Note the parallel to a continuous distribution HMM. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . These have been shown to be quite useful for a wide range of problems. This approach has a simple geometric interpretation: y’2 Θk S(u) 1 wk y’1 u Another popular variant of this design is to use a Gaussian nonlinearity: S(u) = e –u 2 What types of problems are such networks useful for? • pattern classiﬁcation (N—way choice. vector quantization) • associative memory (generate an output from a noisy input. dimensionality reduction) We will focus on multilayer perceptrons in our studies.
.. or nonrecurrent. yN Output Layer Hidden Layer Input Layer x0 x1 . INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . and must learn the optimal weights of the coefﬁcients to minimize the difference between these two.MAY 1517.. LVQ is popular because it supports discriminative training. The MLP network. xN An alternate formulation of such a net is known as the learning vector quantizer (LVQ) — to be discussed later.. 1996 TEXAS INSTRUMENTS PAGE 132 OF 147 Multilayer Perceptrons (MLP) This architecture has the following characteristics: • Network segregated into layers: Ni cells per layer. L layers • feedforward. uses a supervised learning algorithm. thereby clustering the data (learning the boundaries representing a segregation of the data). not surprisingly. The LVQ network uses unsupervised learning — the network adjusts itself automatically to the input data. network (no feedback from the output of a node to the input of a node) y0 y1 . The network is presented the input and the corresponding output.
. 1996 TEXAS INSTRUMENTS PAGE 133 OF 147 Why Artiﬁcial Neural Networks? • An ability to separate classes that are not linearly separable: Linearly Separable Decision Boundary A A Meshed Classes B B A threelayer perceptron is required to determine arbitrarilyshaped decision regions. much like the difference between VQ and continuous distributions in HMM. Why not Artiﬁcial Neural Networks? (The Price We Pay. INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .) • Difﬁcult to deal with patterns of unequal length • Temporal relationships not explicitly modeled And. • Contextsensitive statistics Again. of course.. the ANN can learn complex statistical dependencies provided there are enough degrees of freedom in the system. both of these are extremely important to the speech recognition problem.MAY 1517. • Nonlinear statistical models The ANN is capable of modeling arbitrarily complex probability distributions.
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 134 OF 147 Session IX: Language Modeling INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
acoustic confusability • Global properties that inﬂuence average difﬁculty: perplexity If there are W possible words output from a random source. For a language in which all words are equally likely for all time. these probabilities must be estimated from training data (or test data). the entropy associated with this source is deﬁned as: 1 H (w) = – lim N → ∞N 1 = – lim N → ∞N ∑ w1 N N – E log P ( w 1 = w 1 ) N N ∑ P ( w1 w1 N = w 1 ) log P(w 1 = w 1 ) N N N For an ergodic source. a common measure used in speech recognition.MAY 1517. we can compute temporal averages: N N 1 H (w) = – lim . is simply: Q(w) = 2 H (w) and represents the average branching factor of the language. and the source is not statistically independent (as words in a language tend to be). Perplexities range from 11 for digit recognition to several hundred for LVCSR. 1996 TEXAS INSTRUMENTS PAGE 135 OF 147 Perplexity How do we evaluate the difﬁculty of a recognition task? • Instantaneous/local properties that inﬂuence peak resource requirements: maximum number of branches at a node. Perplexity. Q(w) = W INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .log P(w 1 = w 1 ) N → ∞N Of course.
etc.) • BackOff Models (Merging Bigrams and Trigrams) • Long Range NGrams and CoOccurrences (SWITCHBOARD) • Triggers and Cache Models (WSJ) • Link Grammars (SWITCHBOARD) How do we deal with OutOfVocabulary words? • Garbage Models • Monophone and Biphone Grammars with optional Dictionary Lookup • Filler Models Can we adapt the language model? INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . 1996 TEXAS INSTRUMENTS PAGE 136 OF 147 Types of Language Models Common Language Models: • No Grammar (Digits) • Sentence pattern grammars (Resource Management) • Word Pair/Bigram (RM.MAY 1517. etc.) • Trigram (WSJ. Wall Street Journal) • Word Class (WSJ.
Uniﬁcation) ω 1 Aω 2 → ω 1 βω 2 • Context Free (LR) A→β • Regular (or Finite State) A → aB A→b • Issues — stochastic parsing — memory. 1996 TEXAS INSTRUMENTS PAGE 137 OF 147 Other Formal Languages • Unrestricted Grammars • Context Sensitive (NGrams.MAY 1517. pruning. and complexity INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
MAY 1517.Z) & examined(X. ni+z #/ /# dh i # d A k t exr I g z ae m ex n + d # dh i # p e sh ex n t + z dh ex d dh ex d A A k t exr I k t exr I g z g z ae ae m I n d dhex pH eI sh I nt s m I n d dhex pH eI sh I nt s n n i: i: z z 0 0. 1996 TEXAS INSTRUMENTS PAGE 138 OF 147 What can you do with all of this? Logical: ∃(X) & ∃(Y) & ∃(Z) & doctor(X) & patient(Y) & knees(Z) & partof(Y.5 2 INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .5 1 Time (secs) 1.Z) S Syntactic: NP [case:subj] VP NP [case:obj] NP Det def:+ N form: nom num: sing type: caregiver V tense: past arg1: subj [type:caregiver] arg2: obj [type:bodypart] Det def:+ N form: gen num: sing type: patient N form: acc num: pl type: bodypart Lexical: Orthographic: Phonemic: Phonetic: Phonetic: 5 kHz 4 kHz 3 kHz 2 kHz 1 kHz 0 kHz The doctor # examined the patient’s # knees.
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 139 OF 147 Session IX: State of the Art INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
1996 TEXAS INSTRUMENTS PAGE 140 OF 147 WHAT IS SPEECH UNDERSTANDING? Ë Speech Recognition: transcription of words performance measured by word error rate Ë Speech understanding: acting on intentions performance measured by the number of “queries” successfully answered (adds a natural language dimension to the problem) DIMENSIONS OF THE PROBLEM Ë Performance is a function of the vocabulary how many words are known to the system how acoustically similar are the competing choices at each point how many words are possible at each point in the dialog Ë Vocabulary size is a function of memory size Ë Usability is a function of the number of possible combinations of vocabulary words restrictive syntax is good for performance and bad for usability Ë Hence.MAY 1517. performance is a function of memory size INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
0% 0.MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 141 OF 147 TODAY’S TECHNOLOGY Memory 100M NAB Newswire 100K words (perplexity =250) 10M 1M Digit Recognition 10 words (perplexity = 10) 100K MIPS 10 Word Error Rate 100.1% 10 100 1K 10K 100K 30 100 Vocabulary Size (Words) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .0% 10.0% 1.
MAY 1517. 1996 TEXAS INSTRUMENTS PAGE 142 OF 147 PROGRESS IN CSR PERFORMANCE Resource Management (1000 Words) q Wall Street Journal (5000 words) v Wall Street Journal (20.2 90 91 92a 92b 93 94 Time Note: Machine performance is still at least two orders of magnitude lower than humans (Data from George Doddington.000 words) w NAB News (unlimited vocabulary) s Word Error Rate (%) 100 50 s q v s w s s s s q q v 10 5 1 87 88 89. ARPA HLT Program Manager.1 89. HLT’95) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
MAY 1517, 1996
TEXAS INSTRUMENTS
PAGE 143 OF 147
WHAT IS THE CONTEXT FOR SPEECH UNDERSTANDING RESEARCH IN THE 1990’s? Ë The Application Need Interaction with online data (Internet) Automated information agents (“24hour Help Desk”) Global multilingual interactions Ë The Technology Challenge Create natural transparent interaction with computers (“2001,” “Star Trek”) Bring computing to the masses (vanishing window) Intelligent presentation of information (“hands/eyes busy applications”) Ë Application Areas Command/Control (Telecom/Workstations) Database Query (Internet) Dictation (Workstations) Machine Translation (Workstations) RealTime Interpretation (Telecom)
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996
TEXAS INSTRUMENTS
PAGE 144 OF 147
THE TECHNICAL CHALLENGE Ë Barriers Speech understanding is a vast empirical problem:
• Hierarchies of hidden representations (partofspeech, noun phrases, units of meaning) produce immensely complex models • Training of complex models requires huge and dynamic knowledge bases
Interconnected and interdependent levels of representation:
• Correct recognition and transcription of speech depends on understanding the meaning encoded in speech • Correct understanding and interpretation of text depends on the domain of discourse
Ë Approach Capitalize on exponential growth in computer power and memory:
• Statistical modeling and automatic training • Shared resources and infrastructure
Applicationfocused technical tasks
• New metrics of performance based on user feedback and productivity enhancement (human factors)
(G. Doddington, ARPA HLT Program Manager, “Spoken Language Technology Discussion,” ARPA Human Language Technology Workshop, January 1995)
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
MAY 1517, 1996
TEXAS INSTRUMENTS
PAGE 145 OF 147
Example No. 1: HTK (Cambridge University) Signal Processing: Sample Frequency = 16 kHz Frame Duration = 10 ms Window Duration = 25 ms (Hamming window) FFTBased Spectrum Analysis MelFrequency Filter Bank (24) LogDCT Cepstral Computation (Energy plus 12 coefﬁcients) Linear Regression Using A 5 Frame Window For ∆ and ∆ Parameters
2
Acoustic Modeling: Simple LeftToRight Topology With Three Emitting States 10 component Gaussian Mixtures at Each State BaumWelch Reestimation CrossWord ContextDependent Phone Models StateTying Using Phonetic Decision Trees (Common Broad Phonetic Classes) +/ 2 Phones Used In Phonetic Decision Trees Including Word Boundary Language Modeling: NGram Language Model (3, 4, and 5Grams Have Been Reported) Discounting (DeWeight More Probable NGrams) BackOff Model Using Bigrams SinglePass TimeSynchronous Tree Search Notes: Approx. 3 to 8 Million Free Parameters StateoftheArt Performance Relatively Low Complexity Excellent RealTime Performance
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING
1996 TEXAS INSTRUMENTS PAGE 146 OF 147 Example No. Filter Bank (20) + 3 Voicing Features (replaced by PLP) Each Feature Normalized To Zero Mean and Unit Variance (Grand Covar.) Signal Processing: Sample Frequency = 16 kHz Frame Duration = 16 ms Window Duration = 32 ms FFTBased Spectrum Analysis MelFreq. 2: Abbot Hybrid Connectionist HMM (Cambridge Univ.MAY 1517.) Acoustic Modeling: Recurrent Neural Network Single Layer FeedForward Network With Four Frame Delay Sigmoidal Nonlinearities A Single Network Computes All Posterior Phone Probabilities ViterbiLike BackPropagationThroughTime Training Parallel Combination of BackwardInTime and ForwardInTime Networks ContextDependent Probabilities Computed Using A Second Layer Linear Input Mapping For Speaker Adaptation (Similar to MLLR) Language Modeling: SinglePass Stack Decoding Algorithm NGram Language Model PhoneLevel and WordLevel Pruning TreeBased Lexicon NO ContextDependent CrossWord Models Notes: Approx. 80K Free Parameters Performance Comparable To Best HMMs Relatively Low Complexity Excellent RealTime Performance INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING .
edu/psl/speech.ri.ac.MAY 1517.ucsc..html Public Domain: Oregon Graduate Institute Cambridge Connectionist Group — AbbotDemo ..edu/speech ftp://svrftp..edu/spib.also Entropic) INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING . and many more .rice..cam. ISIP (about one year away) Commercial: Entropic Waves / HTK Matlab (coming soon .speech/ http://mambo.html http://spib.speech: UCSB: IEEE SP (Rice): http://www.uk/pub/pub/comp.cmu.eng. 1996 TEXAS INSTRUMENTS PAGE 147 OF 147 Additional Resources Web: CMU: comp.