Are you sure?
This action might not be possible to undo. Are you sure you want to continue?
B. Plannerer
March 28, 2005
1
This document . . .
. . . was written after I presented a tutorial on speech recognition on the graduates
workshop ”SMG”, November 2000, Venice, held by Prof. Dr. H. G. Tillmann, Institut
f¨ ur Phonetik und Sprachliche Kommunikation, University of Munich, Germany. Ini
tially, I intended to give a short overview over the principles of speech recognition and
the industrial applications of speech recognition technology to an audience consisting
both of engineers and non–engineers. As I worked on the slides, I tried to explain most
of the principles and algorithms of speech recognition by using pictures and examples
that can also be understood by non–technicians. For the engineers in the audience
and to prevent me from making inappropriate simpliﬁcations, I also added the formal
framework. Thus, I always started to explain the algorithms by using pictures and
examples to give an idea of what the algorithm does and then went into some of the
details of the formulae. Although I was afraid that this might be the perfect approach
to boring both the engineers and the non–engineers of my audience, it turned out
that they both could live with this mixed presentation. Being asked to convert my
slides into this written introduction to speech recognition, I took the chance to add
some more details and formal framework which can be skipped by the non–engineer
readers.
Version 1.1
In version 1.1 of this document some error corrections were made. In some
chapters examples and pictures were added. Of course, errors (old and new
ones)will still be contained and additional reﬁnements might be necessary. If
you have any suggestions how to improve this document, feel free to send an
email to
plannerer@ieee.org
Copyright
This work is subject to copyright. All rights are reserved, whether the whole or part of this document is concerned,
specifically the rights of translation, reuse of illustrations, reproduction and storage in data banks. This document
may be copied, duplicated or modified only with the written permission of the author.
c 2001,2002,2003 Bernd Plannerer, Munich, Germany
plannerer@ieee.org
Who should read this?
This introduction to speech recognition is intended to provide students and
graduates with the necessary background to understand the principles of speech
recognition technology. It will give an overview over the very basics of pattern
matching and stochastic modelling and it will show how the dynamic program
ming algorithm is used for recognition of isolated words as well as for the recog
nition of continuous speech. While it is impossible to present every detail of all
these topics, it is intended to provide enough information to the interested reader
to select among the numerous scientiﬁc publications on the subject. Those who
are not interested in further studies should at least be provided with an idea of
how automated speech recognition works and what today’s systems can do —
and can’t do.
2
Chapter 1
The Speech Signal
1.1 Production of Speech
While you are producing speech sounds, the air ﬂow from your lungs ﬁrst passes
the glottis and then your throat and mouth. Depending on which speech sound
you articulate, the speech signal can be excited in three possible ways:
• voiced excitation The glottis is closed. The air pressure forces the glot
tis to open and close periodically thus generating a periodic pulse train
(triangle–shaped). This ”fundamental frequency” usually lies in the range
from 80Hz to 350Hz.
• unvoiced excitation The glottis is open and the air passes a narrow
passage in the throat or mouth. This results in a turbulence which gener
ates a noise signal. The spectral shape of the noise is determined by the
location of the narrowness.
• transient excitation A closure in the throat or mouth will raise the air
pressure. By suddenly opening the closure the air pressure drops down
immediately. (”plosive burst”)
With some speech sounds these three kinds of excitation occur in combination.
The spectral shape of the speech signal is determined by the shape of the vocal
tract (the pipe formed by your throat, tongue, teeth and lips). By changing
the shape of the pipe (and in addition opening and closing the air ﬂow through
your nose) you change the spectral shape of the speech signal, thus articulating
diﬀerent speech sounds.
1.2 Technical Characteristics of the Speech Sig
nal
An engineer looking at (or listening to) a speech signal might characterize it as
follows:
• The bandwidth of the signal is 4 kHz
3
CHAPTER 1. THE SPEECH SIGNAL 4
• The signal is periodic with a fundamental frequency between 80 Hz and
350 Hz
• There are peaks in the spectral distribution of energy at
(2n −1) ∗ 500 Hz ; n = 1, 2, 3, . . . (1.1)
• The envelope of the power spectrum of the signal shows a decrease with
increasing frequency (6dB per octave)
This is a very rough and technical description of the speech signal. But where
do those characteristics come from?
1.2.1 Bandwidth
The bandwidth of the speech signal is much higher than the 4 kHz stated above.
In fact, for the fricatives, there is still a signiﬁcant amount of energy in the
spectrum for high and even ultrasonic frequencies. However, as we all know
from using the (analog) phone, it seems that within a bandwidth of 4 kHz the
speech signal contains all the information necessary to understand a human
voice.
1.2.2 Fundamental Frequency
As described earlier, using voiced excitation for the speech sound will result in
a pulse train, the socalled fundamental frequency. Voiced excitation is used
when articulating vowels and some of the consonants. For fricatives (e.g., /f/ as
in fish or /s/, as in mess ), unvoiced excitation (noise) is used. In these cases,
usually no fundamental frequency can be detected. On the other hand, the zero
crossing rate of the signal is very high. Plosives (like /p/ as in put), which use
transient excitation, you can best detect in the speech signal by looking for the
short silence necessary to build up the air pressure before the plosive bursts out.
1.2.3 Peaks in the Spectrum
After passing the glottis, the vocal tract gives a characteristic spectral shape to
the speech signal. If one simpliﬁes the vocal tract to a straight pipe (the length
is about 17cm), one can see that the pipe shows resonance at the frequencies as
given by (1.1). These frequencies are called formant frequencies. Depending on
the shape of the vocal tract (the diameter of the pipe changes along the pipe),
the frequency of the formants (especially of the 1st and 2nd formant) change
and therefore characterize the vowel being articulated.
1.2.4 The Envelope of the Power Spectrum Decreases with
Increasing Frequency
The pulse sequence from the glottis has a power spectrum decreasing towards
higher frequencies by 12dB per octave. The emission characteristics of the lips
show a highpass characteristic with +6dB per octave. Thus, this results in an
overall decrease of 6dB per octave.
CHAPTER 1. THE SPEECH SIGNAL 5
1.3 A Very Simple Model of Speech Production
As we have seen, the production of speech can be separated into two parts:
Producing the excitation signal and forming the spectral shape. Thus, we can
draw a simpliﬁed model of speech production [ST95, Rus88]:
emission
R(f)
lips
spectral shaping
H(f)
vocal tract
P(f)
voiced excitation
pulse train
white noise
N(f)
unvoiced excitation
S(f)
X(f)
u
v
Figure 1.1: A simple model of speech production
This model works as follows: Voiced excitation is modelled by a pulse generator
which generates a pulse train (of triangle–shaped pulses) with its spectrum given
by P(f). The unvoiced excitation is modelled by a white noise generator with
spectrum N(f). To mix voiced and unvoiced excitation, one can adjust the
signal amplitude of the impulse generator (v) and the noise generator (u). The
output of both generators is then added and fed into the box modelling the
vocal tract and performing the spectral shaping with the transmission function
H(f). The emission characteristics of the lips is modelled by R(f). Hence, the
spectrum S(f) of the speech signal is given as:
S(f) = (v · P(f) +u · N(f)) · H(f) · R(f) = X(f) · H(f) · R(f) (1.2)
To inﬂuence the speech sound, we have the following parameters in our speech
production model:
• the mixture between voiced and unvoiced excitation (determined by v and
u)
• the fundamental frequency (determined by P(f))
• the spectral shaping (determined by H(f))
• the signal amplitude (depending on v and u)
These are the technical parameters describing a speech signal. To perform
speech recognition, the parameters given above have to be computed from the
time signal (this is called speech signal analysis or ”acoustic preprocessing”) and
then forwarded to the speech recognizer. For the speech recognizer, the most
valuable information is contained in the way the spectral shape of the speech
signal changes in time. To reﬂect these dynamic changes, the spectral shape is
determined in short intervals of time, e.g., every 10 ms. By directly computing
the spectrum of the speech signal, the fundamental frequency would be implicitly
contained in the measured spectrum (resulting in unwanted ”ripples” in the
spectrum). Figure 1.2 shows the time signal of the vowel /a:/ and ﬁg. 1.3 shows
the logarithmic power spectrum of the vowel computed via FFT.
CHAPTER 1. THE SPEECH SIGNAL 6
Figure 1.2: Time signal of the vowel /a:/ (f
s
= 11kHz, length = 100ms). The
high peaks in the time signal are caused by the pulse train P(f) generated by
voiced excitation.
Figure 1.3: Log power spectrum of the vowel /a:/ (f
s
= 11kHz, N = 512). The
ripples in the spectrum are caused by P(f).
One could also measure the spectral shape by means of an analog ﬁlter bank us
ing several bandpass ﬁlters as is depicted below. After rectifying and smoothing
the ﬁlter outputs, the output voltage would represent the energy contained in
the frequency band deﬁned by the corresponding bandpass ﬁlter (cf. [Rus88]).
1.4 Speech Parameters Used by Speech Recog
nition Systems
As shown above, the direct computation of the power spectrum from the speech
signal results in a spectrum containing ”ripples” caused by the excitation spec
trum X(f). Depending on the implementation of the acoustic preprocessing
however, special transformations are used to separate the excitation spectrum
X(f) from the spectral shaping of the vocal tract H(f). Thus, a smooth spectral
shape (without the ripples), which represents H(f) can be estimated from the
speech signal. Most speech recognition systems use the so–called mel frequency
cepstral coeﬃcients (MFCC) and its ﬁrst (and sometimes second) derivative in
time to better reﬂect dynamic changes.
1.4.1 Computation of the Short Term Spectra
As we recall, it is necessary to compute the speech parameters in short time
intervals to reﬂect the dynamic change of the speech signal. Typically, the spec
CHAPTER 1. THE SPEECH SIGNAL 7
vn(t)
v2(t)
v1(t)
fn
f2
f1
s(t)
Figure 1.4: A ﬁlter bank for spectral analysis
tral parameters of speech are estimated in time intervals of 10ms. First, we
have to sample and digitize the speech signal. Depending on the implemen
tation, a sampling frequency f
s
between 8kHz and 16kHz and usually a 16bit
quantization of the signal amplitude is used. After digitizing the analog speech
signal, we get a series of speech samples s(k · ∆t) where ∆t = 1/f
s
or, for easier
notation, simply s(k). Now a preemphasis ﬁlter is used to eliminate the 6dB
per octave decay of the spectral energy:
ˆ s(k) = s(k) −0.97 · s(k −1) (1.3)
Then, a short piece of signal is cut out of the whole speech signal. This is done
by multiplying the speech samples ˆ s(k) with a windowing function w(k) to cut
out a short segment of the speech signal, v
m
(k) starting with sample number
k = m and ending with sample number k = m + N − 1. The length N of the
segment (its duration) is usually chosen to lie between 16ms to 25 ms, while the
time window is shifted in time intervals of about 10ms to compute the next set
of speech parameters. Thus, overlapping segments are used for speech analysis.
Many window functions can be used, the most common one is the so–called
HammingWindow:
w(k) =
0.54 −0.46 cos
2πk
N−1
: if k = 0, 1, . . . N −1
0 : else
(1.4)
where N is the length of the time window in samples. By multiplying our speech
signal with the time window, we get a short speech segment v
m
(k):
v
m
(k) =
ˆ s(k) · w(k −m) : if k = m, m+ 1, . . . m+N −1
0 : else
(1.5)
CHAPTER 1. THE SPEECH SIGNAL 8
As already mentioned, N denotes the length of the speech segment given in
samples (the window length is typically between 16ms and 25ms) while m is
the start time of the segment. The start time m is incremented in intervals of
(usually) 10ms, so that the speech segments are overlapping each other. All the
following operations refer to this speech segment v
m
(k), k = m. . . m + N − 1.
To simplify the notation, we shift the signal in time by m samples to the left, so
that our time index runs from 0 . . . N −1 again. From the windowed signal, we
want to compute its discrete power spectrum V (n)
2
. First of all, the complex
spectrum V (n) is computed. The complex spectrum V (n) has the following
properties:
• The spectrum V (n) is deﬁned within the range from n = −∞to n = +∞.
• V (n) is periodic with period N, i.e.,
V (n ±i · N) = V (n) ; i = 1, 2, . . . (1.6)
• Since v
m
(k) is real–valued, the absolute values of the coeﬃcients are also
symmetric:
V (−n) = V (n) (1.7)
To compute the spectrum, we compute the discrete Fourier transform (DFT,
which gives us the discrete, complex–valued short term spectrum
ˆ
V (n) of the
speech signal (for a good introduction to the DFT and FFT, see [Bri87], and
for both FFT and Hartley Transform theory and its applications see [Bra00]):
ˆ
V (n) =
N−1
¸
k=0
v(k) · e
−j2πkn/N
; n = 0, 1, . . . N −1 (1.8)
The DFT gives us N discrete complex values for the spectrum
ˆ
V (n) at the
frequencies n · ∆f where
∆f =
1
N∆t
(1.9)
Remember that the complex spectrum V (n) is deﬁned for n = −∞ to n = +∞,
but is periodic with period N (1.6). Thus, the N diﬀerent values of
ˆ
V (n) are
suﬃcient to represent V (n). One should keep in mind that we have to interpret
the values of
ˆ
V (n) ranging from n = N/2 to n = N − 1 as the values for the
negative frequencies of the spectrum V (n · ∆f), i.e for −N/2 · ∆f to −1 · ∆f.
One could think that with a frequency range from −N/2 · ∆f to +N/2 · ∆f
we should have N +1 diﬀerent values for V (n), but since V (n) is periodic with
period N (1.6), we know that
V (−N/2 · ∆f) = V (N/2 · ∆f)
. So nothing is wrong, and the N diﬀerent values we get from
ˆ
V (n) are suﬃcient
to describe the spectrum V (n). For further processing, we are only interested
in the power spectrum of the signal. So we can compute the squares of the
absolute values, V (n)
2
.
Due to the periodicity (1.6) and symmetry (1.7) of V (n), only the values
V (0)
2
. . . V (N/2)
2
are used for further processing, giving a total number of
N/2 + 1 values.
It should be noted that V (0) contains only the DCoﬀset of the signal and
therefore provides no useful information for our speech recognition task.
CHAPTER 1. THE SPEECH SIGNAL 9
1.4.2 Mel Spectral Coeﬃcients
As was shown in perception experiments, the human ear does not show a linear
frequency resolution but builds several groups of frequencies and integrates the
spectral energies within a given group. Furthermore, the midfrequency and
bandwidth of these groups are non–linearly distributed. The non–linear warping
of the frequency axis can be modeled by the so–called melscale. The frequency
groups are assumed to be linearly distributed along the melscale. The so–called
mel–frequency f
mel
can be computed from the frequency f as follows:
f
mel
(f) = 2595 · log
1 +
f
700Hz
(1.10)
Figure 1.5 shows a plot of the mel scale (1.10).
0 500 1000 1500 2000 2500 3000 3500 4000
0
500
1000
1500
2000
2500
f
[
m
e
l
]
f [Hz]
mel frequency scale
Figure 1.5: The mel frequency scale
The human ear has high frequency resolution in low–frequency parts of the spec
trum and low frequency resolution in the high–frequency parts of the spectrum.
The coeﬃcients of the power spectrum V (n)
2
are now transformed to reﬂect
the frequency resolution of the human ear.
A common way to do this is to use K triangle–shaped windows in the spectral
domain to build a weighted sum over those power spectrum coeﬃcients V (n)
2
which lie within the window. We denote the windowing coeﬃcients as
η
kn
; k = 0, 1, . . . K −1 ; n = 0, 1, . . . N/2
CHAPTER 1. THE SPEECH SIGNAL 10
This gives us a new set of coeﬃcients,
G(k) ; k = 0, 1, . . . , K −1
the so–called mel spectral coeﬃcients, e.g., [ST95].
G(k) =
N/2
¸
n=0
η
kn
· V (n)
2
; k = 0, 1, . . . K −1 (1.11)
Caused by the symmetry of the original spectrum (1.7), the mel power spectrum
is also symmetric in k:
G(k) = G(−k) (1.12)
Therefore, it is suﬃcient to consider only the positive range of k, k = 0, 1, . . . K−
1.
1.4.3 Example: Mel Spectral Coeﬃcients
This example will show how to compute the window coeﬃcients η
kn
for a given
set of parameters. We start with the deﬁnition of the preprocessing parameters:
f
s
= 8000 Hz
N = 256
K = 22
From these values, we can derive some useful parameters. First, we compute
the maximum frequency of the sampled signal: f
g
=
fs
2
= 4000 Hz.
According to (1.10), this corresponds to a maximum mel frequency of f
max
[mel] =
2146.1 mel. The center frequencies of our triangleshaped windows are equally
spaced on the mel scale. Therefore, the frequency distance bewteen every two
center frequencies can easily be computed:
∆
mel
=
f
max
K + 1
= 93.3 mel
Thus, the center frequencies on the mel scale can be expressed as:
f
c
(k) = (k + 1) · ∆
mel
; k = 0 · · · K −1
Given the center frequencies in the mel scale, the center frequencies on the linear
frequency scale can easily be computed by solving equantion (1.10).
In ﬁgure 1.6 you can see a plot of the resulting center frequencies f
c
(k) on the
linear frequency scale vs. the mel frequency scale.
Now the values for the center frequencies on the linear frequency scale have to
be mapped to the indices of the FFT. The frequency resolution of our FFT with
length N is given as ∆f =
fs
N
= 31.25Hz.
Dividing f
c
(k) on the linear frequency scale by ∆f and rounding to the next in
teger, we get the corresponding indices of the coeﬃcients of the power spectrum
V (n)
2
. The rounding to the next integer can be interpreted as a quantization
of the linear frequency scale where the frequency resolution is given by ∆f. We
denote the power spectrum indices corresponding to the center frequencies as
n
c
(k).
CHAPTER 1. THE SPEECH SIGNAL 11
0 500 1000 1500 2000 2500 3000 3500 4000
0
500
1000
1500
2000
2500
mel filter frequencies
f[Hz]
f
[
m
e
l
]
Figure 1.6: Center frequencies of the mel ﬁlter. (f
s
= 8kHz ; K = 22).
In ﬁgure 1.7 you can see a plot of the power spectrum coeﬃcient indices n
c
(k)
versus the mel spectral coeﬃcient index k ; k = 0 · · · K −1.
Now we know the center frequencies f
c
(k) of our windows and the coresponding
coeﬃcients of the power spectrum n
c
(k).
All what is left to do is to compute the windowing coeﬃcients η
kn
; k =
0, 1, . . . K − 1 ; n = 0, 1, . . . N/2. The windowing coeﬃcients η
kn
can be repre
sented as a windowing matrix
˜
W = [η
kn
]. The rows of this matrix represent the
individual windows k (”’ﬁlter channels”’), the columns of the matrix represent
the weighting coeﬃcients for each power spectrum coeﬃcient n.
For each window (rows k in the matrix), the weighting coeﬃcients η
kn
are
computed to form a triangleshaped window with a maximum value of 1.0 at
the center frequency f
c
(k). The column index of the window’s maximum is
given by the corresponding index of the power spectrum n
c
(k). The left and
right boundaries of the triangles of a given window are deﬁned to be the center
frequencies of the left and right ﬁlter channel, i.e., f
c
(k−1) and f
c
(k+1). Thus,
the corresponding weighting coeﬃcients η
k,nc(k−1)
and η
k,nc(k+1)
are deﬁned to
be zero. The coeﬃcients within the window are chosen to linearly raise from
zero at column index n
c
(k − 1) to the maximum value 1.0 at n
c
(k) and then
again decrease towards zero at index n
c
(k + 1).
The resulting matrix
˜
W is shown in Figure 1.8. A black value on the grayscale
denotes a coeﬃcient value of 1.0. Note that the bandwidth (number of power
spectrum coeﬃcients within a window) increases with the center frequency of
CHAPTER 1. THE SPEECH SIGNAL 12
2 4 6 8 10 12 14 16 18 20 22
10
20
30
40
50
60
70
80
90
100
110
FFT indices
filter channel
F
F
T
i
n
d
e
x
Figure 1.7: Indices of power spectrum vs mel spectral coeﬃcient index.
the window.
1.4.4 Cepstral Transformation
Before we continue, lets remember how the spectrum of the speech signal was
described by (1.2). Since the transmission function of the vocal tract H(f) is
multiplied with the spectrum of the excitation signal X(f), we had those un
wanted ”ripples” in the spectrum. For the speech recognition task, a smoothed
spectrum is required which should represent H(f) but not X(f). To cope with
this problem, cepstral analysis is used. If we look at (1.2), we can separate the
product of spectral functions into the interesting vocal tract spectrum and the
part describing the excitation and emission properties:
S(f) = X(f) · H(f) · R(f) = H(f) · U(f) (1.13)
We can now transform the product of the spectral functions to a sum by taking
the logarithm on both sides of the equation:
log (S(f)) = log (H(f) · U(f))
= log (H(f)) + log (U(f)) (1.14)
CHAPTER 1. THE SPEECH SIGNAL 13
matrix of filter coefficients
FFT index
f
i
l
t
e
r
c
h
a
n
n
e
l
20 40 60 80 100 120
2
4
6
8
10
12
14
16
18
20
22
Figure 1.8: Matrix of coeﬃcients η
kn
.
This holds also for the absolute values of the power spectrum and also for their
squares:
log
S(f)
2
= log
H(f)
2
· U(f)
2
= log
H(f)
2
+ log
U(f)
2
(1.15)
In ﬁgure 1.9 we see an example of the log power spectrum, which contains
unwanted ripples caused by the excitation signal U(f) = X(f) · R(f).
In the log–spectral domain we could now subtract the unwanted portion of the
signal, if we knew U(f)
2
exactly. But all we know is that U(f) produces the
”ripples”, which now are an additive component in the log–spectral domain, and
that if we would interpret this log–spectrum as a time signal, the ”ripples” would
have a ”high frequency” compared to the spectral shape of H(f). To get rid of
the inﬂuence of U(f), one would have to get rid of the ”highfrequency” parts of
the log–spectrum (remember, we are dealing with the spectral coeﬃcients as if
they would represent a time signal). This would be a kind of low–pass ﬁltering.
The ﬁltering can be done by transforming the log–spectrum back into the time–
domain (in the following, FT
−1
denotes the inverse Fourier transform):
ˆ s(d) = FT
−1
¸
log
S(f)
2
¸
= FT
−1
¸
log
H(f)
2
¸
+FT
−1
¸
log
U(f)
2
¸
(1.16)
CHAPTER 1. THE SPEECH SIGNAL 14
Figure 1.9: Log power spectrum of the vowel /a:/ (f
s
= 11kHz, N = 512). The
ripples in the spectrum are caused by X(f).
The inverse Fourier transform brings us back to the time–domain (d is also called
the delay or quefrency), giving the so–called cepstrum (a reversed ”spectrum”).
The resulting cepstrum is real–valued, since U(f)
2
and H(f)
2
are both real
valued and both are even: U(f)
2
= U(−f)
2
and H(f)
2
= H(−f)
2
. Apply
ing the inverse DFT to the log power spectrum coeﬃcients log
V (n)
2
yields:
ˆ s(d) =
1
N
N−1
¸
n=0
log
V (n)
2
· e
j2πdn/N
; d = 0, 1, . . . N −1 (1.17)
Figure 1.10 shows the result of the inverse DFT applied on the log power spec
trum shown in ﬁg. 1.9.
The peak in the cepstrum reﬂects the ripples of the log power spectrum: Since
the inverse Fourier transformation (1.17) is applied to the log power spectrum,
an oscillation in the frequency domain with a Hertzperiod of
p
f
= n · ∆f = n ·
f
s
N
=
n
N · ∆t
(1.18)
will show up in the cepstral domain as a peak at time t:
t =
1
p
f
=
N
n
·
1
f
s
=
N · ∆t
n
(1.19)
The cepstral index d of the peak expressed as a function of the period length n
is:
d =
t
∆t
=
N · ∆t
n · ∆t
=
N
n
(1.20)
The low–pass ﬁltering of our energy spectrum can now be done by setting the
highervalued coeﬃcients of ˆ s(d) to zero and then transforming back into the
frequency domain. The process of ﬁltering in the cepstral domain is also called
liftering. In ﬁgure 1.10, all coeﬃcients right of the vertical line were set to zero
and the resulting signal was transformed back into the frequency domain, as
shown in ﬁg. 1.11.
One can clearly see that this results in a ”smoothed” version of the log power
spectrum if we compare ﬁgures 1.11 and 1.9.
CHAPTER 1. THE SPEECH SIGNAL 15
Figure 1.10: Cepstrum of the vowel /a:/ (f
s
= 11kHz, N = 512). The ripples
in the spectrum result in a peak in the cepstrum.
Figure 1.11: Power spectrum of the vowel /a:/ after cepstral smoothing. All
but the ﬁrst 32 cepstral coeﬃcients were set to zero before transforming back
into the frequency domain.
It should be noted that due to the symmetry (1.7) of V (n)
2
, is is possible to
replace the inverse DFT by the more eﬃcient cosine transform [ST95]:
ˆ s(0) =
2
N
N/2−1
¸
n=0
log
V (n)
2
(1.21)
ˆ s(d) =
4
N
N/2−1
¸
n=0
log
V (n)
2
· cos
πd (2n + 1)
N
; d = 1, 2, ...N/2
(1.22)
1.4.5 Mel Cepstrum
Now that we are familiar with the cepstral transformation and cepstral smooth
ing, we will compute the mel cepstrum commonly used in speech recognition.
As stated above, for speech recognition, the mel spectrum is used to reﬂect the
perception characteristics of the human ear. In analogy to computing the cep
strum, we now take the logarithm of the mel power spectrum (1.11) (instead
of the power spectrum itself) and transform it into the quefrency domain to
compute the so–called mel cepstrum. Only the ﬁrst Q (less than 14) coeﬃcients
of the mel cepstrum are used in typical speech recognition systems. The restric
tion to the ﬁrst Q coeﬃcients reﬂects the low–pass liftering process as described
above.
CHAPTER 1. THE SPEECH SIGNAL 16
Since the mel power spectrum is symmetric due to (1.12), the FourierTransform
can be replaced by a simple cosine transform:
c(q) =
K−1
¸
k=0
log (G(k)) · cos
πq(2k + 1)
2K
; q = 0, 1, . . . , Q−1 (1.23)
While successive coeﬃcients G(k) of the mel power spectrum are correlated, the
Mel Frequency Cepstral Coeﬃcients (MFCC) resulting from the cosine trans
form (1.23) are decorrelated. The MFCC are used directly for further processing
in the speech recognition system instead of transforming them back to the fre
quency domain.
1.4.6 Signal Energy
Furthermore, the signal energy is added to the set of parameters. It can simply
be computed from the speech samples s(n) within the time window by:
e =
N−1
¸
n=0
s
2
(n) (1.24)
1.4.7 Dynamic Parameters
As stated earlier in eqn (1.5), the MFCC are computed for a speech segment
v
m
at time index m in short time intervals of typically 10ms. In order to better
reﬂect the dynamic changes of the MFCC c
m
(q) (and also of the energy e
m
) in
time, usually the ﬁrst and second derivatives in time are also computed, e.g by
computing the diﬀerence of two coeﬃcients lying τ time indices in the past and
in the future of the time index under consideration.
For the ﬁrst derivative we get:
∆c
m
(q) = c
m+τ
(q) −c
m−τ
(q) ; q = 0, 1, ..Q−1 (1.25)
The sevond derivative is computed from the diﬀerences of the ﬁrst derivatives:
∆∆c
m
(q) = ∆c
m+τ
(q) −∆c
m−τ
(q) ; q = 0, 1, ..Q−1 (1.26)
The time intervall usually lies in the range 2 ≤ τ ≤ 4.
This results in a total number of up to 63 parameters (e.g., [Ney93]) which are
computed every 10ms. Of course, the choice of parameters for acoustic pre
processing has a strong impact on the performance of the speech recognition
systems. One can spend years of research on the acoustic preprocessing mod
ules of a speech recognition system, and many Ph.D. theses have been written
on that subject. For our purposes however, it is suﬃcient to remember that
the information contained in the speech signal can be represented by a set of
parameters which has to be measured in short intervals of time to reﬂect the
dynamic change of those parameters.
Chapter 2
Features and Vector Space
Until now, we have seen that the speech signal can be characterized by a set of
parameters (features), which will be measured in short intervals of time during
a preprocessing step. Before we start to look at the speech recognition task, we
will ﬁrst get familiar with the concept of feature vectors and vector space.
2.1 Feature Vectors and Vector Space
If you have a set of numbers representing certain features of an object you want
to describe, it is useful for further processing to construct a vector out of these
numbers by assigning each measured value to one component of the vector. As
an example, think of an air conditioning system which will measure the tem
perature and relative humidity in your oﬃce. If you measure those parameters
every second or so and you put the temperature into the ﬁrst component and
the humidity into the second component of a vector, you will get a series of
two–dimensional vectors describing how the air in your oﬃce changes in time.
Since these so–called feature vectors have two components, we can interpret
the vectors as points in a two–dimensional vector space. Thus we can draw a
two–dimensional map of our measurements as sketched below. Each point in
our map represents the temperature and humidity in our oﬃce at a given time.
As we know, there are certain values of temperature and humidity which we
ﬁnd more comfortable than other values. In the map the comfortable value–
pairs are shown as points labelled ”+” and the less comfortable ones are shown
as ””. You can see that they form regions of convenience and inconvenience,
respectively.
2.2 A Simple Classiﬁcation Problem
Lets assume we would want to know if a value–pair we measured in our oﬃce
would be judged as comfortable or as uncomfortable by you. One way to ﬁnd
out is to initially run a test series trying out many value–pairs and labelling each
point either ”+” or ”” in order to draw a map as the one you saw above. Now if
you have measured a new value–pair and you are to judge if it will be convenient
or not to a person, you would have to judge if it lies within those regions which
are marked in your map as ”+” or if it lies in those marked as ””. This is our
17
CHAPTER 2. FEATURES AND VECTOR SPACE 18
rel. humidity
60% 40% 50%
22°C
18°C
20°C
temperature
Figure 2.1: A map of feature vectors
ﬁrst example of a classiﬁcation task: We have two classes (”comfortable” and
”uncomfortable”) and a vector in feature space which has to be assigned to one
of these classes. — But how do you describe the shape of the regions and how
can you decide if a measured vector lies within or without a given region? In
the following chapter we will learn how to represent the regions by prototypes
and how to measure the distance of a point to a region.
2.3 Classiﬁcation of Vectors
2.3.1 Prototype Vectors
The problem of how to represent the regions of ”comfortable” and ”uncom
fortable” feature vectors of our classiﬁcation task can be solved by several
approaches. One of the easiest is to select several of the feature vectors we
measured in our experiments for each of our classes (in our example we have
only two classes) and to declare the selected vectors as ”prototypes” representing
their class. We will later discuss how one can ﬁnd a good selection of prototypes
using the ”k–means algorithm”. For now, we simply assume that we were able
to make a good choice of the prototypes, as shown in ﬁgure 2.2.
2.3.2 Nearest Neighbor Classiﬁcation
The classiﬁcation of an unknown vector is now accomplished as follows: Measure
the distance of the unknown vector to all classes. Then assign the unknown vec
tor to the class with the smallest distance. The distance of the unknown vector
to a given class is deﬁned as the smallest distance between the unknown vector
and all of the prototypes representing the given class. One could also verbalize
the classiﬁcation task as: Find the nearest prototype to the unknown vector
and assign the unknown vector to the class this ”nearest neighbor” represents
CHAPTER 2. FEATURES AND VECTOR SPACE 19
?
50% 60% 40%
18°C
20°C
22°C
rel. humidity
temperature
Figure 2.2: Selected prototypes
(hence the name). Fig. 2.2 shows the unknown vector and the two ”nearest
neighbors” of prototypes of the two classes. The classiﬁcation task we described
can be formalized as follows: Let Ω =
¸
ω
1
, ω
2
, . . . , ω
(V −1)
¸
be the set of classes,
V being the total number of classes. Each class is represented by its prototype
vectors p
k,ωv
, where k = 0, 1, . . . , (K
ωv
− 1). Let x denote the unclassiﬁed vec
tor. Let the distance measure between the vector and a prototype be denoted
as d (x, p
k,ωv
) (eg., the Euclidean distance. We will discuss several distance
measures later) . Then the class distance between x and the class ω
v
is deﬁned
as:
d
ωv
(x) = min
k
{d (x, p
k,ωv
)} ; k = 0, 1, . . . , (K
ωv
−1) (2.1)
Using this class distance, we can write the classiﬁcation task as follows:
x ∈ ω
v
⇔ v = arg min
v
{d
ωv
(x)} (2.2)
Chapter 3
Distance Measures
So far, we have found a way to classify an unknown vector by calculation of
its class–distances to predeﬁned classes, which in turn are deﬁned by the dis
tances to their individual prototype vectors. Now we will brieﬂy look at some
commonly used distance measures. Depending on the application at hand, each
of the distance measures has its pros and cons, and we will discuss their most
important properties.
3.1 Euclidean Distance
The Euclidean distance measure is the ”standard” distance measure between
two vectors in feature space (with dimension DIM) as you know it from school:
d
2
Euclid
(x, p) =
DIM−1
¸
i=0
(x
i
−p
i
)
2
(3.1)
To calculate the Euclidean distance measure, you have to compute the sum of
the squares of the diﬀerences between the individual components of x and p.
This can also be written as the following scalar product:
d
2
Euclid
(x, p) = (x − p)
′
· (x − p) (3.2)
where ′ denotes the vector transpose. Note that both equations (3.1) and (3.2)
compute the square of the Euclidean distance, d
2
instead of d. The Euclid
ean distance is probably the most commonly used distance measure in pattern
recognition.
3.2 City Block Distance
The computation of the Euclidean distance involves computing the squares of
the individual diﬀerences thus involving many multiplications. To reduce the
computational complexity, one can also use the absolute values of the diﬀerences
instead of their squares. This is similar to measuring the distance between two
points on a street map: You go three blocks to the East, then two blocks to the
South (instead of straight trough the buildings as the Euclidean distance would
20
CHAPTER 3. DISTANCE MEASURES 21
assume). Then you sum up all the absolute values for all the dimensions of the
vector space:
d
City−Block
(x, p) =
DIM−1
¸
i=0
x
i
−p
i
 (3.3)
3.3 Weighted Euclidean Distance
Both the Euclidean distance and the City Block distance are treating the indi
vidual dimensions of the feature space equally, i.e., the distances in each dimen
sion contributes in the same way to the overall distance. But if we remember our
example from section 2.1, we see that for real–world applications, the individual
dimensions will have diﬀerent scales also. While in our oﬃce the temperature
values will have a range of typically between 18 and 22 degrees Celsius, the
humidity will have a range from 40 to 60 percent relative humidity. While a
small diﬀerence in humidity of e.g., 4 percent relative humidity might not even
be noticed by a person, a temperature diﬀerence of 4 degrees Celsius certainly
will. In Figure 3.1 we see a more abstract example involving two classes and
two dimensions. The dimension x
1
has a wider range of values than dimension
x
2
, so all the measured values (or prototypes) are spread wider along the axis
denoted as ”x
1
” as compared to axis ”x
2
”. Obviously, an Euclidean or City
Block distance measure would give the wrong result, classifying the unknown
vector as ”class A” instead of ”class B” which would (probably) be the correct
result.
?
X1
class "B"
class "A"
X2
Figure 3.1: Two dimensions with diﬀerent scales
To cope with this problem, the diﬀerent scales of the dimensions of our feature
vectors have to be compensated when computing the distance. This can be
done by multiplying each contributing term with a scaling factor speciﬁc for
the respective dimension. This leads us to the so–called ”Weighted Euclidean
Distance”:
CHAPTER 3. DISTANCE MEASURES 22
d
2
Weighted Euclid
(x, p) =
DIM−1
¸
i=0
λ
i
· (x
i
−p
i
)
2
(3.4)
As before, this can be rewritten as:
d
2
Weighted Euclid
(x, p) = (x − p)
′
·
˜
Λ · (x − p) (3.5)
Where
˜
Λ denotes the diagonal matrix of all the scaling factors:
˜
Λ = diag
λ
0
. . .
λ
i
. . .
λ
DIM−1
¸
¸
¸
¸
¸
¸
The scaling factors are usually chosen to compensate the variances of the mea
sured features:
λ
i
=
1
σ
i
2
The variance of dimension i is computed from a training set of N vectors
{x
0
, x
1
, . . . , x
N−1
}. Let x
i,n
denote the ith element of vector x
n
, then the
variances σ
i
2
can be estimated from the training set as follows::
σ
i
2
=
1
N −1
N−1
¸
n=0
(x
i,n
−m
i
)
2
(3.6)
where m
i
is the mean value of the training set for dimension i:
m
i
=
1
N
N−1
¸
n=0
x
i,n
(3.7)
3.4 Mahalanobis Distance
So far, we can deal with diﬀerent scales of our features using the weighted
Euclidean distance measure. This works very well if there is no correlation
between the individual features as it would be the case if the features we selected
for our vector space were statistically independent from each other. What if
they are not? Figure 3.2 shows an example in which the features x
1
and x
2
are
correlated.
Obviously, for both classes A and B, a high value of x
1
correlates with a high
value for x
2
(with respect to the mean vector (center) of the class), which is
indicated by the orientation of the two ellipses. In this case, we would want
the distance measure to regard both the correlation and scale properties of the
features. Here a simple scale transformation as in (3.4) will not be suﬃcient.
Instead, the correlations between the individual components of the feature vec
tor will have to be regarded when computing the distance between two vectors.
This leads us to a new distance measure, the so–called Mahalanobis Distance:
CHAPTER 3. DISTANCE MEASURES 23
X1
class "B" class "A"
X2
?
Figure 3.2: Correlated features
d
Mahalanobis
x, p,
˜
C
= (x − p)
′
·
˜
C
−1
· (x − p) (3.8)
Where
˜
C
−1
denotes the inverse of the covariance matrix
˜
C:
˜
C =
σ
0,0
. . . σ
0,j
. . . σ
0,DIM−1
.
.
.
.
.
.
.
.
.
σ
i,0
. . . σ
i,j
. . . σ
i,DIM−1
.
.
.
.
.
.
.
.
.
σ
DIM−1,0
. . . σ
DIM−1,j
. . . σ
DIM−1,DIM−1
¸
¸
¸
¸
¸
¸
¸
¸
The individual components σ
i,j
represent the covariances between the compo
nents i and j. The covariance matrix can be estimated from a training set of N
vectors {x
0
, x
1
, . . . , x
N−1
} as follows:
˜
C =
1
N −1
N−1
¸
n=0
(x
n
− m) · (x
n
− m)
′
(3.9)
where m is the mean vector of the training set:
m =
1
N
N−1
¸
n=0
x
n
(3.10)
looking at a single element of the matrix
˜
C, say, σ
i,j
, this can also be written
as:
σ
i,j
=
1
N −1
N−1
¸
n=0
(x
i,n
−m
i
) · (x
j,n
−m
j
) (3.11)
CHAPTER 3. DISTANCE MEASURES 24
From (3.6) and (3.11) it is obvious that σ
i,i
, the values of the main diagonal of
˜
C, are identical to the variances σ
i
2
of the dimension i. Two properties of
˜
C can
be seen immediately form (3.11): First,
˜
C is symmetric and second, the values
σ
i,i
of its main diagonal are either positive or zero. A zero variance would mean
that for this speciﬁc dimension of the vector we chose a feature which does not
change its value at all).
3.4.1 Properties of the Covariance Matrix
How can we interpret the elements of the covariance matrix? For the σ
i,i
, we
already know that they represent the variances of the individual features. A
big variance for one dimension reﬂects the fact that the measured values of this
feature tend to be spread over a wide range of values with respect to the mean
value, while a variance close to zero would indicate that the values measured
are all nearly identical and therefore very close to the mean value. For the
covariances σ
i,j
, i = j, we will expect a positive value of σ
i,j
if a big value
for feature i can be frequently observed together with a big value for feature
j (again, with respect to the mean value) and observations with small values
for feature i also tend to show small values for feature j . In this case, for a
two–dimensional vector space example, the clusters look like the two clusters of
the classes A and B in Fig. 3.2, i.e., they are oriented from bottom–left to top–
right. On the other hand, if a cluster is oriented from top–left to bottom–right,
high values for feature i often are observed with low values for feature j and the
value for the covariance σ
i,j
would be negative (due to the symmetry of
˜
C, you
can swap the indices i and j in the description above).
3.4.2 Clusters With Individual and Shared Covariance Ma
trices
In our Fig. 3.2 the two classes A and B each consisted of one cluster of vectors
having its own center (mean vector, that is) and its own covariance matrix.
But in general, each class of a classiﬁcation task may consist of several clusters
which belong to that class, if the ”shape” of the class in our feature space is
more complex. As an example, think of the shape of the ”uncomfortable” class
of Fig. 2.1. In these cases, each of those clusters will have its own mean vector
(the ”prototype vector” representing the cluster) and its own covariance matrix.
However, in some applications, it is suﬃcient (or necessary, due to a lack of
training data) to estimate one covariance matrix out of all training vectors of
one class, regardless to which cluster of the class the vectors are assigned. The
resulting covariance matrix is then class speciﬁc but is shared among all clusters
of the class while the mean vectors are estimated individually for the clusters.
To go even further, it is also possible to estimate a single ”global” covariance
matrix which is shared among all classes (and their clusters). This covariance
matrix is computed out of the complete trainig set of all vectors regardless of
the cluster and class assignments.
CHAPTER 3. DISTANCE MEASURES 25
3.4.3 Using the Mahalanobis Distance for the Classiﬁca
tion Task
With the Mahalanobis distance, we have found a distance measure capable of
dealing with diﬀerent scales and correlations between the dimensions of our
feature space. In section 7.3, we will learn that it plays a signiﬁcant role when
dealing with the gaussian probability density function.
To use the Mahalanobis distance in our nearest neighbor classiﬁcation task, we
would not only have to ﬁnd the appropriate prototype vectors for the classes
(that is, ﬁnding the mean vectors of the clusters), but for each cluster we also
have to compute the covariance matrix from the set of vectors which belong to
the cluster. We will later learn how the ”k–means algorithm” helps us not only
in ﬁnding prototypes from a given training set, but also in ﬁnding the vectors
that belong to the clusters represented by these prototypes. Once the prototype
vectors and the corresponding (either shared or individual) covariance matrices
are computed, we can insert the Mahalanobis distance according to (3.8) into
the equation for the class distance (2.1) and get:
d
ωv
(x) = min
k
d
Mahalanobis
x, p
k,ωv
,
˜
C
k,ωv
¸
; k = 0, 1, . . . , (K
ωv
−1) (3.12)
Then we can perform the Nearest Neighbor classiﬁcation as in (2.2).
Chapter 4
Dynamic Time Warping
In the last chapter, we were dealing with the task of classifying single vectors
to a given set of classes which were represented by prototype vectors computed
from a set of training vectors. Several distance measures were presented, some
of them using additional sets of parameters (e.g., the covariance matrices) which
also had to be computed from a training vectors.
How does this relate to speech recognition?
As we saw in chapter 1, our speech signal is represented by a series of feature
vectors which are computed every 10ms. A whole word will comprise dozens
of those vectors, and we know that the number of vectors (the duration) of a
word will depend on how fast a person is speaking. Therefore, our classiﬁcation
task is diﬀerent from what we have learned before. In speech recognition, we
have to classify not only single vectors, but sequences of vectors. Lets assume
we would want to recognize a few command words or digits. For an utter
ance of a word w which is T
X
vectors long, we will get a sequence of vectors
˜
X = {x
0
, x
1
, . . . , x
TX−1
} from the acoustic preprocessing stage. What we need
here is a way to compute a ”distance” between this unknown sequence of vectors
˜
X and known sequences of vectors
˜
W
k
=
w
k0
, w
k1
, . . . , w
kTW
k
¸
which are pro
totypes for the words we want to recognize. Let our vocabulary (here: the set of
classes Ω) contain V diﬀerent words w
0
, w
1
, . . . w
V −1
. In analogy to the Nearest
Neighbor classiﬁcation task from chapter 2, we will allow a word w
v
(here: class
ω
v
∈ Ω) to be represented by a set of prototypes
˜
W
k,ωv
, k = 0, 1, . . . , (K
ωv
−1) to
reﬂect all the variations possible due to diﬀerent pronunciation or even diﬀerent
speakers.
4.1 Deﬁnition of the Classiﬁcation Task
If we have a suitable distance measure D(
˜
X,
˜
W
k,ωv
) between
˜
X and a prototype
˜
W
k,ωv
, we can deﬁne the class distance D
ωv
(
˜
X) in analogy to (2.2) as follows:
D
ωv
(
˜
X) = min
k
D
˜
X,
˜
W
k,ωv
¸
; k = 0, 1, . . . , (K
ωv
−1) (4.1)
where ω
v
is the class of utterances whose orthographic representation is given
by the word w
v
Then we can write our classiﬁcation task as:
26
CHAPTER 4. DYNAMIC TIME WARPING 27
˜
X ∈ ω
v
⇔v = arg min
v
D
ωv
˜
X
¸
(4.2)
This deﬁnition implies that the vocabulary of our speech recognizer is limited
to the set of classes Ω =
¸
ω
0
, ω
1
, . . . , ω
(V −1)
¸
. In other words, we can only
recognize V words, and our classiﬁer will match any speech utterance to one of
those V words in the vocabulary, even if the utterance was intended to mean
completely diﬀerent. The only criterium the classiﬁer applies for its choice is the
acoustic similarity between the utterance
˜
X and the known prototypes
˜
W
k,ωv
as deﬁned by the distance measure D(
˜
X,
˜
W
k,ωv
).
4.2 Distance Between Two Sequences of Vectors
As we saw before, classiﬁcation of a spoken utterance would be easy if we had
a good distance measure D(
˜
X,
˜
W) at hand (in the following, we will skip the
additional indices for ease of notation). What should the properties of this
distance measure be ? The distance measure we need must:
• Measure the distance between two sequences of vectors of diﬀerent length
(T
X
and T
W
, that is)
• While computing the distance, ﬁnd an optimal assignment between the
individual feature vectors of
˜
X and
˜
W
• Compute a total distance out of the sum of distances between individual
pairs of feature vectors of
˜
X and
˜
W
4.2.1 Comparing Sequences With Diﬀerent Lengths
The main problem is to ﬁnd the optimal assignment between the individual
vectors of
˜
X and
˜
W. In Fig. 4.1 we can see two sequences
˜
X and
˜
W which
consist of six and eight vectors, respectively. The sequence
˜
W was rotated by
90 degrees, so the time index for this sequence runs from the bottom of the
sequence to its top. The two sequences span a grid of possible assignments
between the vectors. Each path through this grid (as the path shown in the
ﬁgure) represents one possible assignment of the vector pairs. For example, the
ﬁrst vector of
˜
X is assigned the ﬁrst vector of
˜
W, the second vector of
˜
X is
assigned to the second vector of
˜
W, and so on.
Fig. 4.1 shows as an example the following path P given by the sequence of
time index pairs of the vector sequences (or the grid point indices, respectively):
P =
¸
g
1
, g
2
, . . . , g
Lp
¸
= {(0, 0) , (1, 1) , (2, 2) , (3, 2) , (4, 2) , (5, 3) , (6, 4) , (7, 4)}
(4.3)
The length L
P
of path P is determined by the maximum of the number of
vectors contained in
˜
X and
˜
W. The assignment between the time indices of
˜
W
and
˜
X as given by P can be interpreted as ”time warping” between the time
axes of
˜
W and
˜
X. In our example, the vectors x
2
, x
3
and x
4
were all assigned to
w
2
, thus warping the duration of w
2
so that it lasts three time indices instead of
one. By this kind of time warping, the diﬀerent lengths of the vector sequences
can be compensated.
CHAPTER 4. DYNAMIC TIME WARPING 28
w0
w1
w2
w3
w4
w5
x0 x1 x2 x3 x4 x5 x6 x7
Figure 4.1: Possible assignment between the vector pairs of
˜
X and
˜
W
For the given path P, the distance measure between the vector sequences can
now be computed as the sum of the distances between the individual vectors.
Let l denote the sequence index of the grid points. Let d(g
l
) denote the vector
distance d (x
i
, w
j
) for the time indices i and j deﬁned by the grid point g
l
= (i, j)
(the distance d (x
i
, w
j
) could be the Euclidean distance from section 3.1 ). Then
the overall distance can be computed as:
D
˜
X,
˜
W, P,
=
LP
¸
l=1
d (g
l
) (4.4)
4.2.2 Finding the Optimal Path
As we stated above, once we have the path, computing the distance D is a
simple task. How do we ﬁnd it ?
The criterium of optimality we want to use in searching the optimal path P
opt
should be to minimize D
˜
X,
˜
W, P
:
P
opt
= arg min
P
D
˜
X,
˜
W, P
¸
(4.5)
Fortunately, it is not necessary to compute all possible paths P and correspond
ing distances D
˜
X,
˜
W, P
to ﬁnd the optimum.
Out of the huge number of theoretically possible paths, only a fraction is rea
sonable for our purposes. We know that both sequences of vectors represent
feature vectors measured in short time intervals. Therefore, we might want to
restrict the time warping to reasonable boundaries: The ﬁrst vectors of
˜
X and
CHAPTER 4. DYNAMIC TIME WARPING 29
˜
W should be assigned to each other as well as their last vectors. For the time
indices in between, we want to avoid any giant leap backward or forward in
time, but want to restrict the time warping just to the ”reuse” of the preceding
vector(s) to locally warp the duration of a short segment of speech signal. With
these restrictions, we can draw a diagram of possible ”local” path alternatives
for one grid point and its possible predecessors (of course, many other local path
diagrams are possible):
w
t
x
t
(i1,j)
(i1,j1)
(i,j1) (i,j)
Figure 4.2: Local path alternatives for a grid point
Note that Fig. 4.2 does not show the possible extensions of the path from a
given point but the possible predecessor paths for a given grid point. We will
soon get more familiar with this way of thinking. As we can see, a grid point
(i, j) can have the following predecessors:
• (i − 1, j) : keep the time index j of
˜
X while the time index of
˜
W is
incremented
• (i −1, j −1) : both time indices of
˜
X and
˜
W are incremented
• (i, j − 1) : keep of the time index i of
˜
W while the time index of
˜
X is
incremented
All possible paths P which we will consider as possible candidates for being
the optimal path P
opt
can be constructed as a concatenation of the local path
alternatives as described above. To reach a given grid point (i, j) from (i−1, j −
1) , the diagonal transition involves only the single vector distance at grid point
(i, j) as opposed to using the vertical or horizontal transition, where also the
distances for the grid points (i − 1, j) or (i, j − 1) would have to be added. To
compensate this eﬀect, the local distance d( w
i
, x
j
) is added twice when using
the diagonal transition.
Bellman’s Principle
Now that we have deﬁned the local path alternatives, we will use Bellman’s
Principle to search the optimal path P
opt
.
Applied to our problem, Bellman’s Principle states the following:
CHAPTER 4. DYNAMIC TIME WARPING 30
If P
opt
is the optimal path through the matrix of grid points beginning
at (0, 0) and ending at (T
W
−1, T
X
−1), and the grid point (i, j) is part
of path P
opt
, then the partial path from (0, 0) to (i, j) is also part of
P
opt
.
From that, we can construct a way of iteratively ﬁnding our optimal path P
opt
:
According to the local path alternatives diagram we chose, there are only three
possible predecessor paths leading to a grid point (i, j): The partial paths from
(0, 0) to the grid points (i −1, j), (i −1, j −1) and (i, j −1) ).
Let’s assume we would know the optimal paths (and therefore the accumulated
distance δ(.) along that paths) leading from (0, 0) to these grid points. All these
path hypotheses are possible predecessor paths for the optimal path leading from
(0, 0) to (i, j).
Then we can ﬁnd the (globally) optimal path from (0, 0) to grid point (i, j) by
selecting exactly the one path hypothesis among our alternatives which mini
mizes the accumulated distance δ(i, j) of the resulting path from (0, 0) to (i, j).
The optimization we have to perform is as follows:
δ(i, j) = min
δ(i, j −1) + d( w
i
, x
j
)
δ(i −1, j −1) + 2 · d( w
i
, x
j
)
δ(i −1, j) + d( w
i
, x
j
)
(4.6)
By this optimization, it is ensured that we reach the grid point (i, j) via the
optimal path beginning from (0, 0) and that therefore the accumulated distance
δ(i, j) is the minimum among all possible paths from (0, 0) to (i, j).
Since the decision for the best predecessor path hypothesis reduces the number
of paths leading to grid point (i, j) to exactly one, it is also said that the possible
path hypotheses are recombined during the optimization step.
Applying this rule, we have found a way to iteratively compute the optimal path
from point (0, 0) to point (T
W
−1, T
X
−1), starting with point (0, 0):
• Initialization: For the grid point (0, 0), the computation of the optimal
path is simple: It contains only grid point (0, 0) and the accumulated
distance along that path is simply d( w
0
, x
0
).
• Iteration: Now we have to expand the computation of the partial paths
to all grid points until we reach (T
W
−1, T
X
−1): According to the local
path alternatives, we can now only compute two points from grid point
(0, 0): The upper point (1, 0), and the right point (0, 1). Optimization of
(4.6) is easy in these cases, since there is only one valid predecessor: The
start point (0.0). So we add δ(0, 0) to the vector distances deﬁned by the
grid points (1, 0) and (0, 1) to compute δ(1, 0) and δ(0, 1). Now we look at
the points which can be computed from the three points we just ﬁnished.
For each of these points (i, j), we search the optimal predecessor point
out of the set of possible predecessors (Of course, for the leftmost column
(i, 0) and the bottom row (0, j) the recombination of path hypotheses is
always trivial). That way we walk trough the matrix from bottomleft to
topright.
• Termination: D(
˜
W,
˜
X) = δ(T
W
− 1, T
X
− 1) is the distance between
˜
W
and
˜
X.
CHAPTER 4. DYNAMIC TIME WARPING 31
Fig. 4.3 shows the iteration through the matrix beginning with the start point
(0, 0). Filled points are already computed, empty points are not. The dotted
arrows indicate the possible path hypotheses over which the optimization (4.6)
has to be performed. The solid lines show the resulting partial paths after the
decision for one of the path hypotheses during the optimization step. Once we
reached the top–right corner of our matrix, the accumulated distance δ(T
W
−
1, T
X
−1) is the distance D(
˜
W,
˜
X) between the vector sequences. If we are also
interested in obtaining not only the distance D(
˜
W,
˜
X), but also the optimal path
P, we have — in addition to the accumulated distances — also to keep track of
all the decisions we make during the optimization steps (4.6). The optimal path
is known only after the termination of the algorithm, when we have made the
last recombination for the three possible path hypotheses leading to the top–
right grid point (T
W
−1, T
X
−1). Once this decision is made, the optimal path
can be found by reversely following all the local decisions down to the origin
(0, 0). This procedure is called backtracking.
4.2.3 Dynamic Programming / Dynamic Time Warping
The procedure we deﬁned to compare two sequences of vectors is also known
as Dynamic Programming (DP) or as Dynamic Time Warping (DTW). We will
use it extensively and will add some modiﬁcations and reﬁnements during the
next chapters. In addition to the verbal description in the section above, we
will now deﬁne a more formal framework and will add some hints to possible
realizations of the algorithm.
The Dynamic Programming Algorithm
In the following formal framework we will iterate through the matrix column by
column, starting with the leftmost column and beginning each column at the
bottom and continuing to the top.
For ease of notation, we deﬁne d(i, j) to be the distance d( w
i
, x
j
) between the
two vectors w
i
and x
i
.
Since we are iterating through the matrix from left to right, and the optimiza
tion for column j according to (4.6) uses only the accumulated distances from
columns j and j −1, we will use only two arrays δ
j
(i) and δ
j−1
(i) to hold the val
ues for those two columns instead of using a whole matrix for the accumulated
distances δ(i, j):
Let δ
j
(i) be the accumulated distance δ(i, j) at grid point (i, j) and δ
j−1
(i) the
accumulated distance δ(i, j −1) at grid point (i, j −1).
It should be mentioned that it possible to use a single array for time indices j
and j − 1. One can overwrite the old values of the array with the new ones.
However, for clarity, the algorithm using two arrays is described here and the
formulation for a single–array algorithm is left to the reader.
To keep track of all the selections among the path hypotheses during the opti
mization, we have to store each path alternative chosen for every grid point. We
could for every grid point (i, j) either store the indices k and l of the predecessor
point (k, l) or we could only store a code number for one of the three path al
ternatives (horizontal, diagonal and vertical path) and compute the predecessor
point (k, l) out of the code and the current point (i, j). For ease of notation,
CHAPTER 4. DYNAMIC TIME WARPING 32
we assume in the following that the indices of the predecessor grid point will be
stored:
Let ψ(i, j) be the predecessor grid point (k, l) chosen during the optimization
step at grid point (i, j).
With these deﬁnitions, the dynamic programming (DP)algorithm can be written
as follows:
◦ Initialization:
◦ Grid point (0, 0):
ψ(0, 0) = (−1, −1)
δ
j
(0) = d(0, 0)
(4.7)
◦ Initialize ﬁrst column (only vertical path possible):
for i = 1 to T
W
−1
{
δ
j
(i) = d(i, 0) +δ
j
(i −1)
ψ(i, 0) = (i −1, 0)
(4.8)
}
◦ Iteration:
compute all columns:
for j = 1 to T
X
−1
{
◦ swap arrays δ
j−1
(.) and δ
j
(.)
◦ ﬁrst point (i = 0, only horizontal path possible):
δ
j
(0) = d(0, j) +δ
j−1
(0)
ψ(0, j) = (0, j −1)
(4.9)
◦ compute column j:
for i = 1 to T
W
−1
{
◦ optimization step:
δj(i) = min
δj−1(i) + d(i, j)
δj−1(i − 1) + 2 · d(i, j)
δj(i − 1) + d(i, j)
(4.10)
◦ tracking of path decisions:
ψ(i, j) = arg min
(k,l)∈
(i, j − 1),
(i − 1, j − 1),
(i − 1, j)
¸
δj−1(i) + d(i, j)
δj−1(i − 1) + 2 · d(i, j)
δj(i − 1) + d(i, j)
(4.11)
}
CHAPTER 4. DYNAMIC TIME WARPING 33
}
◦ Termination:
D(T
W
−1, T
X
−1) = δ
j
(T
W
−1, T
X
−1) (4.12)
◦ Backtracking:
◦ Initialization:
i = T
W
−1, j = T
X
−1 (4.13)
◦ while ψ(i, j) = −1
{
◦ get predecessor point:
i, j = ψ(i, j) (4.14)
}
Additional Constraints
Note that the algorithm above will ﬁnd the optimum path by considering all
path hypotheses through the matrix of grid points. However, there will be paths
regarded during computation which are not very likely to be the optimum path
due to the extreme time warping they involve. For example, think of the borders
of the grid: There is a path going horizontally from (0, 0) until (T
W
− 1, 0)
and then running vertically to (T
W
− 1, T
X
− 1). Paths running more or less
diagonally trough the matrix will be much more likely candidates for being the
optimal path. To avoid unnecessary computational eﬀorts, the DP algorithm
is therefore usually restricted to searching only in a certain region around the
diagonal path leading from (0, 0) to (T
W
−1, T
X
−1). Of course, this approach
takes the (small) risk of missing the globally optimal path and instead ﬁnding
only the best path within the deﬁned region. Later we will learn far more
sophisticated methods to limit the so–called search space of the DP algorithm.
CHAPTER 4. DYNAMIC TIME WARPING 34
t
x
t
w
t
x
t
w
w
t
x
t
t
x
t
w
w
t
x
t
w
t
x
t
w
t
x
t t
x
t
w
t
x
t
w w
t
x
t t
x
t
w
w
t
x
t t
x
t
w w
t
x
t
w
t
x
t
t
w
t
x
t
x
t
w
t
x
t
w w
t
x
t t
x
t
w
Figure 4.3: Iteration steps ﬁnding the optimal path
Chapter 5
Recognition of Isolated
Words
In the last chapter we saw how we can compute the distance between two
sequences of vectors to perform the classiﬁcation task deﬁned by (4.2). The
classiﬁcation was performed by computing the distance between an unknown
vector sequence to all prototypes of all classes and then assigning the unknown
sequence to the class having the smallest class distance.
Now we will reformulate the classiﬁcation task so that it will depend directly
on the optimum path chosen by the dynamic programming algorithm and not
by the class distances.
While the description of the DTW classiﬁcation algorithm in chapter 4 might let
us think that one would compute all the distances sequentially and then select
the minimum distance, it is more useful in practical applications to compute all
the distances between the unknown vector sequence and the class prototypes
in parallel. This is possible since the DTW algorithm needs only the values for
time index t and (t−1) and therefore there is no need to wait until the utterance
of the unknown vector sequence is completed. Instead, one can start with the
recognition process immediately as soon as the utterance begins (we will not
deal with the question of how to recognize the start and end of an utterance
here).
To do so, we have to reorganize our search space a little bit. First, lets assume
the total number of all prototypes over all classes is given by M. If we want
to compute the distances to all M prototypes simultaneously, we have to keep
track of the accumulated distances between the unknown vector sequence and
the prototype sequences individually. Hence, instead of the column (or two
columns, depending on the implementation) we used to hold the accumulated
distance values for all grid points, we now have to provide M columns during
the DTW procedure.
Now we introduce an additional ”virtual” grid point together with a specialized
local path alternative for this point: The possible predecessors for this point are
deﬁned to be the upper–right grid points of the individual grid matrices of the
prototypes. In other words, the virtual grid point can only be reached from the
end of each prototype word, and among all the possible prototype words, the one
with the smallest accumulated distance is chosen. By introducing this virtual
35
CHAPTER 5. RECOGNITION OF ISOLATED WORDS 36
grid point, the classiﬁcation task itself (selecting the class with the smallest class
distance) is integrated into the framework of ﬁnding the optimal path.
Now all we have to do is to run the DTW algorithm for each time index j and
along all columns of all prototype sequences. At the last time slot (T
W
−1) we
perform the optimization step for the virtual grid point, i.e, the predecessor grid
point to the virtual grid point is chosen to be the prototype word having the
smallest accumulated distance. Note that the search space we have to consider
is spanned by the length of the unknown vector sequence on one hand and the
sum of the length of all prototype sequences of all classes on the other hand.
Figure 5.1 shows the individual grids for the prototypes (only three are shown
here) and the selected optimal path to the virtual grid point. The backtracking
procedure can of course be restricted to keeping track of the ﬁnal optimization
step when the best predecessor for the virtual grid point is chosen. The clas
siﬁcation task is then performed by assigning the unknown vector sequence to
the very class to which the prototype belongs whose word end grid point was
chosen.
WM
W1
W0
X
MIN
Figure 5.1: Classiﬁcation task redeﬁned as ﬁnding the optimal path among all
prototype words
CHAPTER 5. RECOGNITION OF ISOLATED WORDS 37
Of course, this is just a diﬀerent (and quite complicated) deﬁnition of how we can
perform the DTW classiﬁcation task we already deﬁned in (4.2). Therefore, only
a verbal description was given and we did not bother with a formal description.
However, by the reformulation of the DTW classiﬁcation we learned a few things:
• The DTW algorithm can be used for real–time computation of the dis
tances
• The classiﬁcation task has been integrated into the search for the optimal
path
• Instead of the accumulated distance, now the optimal path itself is impor
tant for the classiﬁcation task
In the next chapter, we will see how the search algorithm described above can
be extended to the recognition of sequences of words.
Chapter 6
Recognition of Connected
Words
In the last chapter, we learned how to recognize isolated words, e.g., simple com
mand words like ”start”, ”stop”, ”yes’, ”no”. The classiﬁcation was performed
by searching the optimal path through the search space, which was spanned
by the individual prototype vector sequences on one hand and by the unknown
vector sequence on the other hand.
What if we want to recognize sequences of words, like a credit card number or
telephone number? In this case, the classiﬁcation task can be divided into two
diﬀerent subtasks: The segmentation of the utterance, i.e., ﬁnding the bound
aries between the words and the classiﬁcation of the individual words within
their boundaries. As one can imagine, the number of possible combinations of
words of a given vocabulary together with the fact that each word may widely
vary in its length provides for a huge number of combinations to be considered
during the classiﬁcation. Fortunately, there is a solution for the problem, which
is able to ﬁnd the word boundaries and to classify the words within the bound
aries in one single step. Best of all, you already know the algorithm, because it
is the Dynamic Programming algorithm again ! In the following, ﬁrst a verbal
description of what modiﬁcations are necessary to apply the DP algorithm on
the problem of connected word recognition is given. Then a formal description
of the algorithm is presented.
6.1 The Search Space for Connected Word Recog
nition
If we want to apply the DP algorithm to our new task, we will ﬁrst have to
deﬁne the search space for our new task. For simplicity, let’s assume that each
word of our vocabulary is represented by only one prototype, which will make
the notation a bit easier for us. An unknown utterance of a given length T
X
will contain several words out of our vocabulary, but we do not know which
words these are and at what time index they begin or end. Therefore, we will
have to match the utterance against all prototypes and we will have to test for
38
CHAPTER 6. RECOGNITION OF CONNECTED WORDS 39
all possible word ends and word beginnings. The optimal path we want to ﬁnd
with our DP algorithm must therefore be allowed to run through a search space
which is spanned by the set of all vectors of all prototypes in one dimension and
by the vector sequence of the unknown utterance in the other dimension, as is
shown in Fig. 6.1.
1
X
j = T
alternatives
modified path
node
virtual grid
j = 0
v = 0
v = V
v
X
W
V
=
"
n
i
n
e
"
W
v
=
"
f
o
u
r
"
W
0
=
"
z
e
r
o
"
X = "zero four zero nine zero"
Figure 6.1: DTW algorithm for connected word recognition
6.2 Modiﬁcation of the Local Path Alternatives
Remember that in our DP algorithm in section 4 we were dealing with the local
path alternatives each grid could select as possible predecessors. The alterna
tives were including grid points one time index in the past (the ”horizontal”
and ”diagonal” transitions) and in the present (the ”vertical” transition). This
allowed us to implement the DP algorithm in real–time, column per column.
Now we have to modify our deﬁnition of local path alternatives for the ﬁrst
vector of each prototype: The ﬁrst vector of a prototype denotes the beginning
of the word. Remember that in the case of isolated word recognition, the grid
point corresponding to that vector was initialized in the ﬁrst column with the
distance between the ﬁrst vector of the prototype and the ﬁrst vector of the
utterance. During the DP, i.e., while we were moving from column to column,
this grid point had only one predecessor, it could only be preceded by the same
grid point in the preceding column (the ”horizontal” transition), indicating that
the ﬁrst vector of the prototype sequence was warped in time to match the next
vector of the unknown utterance.
For connected word recognition however, the grid point corresponding to the
ﬁrst vector of a prototype has more path alternatives: It may either select the
same grid point in the preceding column (this is the ”horizontal” transition,
which we already know), or it may select every last grid point (word end, that
CHAPTER 6. RECOGNITION OF CONNECTED WORDS 40
is) of all prototypes (including the prototype itself ) of the preceding column as
a possible predecessor for between–word path recombination. In this case, the
prototype word under consideration is assumed to start at the current time
index and the word whose word end was chosen is assumed to end at the time
index before.
Fig. 6.2 shows the local path alternatives for the ﬁrst grid point of the column
associated with a prototype.
w
0
w
T
X
of first grid point
Local path alternatives
v
Figure 6.2: Local path alternatives of the ﬁrst grid point of a column
By allowing these additional path alternatives, we allow the optimal path to run
through several words in our search space, where a word boundary can be fond
at those times at which the path jumps from the last grid point of a prototype
column to the ﬁrst grid point of a prototype column.
As in chapter 5, an additional ”virtual grid point” is used to represent the
selection of the best word end at the end of the utterance.
If we now use the DP algorithm again to ﬁnd the optimal path through our
search space, it will automatically match all words at all boundaries. If we then
CHAPTER 6. RECOGNITION OF CONNECTED WORDS 41
use the backtracking to determine the optimal path, this path will represent the
sequence of prototype words which best matches the given utterance !
6.3 Backtracking
To allow proper backtracking, we do not need to keep track of all the path
alternatives selected within the prototype sequences, but we only need to keep
track of the predecessor grid points selected during the between–word path
recombinations. So for backtracking, we need backtracking memory to hold for
each time index j the following values:
• For each time index j the best word end w(j) at time j. Any word starting
at time (j + 1) will select this best word end at time t as its predecessor.
• For each time index j the start time t
s
(j) of the word end w(j), i.e., the
time index at which the path ending at time index j in word w(j) has
entered the word.
Using these values, we can go backwards in time to ﬁnd all the words and word
boundaries deﬁned by the optimal path.
Note that there is only one best word end for each time index j which all words
starting at time index (j + 1) will select as a predecessor word. Therefore, it is
also suﬃcient to keep track of the start time t
s
(j) for this best word end. The
best word end at the end of the utterance (time index (T
X
−1)) is then used as
the starting point for the backtracking procedure.
Figure 6.1 shows the search space, the virtual grid point, the modiﬁed path
alternatives at the ﬁrst grid points of the prototypes and a possible path through
the search space, resulting in a corresponding sequence of words.
6.4 Formal Description
We now will deduce a formal description of the connected word recognition
algorithm.
6.4.1 Dimensions of the Search Space
Let V be the size of our vocabulary. For simplicity, we will assume that each
word w
v
of our vocabulary is represented by a single prototype vector sequence
˜
W
v
only. The use of multiple prototypes per word can easily be implemented
by mapping the sequence of prototypes found by the DP algorithm to the ap
propriate sequence of words.
Let T
X
be the length of the unknown utterance
˜
X and let T
v
, v = 0, 1, . . . , (V −1)
be the length of the prototype sequence of prototype
˜
W
v
.
Using these deﬁnitions, our search space is spanned by the length of the utter
ance T
X
in one dimension and the sum of the length of the prototype vector
sequences in the other dimension. Instead of referring to a single time index for
all the vectors of all the prototypes, we will use the prototype index v and the
time index i within a given prototype separately:
Let v be the index of the prototype sequence, v = 0, 1, . . . (V − 1). This will
divide our search space into V separate ”partitions” P
v
, each representing the
CHAPTER 6. RECOGNITION OF CONNECTED WORDS 42
grid points corresponding to a prototype
˜
W
v
of a word w
v
. These partitions are
visualized in Fig. 6.1. Let i be the time index within a given partition P
v
, with
i = 0, 1, . . . , (T
v
− 1). Let j be the time index of the unknown utterance with
j = 0, 1, . . . , (T
X
−1).
Given these deﬁnitions, a grid point in our search space can be identiﬁed by
three indices, (i, v, j). We denote the local distance between a vector x
j
of the
utterance
˜
X and a prototype vector w
v,i
of prototype
˜
W
v
by d( w
v,i
, x
j
) and the
accumulated distance for grid point (i, v, j) we denote by δ(i, v, j).
6.4.2 Local Path Recombination
Within–Word Transition
For the grid points within a partition P
v
, we use the same local path recombi
nation as in (4.6):
δ(i, v, j) = min
δ(i, v, j −1) + d( w
v,i
, x
j
)
δ(i −1, v, j −1) + 2 · d( w
v,i
, x
j
)
δ(i −1, v, j) + d( w
v,i
, x
j
)
(6.1)
This equation holds for all grid points within the partition except for the ﬁrst
grid point, i.e., for all grid points (i, v, j) with partition time index i = 1, . . . , (T
v
−
1), partition index v = 0, 1, . . . (V −1) and utterance time index j = 1, 2, . . . , (T
X
−
1). Please note that – as in section 4 – , there is no predecessor for the ﬁrst
time index j = 0 of the utterance, so we have to limit the path alternatives
for the grid points (i, v, 0) to their only predecessor grid point (i − 1, v, 0), the
”vertical” transition.
Between–Word Transitions
In addition, we have to deﬁne the local path alternatives for the between–word
transitions, which are possible at the ﬁrst grid point of the column of each
partition, i.e., all grid points (0, v, j). In this case, it is allowed to select either
the ”horizontal” transition from grid point (0, v, j −1) to grid point (0, v, j), or
to select any word end, i.e., the last grid point of any partition, (T
v
− 1, v, 1).
From all these alternatives, the best predecessor is selected. To simplify the
notation, the optimization over all partitions (word ends) is carried out as a
separate step for a given time index j −1, resulting in γ(j −1), the accumulated
distance of the best word end hypothesis at time index (j −1):
γ(j −1) = min
v=0,1,...,V −1
{δ(T
v
−1, v, 1) (6.2)
This step recombines all path hypotheses containing a word end for the time
index (j − 1) (between–word path recombination). The optimization step for
the ﬁrst grid point (0, v, j) of each partition can now be written as:
δ(0, v, j) = min
δ(0, v, j −1) + d( w
v,i
, x
j
)
γ(j −1) + d( w
v,i
, x
j
)
(6.3)
This optimization step will be processed for all partition indices v = 0, 1, . . . (V −
1) and utterance time index j = 1, 2, . . . , (T
X
−1). Again, for time index j = 0,
there is no predecessor since the utterance starts at j = 0. Therefore, we need
CHAPTER 6. RECOGNITION OF CONNECTED WORDS 43
to initialize all the grid points (0, v, 0) (i.e., all the words starting at j = 0) with
the local vector distances d( w
v,0
, x
0
).
6.4.3 Termination
After we have processed all time indices j = 0, 1, . . . , (T
X
− 1), we have for
each partition P
v
a hypothesis for a word end at time j = (T
X
− 1). These V
path hypotheses are now recombined and – as for all other times j – the best
accumulated distance is entered into γ(T
X
− 1). This step performs the ﬁnal
recombination of paths in the virtual grid point of Fig. 6.1, its value representing
the accumulated distance for the globally optimal path.
6.4.4 Backtracking
During the backtracking operation, we have to run backwards in time from the
end of the utterance to its beginning, and we have to follow all word boundaries
along the optimal path. This results in a list of word indices, representing the
sequence of words (in reversed order) contained in our vocabulary which best
matches the spoken utterance.
Keeping Track of the Word Boundaries
To follow the path from the end of the utterance to its beginning, we need
to track for each time index j which best word end was chosen during the
optimization (6.2). We will hold this information in the array w(j) :
w(j) = arg min
v=0,1,...,V −1
{δ((T
v
−1), v, j) (6.4)
For the best word end at every time index j, we also need to know the start time
of that word. For the backtracking procedure, this enables us to run backwards
in time through the array of best word ends:
Beginning with time index j = (T
X
−1) (the last time index of the utterance),
we look up the best word w(j) and its start time
ˆ
j = t
s
(j), which leads us to
its predecessor word w(
ˆ
j −1) which had to end at time index (
ˆ
j −1). Using the
start time of that word, we ﬁnd its predecessor word again and so on.
So, for the best word end w(j) at time index j we need to know at what time in
dex the path ending in that word has entered the ﬁrst grid point of the partition
with index v = w(j). We will hold this information in an array t
s
(j).
It is important to understand that this information has to be forwarded during
the within–word path recombination (6.1) and also during the between–word
path recombination (6.3): Whenever we are selecting a path alternative (k, v, l)
as a predecessor for a given grid point (i, v, j), we not only have to add its
accumulated distance, but we also have to keep track of the point in time when
that selected path alternative entered the partition P
v
.
Therefore, in addition to the accumulated distance δ(i, v, j) we need another
variable ξ(i, v, j) to hold the start time for each grid point (i, v, j).
Extensions for Backtracking
Now we are able to extend all the recombination steps by the bookkeeping
needed for backtracking:
CHAPTER 6. RECOGNITION OF CONNECTED WORDS 44
First we start with the last grid points (T
v
− 1, j − 1, v) of all partitions v:
For these grid points, which represent the word ends, we have to perform the
optimization over all possible word ends, i.e., we have to ﬁnd the best word end
at time j −1 and store its its accumulated distance into γ(j −1), its partition
index into w(j −1) and the corresponding word start time into t
s
(j −1). Note
that the word end optimization is done for the ”old values”, i.e., for time index
j − 1, since the between–word path recombination will use γ(j − 1) for ﬁnding
the best predecessor word at time j.
• Find best word end:
v
opt
= arg min
v=0,1,...,(V −1)
{δ(T
v
−1, v, j −1)} . (6.5)
• Store accumulated distance of the best word end:
γ(j −1) = δ(T
vopt
−1, v
opt
, j −1) (6.6)
• Store the word index for backtracking:
w(j −1) = v
opt
(6.7)
• Store word start time for backtracking:
t
s
(j −1) = ξ(T
vopt
−1, v
opt
, j −1) (6.8)
Now we will extend the within–word optimization step to keep track of the start
times within the partitions:
• Within–word path recombination for partition v:
δ(i, v, j) = min
δ(i, v, j −1) + d( w
v,i
, x
j
)
δ(i −1, v, j −1) + 2 · d( w
v,i
, x
j
)
δ(i −1, v, j) + d( w
v,i
, x
j
)
(6.9)
• Within–word backtracking variable ξ(i, v, j) for partition v:
– Get predecessor grid point (k, v, l):
(k, v, l) = arg min
k,v,l∈
(i, v, j −1),
(i −1, v, j −1),
(i −1, v, j)
δ(i, v, j − 1) + d( w
v,i
, x
j
)
δ(i −1, v, j − 1) + 2 · d( w
v,i
, x
j
)
δ(i −1, v, j) + d( w
v,i
, x
j
)
(6.10)
– Forward the start time of the predecessor:
ξ(i, v, j) = ξ(k, v, l) (6.11)
For the ﬁrst grid point (0, v, j), there are two possible cases: Either the ”hori
zontal” transition is selected, in which case we simply have to forward the start
time ξ(0, v, j − 1), which comes along with that path hypothesis, or the best
word end at time j − 1 is selected. In that case, we have just (re)started the
word v at time j and we have to enter the start time j into ξ(i, v, j). Now we
are able to extend the betweenword path recombination in analogy with (6.3):
CHAPTER 6. RECOGNITION OF CONNECTED WORDS 45
• Path recombination at the ﬁrst grid point:
δ(0, v, j) = min
δ(0, v, j −1) + d( w
v,i
, x
j
)
γ(j −1) + d( w
v,i
, x
j
)
(6.12)
• Forward start time in variable ξ(0, v, j):
ξ(0, v, j) =
ξ(0, v, j −1) , if δ(0, v, j −1) ≤ γ(j −1)
j , else
(6.13)
6.5 Implementation issues
As explained in section 4, the fact that all optimization steps need only the
columns for the last time index helps us in saving computer memory. Only an
array δ
j
(i, v) is needed to hold the accumulated distances δ(i, v, j). The start
time indices forwarded during recombination will also be held in an array ξ
j
(i, v).
For backtracking, only the two arrays w(j) and t
s
(j) are needed. So we only
have to deal with a few arrays (and several ”partition–local ” variables) instead
of explicitly holding in memory the information for all grid points (i, v, j) of the
huge matrix spanned by our search space.
Of course, all the additional constraints of the search space as mentioned in
section 4.2.3 can be applied here, too.
With the deﬁnitions of the equations for path recombination and backtracking
given above, it is easy to write down the complete algorithm in analogy to the
one given in section 4. This is left as an exercise to the reader.
6.6 Summary
What have we learned in this chapter?
We have extended the dynamic programming algorithm to deal with sequence
of words instead of isolated words. This was achieved by reformulating the
classiﬁcation task in such a way that it could be described as ﬁnding the optimal
path through a matrix of grid points together with the local schemes for within–
word and between–word path recombination. Since the recognition process was
reduced to ﬁnding the optimal path through the search space, we had to store
the information needed for backtracking while processing the columns of the
search space. We found that we only needed to store the information about the
best word end optimization steps during the between–word path recombinations,
which we can hold in two additional arrays.
Speech recognition systems based on the DP algorithm using prototype vector
sequences to match against the unknown utterance (also known as dynamic
pattern matching) were widely used some years ago, and were even extended by
a ﬁnite state syntax (for a detailed description see e.g., [Ney84]).
Of course, there are many details to work out to implement a working speech
recognition system based on the DP algorithm as described above. Some of the
open issues not dealt with in the sections above are:
• How do we detect the start and the end of the utterance properly?
• How many prototype vector sequences
˜
W
v
should we use to represent a
word of the vocabulary?
CHAPTER 6. RECOGNITION OF CONNECTED WORDS 46
• How do we ﬁnd appropriate prototypes for the words ?
• For a speaker–independent system the number of prototypes will have to
be very large to cover all the variabilities involved with speaker–independent
systems.
• The more prototypes we use, the larger our search space will be, consuming
more memory and more computation time. How can we implement a
speaker–independent system dealing with large vocabularies?
As we will see in the next chapters, there are better ways to model the prototype
words for a speech recognition system.
Chapter 7
Hidden Markov Models
In the last chapter, we were using the DP algorithm to ﬁnd the optimum match
between an utterance and a sequence of words contained in our vocabulary.
The words of the vocabulary were represented by one or more prototype vector
sequences. All variability of speech that naturally occurs if a word is spoken at
diﬀerent times by a person or even by diﬀerent persons had to be reﬂected by the
prototype vector sequences. It should be clear that the number of prototypes
we have to store in order to implement a speaker–independent system might
become quite large, especially if we are dealing with a large vocabulary size
also.
What we are rather looking for to handle this problem is a way to represent
the words of our vocabulary in a more generic form than just to store many
speech samples for each word. One would like to look for a model of the speech
generation process for each word itself instead of storing samples of its output.
If, for example, we would have a general stochastic model for the generation of
feature vectors corresponding to a given word, then we could calculate how good
a given utterance ﬁts to our model. If we calculate a ﬁt–value for each model of
our vocabulary, then we can assign the unknown utterance to the model which
best ﬁts to the utterance.
This is a very simplistic description of a general classiﬁcation method, the so–
called statistical classiﬁcation. For additional reading on statistical methods for
speech recognition, the book [Jel98] is strongly recommended.
In the following, we will see how it can be used for speech recognition.
7.1 Statistical Classiﬁcation
The classiﬁcation task for isolated word recognition can be formulated in a
statistical framework as follows:
Let’s assume we have for each word w
v
of our vocabulary a model Λ
v
which
is deﬁned by a set of parameters λ
v
. What kind of parameters these are we
will learn later. For now, we assume that we are able to compute for a vector
sequence
˜
X the probability density that the utterance
˜
X was produced by the
model Λ
v
. We denote this conditional probability density function, which de
pends on the model Λ
v
, as p(
˜
XΛ
v
). It will later often be called the emission
probability density. Note that p(
˜
XΛ
v
) is not a probability, but rather a class
47
CHAPTER 7. HIDDEN MARKOV MODELS 48
speciﬁc probability density function (PDF), since we are dealing with continu
ous values of each element of the vectors x
j
of the sequence, instead of discrete
values.
7.1.1 Bayes Classiﬁcation
The task of classifying isolated words can now be verbalized as: Assign the un
known utterance
˜
X to that class ω
v
, to which the unknown utterance ”belongs”
to with the highest probability.
What we need here is the so–called a–posteriori probability P(ω
v

˜
X). It de
notes the probability that the utterance
˜
X belongs to the class ω
v
and will be
computed with respect to all our models Λ for all classes ω.
The classiﬁcation task at hand can now be deﬁned as:
˜
X ∈ ω
v
⇔v = arg max
v
P(ω
v

˜
X)
¸
(7.1)
Note that in contrary to (4.2), we have to chose the maximum a–posteriori
probability instead of the minimum class distance. The classiﬁcation scheme as
deﬁned above is called the ”Bayes Classiﬁcation”, e.g., [Rus88].
How can we compute the probability P(ω
v

˜
X) measuring the probability that
˜
X belongs to class ω
v
? Using Bayes’ equation it is computed as follows:
P(ω
v

˜
X) =
p(
˜
X, ω
v
)
p(
˜
X)
=
p(
˜
Xω
v
) · P(ω
v
)
p(
˜
X)
(7.2)
If we insert (7.2) into (7.1), the unconditional emission probability density p(
˜
X)
can be eliminated, since we are performing a maximum search and p(
˜
X) is
independent from the class under consideration. Thus, we can rewrite (7.1):
˜
X ∈ ω
v
⇔v = arg max
v
p(
˜
Xω
v
) · P(ω
v
)
¸
(7.3)
So we only need the probability density functions p(
˜
Xω
v
) and the a–priori
probabilities of the classes P(ω
v
), which denote the unconditional (i.e., not
depending on the measured vector sequence
˜
X) probability that the utterance
belongs to class ω
v
.
We can now use our word models Λ as approximations for the speech generation
process and by using the emission probability densities of our word models, (7.3)
can be rewritten as:
˜
X ∈ Λ
v
⇔v = arg max
v
p(
˜
XΛ
v
) · P(Λ
v
)
¸
(7.4)
7.1.2 Maximum Likelihood Classiﬁcation
Equation (7.4) uses the probability density functions p(
˜
XΛ
v
) – which we assume
we are able to compute – and the a–priori probabilities of the models P(Λ
v
),
which could be determined by simply counting the word frequencies within a
given training set.
However, the speech recognition task we are currently dealing with is the recog
nition of isolated words, like simple command words. It it reasonable to assume
that these command words will occur with equally distributed probability, e.g.,
CHAPTER 7. HIDDEN MARKOV MODELS 49
for a ”yes / no” task, we can assume that the utterance ”yes” will occur as often
as ”no”. If we assume equally distributed a–priori probabilities, P(Λ
v
) becomes
a constant value (1/V ) and we can rewrite (7.4) as:
˜
X ∈ Λ
v
⇔v = arg max
v
p(
˜
XΛ
v
)
¸
(7.5)
Now the classiﬁcation depends solely on the model (or class) speciﬁc probability
densities p(
˜
XΛ
v
), which are also called the likelihood functions. Therefore, this
classiﬁcation approach is called the maximum likelihood classiﬁcation. In the
next sections we will see how to compute the density functions p(
˜
XΛ
v
).
7.2 Hidden Markov Models
One major breakthrough in speech recognition was the introduction of the sta
tistical classiﬁcation framework described above for speech recognition purposes.
As we have seen, the important part of that framework is the emission proba
bility density p(
˜
XΛ
v
). These density functions have to be estimated for each
word model Λ
v
to reﬂect the generation (or ”emission”) process of all possible
vector sequences
˜
X
v
belonging to the word class ω
v
. As we know, these vector
sequences my vary in their length as well as in the individual spectral shapes
of the feature vectors within that sequence. Thus, a model is needed which is
capable of dealing with both of these variabilities. The Hidden Markov Models
(short: HMMs) are used successfully in speech recognition for many years.
In the following, we will only discuss those properties of the HMMs which we
will need to understand how they are used in algorithms for speech recognition.
A very detailed introduction into HMMs can be found in [RJ86, Rab89, Jel98],
while early applications are described in [LRS85a, LRS89].
7.2.1 The Parameters of a Hidden Markov Model
The HMM is modelling a stochastic process deﬁned by a set of states and tran
sition probabilities between those states, where each state describes a stationary
stochastic process and the transition from one state to another state describes
how the process changes its characteristics in time.
Each state of the HMM can model the generation (or: emission) of the observed
symbols (or in our case of the feature vectors) using a stationary stochastic
emission process. For a given observation (vector) however, we do not know
in which state the model has been when emitting that vector. The underlying
stochastic process is therefore ”hidden” from the observer.
In our application, each state of the HMM will model a certain segment of the
vector sequence of the utterance. These segments (which are assumed to be
stationary) are described by the stationary emission processes assigned to the
states, while the dynamic changes of the vector sequence will be modelled by
transitions between the states of the HMM.
Now it is time to introduce a little bit of formal framework:
Let the HMM Λ
v
have N
v
diﬀerent states denoted as s
v
θ
with θ ∈ Ω
N
v , where
Ω
N
v = {0, 1, . . . , (N
v
−1)} is the set of state indices.
CHAPTER 7. HIDDEN MARKOV MODELS 50
Let a
v
i,j
; i, j ∈ {0, 1, . . . , (N
v
−1)} be the transition probability between state
s
v
i
and s
v
j
. The set of all transition probabilities can be compactly described
by the matrix
˜
A
(N
v
×N
v
)
=
a
v
i,j
. Each state s
v
θ
has assigned its stationary
emission process deﬁned by the emission probability density function p (xs
v
θ
).
In addition, we deﬁne the initial state probability u
v
θ
; θ ∈ {0, 1, . . . , (N
v
−1)},
which denotes the initial probability of the model being in state s
v
θ
when the
generation process starts.
7.2.2 HMMs for Word Modelling
A vector sequence
˜
X =
¸
x
0
, x
1
, . . . , x
(TX−1)
¸
which belongs to a word w
v
can
now be generated by the HMM Λ
v
as follows (to simplify the notation, we skip
the word class index v):
At each time t = 0, 1, ...(T
X
−1) the HMM is in a state s
θt
and will generate a
vector x
t
with the emission probability density p (x
t
s
θt
) (the initial state prob
ability u
θ
gives the probability being in state s
θ
at time t = 0). Then it will
make a state transition from state s
θt
to state s
θt+1
with the transition proba
bility a
θt,θt+1
and so on. A given vector sequence
˜
X can thus be generated by
running through a sequence of state indices Θ =
¸
θ
0
, θ1, . . . θ
(TX−1)
¸
. However,
given only the sequence
˜
X, one can not determine which sequence Θ was used
to generate
˜
X i.e., the state sequence is hidden from the observer.
The state transitions of the HMM reﬂect the sequence of quasi–stationary seg
ments of speech contained in the utterance
˜
X. Therefore, one usually restricts
the transition probability densities of the HMM to a ”from left to right” struc
ture, i.e., being in a certain state s
θ
, only states with the same or a higher state
index θ can be reached with nonzero probability. The initial state probabil
ity is usually set to one for the ﬁrst state and to zero for all the other states,
so the generation process always starts in state s
0
. Usually the state transi
tion probabilities are restricted to reach only the two next state indices (e.g.,
[LRS85a]):
a
θ,θ+∆
> 0 , if 0 ≤ ∆ ≤ 2
a
θ,θ+∆
= 0 , else
(7.6)
This leads to a left–to–right HMM structure as shown for ﬁve states in Fig. 7.1
0
s
1
s
2
s
3
s
4
s
33
a a
44
a
22 11
a a
00
01
a a
12 23
a a
34
a
24 13
a a
02
Figure 7.1: Typical left–to–right HMM for word recognition
Note that by choosing nonzero self–transitions probabilities a
θ,θ
, vector se
quences of inﬁnite length can generated, the shortest length of the sequence is
deﬁned by the shortest path through the model, e.g., in Fig. 7.1, the shortest
vector sequence must have three vectors.
CHAPTER 7. HIDDEN MARKOV MODELS 51
7.2.3 Emission Probability Density of a Vector Sequence
As we saw in section 7.1.2, we have to compute the emission probability den
sity p
˜
XΛ
for a given sequence
˜
X and model Λ. We remember that the
HMM will generate the vector sequence
˜
X along a state index sequence Θ =
¸
θ
0
, θ1, . . . θ
(TX−1)
¸
. Assuming that we know the state sequence Θ which pro
duces
˜
X, we can easily compute the probability density p
˜
X, ΘΛ
, which is the
probability density of generating
˜
X along the state sequence Θ using the model
Λ. This is done by simply multiplying all the emission probability densities and
transition probabilities along the state sequence Θ:
p
˜
X, ΘΛ
= u
θ0
· p (x
0
s
θ0
) ·
TX−1
¸
t=1
a
θ
(t−1)
,θt
· p (x
t
s
θt
)
(7.7)
To compute the desired emission probability density p
˜
XΛ
, we now have to
sum up p
˜
X, ΘΛ
over all possible state sequences Θ with length T
X
. The
number of possible state sequences equals the number of T
X
– tupels that can
be generated from the set of state indices Ω
N
. We denote the set of all state
sequences of length T
X
given the set of state indices Ω
N
as Ω
(N
(T
X
)
)
.
Then, p
˜
XΛ
can be computed as:
p
˜
XΛ
=
¸
Θ ∈ Ω
(
N
(T
X
)
)
p
˜
X, ΘΛ
(7.8)
If we insert (7.7) into (7.8), we get:
p
˜
XΛ
=
¸
Θ ∈ Ω
(
N
(T
X
)
)
u
θ0
· p (x
0
s
θ0
) ·
TX−1
¸
t=1
a
θ
(t−1)
,θt
· p (x
t
s
θt
)
(7.9)
To compute the sum over all possible state sequences would require lots of
computational eﬀorts, but fortunately there exists an eﬃcient algorithm to do
so, the so–called Baum–Welch algorithm [Rab89].
Instead of using the Baum–Welch algorithm, we choose a diﬀerent approach:
We will not compute the sum over all state sequences Θ ∈ Ω
(N
(T
X
)
)
, but
we will approximate the sum by the single state sequence Θ
opt
which has the
highest contribution to the sum (7.9):
p
˜
XΛ
≈ p
˜
X, Θ
opt
Λ
= u
θ0
· p (x
0
s
θ0
) ·
TX−1
¸
t=1
a
θ
(t−1)
,θt
· p (x
t
s
θt
)
(7.10)
Where Θ
opt
is deﬁned as the state sequence which maximizes the emission prob
ability density p
˜
X, ΘΛ
:
Θ
opt
= arg max
Θ ∈ Ω
(
N
(T
X
)
)
p
˜
X, ΘΛ
(7.11)
CHAPTER 7. HIDDEN MARKOV MODELS 52
If we evaluate (7.10), we have to deal with product terms only, so that we
can easily take the logarithm of the individual probability (density) values and
replace the product by the sum of those log–values.
Since the probability density values may vary over several orders of magnitude,
taking the logarithmhas the additional advantage of reducing the dynamic range
of those values. The logarithm of the probability density is often called the log
score value.
How do we ﬁnd the optimal state sequence Θ
opt
?
Fig. 7.2 shows all the possible state sequences or paths (also called the trellis)
for the HMM shown in Fig. 7.1. In this Figure, the HMM is rotated by 90
degrees and on the horizontal axis the vectors of the utterance
˜
X are shown.
At every time index t, each possible path will go through one of the N states
of the HMM. In other words, at every grid point shown in Fig. 7.2, a path
recombination takes place. If we are using the log score values for the transition
probabilities and the emission probability densities, the evaluation of (7.10) is
transformed to adding up all the log score values along a given path, in analogy
to adding all the local distances along the path in section 4.
s
4 4
4
a
s
3
3
4
a
a
3
3
2
4
a
s
2
a
2
3
2
2
a a
1
3
0
2
a
1
2
a
a
1
1
s
1
s
0
a
0
1
0
0
a
X
Figure 7.2: Possible state sequences through the HMM
Now it is easy to ﬁnd the optimum state sequence / the optimum path through
the grid: It can be done with the DP algorithm (Also known as Viterbi algo
rithm) we learned in section 4 ! All we have to do is to deﬁne the equation for
the local recombination.
CHAPTER 7. HIDDEN MARKOV MODELS 53
7.2.4 Computing the Emission Probability Density
To compute p
˜
X, Θ
opt
Λ
, we have to ﬁnd the path or state sequence which
yields the maximum accumulated sum of log scores along that path. In section
4, we were searching for the minimum accumulated distance, instead. If we
would use the negative log score values, we could even search for the path with
the minimum accumulated negative log score. However, it is more intuitive to
deal with higher scores for better matches, so we will use log scores and search
for the maximum in the following.
Scores and Local Path Recombination
We will start with the deﬁnition of the log score values. Since we are dealing with
one given HMM Λ
v
here, we will leave out the model index v in the following.
Let b(t, s
θ
) denote the logarithm of the emission probability density of vector
x
t
being emitted by state s
θ
:
b(t, s
θ
) = log (p (x
t
s
θ
)) (7.12)
Let d(s
i
, s
θ
) be the log probability for the state transition s
i
→ s
θ
:
d(s
i
, s
θ
) = log (a
si,s
θ
) (7.13)
Let δ(s
θ
, t) be the accumulated log probability density for the optimal path
beginning at time index 0 and ending in state s
θ
at time index t (If we further
assume that u
0
= 1, then all paths are starting at time index 0 in state s
0
).
Now we will deﬁne the local path alternatives for path recombination. Depend
ing on the transition probabilities of the model, a given state may have up to
N predecessors. In our left–to–right structure, the states s
θ
in the middle of
the model have only three possible predecessors (see Fig. 7.3): The state s
θ
itself, and its two predecessors with the lower state indices (θ −1) and (θ −2).
However, this is not true for the ﬁrst and second state of the model. To allow
for more complex model structures also, we deﬁne the set Ξ(s
θ
) as the set of
all valid predecessor state indices for a given state s
θ
.
With these deﬁnitions, the local path recombination scheme can be written as:
δ(s
θ
, t) = b(t, s
θ
) + max
i ∈ Ξ(s
θ
)
{d(s
i
, s
θ
) +δ(s
i
, (t −1))} (7.14)
Note that in contrary to (4.6), all predecessor grid points have the time index
(t −1) here (there is no vertical transition possible, since each state transition in
the HMM implies an increase of the time index t). This local path recombination
will now be performed for each time index t and within each time index for all
state indices θ.
The Viterbi Algorithm
Now, we can formulate the Viterbi (or DP) algorithm for computing the score
log
p
˜
X, Θ
opt
Λ
. To enable the backtracking of the state sequence Θ
opt
, we
will use the backtracking variable ψ(θ, t) in analogy to section 4:
CHAPTER 7. HIDDEN MARKOV MODELS 54
t
(i2)
(i1)
i
t (t1)
i
Figure 7.3: Possible predecessor states for a given HMM state
• Initialization (t = 0):
For the left–to–right structure we use, we deﬁne u
0
= 1. Therefore, all
other initial state probabilities are zero, enforcing all paths to begin in
state s
0
with the emission of the vector x
0
:
– Initialize ﬁrst column:
δ(s
θ
, 0) =
b(0, s
0
) , θ = 0
−∞ , θ = 1, 2, . . . , (N −1)
(7.15)
– Initialize backtracking variables:
ψ(θ, 0) = −1 ; θ = 0, 1, . . . , (N −1) (7.16)
• Iteration:
Perform the local path recombination as given by (7.14) for all states
θ = 0, 1, . . . , (N −1) and all time indices t = 1, 2, . . . , (T
X
−1):
– Optimization step:
δ(s
θ
, t) = b(t, s
θ
) + max
i ∈ Ξ(s
θ
)
{d(s
i
, s
θ
) +δ(s
i
, t −1)} (7.17)
– Backtracking:
ψ(θ, t) = arg max
i ∈ Ξ(s
θ
)
{d(s
i
, s
θ
) +δ(s
i
, t −1)} (7.18)
• Termination:
In our left–to–right model, we will consider only paths which end in the
last state of the HMM, the state s
N−1
. After the last iteration step is
done, we get:
log
p
˜
X, Θ
opt
Λ
= δ(s
N−1
, (T
X
−1)) (7.19)
CHAPTER 7. HIDDEN MARKOV MODELS 55
• Backtracking procedure:
Beginning with time index (T
X
− 1), go backwards in time through the
backtracking variables:
– Initialization:
t = (T
X
−1) ; θ
(TX−1)
= N −1 (7.20)
– Iteration:
For t = (T
X
−1), . . . , 1:
θ
(t−1)
= ψ(θ
t
, t) (7.21)
– Termination:
θ
0
= ψ(θ
1
, 1) ≡ 0 (7.22)
7.3 Modelling the Emission Probability Densi
ties
So far, we have assumed that we can compute the value of the state–speciﬁc
emission probability density p (xs
θ
) of a HMM Λ. Now we have to deal with
the question of how p (xs
θ
) is modelled within each state of a HMM and how
to compute the probability density for a given vector x. As we recall, in the
states of the HMM we model stationary emission processes which we assume to
correspond with stationary segments of speech. Within those segments, we have
to allow for the wide variability of the emitted vectors caused by the variability
of the characteristic features of the speech signal.
7.3.1 Continuous Mixture Densities
How do we model the emission process ?
As we remember from chapter 2, we are dealing with feature vectors where each
component is representing a continuous valued feature of the speech signal.
Before we deal with the question of how to model probability density functions
for multi–dimensional feature vectors, we will ﬁrst look at the one–dimensional
case.
Lets assume we measure only one continuously valued feature x. If we assume
the measurement of this value to be disturbed by many statistically independent
processes, we can assume that our measurements will assume a Gaussian distri
bution of values, centered around a mean value m, which we then will assume
to be a good estimate for the true value of x. The values we measure can be
characterized by the Gaussian probability density function. As we recall from
school, the Gaussian PDF φ(x) is deﬁned as:
φ(x) =
1
√
2πσ
2
· e
−
1
2
·
(x−m)
2
σ
2
(7.23)
Where m represents the mean value and σ
2
denotes the variance of the PDF.
These two parameters fully characterize the one–dimensional Gaussian PDF.
CHAPTER 7. HIDDEN MARKOV MODELS 56
Gaussian Probability Density Function
In our application, we measure feature vectors in a multi–dimensional feature
space, so we will use a multivariate Gaussian PDF, which looks like this:
φ(x) =
1
(2π)
DIM
· 
˜
C
· e
(−
1
2
( x− m)
′
·
˜
C
−1
· ( x− m))
(7.24)
Where m denotes the mean vector and
˜
C denotes the covariance matrix. You
know these parameters already, because we used them in computing the Maha
lanobis distance (3.8) in section 3.4. Note that the the Mahalanobis distance is
actually part of the exponent of the PDF (7.24).
Continuous Mixture Densities
The Gaussian PDF (7.24) can characterize the observation probability for vec
tors generated by a single Gaussian process. The PDF of this process has a
maximum value at the position of the mean vector m and its value exponen
tially decreases with increasing distance from the mean vector. The regions of
constant probability density are of elliptical shape, and their orientation is de
termined by the Eigenvectors of the covariance matrix (e.g., [Rus88]). However,
for speech recognition, we would like to model more complex probability dis
tributions, which have more than one maximum and whose regions of constant
probability density are not elliptically shaped, but have complex shapes like the
regions we saw in Fig. 2.1.
To do so, the weighted sum over a set of K Gaussian densities can be used to
model p(x) (e.g., [CHLP92, LRS85b, Rab89]):
p(x) =
K−1
¸
k=0
c
k
· φ
k
(x) (7.25)
Where φ
k
(x) is the Gaussian PDF as in (7.24). The weighting coeﬃcients c
k
are called the mixture coeﬃcients and have to ﬁt the constraint:
K−1
¸
k=0
c
k
= 1 (7.26)
A PDF modelled by this weighted sum of Gaussian densities is called a Con
tinuous Mixture Density, a HMM using this modelling approach is called a
Continuous Mixture Density HMM.
Figure 7.4 shows the scatterplot of an emission process consisting of three
Gaussian emission processes. The process parameters are:
c
1
= 0.3 ;
˜
C
1
=
¸
5 2
2 2
; m
1
= [1, 1.5]
′
c
2
= 0.4 ;
˜
C
2
=
¸
1 −1
−1 5
; m
2
= [−3, −1]
′
c
2
= 0.3 ;
˜
C
3
=
¸
1 0
0 1
; m
3
= [3, −3]
′
The corresponding probability density function is visualized as a suface in ﬁg.
7.5.
CHAPTER 7. HIDDEN MARKOV MODELS 57
−8 −6 −4 −2 0 2 4 6 8
−10
−8
−6
−4
−2
0
2
4
6
8
scatterplot
Figure 7.4: scatterplot of vectors emitted by a gaussian mixture density process
The complexity of the PDF modelled by (7.25) can be inﬂuenced by selecting
the appropriate number of mixtures K and may be set individually for every
state of the HMM. However, since a high number of mixtures also means that
a high number of parameters has to be estimated from a given training set of
data, always some compromise between the granularity of the modeling and the
reliability of the estimation of the parameters has to be found.
7.3.2 Approximations
Using mixture densities leads to some practical problems:
First, the computational eﬀort to compute all the densities for all states of a
HMM might be very high.
We left out the state indices in the sections before for better readability, but
we have to remember that in every state s
θ
of the HMM a state–speciﬁc emis
sion probability density function p (xs
θ
) has to be modelled, each having K
θ
gaussian densities. Since the computation of each Gaussian density involves the
multiplication with the covariance matrix, this leads to a signiﬁcant computa
tional complexity.
Second, the high number of parameters demands a huge training material to
reliably estimate the model parameters.
Many methods are known to reduce the computational complexity and the num
ber of parameters as well. In the following, just some examples of the various
CHAPTER 7. HIDDEN MARKOV MODELS 58
Figure 7.5: probability density function
possibilities are given.
Maximum Approximation
In (7.25), the sum of all K gaussian mixtures is computed. However, since value
for a gaussian density exponentially decreases with the distance from the mean
vector, one can approximate the sum over all mixtures by selecting just the one
term which yields the highest contribution to the sum (e.g.,[Ney93]):
p(x) ≈ max
k=0,1,...,K−1
{c
k
· φ
k
(x)} (7.27)
Since we do not have to compute the sum over the mixture densities anymore,
the log values of the mixture coeﬃcient and the gaussian PDF can be used,
which we know helps us in avoiding numerical problems:
log (p(x)) ≈ max
k=0,1,...,K−1
{log(c
k
) + log (φ
k
(x))} (7.28)
Diagonal Covariance Matrix
To reduce the number of parameters and the computational eﬀort simultane
ously, it is often assumed that the individual features of the feature vector are
not correlated. Then diagonal covariance matrices can be used instead of full
covariance matrices. This drastically reduces the number of parameters and the
computational eﬀort. As a consequence, a higher number of mixtures can be
CHAPTER 7. HIDDEN MARKOV MODELS 59
used, and correlation of the feature vectors can thus (to a certain degree) be
modelled by the combination of diagonal mixtures [LRS85b, CHLP92].
Parameter Tying
To reduce the number of parameters even further, some parameters can be
”shared” or tied among several mixtures of a given state, among several states
of an HMM and even among the HMMs. The idea behind parameter tying
is that if two mixtures (or states or models) have tied parameters, i.e., are
using the same value for that parameter, then the training material for these
mixtures (states, models) can also be pooled to increase the reliability of the
estimation of the tied parameter. A huge number of approaches for Models with
tied parameters exists and is beyond the scope of this introduction. Just a few
examples shall be given here:
• Using a single covariance matrix:
A commonly used approach is to share the covariance matrix among all
states and even among all models, while the mean vectors and the mixture
coeﬃcients are estimated individually for each state. These parameters al
low a reliable estimate and therefore the number of mixtures per state can
be signiﬁcantly increased, thus implicitly modelling state–speciﬁc vari
ances and correlations (e.g., [Ney93]).
• Semicontinuous HMM:
Instead of estimating individual gaussian densities φ
k,θ,Λ
(x) for each state
θ and Model Λ, The densities are shared among all models and states.
Only the mixture coeﬃcients c
k,θ,Λ
are estimated individually for each
state (e.g.,[XDHH90, HL92]).
• Generalised Parameter Tying:
While the examples above were deﬁning the set of tied parameters by
an a–priori decision, it is also possible to introduce parameter tying on
several levels (e.g., tied mean vectors, tied covariances, tied gaussian den
sities, tied mixtures, tied states, tied models), depending on the amount of
training material available as well as determined on incorporated apriori
knowledge. Examples can be found in [You92, SYW94, You96, You01].
Chapter 8
Speech Recognition with
HMMs
Now that we are able to compute the likelihood functions p(
˜
XΛ
v
), we can
perform the maximum likelihood classiﬁcation (7.5) as described in section 7.1.2.
In analogy to chapter 5, we will use the optimum path sequence found by the
Viterbi algorithm to perform the maximum likelihood classiﬁcation.
8.1 Isolated Word Recognition
First of all, we will organize our search space according to the task at hand. To
do so, we will build a superHMM, which consists of all word models Λ
v
and
is extended by two virtual states, a virtual start state and a virtual end state,
respectively. In the following, the state index will be extended by the additional
index v to identify the model Λ
v
a state s
(θ,v)
belongs to. Fig. 8.1 shows the
structure of the superHMM.
s
4
44
a
34
a
s
3
a
33
24
a
a
23
s
2
22
a
a
13
12
a
s
1
a
11
02
a
a
01
s
0
00
a
a
00
0
s
01
a
a
02
11
a
1
s
a
12
13
a
a
22
2
s
23
a
a
24
33
a
3
s
a
34
a
44
4
s
s
4
44
a
34
a
s
3
a
33
24
a
a
23
s
2
22
a
a
13
12
a
s
1
a
11
02
a
a
01
s
0
00
a
End Start
Figure 8.1: The model structure for isolated word recognition
Now we adapt the Viterbi algorithm we used in section 7.2.4 accordingly:
60
CHAPTER 8. SPEECH RECOGNITION WITH HMMS 61
The virtual start state will not emit any vectors but will only provide the ﬁrst
states of each HMM with a predecessor for the virtual time slot t = −1. To
initialize the Viterbi algorithm, we simply set the accumulated score for s
start
to
zero and the backtracking variable ψ(s
start
, −1) to −1. For each ﬁrst state s
(θ,v)
of each HMM Λ
v
, the set of predecessor states Ξ
s
(θ,v)
is deﬁned to contain
only the state s
start
. For simplicity, the state transition scores d(s
start
, s
(θ,v)
)
are set to zero instead of log(
1
V
).
For the end state of the super HMM, the set of predecessors Ξ(s
end
) will contain
all end states s
((Nv−1),v)
. The transition scores d(s
((Nv−1),v)
, s
end
) are set to
zero.
Now, the Viterbi iteration (7.17) and backtracking procedure (7.18) is performed
for all states s
(θ,v)
for all HMMs Λ
v
and for all time indices t = 0, 1, . . . , (T
X
−1).
Due to the deﬁnition of the variables for our start state as given above, it is
ensured that the ﬁrst state of each model will be initialized properly for time
t = 0. If backtracking is desired, now both the state and the model index (i, v)
have to be stored in ψ(θ, t).
To terminate the Viterbi algorithm, we perform the last path recombination for
the virtual time index T
X
:
• Optimization step:
δ(s
end
, T
X
) = max
(i,v) ∈ Ξ(s
end
)
¸
δ(s
(i,v)
, (T
X
−1))
¸
(8.1)
• Backtracking:
ψ(s
end
, T
X
) = arg max
(i,v) ∈ Ξ(s
end
)
¸
δ(s
(i,v)
, (T
X
−1))
¸
(8.2)
The word index v contained in ψ(s
end
, T
X
) contains the best path among all
possible word ends to s
end
and is therefore the desired classiﬁcation result ac
cording to the maximum likelihood approach.
By constructing the Super HMM framework we were able to solve the classiﬁ
cation task by ﬁnding the optimal state sequence, as we did in in chapter 5. Of
course, as in that chapter, the Viterbi algorithm allows a real–time implemen
tation.
To construct large HMMs which reﬂect the speech recognition task at hand
from smaller HMMs and performing the recognition task at hand by search
ing the optimum state sequence for that HMM is one of the most important
issues in the stochastic modelling framework. These super models can even
contain additional knowledge sources, as e.g., a so–called language model which
models the a–priori probabilities of sequences of words. Because with these
large HMMs those knowledge sources can be fully integrated into the frame
work (even into the training procedures), they are also called uniﬁed stochastic
engines, e.g.[XHH93].
We will see in the following section, that the extension of that framework to
connected word recognition is straightforward.
CHAPTER 8. SPEECH RECOGNITION WITH HMMS 62
8.2 Connected Word Recognition
8.2.1 Deﬁnition of the Search Space
If we want to perform connected word recognition as in chapter 6, we will have
to construct a super HMM capable of generating an unlimited sequence of words
from our vocabulary. This is an easy thing to do: we simply take our model for
isolated word recognition and introduce a ”loop” transition from the last state
back to the ﬁrst state of the model. As before, those two virtual states will
not emit any vectors and therefore will not consume any time index. They are
just used to simplify the the ”bookkeeping” for the path recombinations after
leaving the last states of the individual word models and the transition into the
ﬁrst states of the word models. The resulting model is shown in Fig. 8.2.
s
4
44
a
34
a
s
3
a
33
24
a
a
23
s
2
22
a
a
13
12
a
s
1
a
11
02
a
a
01
s
0
00
a
a
00
0
s
01
a
a
02
11
a
1
s
a
12
13
a
a
22
2
s
23
a
a
24
33
a
3
s
a
34
a
44
4
s
s
4
44
a
34
a
s
3
a
33
24
a
a
23
s
2
22
a
a
13
12
a
s
1
a
11
02
a
a
01
s
0
00
a
S R
Figure 8.2: The model structure for connected word recognition
Note that this extension is in total analogy to what we did in chapter 6:
At each time index t all word models may make a transition from their last state
to the virtual ”R” state for path recombination. After the path recombination
a virtual transition to state ”S” is made and state ”S” may be used as a possible
predecessor state for the ﬁrst states of all word models.
In chapter 6, for between–word path recombination, the ﬁrst grid point of each
word was allowed to select either the the ”horizontal” transition or the best
word end as its predecessor grid end. By introducing the variable γ(j − 1) as
in (6.2) to simplify the between–word path recombination (6.3), we did in fact
introduce a virtual state like S
R
for path recombination.
To deﬁne the dimensions of our search space is very simple: It is spanned by all
states of all word models Λ
v
in one dimension and by the vectors of the vector
sequence
˜
X in the other dimension. To identify the individual states s
θ
of the
word models, we use again the model index v as an additional index, in analogy
to the use of vector indices i and partition indices v we had in chapter 6.
CHAPTER 8. SPEECH RECOGNITION WITH HMMS 63
8.2.2 The Viterbi Algorithm for Connected Word Recog
nition
Now, all we have to to is to reformulate the equations for path recombination
and backtracking given in section 7.2.4 to ﬁt to the new problem.
Let Ξ
s
(θ,v)
be the set of predecessor states of state s
(θ,v)
of word model Λ
v
.
Note that the predecessor states are identiﬁed by the tupel (i, k) consisting of
the state index i and the model index k. We extend the set of predecessors Ξ
to consider also the two states s
R
and s
S
: For s
R
, Ξ(s
R
) will contain the state
indices of all last states s
(Nv−1),v
of all models Λ
v
.
The state s
S
has only the predecessor s
R
and will not emit any vector. Since
it is used only as a predecessor for the ﬁrst states of the word HMMs, we can
deﬁne S
R
as their predecessor state directly, in fact merging the states s
S
and
s
R
into s
R
: For all ﬁrst states of all models Λ
v
, Ξ(s
0,v
) will contain only the
the state index of s
R
and the state index for the state s
0,v
itself to allow the
self–transition .
We start with the local path recombination in analogy to (7.14):
δ(s
(θ,v)
, t) = b(t, s
(θ,v)
) + max
(i,k) ∈ Ξ(s
(θ,v))
¸
d(s
(i,k)
, s
θ,v
) +δ(s
i,k)
, (t −1))
¸
(8.3)
The equation for the local path recombination (8.3) holds also for the ﬁrst states
of each model due to the deﬁnition of the set of predecessors.
For the virtual state s
R
, the recombination of the paths at the word ends is
performed for the time index t. This has to be done after all the local path
recombinations to ensure that all local paths leading to word ends are already
recombined correctly before we are selecting the best word end. To emphasize
this fact in section 6.4.4, we explicitly made this step at the beginning of time
index j using the index of the old values j − 1. Now that we are used to it,
we just keep in mind that the recombination for the state s
R
has to be tha last
action for time index t:
δ(s
R
, t) = max
(i,k) ∈ Ξ(sR)
¸
δ(s
i,k)
, t)
¸
(8.4)
To allow backtracking of the word transitions made during the between–word
path recombination (8.4), we need the backtracking arrays w(t) and t
s
(t) as in
section 6.4.4 and again we must forward the start time of the word model during
the local path recombination.
Now, the Viterbi algorithm for connected word recognition with HMMs can be
deﬁned as follows:
• Initialization:
For the ﬁrst time index t = 0, we must initialize the ﬁrst columns for all
models. While the ﬁrst states of all models will be initialized with the log
score of emitting the ﬁrst vector of the utterance, all other states will be
initialized with the default value −∞.
The word start time for all models at that time index is t = 0, of course.
– Initialize ﬁrst columns:
For all v = 0, 1, . . . , (V −1):
CHAPTER 8. SPEECH RECOGNITION WITH HMMS 64
δ(s
(θ,v)
, 0) =
b(0, s
(0,v)
) , θ = 0
−∞ , θ = 1, 2, . . . , (N
v
−1)
(8.5)
– Initialize state–level backtracking variables:
For all v = 0, 1, . . . , (V −1):
ξ((θ, v), 0) =
0 , θ = 0
−1 , θ = 1, 2, . . . , (N
v
−1)
(8.6)
– Init word level backtracking:
t
s
(0) = 0 ; w(0) = −1 (8.7)
• Iteration for t = 1, 2, . . . , (T
X
−1):
– Local path recombination:
δ(s
(θ,v)
, t) = b(t, s
(θ,v)
)+ max
(i,k) ∈ Ξ(s
(θ,v))
¸
d(s
(i,k)
, s
θ,v
) +δ(s
(i,k)
, (t −1))
¸
(8.8)
– Backtracking at state level:
Forward the word start time associated with the path ending in
δ(s
(θ,v)
, t):
∗ Get predecessor grid point:
(i, k) = arg max
(i,k) ∈ Ξ(s
(θ,v))
¸
d(s
(i,k)
, s
θ,v
) +δ(s
(i,k)
, (t −1))
¸
(8.9)
∗ Forward the start time:
ξ((θ, v), t) = ξ((i, k), t −1) (8.10)
– Path recombination at the word ends:
Select the best word end at time t and enter the score into δ(s
R
, t).
δ(s
R
, t) = max
(θ,v) ∈ Ξ(sR)
¸
δ(s
((θ,v)
, t)
¸
(8.11)
– Backtracking at word level:
Store the word index and word start time of the best word end at
time t:
∗ Get best word end state (i, k) at time t:
(i, k) = arg max
(θ,v) ∈ Ξ(sR)
¸
δ(s
((θ,v)
, t)
¸
(8.12)
∗ Store word index:
w(t) = k (8.13)
∗ Store word start time:
t
s
(t) = ξ((i, k), t) (8.14)
CHAPTER 8. SPEECH RECOGNITION WITH HMMS 65
• Termination:
At t = (T
X
− 1), the accumulated score of the utterance can be found
in δ(s
R
, (T
X
− 1)), the index of the best word end (the last word of the
utterance) can be found in w(T
X
− 1) and the start time of that word in
t
s
(T
X
−1)
• Backtracking Procedure:
Beginning with time index (T
X
− 1), go backwards in time through the
backtracking variables:
– Initialization:
t = (T
X
−1) (8.15)
– Iteration:
While t > 0:
w
t
= w(t) ; t = t
s
(t); (8.16)
8.3 Summary
Now you know the principles of the algorithms used for simple speech recognition
tasks like the recognition of connected digits. Of course, real implementations
will deal with additional problems like the reduction of the computational ef
forts during the Viterbi search or the incorporation of silence models into the
search process.
We know now how we can use the Maximum Likelihood classiﬁcation for match
ing a given utterance against a predeﬁned vocabulary represented by HMMs.
The algorithms we used were quite straightforward and had a very regular struc
ture. The ﬂip side of the coin is that this simple approach has several disadvan
tages, of which at least some should be mentioned here:
• No matter what the utterance is, our algorithm will always produce a
sequence of words out of our vocabulary.
• The connected word recognizer we described is unable to either provide
us with a conﬁdence measure for the accuracy of the recognized word
sequence or detect ”out of vocabulary” words.
• We will get only the best match for the word sequence and no alternative
”nbest” word sequences are presented.
• During the recognition process, only the acoustic knowledge represented
by the HMMs is used, no other ”higher–level” knowledge sources are ex
ploited.
• It has no knowledge about any semantic restrictions of the word sequence
and is in no way capable of really ”understanding” the utterance.
Chapter 9
Conclusion
This short introduction covered the basics of speech recognition up to the prin
ciples of connected word recognition. We also pointed out the disadvantages of
these primitive algorithms. However, there are so many open topics we did not
even come close to deal with. Among them are the following:
• We know now how the basic algorithm for connected word recognition
with HMMs works. — But how does the training procedure for HMMs
look like?
• If we have large vocabularies, we will not have enough training samples for
the creation of word models. How can we build word models from smaller
units which can be reliably estimated?
• How can we search ”nbest” word sequences instead of the best word
sequence only?
• How can we integrate additional knowledge sources like ﬁnite state gram
mars or stochastic language models into the recognition process?
• How can we reduce the computational complexity of the search algorithm
so that we are able to meet real time constraints ?
We will learn more about some of these topics in the successor of this textbook
which will be entitled ”Speech Recognition: Selected Topics” and will be
available soon !
66
Bibliography
[Bra00] R. N. Bracewell. The Fourier Transform and its Applications.
McGrawHill, 2000.
[Bri87] E. Oran Brigham. Schnelle Fourier Transformation. Oldenbourg
Verlag, 1987.
[CHLP92] L. R. Rabiner C.H. Lee and R. Pieraccini. Speaker Independent
Continuous Speech Recognition Using Continuous Density Hidden
Markov Models., volume F75 of NATO ASI Series, Speech Recogni
tion and Understanding. Recent Advances. Ed. by P. Laface and R.
De Mori. Springer Verlag, Berlin Heidelberg, 1992.
[HL92] X. D. Huang and K. F. Lee. Phonene classiﬁcation using semicontin
uous hidden markov models. IEEE Trans. on Signal Processessing,
40(5):1962–1067, May 1992.
[Jel98] F. Jelinek. Statistical Methods for Speech Recognition. MIT Press,
1998.
[LRS85a] S. E. Levinson L.R. Rabiner, B.H. Juang and M. M. Sondhi. Recog
nition of isolated digits using hidden markov models with continu
ous mixture densities. AT & T Technical Journal, 64(6):1211–1234,
JulyAugust 1985.
[LRS85b] S. E. Levinson L.R. Rabiner, B.H. Juang and M. M. Sondhi. Some
properties of continuous hidden markov model representations. AT
& T Technical Journal, 64(6):1251–1270, JulyAugust 1985.
[LRS89] J. G. Wilpon L.R. Rabiner and Frank K. SOONG. High per
formance connected digit recognition using hidden markov mod
els. IEEE Transactions on Acoustics, Speech and Signal Processing,
37(8):1214–1225, August 1989.
[Ney84] H. Ney. The use of a onestage dynamic programming algorithm
for connected word recognition. IEEE Transactions on Acoustics,
Speech and Signal Processing, ASSP32(2):263–271, April 1984.
[Ney93] H. Ney. Modeling and search in continuous speech recognition. Pro
ceedings of EUROSPEECH 1993, pages 491–498, 1993.
[Rab89] L.R. Rabiner. A tutorial on hidden markov models and selected ap
plications in speech recognition. Proceedings of the IEEE, 77(2):257–
286, February 1989.
67
BIBLIOGRAPHY 68
[RJ86] L.R. Rabiner and B.H. Juang. An introduction to hidden markov
models. IEEE ASSP Magazine, pages 4–16, January 1986.
[Rus88] G. Ruske. Automatische Spracherkennung. Oldenbourg Verlag,
M¨ unchen, 1988.
[ST95] E. G. SchukatTalamazzini. Automatische Spracherkennung. Vieweg
Verlag, 1995.
[SYW94] J.J. Odell S. Young and P.C. Woodland. Treebased state tying
for high accuracy acoustic modelling. Proc. Human Language Tech
nology Workshop, Plainsboro NJ, Morgan Kaufman Publishers Inc.,
pages 307–312, 1994.
[XDHH90] K. F. Lee X. D. Huang and H. W. Hon. On semicontinuous hidden
markov modeling. Proceedings ICASSP 1990, Albuquerque, Mexico,
pages 689–692, April 1990.
[XHH93] F. Alleva X. Huang, M. Belin and M. Hwang. Uniﬁed stochas
tic engine (use) for speech recognition. Proceedings ICASSP 1993,,
II:636–639, 1993.
[You92] S. J. Young. The general use of tying in phonemebased hmm speech
recognisers. Proceedings of ICASSP 1992, I(2):569–572, 1992.
[You96] S. Young. Large vocabulary continuous speech recognition: A re
view. IEEE Signal Processing Magazine, 13(5):45–57, 1996.
[You01] S. Young. Statistical modelling in continuous speech recognition.
Proc. Int. Conference on Uncertainity in Artiﬁcial Intelligence,
Seattle, WA, August 2001.
1
This document . . .
. . . was written after I presented a tutorial on speech recognition on the graduates workshop ”SMG”, November 2000, Venice, held by Prof. Dr. H. G. Tillmann, Institut f¨r Phonetik und Sprachliche Kommunikation, University of Munich, Germany. Iniu tially, I intended to give a short overview over the principles of speech recognition and the industrial applications of speech recognition technology to an audience consisting both of engineers and non–engineers. As I worked on the slides, I tried to explain most of the principles and algorithms of speech recognition by using pictures and examples that can also be understood by non–technicians. For the engineers in the audience and to prevent me from making inappropriate simpliﬁcations, I also added the formal framework. Thus, I always started to explain the algorithms by using pictures and examples to give an idea of what the algorithm does and then went into some of the details of the formulae. Although I was afraid that this might be the perfect approach to boring both the engineers and the non–engineers of my audience, it turned out that they both could live with this mixed presentation. Being asked to convert my slides into this written introduction to speech recognition, I took the chance to add some more details and formal framework which can be skipped by the non–engineer readers.
Version 1.1
In version 1.1 of this document some error corrections were made. In some chapters examples and pictures were added. Of course, errors (old and new ones)will still be contained and additional reﬁnements might be necessary. If you have any suggestions how to improve this document, feel free to send an email to plannerer@ieee.org
Copyright
This work is subject to copyright. All rights are reserved, whether the whole or part of this document is concerned, specifically the rights of translation, reuse of illustrations, reproduction and storage in data banks. This document may be copied, duplicated or modified only with the written permission of the author. c 2001,2002,2003 Bernd Plannerer, Munich, Germany plannerer@ieee.org
Who should read this?
This introduction to speech recognition is intended to provide students and graduates with the necessary background to understand the principles of speech recognition technology. It will give an overview over the very basics of pattern matching and stochastic modelling and it will show how the dynamic programming algorithm is used for recognition of isolated words as well as for the recognition of continuous speech. While it is impossible to present every detail of all these topics, it is intended to provide enough information to the interested reader to select among the numerous scientiﬁc publications on the subject. Those who are not interested in further studies should at least be provided with an idea of how automated speech recognition works and what today’s systems can do — and can’t do.
2
By changing the shape of the pipe (and in addition opening and closing the air ﬂow through your nose) you change the spectral shape of the speech signal.2 Technical Characteristics of the Speech Signal An engineer looking at (or listening to) a speech signal might characterize it as follows: • The bandwidth of the signal is 4 kHz 3 . thus articulating diﬀerent speech sounds. The spectral shape of the speech signal is determined by the shape of the vocal tract (the pipe formed by your throat. The air pressure forces the glottis to open and close periodically thus generating a periodic pulse train (triangle–shaped). 1. the speech signal can be excited in three possible ways: • voiced excitation The glottis is closed. Depending on which speech sound you articulate. By suddenly opening the closure the air pressure drops down immediately. This ”fundamental frequency” usually lies in the range from 80Hz to 350Hz. the air ﬂow from your lungs ﬁrst passes the glottis and then your throat and mouth. tongue. (”plosive burst”) With some speech sounds these three kinds of excitation occur in combination.Chapter 1 The Speech Signal 1.1 Production of Speech While you are producing speech sounds. teeth and lips). This results in a turbulence which generates a noise signal. • transient excitation A closure in the throat or mouth will raise the air pressure. • unvoiced excitation The glottis is open and the air passes a narrow passage in the throat or mouth. The spectral shape of the noise is determined by the location of the narrowness.
1 Bandwidth The bandwidth of the speech signal is much higher than the 4 kHz stated above. .2 Fundamental Frequency As described earlier. the zero crossing rate of the signal is very high.1).4 The Envelope of the Power Spectrum Decreases with Increasing Frequency The pulse sequence from the glottis has a power spectrum decreasing towards higher frequencies by 12dB per octave. the frequency of the formants (especially of the 1st and 2nd formant) change and therefore characterize the vowel being articulated.. as we all know from using the (analog) phone. Voiced excitation is used when articulating vowels and some of the consonants. 1.g. . For fricatives (e. /f/ as in fish or /s/. These frequencies are called formant frequencies. On the other hand. But where do those characteristics come from? 1. as in mess ).2. the socalled fundamental frequency. . Depending on the shape of the vocal tract (the diameter of the pipe changes along the pipe). However. for the fricatives. usually no fundamental frequency can be detected. Plosives (like /p/ as in put). 3. one can see that the pipe shows resonance at the frequencies as given by (1. 1. In fact. the vocal tract gives a characteristic spectral shape to the speech signal. you can best detect in the speech signal by looking for the short silence necessary to build up the air pressure before the plosive bursts out. 1. using voiced excitation for the speech sound will result in a pulse train. it seems that within a bandwidth of 4 kHz the speech signal contains all the information necessary to understand a human voice. this results in an overall decrease of 6dB per octave.2. unvoiced excitation (noise) is used. Thus. In these cases. If one simpliﬁes the vocal tract to a straight pipe (the length is about 17cm). there is still a signiﬁcant amount of energy in the spectrum for high and even ultrasonic frequencies. 2. n = 1. The emission characteristics of the lips show a highpass characteristic with +6dB per octave.2. which use transient excitation.1) • The envelope of the power spectrum of the signal shows a decrease with increasing frequency (6dB per octave) This is a very rough and technical description of the speech signal. (1.CHAPTER 1.3 Peaks in the Spectrum After passing the glottis. THE SPEECH SIGNAL 4 • The signal is periodic with a fundamental frequency between 80 Hz and 350 Hz • There are peaks in the spectral distribution of energy at (2n − 1) ∗ 500 Hz .2. .
we have the following parameters in our speech production model: • the mixture between voiced and unvoiced excitation (determined by v and u) • the fundamental frequency (determined by P (f )) • the spectral shaping (determined by H(f )) • the signal amplitude (depending on v and u) These are the technical parameters describing a speech signal. The output of both generators is then added and fed into the box modelling the vocal tract and performing the spectral shaping with the transmission function H(f ). one can adjust the signal amplitude of the impulse generator (v) and the noise generator (u). By directly computing the spectrum of the speech signal. the spectral shape is determined in short intervals of time. every 10 ms.1: A simple model of speech production This model works as follows: Voiced excitation is modelled by a pulse generator which generates a pulse train (of triangle–shaped pulses) with its spectrum given by P (f ). e. the most valuable information is contained in the way the spectral shape of the speech signal changes in time. For the speech recognizer. the parameters given above have to be computed from the time signal (this is called speech signal analysis or ”acoustic preprocessing”) and then forwarded to the speech recognizer. To reﬂect these dynamic changes. To mix voiced and unvoiced excitation. we can draw a simpliﬁed model of speech production [ST95.. The emission characteristics of the lips is modelled by R(f ). the fundamental frequency would be implicitly contained in the measured spectrum (resulting in unwanted ”ripples” in the spectrum). Figure 1. the production of speech can be separated into two parts: Producing the excitation signal and forming the spectral shape.2) To inﬂuence the speech sound.CHAPTER 1.3 A Very Simple Model of Speech Production As we have seen. To perform speech recognition. THE SPEECH SIGNAL 5 1. Rus88]: voiced excitation pulse train P(f) v vocal tract spectral shaping H(f) lips emission R(f) S(f) X(f) unvoiced excitation white noise N(f) u Figure 1. Thus. Hence. The unvoiced excitation is modelled by a white noise generator with spectrum N (f ). . 1.g. the spectrum S(f ) of the speech signal is given as: S(f ) = (v · P (f ) + u · N (f )) · H(f ) · R(f ) = X(f ) · H(f ) · R(f ) (1.2 shows the time signal of the vowel /a:/ and ﬁg.3 shows the logarithmic power spectrum of the vowel computed via FFT.
The high peaks in the time signal are caused by the pulse train P (f ) generated by voiced excitation. Most speech recognition systems use the so–called mel frequency cepstral coeﬃcients (MFCC) and its ﬁrst (and sometimes second) derivative in time to better reﬂect dynamic changes.4 Speech Parameters Used by Speech Recognition Systems As shown above.2: Time signal of the vowel /a:/ (fs = 11kHz. Figure 1.CHAPTER 1. the spec . 1. the direct computation of the power spectrum from the speech signal results in a spectrum containing ”ripples” caused by the excitation spectrum X(f ).4. N = 512). One could also measure the spectral shape by means of an analog ﬁlter bank using several bandpass ﬁlters as is depicted below. [Rus88]). Thus. Typically.3: Log power spectrum of the vowel /a:/ (fs = 11kHz. special transformations are used to separate the excitation spectrum X(f ) from the spectral shaping of the vocal tract H(f ). 1. After rectifying and smoothing the ﬁlter outputs. THE SPEECH SIGNAL 6 Figure 1. the output voltage would represent the energy contained in the frequency band deﬁned by the corresponding bandpass ﬁlter (cf. which represents H(f ) can be estimated from the speech signal. it is necessary to compute the speech parameters in short time intervals to reﬂect the dynamic change of the speech signal. The ripples in the spectrum are caused by P (f ). Depending on the implementation of the acoustic preprocessing however.1 Computation of the Short Term Spectra As we recall. a smooth spectral shape (without the ripples). length = 100ms).
. .5) . . we get a series of speech samples s(k · ∆t) where ∆t = 1/fs or.3) Then. Many window functions can be used.97 · s(k − 1) ˆ (1. the most common one is the so–called HammingWindow: w(k) = 0. m + N − 1 ˆ 0 : else (1. while the time window is shifted in time intervals of about 10ms to compute the next set of speech parameters.46 cos 2πk N −1 : if k = 0. a short piece of signal is cut out of the whole speech signal. This is done by multiplying the speech samples s(k) with a windowing function w(k) to cut ˆ out a short segment of the speech signal. Thus. we get a short speech segment vm (k): vm (k) = s(k) · w(k − m) : if k = m. THE SPEECH SIGNAL f1 v1(t) 7 f2 s(t) v2(t) fn vn(t) Figure 1. m + 1.4: A ﬁlter bank for spectral analysis tral parameters of speech are estimated in time intervals of 10ms. Now a preemphasis ﬁlter is used to eliminate the 6dB per octave decay of the spectral energy: s(k) = s(k) − 0. The length N of the segment (its duration) is usually chosen to lie between 16ms to 25 ms. . . 1. First. By multiplying our speech signal with the time window. simply s(k). for easier notation. . After digitizing the analog speech signal. overlapping segments are used for speech analysis.CHAPTER 1. Depending on the implementation.4) where N is the length of the time window in samples. a sampling frequency fs between 8kHz and 16kHz and usually a 16bit quantization of the signal amplitude is used. we have to sample and digitize the speech signal. vm (k) starting with sample number k = m and ending with sample number k = m + N − 1.54 − 0. N − 1 0 : else (1.
7) To compute the spectrum.6) and symmetry (1. n = 0. (1. Thus. the complex spectrum V (n) is computed. To simplify the notation.8) ˆ The DFT gives us N discrete complex values for the spectrum V (n) at the frequencies n · ∆f where 1 ∆f = (1. V (n)2 . All the following operations refer to this speech segment vm (k). . ˆ which gives us the discrete.. The complex spectrum V (n) has the following properties: • The spectrum V (n) is deﬁned within the range from n = −∞ to n = +∞. i = 1. the absolute values of the coeﬃcients are also symmetric: V (−n) = V (n) (1.e for −N/2 · ∆f to −1 · ∆f . . N denotes the length of the speech segment given in samples (the window length is typically between 16ms and 25ms) while m is the start time of the segment. N − 1 again.CHAPTER 1. and for both FFT and Hartley Transform theory and its applications see [Bra00]): N −1 ˆ V (n) = k=0 v(k) · e−j2πkn/N .6).e. k = m . The start time m is incremented in intervals of (usually) 10ms. ˆ but is periodic with period N (1. V (n ± i · N ) = V (n) . . Due to the periodicity (1. 1. . so that our time index runs from 0 . V (−N/2 · ∆f ) = V (N/2 · ∆f ) . For further processing. So we can compute the squares of the absolute values. . i. V (N/2)2 are used for further processing. From the windowed signal. we shift the signal in time by m samples to the left. One could think that with a frequency range from −N/2 · ∆f to +N/2 · ∆f we should have N + 1 diﬀerent values for V (n). i. . m + N − 1. One should keep in mind that we have to interpret ˆ the values of V (n) ranging from n = N/2 to n = N − 1 as the values for the negative frequencies of the spectrum V (n · ∆f ). we compute the discrete Fourier transform (DFT. . . . so that the speech segments are overlapping each other. So nothing is wrong. we want to compute its discrete power spectrum V (n)2 . see [Bri87]. complex–valued short term spectrum V (n) of the speech signal (for a good introduction to the DFT and FFT. . THE SPEECH SIGNAL 8 As already mentioned. 2. only the values V (0)2 .6) • Since vm (k) is real–valued. but since V (n) is periodic with period N (1. we are only interested in the power spectrum of the signal.6).9) N ∆t Remember that the complex spectrum V (n) is deﬁned for n = −∞ to n = +∞. N − 1 (1. • V (n) is periodic with period N . . and the N diﬀerent values we get from V (n) are suﬃcient to describe the spectrum V (n). giving a total number of N/2 + 1 values. we know that ˆ .7) of V (n). . the N diﬀerent values of V (n) are suﬃcient to represent V (n). It should be noted that V (0) contains only the DCoﬀset of the signal and therefore provides no useful information for our speech recognition task. First of all.
10).5: The mel frequency scale The human ear has high frequency resolution in low–frequency parts of the spectrum and low frequency resolution in the high–frequency parts of the spectrum. N/2 . . A common way to do this is to use K triangle–shaped windows in the spectral domain to build a weighted sum over those power spectrum coeﬃcients V (n)2 which lie within the window. The non–linear warping of the frequency axis can be modeled by the so–called melscale. Furthermore.4. 1. 1. the midfrequency and bandwidth of these groups are non–linearly distributed. The so–called mel–frequency fmel can be computed from the frequency f as follows: fmel (f ) = 2595 · log 1 + Figure 1. . K − 1 . The frequency groups are assumed to be linearly distributed along the melscale.10) 2000 1500 f [mel] 1000 500 0 0 500 1000 1500 2000 f [Hz] 2500 3000 3500 4000 Figure 1. . . THE SPEECH SIGNAL 9 1. mel frequency scale 2500 f 700Hz (1. the human ear does not show a linear frequency resolution but builds several groups of frequencies and integrates the spectral energies within a given group.2 Mel Spectral Coeﬃcients As was shown in perception experiments.CHAPTER 1. . The coeﬃcients of the power spectrum V (n)2 are now transformed to reﬂect the frequency resolution of the human ear. k = 0. . We denote the windowing coeﬃcients as ηkn . n = 0.5 shows a plot of the mel scale (1.
the frequency distance bewteen every two center frequencies can easily be computed: fmax ∆mel = = 93. k = 0.3 Example: Mel Spectral Coeﬃcients This example will show how to compute the window coeﬃcients ηkn for a given set of parameters. [ST95]. we can derive some useful parameters. In ﬁgure 1.10). Now the values for the center frequencies on the linear frequency scale have to be mapped to the indices of the FFT. G(k) . we compute s the maximum frequency of the sampled signal: fg = f2 = 4000 Hz.10). 1.12) Therefore. We denote the power spectrum indices corresponding to the center frequencies as nc (k). we get the corresponding indices of the coeﬃcients of the power spectrum V (n)2 .7). N Dividing fc (k) on the linear frequency scale by ∆f and rounding to the next integer. the center frequencies on the linear frequency scale can easily be computed by solving equantion (1. First. . . k = 0 · · · K − 1 Given the center frequencies in the mel scale. the mel frequency scale.11) Caused by the symmetry of the original spectrum (1. The center frequencies of our triangleshaped windows are equally spaced on the mel scale. .1 mel.4. According to (1. We start with the deﬁnition of the preprocessing parameters: fs N K = = = 8000 Hz 256 22 From these values. . THE SPEECH SIGNAL 10 This gives us a new set of coeﬃcients. 1. K − 1 the so–called mel spectral coeﬃcients.3 mel K+1 Thus. e. The rounding to the next integer can be interpreted as a quantization of the linear frequency scale where the frequency resolution is given by ∆f . Therefore. . N/2 G(k) = n=0 ηkn · V (n)2 .25Hz. .CHAPTER 1. K− 1. it is suﬃcient to consider only the positive range of k. 1. . the center frequencies on the mel scale can be expressed as: fc (k) = (k + 1) · ∆mel . this corresponds to a maximum mel frequency of fmax [mel] = 2146. k = 0. . 1. K − 1 (1. . The frequency resolution of our FFT with length N is given as ∆f = fs = 31.g. k = 0.6 you can see a plot of the resulting center frequencies fc (k) on the linear frequency scale vs. .. the mel power spectrum is also symmetric in k: G(k) = G(−k) (1. .
. K − 1 .7 you can see a plot of the power spectrum coeﬃcient indices nc (k) versus the mel spectral coeﬃcient index k .6: Center frequencies of the mel ﬁlter. .. All what is left to do is to compute the windowing coeﬃcients ηkn . N/2. the columns of the matrix represent the weighting coeﬃcients for each power spectrum coeﬃcient n. For each window (rows k in the matrix).0 at the center frequency fc (k). . 1. The windowing coeﬃcients ηkn can be repre˜ sented as a windowing matrix W = [ηkn ]. In ﬁgure 1. i. The column index of the window’s maximum is given by the corresponding index of the power spectrum nc (k). The left and right boundaries of the triangles of a given window are deﬁned to be the center frequencies of the left and right ﬁlter channel.8. the weighting coeﬃcients ηkn are computed to form a triangleshaped window with a maximum value of 1. n = 0.0. . fc (k − 1) and fc (k + 1). . A black value on the grayscale denotes a coeﬃcient value of 1. Thus. The coeﬃcients within the window are chosen to linearly raise from zero at column index nc (k − 1) to the maximum value 1. The rows of this matrix represent the individual windows k (”’ﬁlter channels”’). Note that the bandwidth (number of power spectrum coeﬃcients within a window) increases with the center frequency of .CHAPTER 1. K = 22). THE SPEECH SIGNAL 11 mel filter frequencies 2500 2000 1500 f[mel] 1000 500 0 0 500 1000 1500 2000 f[Hz] 2500 3000 3500 4000 Figure 1. ˜ The resulting matrix W is shown in Figure 1. Now we know the center frequencies fc (k) of our windows and the coresponding coeﬃcients of the power spectrum nc (k). (fs = 8kHz . k = 0. k = 0 · · · K − 1.e. the corresponding weighting coeﬃcients ηk.nc (k−1) and ηk.nc (k+1) are deﬁned to be zero. .0 at nc (k) and then again decrease towards zero at index nc (k + 1). 1.
we can separate the product of spectral functions into the interesting vocal tract spectrum and the part describing the excitation and emission properties: S(f ) = X(f ) · H(f ) · R(f ) = H(f ) · U (f ) (1. a smoothed spectrum is required which should represent H(f ) but not X(f ). the window. we had those unwanted ”ripples” in the spectrum. Since the transmission function of the vocal tract H(f ) is multiplied with the spectrum of the excitation signal X(f ).2). 1.13) We can now transform the product of the spectral functions to a sum by taking the logarithm on both sides of the equation: log (S(f )) = = log (H(f ) · U (f )) log (H(f )) + log (U (f )) (1. For the speech recognition task. THE SPEECH SIGNAL 12 FFT indices 110 100 90 80 70 FFT index 60 50 40 30 20 10 2 4 6 8 10 12 filter channel 14 16 18 20 22 Figure 1.14) .4 Cepstral Transformation Before we continue. If we look at (1. cepstral analysis is used.4. lets remember how the spectrum of the speech signal was described by (1.7: Indices of power spectrum vs mel spectral coeﬃcient index.CHAPTER 1. To cope with this problem.2).
But all we know is that U (f ) produces the ”ripples”. which now are an additive component in the log–spectral domain.8: Matrix of coeﬃcients ηkn .9 we see an example of the log power spectrum. THE SPEECH SIGNAL 13 matrix of filter coefficients 2 4 6 8 filter channel 10 12 14 16 18 20 22 20 40 60 FFT index 80 100 120 Figure 1. which contains unwanted ripples caused by the excitation signal U (f ) = X(f ) · R(f ). if we knew U (f )2 exactly. The ﬁltering can be done by transforming the log–spectrum back into the time– domain (in the following. F T −1 denotes the inverse Fourier transform): s(d) = F T −1 log S(f )2 ˆ = F T −1 log H(f )2 + F T −1 log U (f )2 (1. To get rid of the inﬂuence of U (f ). the ”ripples” would have a ”high frequency” compared to the spectral shape of H(f ).15) In ﬁgure 1. one would have to get rid of the ”highfrequency” parts of the log–spectrum (remember.CHAPTER 1. In the log–spectral domain we could now subtract the unwanted portion of the signal. This holds also for the absolute values of the power spectrum and also for their squares: log S(f )2 = log H(f )2 · U (f )2 = log H(f )2 + log U (f )2 (1. we are dealing with the spectral coeﬃcients as if they would represent a time signal).16) . This would be a kind of low–pass ﬁltering. and that if we would interpret this log–spectrum as a time signal.
. .19) The cepstral index d of the peak expressed as a function of the period length n is: t N · ∆t N d= = = (1. Applying the inverse DFT to the log power spectrum coeﬃcients log V (n)2 yields: s(d) = ˆ 1 N N −1 n=0 log V (n)2 · ej2πdn/N .10 shows the result of the inverse DFT applied on the log power spectrum shown in ﬁg.18) will show up in the cepstral domain as a peak at time t: t= 1 N 1 N · ∆t = · = pf n fs n (1. an oscillation in the frequency domain with a Hertzperiod of pf = n · ∆f = n · fs n = N N · ∆t (1.17) Figure 1. The process of ﬁltering in the cepstral domain is also called liftering. The ripples in the spectrum are caused by X(f ). 1. The peak in the cepstrum reﬂects the ripples of the log power spectrum: Since the inverse Fourier transformation (1. all coeﬃcients right of the vertical line were set to zero and the resulting signal was transformed back into the frequency domain.CHAPTER 1.11. since U (f )2 and H(f )2 are both realvalued and both are even: U (f )2 = U (−f )2 and H(f )2 = H(−f )2 . The resulting cepstrum is real–valued. One can clearly see that this results in a ”smoothed” version of the log power spectrum if we compare ﬁgures 1.11 and 1. giving the so–called cepstrum (a reversed ”spectrum”). The inverse Fourier transform brings us back to the time–domain (d is also called the delay or quefrency). d = 0. 1.9.9.20) ∆t n · ∆t n The low–pass ﬁltering of our energy spectrum can now be done by setting the highervalued coeﬃcients of s(d) to zero and then transforming back into the ˆ frequency domain. N = 512). . 1.9: Log power spectrum of the vowel /a:/ (fs = 11kHz. . In ﬁgure 1.10. as shown in ﬁg.17) is applied to the log power spectrum. THE SPEECH SIGNAL 14 Figure 1. N − 1 (1.
.11: Power spectrum of the vowel /a:/ after cepstral smoothing. Figure 1. N = 512).CHAPTER 1.7) of V (n)2 . we now take the logarithm of the mel power spectrum (1. the mel spectrum is used to reﬂect the perception characteristics of the human ear.5 Mel Cepstrum Now that we are familiar with the cepstral transformation and cepstral smoothing. we will compute the mel cepstrum commonly used in speech recognition. The ripples in the spectrum result in a peak in the cepstrum. is is possible to replace the inverse DFT by the more eﬃcient cosine transform [ST95]: s(0) = ˆ 2 N 4 N N/2−1 n=0 N/2−1 log V (n)2 log V (n)2 · cos πd (2n + 1) N (1. In analogy to computing the cepstrum. All but the ﬁrst 32 cepstral coeﬃcients were set to zero before transforming back into the frequency domain. It should be noted that due to the symmetry (1. for speech recognition. THE SPEECH SIGNAL 15 Figure 1. The restriction to the ﬁrst Q coeﬃcients reﬂects the low–pass liftering process as described above. 2.22) n=0 1. Only the ﬁrst Q (less than 14) coeﬃcients of the mel cepstrum are used in typical speech recognition systems. As stated above. .N/2 (1..21) s(d) = ˆ . d = 1.11) (instead of the power spectrum itself) and transform it into the quefrency domain to compute the so–called mel cepstrum.4..10: Cepstrum of the vowel /a:/ (fs = 11kHz.
In order to better reﬂect the dynamic changes of the MFCC cm (q) (and also of the energy em ) in time. One can spend years of research on the acoustic preprocessing modules of a speech recognition system. 1. the MFCC are computed for a speech segment vm at time index m in short time intervals of typically 10ms. the choice of parameters for acoustic preprocessing has a strong impact on the performance of the speech recognition systems. .12).23) While successive coeﬃcients G(k) of the mel power spectrum are correlated. 1. .4.. the Mel Frequency Cepstral Coeﬃcients (MFCC) resulting from the cosine transform (1.24) 1.5).D. the FourierTransform can be replaced by a simple cosine transform: K−1 c(q) = k=0 log (G(k)) · cos πq(2k + 1) 2K . usually the ﬁrst and second derivatives in time are also computed. theses have been written on that subject.26) The time intervall usually lies in the range 2 ≤ τ ≤ 4.. q = 0.CHAPTER 1.g by computing the diﬀerence of two coeﬃcients lying τ time indices in the past and in the future of the time index under consideration. q = 0. it is suﬃcient to remember that the information contained in the speech signal can be represented by a set of parameters which has to be measured in short intervals of time to reﬂect the dynamic change of those parameters. the signal energy is added to the set of parameters. THE SPEECH SIGNAL 16 Since the mel power spectrum is symmetric due to (1.25) The sevond derivative is computed from the diﬀerences of the ﬁrst derivatives: (1. . For the ﬁrst derivative we get: ∆cm (q) = cm+τ (q) − cm−τ (q) .g. It can simply be computed from the speech samples s(n) within the time window by: N −1 e= n=0 s2 (n) (1. . and many Ph.Q − 1 ∆∆cm (q) = ∆cm+τ (q) − ∆cm−τ (q) .23) are decorrelated. 1.4.Q − 1 (1. . The MFCC are used directly for further processing in the speech recognition system instead of transforming them back to the frequency domain. Of course. e.6 Signal Energy Furthermore. 1..7 Dynamic Parameters As stated earlier in eqn (1. [Ney93]) which are computed every 10ms. This results in a total number of up to 63 parameters (e. . . Q − 1 (1. For our purposes however. q = 0.
it is useful for further processing to construct a vector out of these numbers by assigning each measured value to one component of the vector. As we know. Each point in our map represents the temperature and humidity in our oﬃce at a given time.1 Feature Vectors and Vector Space If you have a set of numbers representing certain features of an object you want to describe. 2. there are certain values of temperature and humidity which we ﬁnd more comfortable than other values. Now if you have measured a new value–pair and you are to judge if it will be convenient or not to a person. Since these so–called feature vectors have two components. As an example. You can see that they form regions of convenience and inconvenience. which will be measured in short intervals of time during a preprocessing step. we have seen that the speech signal can be characterized by a set of parameters (features). you will get a series of two–dimensional vectors describing how the air in your oﬃce changes in time. think of an air conditioning system which will measure the temperature and relative humidity in your oﬃce.2 A Simple Classiﬁcation Problem Lets assume we would want to know if a value–pair we measured in our oﬃce would be judged as comfortable or as uncomfortable by you. we will ﬁrst get familiar with the concept of feature vectors and vector space.Chapter 2 Features and Vector Space Until now. respectively. This is our 17 . we can interpret the vectors as points in a two–dimensional vector space. If you measure those parameters every second or so and you put the temperature into the ﬁrst component and the humidity into the second component of a vector. 2. In the map the comfortable value– pairs are shown as points labelled ”+” and the less comfortable ones are shown as ””. Thus we can draw a two–dimensional map of our measurements as sketched below. One way to ﬁnd out is to initially run a test series trying out many value–pairs and labelling each point either ”+” or ”” in order to draw a map as the one you saw above. you would have to judge if it lies within those regions which are marked in your map as ”+” or if it lies in those marked as ””. Before we start to look at the speech recognition task.
We will later discuss how one can ﬁnd a good selection of prototypes using the ”k–means algorithm”.1: A map of feature vectors ﬁrst example of a classiﬁcation task: We have two classes (”comfortable” and ”uncomfortable”) and a vector in feature space which has to be assigned to one of these classes. FEATURES AND VECTOR SPACE temperature 18 22°C 20°C 18°C rel. as shown in ﬁgure 2.1 Classiﬁcation of Vectors Prototype Vectors The problem of how to represent the regions of ”comfortable” and ”uncomfortable” feature vectors of our classiﬁcation task can be solved by several approaches. One could also verbalize the classiﬁcation task as: Find the nearest prototype to the unknown vector and assign the unknown vector to the class this ”nearest neighbor” represents . One of the easiest is to select several of the feature vectors we measured in our experiments for each of our classes (in our example we have only two classes) and to declare the selected vectors as ”prototypes” representing their class.3.2 Nearest Neighbor Classiﬁcation The classiﬁcation of an unknown vector is now accomplished as follows: Measure the distance of the unknown vector to all classes.3. The distance of the unknown vector to a given class is deﬁned as the smallest distance between the unknown vector and all of the prototypes representing the given class. we simply assume that we were able to make a good choice of the prototypes. For now. 2. Then assign the unknown vector to the class with the smallest distance. humidity 40% 50% 60% Figure 2.CHAPTER 2.2. — But how do you describe the shape of the regions and how can you decide if a measured vector lies within or without a given region? In the following chapter we will learn how to represent the regions by prototypes and how to measure the distance of a point to a region.3 2. 2.
.1) Using this class distance. we can write the classiﬁcation task as follows: x ∈ ωv ⇔ v = arg min {dωv (x)} v (2. . Let the distance measure between the vector and a prototype be denoted as d (x. 1. where k = 0.ωv )} . Then the class distance between x and the class ωv is deﬁned as: dωv (x) = min {d (x. humidity 40% 50% 60% Figure 2.2) . We will discuss several distance measures later) .ωv ) (eg.ωv . . . ω(V −1) be the set of classes. . FEATURES AND VECTOR SPACE temperature 19 22°C 20°C ? 18°C rel. Fig.CHAPTER 2. the Euclidean distance. . (Kωv − 1) k (2. (Kωv − 1). k = 0. Let x denote the unclassiﬁed vector. . The classiﬁcation task we described can be formalized as follows: Let Ω = ω1 . pk. . pk. . V being the total number of classes. 2.2: Selected prototypes (hence the name). ω2 . 1. .2 shows the unknown vector and the two ”nearest neighbors” of prototypes of the two classes.. . Each class is represented by its prototype vectors pk. .
Now we will brieﬂy look at some commonly used distance measures.2 City Block Distance The computation of the Euclidean distance involves computing the squares of the individual diﬀerences thus involving many multiplications.1) and (3. 3.2) where ′ denotes the vector transpose.1 Euclidean Distance The Euclidean distance measure is the ”standard” distance measure between two vectors in feature space (with dimension DIM ) as you know it from school: DIM−1 d2 Euclid (x. you have to compute the sum of the squares of the diﬀerences between the individual components of x and p. 3. p) = (x − p) · (x − p) ′ (3. To reduce the computational complexity. This is similar to measuring the distance between two points on a street map: You go three blocks to the East. The Euclidean distance is probably the most commonly used distance measure in pattern recognition. each of the distance measures has its pros and cons. Note that both equations (3. we have found a way to classify an unknown vector by calculation of its class–distances to predeﬁned classes. and we will discuss their most important properties.1) To calculate the Euclidean distance measure. p) = i=0 (xi − pi ) 2 (3.Chapter 3 Distance Measures So far. d2 instead of d. Depending on the application at hand. then two blocks to the South (instead of straight trough the buildings as the Euclidean distance would 20 .2) compute the square of the Euclidean distance. one can also use the absolute values of the diﬀerences instead of their squares. which in turn are deﬁned by the distances to their individual prototype vectors. This can also be written as the following scalar product: d2 Euclid (x.
the diﬀerent scales of the dimensions of our feature vectors have to be compensated when computing the distance. an Euclidean or City Block distance measure would give the wrong result.3) 3. Then you sum up all the absolute values for all the dimensions of the vector space: DIM−1 dCity−Block (x. While in our oﬃce the temperature values will have a range of typically between 18 and 22 degrees Celsius.1: Two dimensions with diﬀerent scales To cope with this problem. the individual dimensions will have diﬀerent scales also. a temperature diﬀerence of 4 degrees Celsius certainly will.g. we see that for real–world applications. But if we remember our example from section 2. The dimension x1 has a wider range of values than dimension x2 .1. DISTANCE MEASURES 21 assume). i.CHAPTER 3. the distances in each dimension contributes in the same way to the overall distance. p) = i=0 xi − pi  (3.. 4 percent relative humidity might not even be noticed by a person. X2 class "A" ? class "B" X1 Figure 3. While a small diﬀerence in humidity of e.1 we see a more abstract example involving two classes and two dimensions. the humidity will have a range from 40 to 60 percent relative humidity. In Figure 3. classifying the unknown vector as ”class A” instead of ”class B” which would (probably) be the correct result. so all the measured values (or prototypes) are spread wider along the axis denoted as ”x1 ” as compared to axis ”x2 ”.e. Obviously.. This leads us to the so–called ”Weighted Euclidean Distance”: . This can be done by multiplying each contributing term with a scaling factor speciﬁc for the respective dimension.3 Weighted Euclidean Distance Both the Euclidean distance and the City Block distance are treating the individual dimensions of the feature space equally.
this can be rewritten as: 2 dW eighted Euclid ′ ˜ (x.n n=0 (3. ˜ Λ = diag λi . p) = i=0 λi · (xi − pi )2 (3.. a high value of x1 correlates with a high value for x2 (with respect to the mean vector (center) of the class). for both classes A and B. This leads us to a new distance measure. the so–called Mahalanobis Distance: . which is indicated by the orientation of the two ellipses. the correlations between the individual components of the feature vector will have to be regarded when computing the distance between two vectors. DISTANCE MEASURES 22 DIM−1 2 dW eighted Euclid (x. . Let xi.4) As before. we would want the distance measure to regard both the correlation and scale properties of the features.. In this case. . What if they are not? Figure 3.n − mi ) 2 (3. Instead.4) will not be suﬃcient. then the variances σi 2 can be estimated from the training set as follows:: σi 2 = 1 N −1 N −1 n=0 (xi. . Obviously.6) where mi is the mean value of the training set for dimension i: mi = 1 N N −1 xi. p) = (x − p) · Λ · (x − p) (3.2 shows an example in which the features x1 and x2 are correlated. x1 .5) ˜ Where Λ denotes the diagonal matrix of all the scaling factors: λ0 . xN −1 }.n denote the ith element of vector xn . .7) 3.4 Mahalanobis Distance So far.. we can deal with diﬀerent scales of our features using the weighted Euclidean distance measure. λDIM−1 The scaling factors are usually chosen to compensate the variances of the measured features: λi = 1 σi 2 The variance of dimension i is computed from a training set of N vectors {x0 . Here a simple scale transformation as in (3. This works very well if there is no correlation between the individual features as it would be the case if the features we selected for our vector space were statistically independent from each other..CHAPTER 3.
9) where m is the mean vector of the training set: m= 1 N N −1 xn n=0 (3. . . σi. DISTANCE MEASURES 23 X2 class "A" class "B" ? X1 Figure 3.j . C = (x − p)′ · C −1 · (x − p) ˜ ˜ Where C −1 denotes the inverse of the covariance matrix C: σ0. .j represent the covariances between the components i and j.0 . σ0. σDIM−1.CHAPTER 3.0 . . . . σi. σ0. .8) The individual components σi. say.. ˜= σi.. . . . The covariance matrix can be estimated from a training set of N vectors {x0 . x1 .j . .j . .11) . .0 . . this can also be written as: σi. xN −1 } as follows: ˜ C= 1 N −1 N −1 n=0 (xn − m) · (xn − m) ′ (3. . .... .DIM−1 (3. p. . .. .j = 1 N −1 N −1 n=0 (xi. σDIM−1.n − mi ) · (xj.DIM−1 C .2: Correlated features ˜ ˜ dMahalanobis x.n − mj ) (3...j .DIM−1 . .10) ˜ looking at a single element of the matrix C. σi. . . . σDIM−1.
11): First. we already know that they represent the variances of the individual features. with respect to the mean value) and observations with small values for feature i also tend to show small values for feature j . In this case.2. 3. think of the shape of the ”uncomfortable” class of Fig. To go even further. if a cluster is oriented from top–left to bottom–right. .2 Clusters With Individual and Shared Covariance Matrices In our Fig. you can swap the indices i and j in the description above).e. 2. DISTANCE MEASURES 24 From (3. they are oriented from bottom–left to top– right.1. i = j. the clusters look like the two clusters of the classes A and B in Fig.i . A big variance for one dimension reﬂects the fact that the measured values of this feature tend to be spread over a wide range of values with respect to the mean value. 3.i of its main diagonal are either positive or zero.j would be negative (due to the symmetry of C.6) and (3. for a two–dimensional vector space example. On the other hand. i. that is) and its own covariance matrix. it is suﬃcient (or necessary.j . Two properties of C can ˜ is symmetric and second. it is also possible to estimate a single ”global” covariance matrix which is shared among all classes (and their clusters).11) it is obvious that σi. are identical to the variances σi 2 of the dimension i. 3. while a variance close to zero would indicate that the values measured are all nearly identical and therefore very close to the mean value. we will expect a positive value of σi.i .1 Properties of the Covariance Matrix How can we interpret the elements of the covariance matrix? For the σi. This covariance matrix is computed out of the complete trainig set of all vectors regardless of the cluster and class assignments. each class of a classiﬁcation task may consist of several clusters which belong to that class.j if a big value for feature i can be frequently observed together with a big value for feature j (again. A zero variance would mean that for this speciﬁc dimension of the vector we chose a feature which does not change its value at all). each of those clusters will have its own mean vector (the ”prototype vector” representing the cluster) and its own covariance matrix. 3. the values be seen immediately form (3. high values for feature i often are observed with low values for feature j and the ˜ value for the covariance σi.CHAPTER 3. in some applications. The resulting covariance matrix is then class speciﬁc but is shared among all clusters of the class while the mean vectors are estimated individually for the clusters. But in general. As an example.2 the two classes A and B each consisted of one cluster of vectors having its own center (mean vector.4. regardless to which cluster of the class the vectors are assigned. due to a lack of training data) to estimate one covariance matrix out of all training vectors of one class. C σi.4. However. if the ”shape” of the class in our feature space is more complex. the values of the main diagonal of ˜ ˜ C. For the covariances σi. In these cases..
. k = 0.12) Then we can perform the Nearest Neighbor classiﬁcation as in (2. (Kωv − 1) (3. we will learn that it plays a signiﬁcant role when dealing with the gaussian probability density function.ωv . we can insert the Mahalanobis distance according to (3. but for each cluster we also have to compute the covariance matrix from the set of vectors which belong to the cluster. Once the prototype vectors and the corresponding (either shared or individual) covariance matrices are computed. pk.2). To use the Mahalanobis distance in our nearest neighbor classiﬁcation task.8) into the equation for the class distance (2.ωv k .3. In section 7. . we would not only have to ﬁnd the appropriate prototype vectors for the classes (that is. .3 Using the Mahalanobis Distance for the Classiﬁcation Task With the Mahalanobis distance. but also in ﬁnding the vectors that belong to the clusters represented by these prototypes. 1.1) and get: ˜ dωv (x) = min dMahalanobis x. . . ﬁnding the mean vectors of the clusters).4. DISTANCE MEASURES 25 3. we have found a distance measure capable of dealing with diﬀerent scales and correlations between the dimensions of our feature space. We will later learn how the ”k–means algorithm” helps us not only in ﬁnding prototypes from a given training set. Ck.CHAPTER 3.
Wk.Chapter 4 Dynamic Time Warping In the last chapter. . Let our vocabulary (here: the set of classes Ω) contain V diﬀerent words w0 . What we need here is a way to compute a ”distance” between this unknown sequence of vectors ˜ ˜ X and known sequences of vectors Wk = wk 0 .2) as follows: W ˜ ˜ ˜ Dωv (X) = min D X. . . .1 Deﬁnition of the Classiﬁcation Task ˜ ˜ ˜ If we have a suitable distance measure D(X.ωv . . . the covariance matrices) which also had to be computed from a training vectors. we have to classify not only single vectors. In speech recognition. . some of them using additional sets of parameters (e. . Therefore.ωv k .ωv ) between X and a prototype ˜ ˜ k. w1 . . A whole word will comprise dozens of those vectors. our classiﬁcation task is diﬀerent from what we have learned before. we will allow a word wv (here: class ˜ ωv ∈ Ω) to be represented by a set of prototypes Wk. . and we know that the number of vectors (the duration) of a word will depend on how fast a person is speaking. (Kωv − 1) (4.g. we were dealing with the task of classifying single vectors to a given set of classes which were represented by prototype vectors computed from a set of training vectors. wV −1 . . . but sequences of vectors. (Kωv −1) to reﬂect all the variations possible due to diﬀerent pronunciation or even diﬀerent speakers. For an utterance of a word w which is TX vectors long.1) where ωv is the class of utterances whose orthographic representation is given by the word wv Then we can write our classiﬁcation task as: 26 . . wk TWk which are prototypes for the words we want to recognize. Lets assume we would want to recognize a few command words or digits. our speech signal is represented by a series of feature vectors which are computed every 10ms. . we will get a sequence of vectors ˜ X = {x0 . k = 0. . . Several distance measures were presented. x1 . 1. Wk. How does this relate to speech recognition? As we saw in chapter 1. we can deﬁne the class distance Dωv (X) in analogy to (2. . 1. . In analogy to the Nearest Neighbor classiﬁcation task from chapter 2. wk 1 .. .ωv . 4. k = 0. xTX −1 } from the acoustic preprocessing stage.
The sequence W 90 degrees. . 4) . . W ) at hand (in the following. The assignment between the time indices of W ˜ as given by P can be interpreted as ”time warping” between the time and X ˜ ˜ axes of W and X. (6. even if the utterance was intended to mean completely diﬀerent. 2) . 2) . 3) . The two sequences span a grid of possible assignments between the vectors. (2.2 Distance Between Two Sequences of Vectors As we saw before. In Fig. (7. x3 and x4 were all assigned to w2 . . For example. thus warping the duration of w2 so that it lasts three time indices instead of one. the second vector of X is ˜ . What should the properties of this distance measure be ? The distance measure we need must: • Measure the distance between two sequences of vectors of diﬀerent length (TX and TW . In our example. 1) . we can only recognize V words. .1 we can see two sequences X and W which ˜ was rotated by consist of six and eight vectors.2. gLp = {(0.2) This deﬁnition implies that the vocabulary of our speech recognizer is limited to the set of classes Ω = ω0 . (5. .1 Comparing Sequences With Diﬀerent Lengths The main problem is to ﬁnd the optimal assignment between the individual ˜ ˜ ˜ ˜ vectors of X and W . 4. In other words. . ω(V −1) .CHAPTER 4. assigned to the second vector of W Fig. and so on. The only criterium the classiﬁer applies for its choice is the ˜ ˜ acoustic similarity between the utterance X and the known prototypes Wk. . . 4)} (4.3) The length LP of path P is determined by the maximum of the number of ˜ ˜ ˜ vectors contained in X and W . respectively. the vectors x2 . Each path through this grid (as the path shown in the ﬁgure) represents one possible assignment of the vector pairs. the diﬀerent lengths of the vector sequences can be compensated. 4. DYNAMIC TIME WARPING 27 ˜ ˜ X ∈ ωv ⇔ v = arg min Dωv X v (4.ωv ˜ Wk. g2 . 0) .ωv ). (1. respectively): P = g1 . that is) • While computing the distance. and our classiﬁer will match any speech utterance to one of those V words in the vocabulary. classiﬁcation of a spoken utterance would be easy if we had ˜ ˜ a good distance measure D(X. ω1 . . the ˜ ˜ ˜ ﬁrst vector of X is assigned the ﬁrst vector of W . 4.1 shows as an example the following path P given by the sequence of time index pairs of the vector sequences (or the grid point indices. 2) . By this kind of time warping. (3. (4. so the time index for this sequence runs from the bottom of the sequence to its top. ˜ as deﬁned by the distance measure D(X. ﬁnd an optimal assignment between the ˜ ˜ individual feature vectors of X and W • Compute a total distance out of the sum of distances between individual ˜ ˜ pairs of feature vectors of X and W 4. we will skip the additional indices for ease of notation).
it is not necessary to compute all possible paths P and correspond˜ ˜ ing distances D X. P. We know that both sequences of vectors represent feature vectors measured in short time intervals. W . Therefore. wj ) for the time indices i and j deﬁned by the grid point gl = (i.4) 4. Let d(gl ) denote the vector distance d (xi . once we have the path.5) Fortunately. P P (4.1 ). P to ﬁnd the optimum. How do we ﬁnd it ? The criterium of optimality we want to use in searching the optimal path Popt ˜ ˜ should be to minimize D X. only a fraction is reasonable for our purposes. Let l denote the sequence index of the grid points. DYNAMIC TIME WARPING 28 w5 w4 w3 w2 w1 w0 x0 x1 x2 x3 x4 x5 x6 x7 ˜ ˜ Figure 4. W . computing the distance D is a simple task. W . the distance measure between the vector sequences can now be computed as the sum of the distances between the individual vectors. j) (the distance d (xi .2. wj ) could be the Euclidean distance from section 3.CHAPTER 4. = l=1 d (gl ) (4. P : ˜ ˜ Popt = arg min D X.2 Finding the Optimal Path As we stated above. Then the overall distance can be computed as: LP ˜ ˜ D X. W . Out of the huge number of theoretically possible paths. we might want to ˜ restrict the time warping to reasonable boundaries: The ﬁrst vectors of X and .1: Possible assignment between the vector pairs of X and W For the given path P .
To compensate this eﬀect. Bellman’s Principle states the following: . j) from (i − 1. To reach a given grid point (i. j − 1) : both time indices of X and W are incremented ˜ ˜ • (i. j − 1) . but want to restrict the time warping just to the ”reuse” of the preceding vector(s) to locally warp the duration of a short segment of speech signal. j − 1) would have to be added. the diagonal transition involves only the single vector distance at grid point (i. j) or (i. j) can have the following predecessors: ˜ ˜ • (i − 1.2: Local path alternatives for a grid point Note that Fig. For the time indices in between. Bellman’s Principle Now that we have deﬁned the local path alternatives.j) t x Figure 4. a grid point (i.2 does not show the possible extensions of the path from a given point but the possible predecessor paths for a given grid point. With these restrictions. 4. j − 1) : keep of the time index i of W while the time index of X is incremented All possible paths P which we will consider as possible candidates for being the optimal path Popt can be constructed as a concatenation of the local path alternatives as described above. we can draw a diagram of possible ”local” path alternatives for one grid point and its possible predecessors (of course.CHAPTER 4. We will soon get more familiar with this way of thinking. we will use Bellman’s Principle to search the optimal path Popt .j1) (i1. j) as opposed to using the vertical or horizontal transition. j) : keep the time index j of X while the time index of W is incremented ˜ ˜ • (i − 1. Applied to our problem. where also the distances for the grid points (i − 1.j) (i1. xj ) is added twice when using the diagonal transition. As we can see. the local distance d(wi . DYNAMIC TIME WARPING 29 ˜ W should be assigned to each other as well as their last vectors.j1) (i. we want to avoid any giant leap backward or forward in time. many other local path diagrams are possible): tw (i.
j − 1) and (i. 1). j).6) δ(i − 1. j) + d(wi . j − 1) + d(wi . 0) and δ(0. 0): The upper point (1. we can now only compute two points from grid point (0. since there is only one valid predecessor: The start point (0. 0) to these grid points. From that. starting with point (0. Applying this rule. xj ) • Initialization: For the grid point (0. j) to exactly one. 0) to grid point (i. xj ) δ(i − 1. and the grid point (i. j). j) by selecting exactly the one path hypothesis among our alternatives which minimizes the accumulated distance δ(i. 1) to compute δ(1. That way we walk trough the matrix from bottomleft to topright. j) = min (4.) along that paths) leading from (0. (i − 1. 0) to (i. j − 1) + 2 · d(wi . j) is also part of Popt . ˜ ˜ ˜ • Termination: D(W . 0) to (i. All these path hypotheses are possible predecessor paths for the optimal path leading from (0. X) = δ(TW − 1. • Iteration: Now we have to expand the computation of the partial paths to all grid points until we reach (TW − 1. j). TX − 1). it is also said that the possible path hypotheses are recombined during the optimization step. j). 0) to the grid points (i − 1. Optimization of (4. DYNAMIC TIME WARPING 30 By this optimization. j): The partial paths from (0. For each of these points (i. Let’s assume we would know the optimal paths (and therefore the accumulated distance δ(. 0) to (i. we can construct a way of iteratively ﬁnding our optimal path Popt : According to the local path alternatives diagram we chose. TX − 1): According to the local path alternatives. j − 1) ). . j) via the optimal path beginning from (0. we have found a way to iteratively compute the optimal path from point (0. then the partial path from (0. 0) to the vector distances deﬁned by the grid points (1. j) is the minimum among all possible paths from (0. and the right point (0. 0) to (i. j) the recombination of path hypotheses is always trivial). 0) and that therefore the accumulated distance δ(i. 0) to point (TW − 1. TX − 1). 0) and ending at (TW − 1. The optimization we have to perform is as follows: δ(i. 0) and the bottom row (0.0).6) is easy in these cases. 0): If Popt is the optimal path through the matrix of grid points beginning at (0. we search the optimal predecessor point out of the set of possible predecessors (Of course. xj ) δ(i. for the leftmost column (i. j) is part of path Popt . 0). TX − 1) is the distance between W ˜ and X. 1). Now we look at the points which can be computed from the three points we just ﬁnished. there are only three possible predecessor paths leading to a grid point (i. it is ensured that we reach the grid point (i. 0) and the accumulated distance along that path is simply d(w0 . j).CHAPTER 4. Then we can ﬁnd the (globally) optimal path from (0. 0) and (0. j) of the resulting path from (0. 0). So we add δ(0. x0 ). Since the decision for the best predecessor path hypothesis reduces the number of paths leading to grid point (i. the computation of the optimal path is simple: It contains only grid point (0.
we deﬁne d(i. Once we reached the top–right corner of our matrix. 0). It should be mentioned that it possible to use a single array for time indices j and j − 1. but also the optimal path P . diagonal and vertical path) and compute the predecessor point (k. In addition to the verbal description in the section above. One can overwrite the old values of the array with the new ones.6) has to be performed. 4. empty points are not. Filled points are already computed. starting with the leftmost column and beginning each column at the bottom and continuing to the top. The solid lines show the resulting partial paths after the decision for one of the path hypotheses during the optimization step. we will use only two arrays δj (i) and δj−1 (i) to hold the values for those two columns instead of using a whole matrix for the accumulated distances δ(i. 4. xj ) between the two vectors wi and xi . For ease of notation. j) to be the distance d(wi . However. j) at grid point (i. we have to store each path alternative chosen for every grid point. X).6). We could for every grid point (i.6) uses only the accumulated distances from columns j and j −1.3 Dynamic Programming / Dynamic Time Warping The procedure we deﬁned to compare two sequences of vectors is also known as Dynamic Programming (DP) or as Dynamic Time Warping (DTW). The optimal path is known only after the termination of the algorithm. j) either store the indices k and l of the predecessor point (k. X) between the vector sequences. The Dynamic Programming Algorithm In the following formal framework we will iterate through the matrix column by column. If we are also ˜ ˜ interested in obtaining not only the distance D(W . Since we are iterating through the matrix from left to right. we will now deﬁne a more formal framework and will add some hints to possible realizations of the algorithm. j − 1). when we have made the last recombination for the three possible path hypotheses leading to the top– right grid point (TW − 1. Once this decision is made. l) out of the code and the current point (i. for clarity.3 shows the iteration through the matrix beginning with the start point (0. the accumulated distance δ(TW − ˜ ˜ 1. We will use it extensively and will add some modiﬁcations and reﬁnements during the next chapters. This procedure is called backtracking. we have — in addition to the accumulated distances — also to keep track of all the decisions we make during the optimization steps (4. . j) and δj−1 (i) the accumulated distance δ(i. j): Let δj (i) be the accumulated distance δ(i. j).2. For ease of notation. The dotted arrows indicate the possible path hypotheses over which the optimization (4. l) or we could only store a code number for one of the three path alternatives (horizontal. TX − 1) is the distance D(W . To keep track of all the selections among the path hypotheses during the optimization. 0). TX − 1). j − 1) at grid point (i.CHAPTER 4. the algorithm using two arrays is described here and the formulation for a single–array algorithm is left to the reader. DYNAMIC TIME WARPING 31 Fig. and the optimization for column j according to (4. the optimal path can be found by reversely following all the local decisions down to the origin (0.
j) (4. the dynamic programming (DP)algorithm can be written as follows: ◦ Initialization: ◦ Grid point (0. j) be the predecessor grid point (k. 0) } ◦ Iteration: compute all columns: for j = 1 to TX − 1 { ◦ swap arrays δj−1 (. j) = (0.) and δj (. 0) + δj (i − 1) ψ(i. j) (4.8) (4. DYNAMIC TIME WARPING 32 we assume in the following that the indices of the predecessor grid point will be stored: Let ψ(i. j) d(i.7) ◦ Initialize ﬁrst column (only vertical path possible): for i = 1 to TW − 1 { δj (i) = d(i. (i − 1. j) 2 · d(i. j) = (k.) ◦ ﬁrst point (i = 0. only horizontal path possible): δj (0) = d(0. j) δj−1 (i) δj−1 (i − 1) δj (i − 1) + + + d(i. j).11) } . (i − 1. 0) = (i − 1. l) chosen during the optimization step at grid point (i.CHAPTER 4.9) (4. j) d(i. j) + δj−1 (0) ψ(0. j − 1). 0): ψ(0. With these deﬁnitions. −1) d(0.l)∈ arg min (i. 0) = δj (0) = (−1. j − 1) ◦ compute column j: for i = 1 to TW − 1 { ◦ optimization step: δj (i) = min δj−1 (i) δj−1 (i − 1) δj (i − 1) + + + d(i. j − 1). j) 2 · d(i.10) ◦ tracking of path decisions: ψ(i. 0) (4.
TX − 1) = δj (TW − 1. TX − 1).CHAPTER 4. there will be paths regarded during computation which are not very likely to be the optimum path due to the extreme time warping they involve. 0) to (TW − 1. this approach takes the (small) risk of missing the globally optimal path and instead ﬁnding only the best path within the deﬁned region. For example. TX − 1). However.14) (4.13) . Later we will learn far more sophisticated methods to limit the so–called search space of the DP algorithm. 0) until (TW − 1.12) i = TW − 1. think of the borders of the grid: There is a path going horizontally from (0. (4. 0) and then running vertically to (TW − 1. Of course. j = ψ(i. To avoid unnecessary computational eﬀorts. the DP algorithm is therefore usually restricted to searching only in a certain region around the diagonal path leading from (0. j) } Additional Constraints Note that the algorithm above will ﬁnd the optimum path by considering all path hypotheses through the matrix of grid points. DYNAMIC TIME WARPING 33 } ◦ Termination: D(TW − 1. Paths running more or less diagonally trough the matrix will be much more likely candidates for being the optimal path. j) = −1 { ◦ get predecessor point: i. TX − 1) ◦ Backtracking: ◦ Initialization: ◦ while ψ(i. j = TX − 1 (4.
CHAPTER 4. DYNAMIC TIME WARPING 34 tw tw tw tw tx tx tx tx t w t w t w t w t x t x t x t x tw tw tw tw t x t x t x t x t w t w t w t w tx tx tx tx t w t w t w t w t x t x t x t x Figure 4.3: Iteration steps ﬁnding the optimal path .
lets assume the total number of all prototypes over all classes is given by M .Chapter 5 Recognition of Isolated Words In the last chapter we saw how we can compute the distance between two sequences of vectors to perform the classiﬁcation task deﬁned by (4.2). First. the virtual grid point can only be reached from the end of each prototype word. Hence. Now we will reformulate the classiﬁcation task so that it will depend directly on the optimum path chosen by the dynamic programming algorithm and not by the class distances. By introducing this virtual 35 . If we want to compute the distances to all M prototypes simultaneously. To do so. and among all the possible prototype words. we have to keep track of the accumulated distances between the unknown vector sequence and the prototype sequences individually. we have to reorganize our search space a little bit. The classiﬁcation was performed by computing the distance between an unknown vector sequence to all prototypes of all classes and then assigning the unknown sequence to the class having the smallest class distance. the one with the smallest accumulated distance is chosen. In other words. While the description of the DTW classiﬁcation algorithm in chapter 4 might let us think that one would compute all the distances sequentially and then select the minimum distance. instead of the column (or two columns. we now have to provide M columns during the DTW procedure. Instead. depending on the implementation) we used to hold the accumulated distance values for all grid points. it is more useful in practical applications to compute all the distances between the unknown vector sequence and the class prototypes in parallel. This is possible since the DTW algorithm needs only the values for time index t and (t−1) and therefore there is no need to wait until the utterance of the unknown vector sequence is completed. Now we introduce an additional ”virtual” grid point together with a specialized local path alternative for this point: The possible predecessors for this point are deﬁned to be the upper–right grid points of the individual grid matrices of the prototypes. one can start with the recognition process immediately as soon as the utterance begins (we will not deal with the question of how to recognize the start and end of an utterance here).
RECOGNITION OF ISOLATED WORDS 36 grid point.e. WM W1 MIN W0 X Figure 5. Figure 5. i. Note that the search space we have to consider is spanned by the length of the unknown vector sequence on one hand and the sum of the length of all prototype sequences of all classes on the other hand. the classiﬁcation task itself (selecting the class with the smallest class distance) is integrated into the framework of ﬁnding the optimal path. At the last time slot (TW − 1) we perform the optimization step for the virtual grid point. The classiﬁcation task is then performed by assigning the unknown vector sequence to the very class to which the prototype belongs whose word end grid point was chosen. The backtracking procedure can of course be restricted to keeping track of the ﬁnal optimization step when the best predecessor for the virtual grid point is chosen. Now all we have to do is to run the DTW algorithm for each time index j and along all columns of all prototype sequences. the predecessor grid point to the virtual grid point is chosen to be the prototype word having the smallest accumulated distance.1: Classiﬁcation task redeﬁned as ﬁnding the optimal path among all prototype words .1 shows the individual grids for the prototypes (only three are shown here) and the selected optimal path to the virtual grid point.CHAPTER 5.
this is just a diﬀerent (and quite complicated) deﬁnition of how we can perform the DTW classiﬁcation task we already deﬁned in (4. now the optimal path itself is important for the classiﬁcation task In the next chapter. RECOGNITION OF ISOLATED WORDS 37 Of course. . Therefore. only a verbal description was given and we did not bother with a formal description. by the reformulation of the DTW classiﬁcation we learned a few things: • The DTW algorithm can be used for real–time computation of the distances • The classiﬁcation task has been integrated into the search for the optimal path • Instead of the accumulated distance.2). we will see how the search algorithm described above can be extended to the recognition of sequences of words. However.CHAPTER 5.
What if we want to recognize sequences of words. we will have to match the utterance against all prototypes and we will have to test for 38 . ﬁnding the boundaries between the words and the classiﬁcation of the individual words within their boundaries. Best of all. ”no”.Chapter 6 Recognition of Connected Words In the last chapter. Therefore. As one can imagine. which is able to ﬁnd the word boundaries and to classify the words within the boundaries in one single step. which will make the notation a bit easier for us. simple command words like ”start”. but we do not know which words these are and at what time index they begin or end. e. we learned how to recognize isolated words. you already know the algorithm. The classiﬁcation was performed by searching the optimal path through the search space. For simplicity.g.1 The Search Space for Connected Word Recognition If we want to apply the DP algorithm to our new task. the classiﬁcation task can be divided into two diﬀerent subtasks: The segmentation of the utterance. the number of possible combinations of words of a given vocabulary together with the fact that each word may widely vary in its length provides for a huge number of combinations to be considered during the classiﬁcation. Then a formal description of the algorithm is presented. there is a solution for the problem. An unknown utterance of a given length TX will contain several words out of our vocabulary. ﬁrst a verbal description of what modiﬁcations are necessary to apply the DP algorithm on the problem of connected word recognition is given.. Fortunately. ”yes’. which was spanned by the individual prototype vector sequences on one hand and by the unknown vector sequence on the other hand.. i. ”stop”. like a credit card number or telephone number? In this case. because it is the Dynamic Programming algorithm again ! In the following. 6. we will ﬁrst have to deﬁne the search space for our new task.e. let’s assume that each word of our vocabulary is represented by only one prototype.
2 Modiﬁcation of the Local Path Alternatives Remember that in our DP algorithm in section 4 we were dealing with the local path alternatives each grid could select as possible predecessors.e. During the DP.1: DTW algorithm for connected word recognition 6. This allowed us to implement the DP algorithm in real–time.. while we were moving from column to column. Remember that in the case of isolated word recognition. the grid point corresponding to the ﬁrst vector of a prototype has more path alternatives: It may either select the same grid point in the preceding column (this is the ”horizontal” transition. column per column. as is shown in Fig. the grid point corresponding to that vector was initialized in the ﬁrst column with the distance between the ﬁrst vector of the prototype and the ﬁrst vector of the utterance. this grid point had only one predecessor. For connected word recognition however. RECOGNITION OF CONNECTED WORDS 39 all possible word ends and word beginnings. indicating that the ﬁrst vector of the prototype sequence was warped in time to match the next vector of the unknown utterance. that . v=V WV = "nine" v virtual grid node Wv = "four" v=0 W0 = "zero" modified path alternatives j=0 j = TX 1 X = "zero four zero nine zero" X Figure 6. which we already know). it could only be preceded by the same grid point in the preceding column (the ”horizontal” transition). The alternatives were including grid points one time index in the past (the ”horizontal” and ”diagonal” transitions) and in the present (the ”vertical” transition). 6. Now we have to modify our deﬁnition of local path alternatives for the ﬁrst vector of each prototype: The ﬁrst vector of a prototype denotes the beginning of the word.CHAPTER 6.1. or it may select every last grid point (word end. i. The optimal path we want to ﬁnd with our DP algorithm must therefore be allowed to run through a search space which is spanned by the set of all vectors of all prototypes in one dimension and by the vector sequence of the unknown utterance in the other dimension.
As in chapter 5. If we now use the DP algorithm again to ﬁnd the optimal path through our search space.2: Local path alternatives of the ﬁrst grid point of a column By allowing these additional path alternatives.2 shows the local path alternatives for the ﬁrst grid point of the column associated with a prototype. Fig. w v Local path alternatives of first grid point w 0 T X Figure 6. an additional ”virtual grid point” is used to represent the selection of the best word end at the end of the utterance. In this case. the prototype word under consideration is assumed to start at the current time index and the word whose word end was chosen is assumed to end at the time index before. where a word boundary can be fond at those times at which the path jumps from the last grid point of a prototype column to the ﬁrst grid point of a prototype column. we allow the optimal path to run through several words in our search space. 6. RECOGNITION OF CONNECTED WORDS 40 is) of all prototypes (including the prototype itself ) of the preceding column as a possible predecessor for between–word path recombination.CHAPTER 6. it will automatically match all words at all boundaries. If we then .
i. . we need backtracking memory to hold for each time index j the following values: • For each time index j the best word end w(j) at time j. The use of multiple prototypes per word can easily be implemented by mapping the sequence of prototypes found by the DP algorithm to the appropriate sequence of words. v = 0. ˜ Let TX be the length of the unknown utterance X and let Tv . we can go backwards in time to ﬁnd all the words and word boundaries deﬁned by the optimal path. 6.. the virtual grid point.4 Formal Description We now will deduce a formal description of the connected word recognition algorithm. v = 0. Figure 6. Using these values. it is also suﬃcient to keep track of the start time ts (j) for this best word end. but we only need to keep track of the predecessor grid points selected during the between–word path recombinations. we do not need to keep track of all the path alternatives selected within the prototype sequences. The best word end at the end of the utterance (time index (TX − 1)) is then used as the starting point for the backtracking procedure.1 Dimensions of the Search Space Let V be the size of our vocabulary. So for backtracking. Using these deﬁnitions. For simplicity. resulting in a corresponding sequence of words.e. (V − 1). our search space is spanned by the length of the utterance TX in one dimension and the sum of the length of the prototype vector sequences in the other dimension. Therefore. • For each time index j the start time ts (j) of the word end w(j). the modiﬁed path alternatives at the ﬁrst grid points of the prototypes and a possible path through the search space.4. 1. . . .CHAPTER 6. . each representing the . we will use the prototype index v and the time index i within a given prototype separately: Let v be the index of the prototype sequence. 6. (V −1) ˜ be the length of the prototype sequence of prototype Wv . Any word starting at time (j + 1) will select this best word end at time t as its predecessor. This will divide our search space into V separate ”partitions” Pv .3 Backtracking To allow proper backtracking.1 shows the search space. . Note that there is only one best word end for each time index j which all words starting at time index (j + 1) will select as a predecessor word. Instead of referring to a single time index for all the vectors of all the prototypes. we will assume that each word wv of our vocabulary is represented by a single prototype vector sequence ˜ Wv only. this path will represent the sequence of prototype words which best matches the given utterance ! 6. the time index at which the path ending at time index j in word w(j) has entered the word. 1. RECOGNITION OF CONNECTED WORDS 41 use the backtracking to determine the optimal path. .
so we have to limit the path alternatives for the grid points (i.6): δ(i. . (V − 1) and utterance time index j = 1. the optimization over all partitions (word ends) is carried out as a separate step for a given time index j − 1. partition index v = 0. v. v. To simplify the notation. v. . (Tv − 1.. resulting in γ(j − 1). . all grid points (0.i . v. 1. 6. v. . j − 1) + γ(j − 1) + d(wv. which are possible at the ﬁrst grid point of the column of each partition. . 1.. These partitions are visualized in Fig. . 6. v.i of prototype Wv by d(wv. xj ) d(wv. there is no predecessor since the utterance starts at j = 0..i .i .3) This optimization step will be processed for all partition indices v = 0..i . j) + d(wv. . (Tv − 1). j). xj ) and the accumulated distance for grid point (i.e. . for all grid points (i. the ”vertical” transition. (i. xj ) This equation holds for all grid points within the partition except for the ﬁrst grid point. 0) to their only predecessor grid point (i − 1. RECOGNITION OF CONNECTED WORDS 42 ˜ grid points corresponding to a prototype Wv of a word wv . j) we denote by δ(i. . In this case. . 0). (Tv − 1).e. j − 1) + 2 · d(wv. .CHAPTER 6.i . 2.1.2 Local Path Recombination Within–Word Transition For the grid points within a partition Pv . . we have to deﬁne the local path alternatives for the between–word transitions. (V −1) and utterance time index j = 1. Given these deﬁnitions.. j). Between–Word Transitions In addition. v. . . We denote the local distance between a vector xj of the ˜ ˜ utterance X and a prototype vector wv. 2. v.4. with i = 0. it is allowed to select either the ”horizontal” transition from grid point (0. (TX − 1). .1. . 1. xj ) δ(i − 1. j) = min (6. (TX − 1). v. j − 1) + d(wv. v. v.1) δ(i − 1.i . . i. . Let i be the time index within a given partition Pv . xj ) (6. v. . v. the best predecessor is selected. we use the same local path recombination as in (4. From all these alternatives. v. i.. Please note that – as in section 4 – . Again. v. xj ) δ(i. v. j). v. j) = min δ(0.V −1 min {δ(Tv − 1. . the accumulated distance of the best word end hypothesis at time index (j − 1): γ(j − 1) = v=0. . the last grid point of any partition. j). 1). we need . j) of each partition can now be written as: δ(0. 1. j) with partition time index i = 1. The optimization step for the ﬁrst grid point (0. .. .e. j − 1) to grid point (0.2) This step recombines all path hypotheses containing a word end for the time index (j − 1) (between–word path recombination). i. (TX − 1). Therefore. . . 1) (6. there is no predecessor for the ﬁrst time index j = 0 of the utterance. a grid point in our search space can be identiﬁed by three indices. Let j be the time index of the unknown utterance with j = 0. or to select any word end. . for time index j = 0. v.
.. we ﬁnd its predecessor word again and so on. v. representing the sequence of words (in reversed order) contained in our vocabulary which best matches the spoken utterance. we look up the best word w(j) and its start time ˆ = ts (j). . for the best word end w(j) at time index j we need to know at what time index the path ending in that word has entered the ﬁrst grid point of the partition with index v = w(j).CHAPTER 6. Extensions for Backtracking Now we are able to extend all the recombination steps by the bookkeeping needed for backtracking: .1.. but we also have to keep track of the point in time when that selected path alternative entered the partition Pv . its value representing the accumulated distance for the globally optimal path. we have for each partition Pv a hypothesis for a word end at time j = (TX − 1). v. It is important to understand that this information has to be forwarded during the within–word path recombination (6.1. x0 ). (TX − 1). Using the j j start time of that word.. . j).e. So. 6.3 Termination After we have processed all time indices j = 0.4 Backtracking During the backtracking operation.1) and also during the between–word path recombination (6. and we have to follow all word boundaries along the optimal path. Therefore.4. These V path hypotheses are now recombined and – as for all other times j – the best accumulated distance is entered into γ(TX − 1). in addition to the accumulated distance δ(i. v.4) For the best word end at every time index j. we have to run backwards in time from the end of the utterance to its beginning. l) as a predecessor for a given grid point (i.0 . we need to track for each time index j which best word end was chosen during the optimization (6. RECOGNITION OF CONNECTED WORDS 43 to initialize all the grid points (0. This step performs the ﬁnal recombination of paths in the virtual grid point of Fig.3): Whenever we are selecting a path alternative (k.2). j) (6. Keeping Track of the Word Boundaries To follow the path from the end of the utterance to its beginning. This results in a list of word indices. We will hold this information in the array w(j) : w(j) = arg min v=0. 1. we also need to know the start time of that word.. which leads us to j its predecessor word w(ˆ − 1) which had to end at time index (ˆ − 1). For the backtracking procedure. 6. we not only have to add its accumulated distance. j) we need another variable ξ(i. v. all the words starting at j = 0) with the local vector distances d(wv. We will hold this information in an array ts (j). this enables us to run backwards in time through the array of best word ends: Beginning with time index j = (TX − 1) (the last time index of the utterance). j). v. j) to hold the start time for each grid point (i.V −1 {δ((Tv − 1).. . v.4. 0) (i. 6. v.
j − 1) + arg min δ(i − 1.11) – Forward the start time of the predecessor: For the ﬁrst grid point (0. its partition index into w(j − 1) and the corresponding word start time into ts (j − 1). j).(V −1) {δ(Tv − 1. since the between–word path recombination will use γ(j − 1) for ﬁnding the best predecessor word at time j. vopt . xj ) 2 · d(wv. • Find best word end: vopt = arg min v=0. we have just (re)started the word v at time j and we have to enter the start time j into ξ(i. v. v. we have to ﬁnd the best word end at time j − 1 and store its its accumulated distance into γ(j − 1). j − 1). v.e.i .CHAPTER 6.7) (6.1. which represent the word ends. j) = ξ(k. j) ξ(i. v.. j). l) = δ(i. j) + – Get predecessor grid point (k. j) + (i. we have to perform the optimization over all possible word ends. v. xj ) d(wv. j − 1)} . j − 1).6) • Within–word backtracking variable ξ(i. xj ) v: d(wv. RECOGNITION OF CONNECTED WORDS 44 First we start with the last grid points (Tv − 1.3): . there are two possible cases: Either the ”horizontal” transition is selected. v. in which case we simply have to forward the start time ξ(0.. l) d(wv. j − 1) + δ(i − 1. v. j − 1. Now we are able to extend the betweenword path recombination in analogy with (6. v. xj ) 2 · d(wv.5) • Store accumulated distance of the best word end: γ(j − 1) = δ(Tvopt − 1. v. i. v. In that case.. xj ) d(wv. vopt . j − 1) + δ(i − 1. l): (k. v. (6.i .i . Note that the word end optimization is done for the ”old values”.i .8) Now we will extend the within–word optimization step to keep track of the start times within the partitions: • Within–word path recombination for partition δ(i. v) of all partitions v: For these grid points. v.. or the best word end at time j − 1 is selected. i. j − 1) + δ(i.10) (6. j − 1) (6. (i − 1. j − 1). v. xj ) (6. for time index j − 1. j) = min δ(i − 1. j) for partition v: (6.9) (6.e.. which comes along with that path hypothesis. v.i . k.. v. v. j − 1) • Store the word index for backtracking: w(j − 1) = vopt • Store word start time for backtracking: ts (j − 1) = ξ(Tvopt − 1.v.i . v. v.l∈ (i − 1. v.
Since the recognition process was reduced to ﬁnding the optimal path through the search space. The start time indices forwarded during recombination will also be held in an array ξj (i. v. With the deﬁnitions of the equations for path recombination and backtracking given above.6 Summary What have we learned in this chapter? We have extended the dynamic programming algorithm to deal with sequence of words instead of isolated words.i . [Ney84]). there are many details to work out to implement a working speech recognition system based on the DP algorithm as described above. only the two arrays w(j) and ts (j) are needed. RECOGNITION OF CONNECTED WORDS 45 • Path recombination at the ﬁrst grid point: δ(0. v) is needed to hold the accumulated distances δ(i.13) 6. 6. v. v. j) = min δ(0. Only an array δj (i. j − 1) ≤ γ(j − 1) j . v.5 Implementation issues As explained in section 4. if δ(0.3 can be applied here.i . xj ) (6. v). it is easy to write down the complete algorithm in analogy to the one given in section 4. j): ξ(0. too.2. j − 1) . Some of the open issues not dealt with in the sections above are: ˜ • How many prototype vector sequences Wv should we use to represent a word of the vocabulary? • How do we detect the start and the end of the utterance properly? . else (6. Of course. we had to store the information needed for backtracking while processing the columns of the search space. j − 1) + γ(j − 1) + d(wv.. Speech recognition systems based on the DP algorithm using prototype vector sequences to match against the unknown utterance (also known as dynamic pattern matching) were widely used some years ago. We found that we only needed to store the information about the best word end optimization steps during the between–word path recombinations. j) of the huge matrix spanned by our search space. the fact that all optimization steps need only the columns for the last time index helps us in saving computer memory. v. xj ) d(wv.12) • Forward start time in variable ξ(0.CHAPTER 6. For backtracking. This is left as an exercise to the reader.g. v. j) = ξ(0. and were even extended by a ﬁnite state syntax (for a detailed description see e. v. So we only have to deal with a few arrays (and several ”partition–local ” variables) instead of explicitly holding in memory the information for all grid points (i. Of course. which we can hold in two additional arrays. This was achieved by reformulating the classiﬁcation task in such a way that it could be described as ﬁnding the optimal path through a matrix of grid points together with the local schemes for within– word and between–word path recombination. all the additional constraints of the search space as mentioned in section 4. v. j).
How can we implement a speaker–independent system dealing with large vocabularies? As we will see in the next chapters.CHAPTER 6. there are better ways to model the prototype words for a speech recognition system. RECOGNITION OF CONNECTED WORDS 46 • How do we ﬁnd appropriate prototypes for the words ? • For a speaker–independent system the number of prototypes will have to be very large to cover all the variabilities involved with speaker–independent systems. • The more prototypes we use. . consuming more memory and more computation time. the larger our search space will be.
What kind of parameters these are we will learn later. but rather a class 47 . we were using the DP algorithm to ﬁnd the optimum match between an utterance and a sequence of words contained in our vocabulary. 7. For now. as p(XΛv ). For additional reading on statistical methods for speech recognition. If. If we calculate a ﬁt–value for each model of our vocabulary.1 Statistical Classiﬁcation The classiﬁcation task for isolated word recognition can be formulated in a statistical framework as follows: Let’s assume we have for each word wv of our vocabulary a model Λv which is deﬁned by a set of parameters λv . It will later often be called the emission ˜ probability density. the so– called statistical classiﬁcation. which de˜ pends on the model Λv . In the following. for example. then we could calculate how good a given utterance ﬁts to our model. the book [Jel98] is strongly recommended. This is a very simplistic description of a general classiﬁcation method. Note that p(XΛv ) is not a probability. It should be clear that the number of prototypes we have to store in order to implement a speaker–independent system might become quite large. we will see how it can be used for speech recognition. The words of the vocabulary were represented by one or more prototype vector sequences. especially if we are dealing with a large vocabulary size also. All variability of speech that naturally occurs if a word is spoken at diﬀerent times by a person or even by diﬀerent persons had to be reﬂected by the prototype vector sequences. then we can assign the unknown utterance to the model which best ﬁts to the utterance. we assume that we are able to compute for a vector ˜ ˜ sequence X the probability density that the utterance X was produced by the model Λv . we would have a general stochastic model for the generation of feature vectors corresponding to a given word. One would like to look for a model of the speech generation process for each word itself instead of storing samples of its output. What we are rather looking for to handle this problem is a way to represent the words of our vocabulary in a more generic form than just to store many speech samples for each word.Chapter 7 Hidden Markov Models In the last chapter. We denote this conditional probability density function.
˜ How can we compute the probability P (ωv X) measuring the probability that ˜ belongs to class ωv ? Using Bayes’ equation it is computed as follows: X ˜ ˜ p(X. The classiﬁcation task at hand can now be deﬁned as: ˜ ˜ X ∈ ωv ⇔ v = arg max P (ωv X) v (7. the speech recognition task we are currently dealing with is the recognition of isolated words..3) can be rewritten as: ˜ ˜ X ∈ Λv ⇔ v = arg max p(XΛv ) · P (Λv ) v (7.3) ˜ So we only need the probability density functions p(Xωv ) and the a–priori probabilities of the classes P (ωv ). [Rus88]. 7. It de˜ belongs to the class ωv and will be notes the probability that the utterance X computed with respect to all our models Λ for all classes ω.1. which could be determined by simply counting the word frequencies within a given training set. since we are performing a maximum search and p(X) independent from the class under consideration. the unconditional emission probability density p(X) ˜ is can be eliminated. to which the unknown utterance ”belongs” to with the highest probability.1 Bayes Classiﬁcation The task of classifying isolated words can now be verbalized as: Assign the un˜ known utterance X to that class ωv . However. we have to chose the maximum a–posteriori probability instead of the minimum class distance.e. .CHAPTER 7.4) uses the probability density functions p(XΛv ) – which we assume we are able to compute – and the a–priori probabilities of the models P (Λv ).2) ˜ If we insert (7. instead of discrete values. like simple command words. ˜ What we need here is the so–called a–posteriori probability P (ωv X). HIDDEN MARKOV MODELS 48 speciﬁc probability density function (PDF).1). We can now use our word models Λ as approximations for the speech generation process and by using the emission probability densities of our word models.g...1): ˜ ˜ X ∈ ωv ⇔ v = arg max p(Xωv ) · P (ωv ) v (7.4) 7. which denote the unconditional (i. ωv ) p(Xωv ) · P (ωv ) ˜ P (ωv X) = = ˜ ˜ p(X) p(X) (7. (7. e.g. Thus. The classiﬁcation scheme as deﬁned above is called the ”Bayes Classiﬁcation”.1.2 Maximum Likelihood Classiﬁcation ˜ Equation (7.1) Note that in contrary to (4. not ˜ depending on the measured vector sequence X) probability that the utterance belongs to class ωv . we can rewrite (7. It it reasonable to assume that these command words will occur with equally distributed probability.2). since we are dealing with continuous values of each element of the vectors xj of the sequence. e.2) into (7.
. 7. while early applications are described in [LRS85a. these vector sequences my vary in their length as well as in the individual spectral shapes of the feature vectors within that sequence. the important part of that framework is the emission proba˜ bility density p(XΛv ). . . Each state of the HMM can model the generation (or: emission) of the observed symbols (or in our case of the feature vectors) using a stationary stochastic emission process.2 Hidden Markov Models One major breakthrough in speech recognition was the introduction of the statistical classiﬁcation framework described above for speech recognition purposes. In our application. As we have seen. These density functions have to be estimated for each word model Λv to reﬂect the generation (or ”emission”) process of all possible ˜ vector sequences Xv belonging to the word class ωv . P (Λv ) becomes a constant value (1/V ) and we can rewrite (7. which are also called the likelihood functions.4) as: ˜ ˜ X ∈ Λv ⇔ v = arg max p(XΛv ) v (7. A very detailed introduction into HMMs can be found in [RJ86. The underlying stochastic process is therefore ”hidden” from the observer. where θ ΩN v = {0. we will only discuss those properties of the HMMs which we will need to understand how they are used in algorithms for speech recognition. Therefore. Rab89. LRS89]. The Hidden Markov Models (short: HMMs) are used successfully in speech recognition for many years. Thus. while the dynamic changes of the vector sequence will be modelled by transitions between the states of the HMM. Jel98]. we can assume that the utterance ”yes” will occur as often as ”no”. For a given observation (vector) however. a model is needed which is capable of dealing with both of these variabilities. each state of the HMM will model a certain segment of the vector sequence of the utterance. If we assume equally distributed a–priori probabilities.CHAPTER 7. this classiﬁcation approach is called the maximum likelihood classiﬁcation. 7. These segments (which are assumed to be stationary) are described by the stationary emission processes assigned to the states. In the ˜ next sections we will see how to compute the density functions p(XΛv ). In the following.2. 1. Now it is time to introduce a little bit of formal framework: Let the HMM Λv have N v diﬀerent states denoted as sv with θ ∈ ΩN v .1 The Parameters of a Hidden Markov Model The HMM is modelling a stochastic process deﬁned by a set of states and transition probabilities between those states. (N v − 1)} is the set of state indices. As we know. we do not know in which state the model has been when emitting that vector. where each state describes a stationary stochastic process and the transition from one state to another state describes how the process changes its characteristics in time. . .5) Now the classiﬁcation depends solely on the model (or class) speciﬁc probability ˜ densities p(XΛv ). HIDDEN MARKOV MODELS 49 for a ”yes / no” task.
.e. only states with the same or a higher state index θ can be reached with nonzero probability.2 HMMs for Word Modelling ˜ A vector sequence X = x0 .e. we skip the word class index v): At each time t = 0. 1. Usually the state transition probabilities are restricted to reach only the two next state indices (e. The initial state probability is usually set to one for the ﬁrst state and to zero for all the other states. 1. being in a certain state sθ . . Each state sv has assigned its stationary i. the shortest vector sequence must have three vectors.g.. e. . so the generation process always starts in state s0 . . θ In addition. if 0 ≤ ∆ ≤ 2 (7.g. [LRS85a]): aθ.2. ..θ+∆ > 0 . However. one can not determine which sequence Θ was used ˜ to generate X i. x(TX −1) which belongs to a word wv can now be generated by the HMM Λv as follows (to simplify the notation. 7.θ+∆ = 0 . vector sequences of inﬁnite length can generated. . A given vector sequence X can thus be generated by running through a sequence of state indices Θ = θ0 . the shortest length of the sequence is deﬁned by the shortest path through the model.θt+1 and so on.1. θ which denotes the initial probability of the model being in state sv when the θ generation process starts. . . Then it will make a state transition from state sθt to state sθt+1 with the transition proba˜ bility aθt . . 1.(TX − 1) the HMM is in a state sθt and will generate a vector xt with the emission probability density p (xt sθt ) (the initial state probability uθ gives the probability being in state sθ at time t = 0). ˜ given only the sequence X. .1 a 00 a s 0 a 11 a s 1 a 22 a s 2 a 33 a s 3 a 44 01 12 23 34 s 4 a 02 a 13 a 24 Figure 7. we deﬁne the initial state probability uv .6) aθ. Therefore. else This leads to a left–to–right HMM structure as shown for ﬁve states in Fig. HIDDEN MARKOV MODELS 50 Let av . (N v − 1)} be the transition probability between state i.CHAPTER 7.. . . θ1. x1 . The set of all transition probabilities can be compactly described i j ˜ by the matrix A(N v ×N v ) = av . . one usually restricts the transition probability densities of the HMM to a ”from left to right” structure.j θ emission process deﬁned by the emission probability density function p (xsv ).θ . . θ(TX −1) . i.. (N v − 1)}..1: Typical left–to–right HMM for word recognition Note that by choosing nonzero self–transitions probabilities aθ. j ∈ {0. . .j sv and sv . The state transitions of the HMM reﬂect the sequence of quasi–stationary seg˜ ments of speech contained in the utterance X. 7. the state sequence is hidden from the observer. θ ∈ {0. . i. in Fig. . 7.
2. which is the ˜ probability density of generating X along the state sequence Θ using the model Λ.θt · p (xt sθt ) (7. but fortunately there exists an eﬃcient algorithm to do so. we now have to ˜ sum up p X. This is done by simply multiplying all the emission probability densities and transition probabilities along the state sequence Θ: TX −1 ˜ p X.9): TX −1 ˜ ˜ p XΛ ≈ p X. we have to compute the emission probability den˜ ˜ sity p XΛ for a given sequence X and model Λ.8) If we insert (7. ΘΛ = uθ0 · p (x0 sθ0 ) · t=1 aθ(t−1) . we get: TX −1 ˜ p XΛ = Θ∈Ω uθ0 · p (x0 sθ0 ) · N (TX ) ( ) t=1 aθ(t−1) .2. HIDDEN MARKOV MODELS 51 7.1. . ΘΛ (N (TX ) ) (7. we choose a diﬀerent approach: We will not compute the sum over all state sequences Θ ∈ Ω(N (TX ) ) .10) Where Θopt is deﬁned as the state sequence which maximizes the emission prob˜ ability density p X. . Assuming that we know the state sequence Θ which pro˜ ˜ duces X. Instead of using the Baum–Welch algorithm. ΘΛ over all possible state sequences Θ with length TX . but we will approximate the sum by the single state sequence Θopt which has the highest contribution to the sum (7. ˜ Then. We remember that the ˜ HMM will generate the vector sequence X along a state index sequence Θ = θ0 .CHAPTER 7. ΘΛ (TX ) (N ) (7. θ(TX −1) .θt · p (xt sθt ) (7. Θopt Λ = uθ0 · p (x0 sθ0 ) · t=1 aθ(t−1) . the so–called Baum–Welch algorithm [Rab89]. The number of possible state sequences equals the number of TX – tupels that can be generated from the set of state indices ΩN .3 Emission Probability Density of a Vector Sequence As we saw in section 7.9) To compute the sum over all possible state sequences would require lots of computational eﬀorts.θt · p (xt sθt ) (7.7) ˜ To compute the desired emission probability density p XΛ . ΘΛ .8). ΘΛ : Θopt = Θ∈Ω ˜ arg max p X. We denote the set of all state sequences of length TX given the set of state indices ΩN as Ω(N (TX ) ) . we can easily compute the probability density p X. . p XΛ can be computed as: ˜ p XΛ = Θ∈Ω ˜ p X.7) into (7. θ1.11) .
10) is transformed to adding up all the log score values along a given path. HIDDEN MARKOV MODELS 52 If we evaluate (7. each possible path will go through one of the N states of the HMM.1. the evaluation of (7.2: Possible state sequences through the HMM Now it is easy to ﬁnd the optimum state sequence / the optimum path through the grid: It can be done with the DP algorithm (Also known as Viterbi algorithm) we learned in section 4 ! All we have to do is to deﬁne the equation for the local recombination. in analogy to adding all the local distances along the path in section 4. 7. In this Figure. The logarithm of the probability density is often called the log score value. we have to deal with product terms only. If we are using the log score values for the transition probabilities and the emission probability densities.2.CHAPTER 7. a path recombination takes place. at every grid point shown in Fig. In other words. Since the probability density values may vary over several orders of magnitude. . At every time index t.2 shows all the possible state sequences or paths (also called the trellis) for the HMM shown in Fig. How do we ﬁnd the optimal state sequence Θopt ? Fig. so that we can easily take the logarithm of the individual probability (density) values and replace the product by the sum of those log–values. a 44 s4 a 34 a 33 a 23 a 22 a 12 a 11 a 00 a 01 s0 s1 a 02 a 13 s2 a 24 s3 X Figure 7.10). taking the logarithm has the additional advantage of reducing the dynamic range of those values. the HMM is rotated by 90 ˜ degrees and on the horizontal axis the vectors of the utterance X are shown. 7. 7.
12) Let δ(sθ . instead. The Viterbi Algorithm Now.6).2. we can formulate the Viterbi (or DP) algorithm for computing the score ˜ log p X.14) Note that in contrary to (4. sθ ) = log (asi . Let b(t. then all paths are starting at time index 0 in state s0 ). Θopt Λ . we have to ﬁnd the path or state sequence which yields the maximum accumulated sum of log scores along that path. since each state transition in the HMM implies an increase of the time index t). t) = b(t. sθ ) = log (p (xt sθ )) Let d(si . If we would use the negative log score values. we deﬁne the set Ξ (sθ ) as the set of all valid predecessor state indices for a given state sθ . t) in analogy to section 4: . t) be the accumulated log probability density for the optimal path beginning at time index 0 and ending in state sθ at time index t (If we further assume that u0 = 1. we will leave out the model index v in the following. With these deﬁnitions. HIDDEN MARKOV MODELS 53 7. sθ ) + max {d(si . 7. This local path recombination will now be performed for each time index t and within each time index for all state indices θ. sθ ) denote the logarithm of the emission probability density of vector xt being emitted by state sθ : b(t.CHAPTER 7. the local path recombination scheme can be written as: δ(sθ . we could even search for the path with the minimum accumulated negative log score. In section 4. (t − 1))} i ∈ Ξ(sθ ) (7. all predecessor grid points have the time index (t − 1) here (there is no vertical transition possible. and its two predecessors with the lower state indices (θ − 1) and (θ − 2). it is more intuitive to deal with higher scores for better matches. To allow for more complex model structures also. Now we will deﬁne the local path alternatives for path recombination. sθ ) be the log probability for the state transition si → sθ : d(si . the states sθ in the middle of the model have only three possible predecessors (see Fig.sθ ) (7. In our left–to–right structure. However. so we will use log scores and search for the maximum in the following. To enable the backtracking of the state sequence Θopt . Θopt Λ . However. a given state may have up to N predecessors. this is not true for the ﬁrst and second state of the model. Since we are dealing with one given HMM Λv here.3): The state sθ itself.13) (7. Depending on the transition probabilities of the model. we were searching for the minimum accumulated distance. Scores and Local Path Recombination We will start with the deﬁnition of the log score values.4 Computing the Emission Probability Density ˜ To compute p X. we will use the backtracking variable ψ(θ. sθ ) + δ(si .
17) – Backtracking: ψ(θ. . t − 1)} i ∈ Ξ(sθ ) (7. . sθ ) + max {d(si . θ=0 θ = 1. HIDDEN MARKOV MODELS i 54 i (i1) (i2) (t1) t t Figure 7. s0 ) . . . After the last iteration step is done. all other initial state probabilities are zero. −∞ . . (TX − 1)) (7. the state sN −1 . we get: ˜ log p X.19) . . 0) = −1 . . t − 1)} i ∈ Ξ(sθ ) (7. 1. we deﬁne u0 = 1. . . sθ ) + δ(si . (N − 1) (7. . . . 2. we will consider only paths which end in the last state of the HMM.14) for all states θ = 0. sθ ) + δ(si . (TX − 1): – Optimization step: δ(sθ . 2. 0) = b(0. Therefore. t) = b(t.16) • Iteration: Perform the local path recombination as given by (7. (N − 1) (7. .3: Possible predecessor states for a given HMM state • Initialization (t = 0): For the left–to–right structure we use. θ = 0. 1. . . (N − 1) and all time indices t = 1. Θopt Λ = δ(sN −1 . .18) • Termination: In our left–to–right model.15) – Initialize backtracking variables: ψ(θ. t) = arg max {d(si .CHAPTER 7. enforcing all paths to begin in state s0 with the emission of the vector x0 : – Initialize ﬁrst column: δ(sθ .
1) ≡ 0 (7. 1: θ(t−1) = ψ(θt . Within those segments. As we recall. Now we have to deal with the question of how p (xsθ ) is modelled within each state of a HMM and how to compute the probability density for a given vector x.CHAPTER 7. we will ﬁrst look at the one–dimensional case. centered around a mean value m. The values we measure can be characterized by the Gaussian probability density function. θ(TX −1) = N − 1 – Iteration: For t = (TX − 1). the Gaussian PDF φ(x) is deﬁned as: φ(x) = √ ·e 2πσ 2 1 −1· 2 (x−m)2 σ2 (7.20) 7. As we recall from school. 7.1 Continuous Mixture Densities How do we model the emission process ? As we remember from chapter 2.22) (7. we have to allow for the wide variability of the emitted vectors caused by the variability of the characteristic features of the speech signal. Before we deal with the question of how to model probability density functions for multi–dimensional feature vectors. HIDDEN MARKOV MODELS 55 • Backtracking procedure: Beginning with time index (TX − 1). Lets assume we measure only one continuously valued feature x. we are dealing with feature vectors where each component is representing a continuous valued feature of the speech signal. .23) Where m represents the mean value and σ 2 denotes the variance of the PDF. . If we assume the measurement of this value to be disturbed by many statistically independent processes. These two parameters fully characterize the one–dimensional Gaussian PDF. we have assumed that we can compute the value of the state–speciﬁc emission probability density p (xsθ ) of a HMM Λ.21) (7. which we then will assume to be a good estimate for the true value of x. . t) – Termination: θ0 = ψ(θ1 . in the states of the HMM we model stationary emission processes which we assume to correspond with stationary segments of speech. we can assume that our measurements will assume a Gaussian distribution of values.3 Modelling the Emission Probability Densities So far. . go backwards in time through the backtracking variables: – Initialization: t = (TX − 1) . .3.
which have more than one maximum and whose regions of constant probability density are not elliptically shaped.5]′ ′ . m3 = [3. You know these parameters already. so we will use a multivariate Gaussian PDF. but have complex shapes like the regions we saw in Fig.g. ˜ C1 = 2 2 −1 5 0 1 .26) A PDF modelled by this weighted sum of Gaussian densities is called a Continuous Mixture Density. 1. HIDDEN MARKOV MODELS 56 Gaussian Probability Density Function In our application.24).4.24) ˜ Where m denotes the mean vector and C denotes the covariance matrix. −1] .8) in section 3.4 shows the scatterplot of an emission process consisting of three Gaussian emission processes.1. because we used them in computing the Mahalanobis distance (3.5. [Rus88]).. for speech recognition. 2. a HMM using this modelling approach is called a Continuous Mixture Density HMM. the weighted sum over a set of K Gaussian densities can be used to model p(x) (e. m2 = [−3.g. Rab89]): K−1 p(x) = k=0 ck · φk (x) (7. C2 = −1 1 ˜ c2 = 0. The regions of constant probability density are of elliptical shape. . LRS85b. m1 = [1. −3] ′ The corresponding probability density function is visualized as a suface in ﬁg.3 . The weighting coeﬃcients ck are called the mixture coeﬃcients and have to ﬁt the constraint: K−1 ck = 1 k=0 (7.24). However.4 . which looks like this: φ(x) = (2π) 1 DIM ˜ · C · e (− 2 (x−m) 1 ′ ˜ · C −1 · (x−m)) (7. we measure feature vectors in a multi–dimensional feature space. Continuous Mixture Densities The Gaussian PDF (7. Note that the the Mahalanobis distance is actually part of the exponent of the PDF (7. Figure 7.CHAPTER 7. 7. C3 = 0 c1 = 0. The process parameters are: 5 2 1 ˜ c2 = 0. The PDF of this process has a maximum value at the position of the mean vector m and its value exponentially decreases with increasing distance from the mean vector.3 .25) Where φk (x) is the Gaussian PDF as in (7. we would like to model more complex probability distributions..24) can characterize the observation probability for vectors generated by a single Gaussian process. [CHLP92. and their orientation is determined by the Eigenvectors of the covariance matrix (e. To do so.
CHAPTER 7. HIDDEN MARKOV MODELS
57
scatterplot 8
6
4
2
0
−2
−4
−6
−8
−10 −8
−6
−4
−2
0
2
4
6
8
Figure 7.4: scatterplot of vectors emitted by a gaussian mixture density process
The complexity of the PDF modelled by (7.25) can be inﬂuenced by selecting the appropriate number of mixtures K and may be set individually for every state of the HMM. However, since a high number of mixtures also means that a high number of parameters has to be estimated from a given training set of data, always some compromise between the granularity of the modeling and the reliability of the estimation of the parameters has to be found.
7.3.2
Approximations
Using mixture densities leads to some practical problems: First, the computational eﬀort to compute all the densities for all states of a HMM might be very high. We left out the state indices in the sections before for better readability, but we have to remember that in every state sθ of the HMM a state–speciﬁc emission probability density function p (xsθ ) has to be modelled, each having Kθ gaussian densities. Since the computation of each Gaussian density involves the multiplication with the covariance matrix, this leads to a signiﬁcant computational complexity. Second, the high number of parameters demands a huge training material to reliably estimate the model parameters. Many methods are known to reduce the computational complexity and the number of parameters as well. In the following, just some examples of the various
CHAPTER 7. HIDDEN MARKOV MODELS
58
Figure 7.5: probability density function
possibilities are given. Maximum Approximation In (7.25), the sum of all K gaussian mixtures is computed. However, since value for a gaussian density exponentially decreases with the distance from the mean vector, one can approximate the sum over all mixtures by selecting just the one term which yields the highest contribution to the sum (e.g.,[Ney93]): p(x) ≈
k=0,1,...,K−1
max
{ck · φk (x)}
(7.27)
Since we do not have to compute the sum over the mixture densities anymore, the log values of the mixture coeﬃcient and the gaussian PDF can be used, which we know helps us in avoiding numerical problems: log (p(x)) ≈
k=0,1,...,K−1
max
{log(ck ) + log (φk (x))}
(7.28)
Diagonal Covariance Matrix To reduce the number of parameters and the computational eﬀort simultaneously, it is often assumed that the individual features of the feature vector are not correlated. Then diagonal covariance matrices can be used instead of full covariance matrices. This drastically reduces the number of parameters and the computational eﬀort. As a consequence, a higher number of mixtures can be
CHAPTER 7. HIDDEN MARKOV MODELS
59
used, and correlation of the feature vectors can thus (to a certain degree) be modelled by the combination of diagonal mixtures [LRS85b, CHLP92]. Parameter Tying To reduce the number of parameters even further, some parameters can be ”shared” or tied among several mixtures of a given state, among several states of an HMM and even among the HMMs. The idea behind parameter tying is that if two mixtures (or states or models) have tied parameters, i.e., are using the same value for that parameter, then the training material for these mixtures (states, models) can also be pooled to increase the reliability of the estimation of the tied parameter. A huge number of approaches for Models with tied parameters exists and is beyond the scope of this introduction. Just a few examples shall be given here: • Using a single covariance matrix: A commonly used approach is to share the covariance matrix among all states and even among all models, while the mean vectors and the mixture coeﬃcients are estimated individually for each state. These parameters allow a reliable estimate and therefore the number of mixtures per state can be signiﬁcantly increased, thus implicitly modelling state–speciﬁc variances and correlations (e.g., [Ney93]). • Semicontinuous HMM: Instead of estimating individual gaussian densities φk,θ,Λ (x) for each state θ and Model Λ, The densities are shared among all models and states. Only the mixture coeﬃcients ck,θ,Λ are estimated individually for each state (e.g.,[XDHH90, HL92]). • Generalised Parameter Tying: While the examples above were deﬁning the set of tied parameters by an a–priori decision, it is also possible to introduce parameter tying on several levels (e.g., tied mean vectors, tied covariances, tied gaussian densities, tied mixtures, tied states, tied models), depending on the amount of training material available as well as determined on incorporated apriori knowledge. Examples can be found in [You92, SYW94, You96, You01].
In analogy to chapter 5. 8. we will use the optimum path sequence found by the Viterbi algorithm to perform the maximum likelihood classiﬁcation. respectively. which consists of all word models Λv and is extended by two virtual states.2. a virtual start state and a virtual end state.2.v) belongs to. we will organize our search space according to the task at hand. 8. To do so.Chapter 8 Speech Recognition with HMMs ˜ Now that we are able to compute the likelihood functions p(XΛv ).1. we can perform the maximum likelihood classiﬁcation (7.5) as described in section 7.1: The model structure for isolated word recognition Now we adapt the Viterbi algorithm we used in section 7.1 Isolated Word Recognition First of all. In the following. we will build a superHMM.4 accordingly: 60 . the state index will be extended by the additional index v to identify the model Λv a state s(θ. Fig.1 shows the structure of the superHMM. a 00 a s0 a 11 a s1 a 22 a s2 a 33 a s3 a 44 01 12 23 34 s4 a 02 a 13 a 24 a 00 a 01 s 0 a 11 a 12 s 1 a 22 a 23 s 2 a 33 a 34 s 3 a 44 s 4 Start a 02 a 13 a 24 End a 00 a s0 a 11 a s1 a 22 a s2 a 33 a s3 a 44 01 12 23 34 s4 a 02 a 13 a 24 Figure 8.
For simplicity.CHAPTER 8. t). Of course. the Viterbi algorithm allows a real–time implementation. (TX − 1)) (8.g.v) ) 1 are set to zero instead of log( V ). e. By constructing the Super HMM framework we were able to solve the classiﬁcation task by ﬁnding the optimal state sequence. the Viterbi iteration (7. as e. TX ) contains the best path among all possible word ends to send and is therefore the desired classiﬁcation result according to the maximum likelihood approach.[XHH93]. we simply set the accumulated score for sstart to zero and the backtracking variable ψ(sstart . the state transition scores d(sstart . . For the end state of the super HMM.v) . To initialize the Viterbi algorithm. We will see in the following section.g.2) The word index v contained in ψ(send . we perform the last path recombination for the virtual time index TX : • Optimization step: δ(send . These super models can even contain additional knowledge sources. . it is ensured that the ﬁrst state of each model will be initialized properly for time t = 0. Now. SPEECH RECOGNITION WITH HMMS 61 The virtual start state will not emit any vectors but will only provide the ﬁrst states of each HMM with a predecessor for the virtual time slot t = −1. the set of predecessor states Ξ s(θ. −1) to −1.v) . as we did in in chapter 5.v) for all HMMs Λv and for all time indices t = 0.18) is performed for all states s(θ.v) is deﬁned to contain only the state sstart . (TX −1).. To construct large HMMs which reﬂect the speech recognition task at hand from smaller HMMs and performing the recognition task at hand by searching the optimum state sequence for that HMM is one of the most important issues in the stochastic modelling framework. To terminate the Viterbi algorithm. that the extension of that framework to connected word recognition is straightforward. 1. they are also called uniﬁed stochastic engines. v) have to be stored in ψ(θ. The transition scores d(s((Nv −1). If backtracking is desired. TX ) = arg max (i. TX ) = • Backtracking: ψ(send . a so–called language model which models the a–priori probabilities of sequences of words. Because with these large HMMs those knowledge sources can be fully integrated into the framework (even into the training procedures). .v) of each HMM Λv . Due to the deﬁnition of the variables for our start state as given above. now both the state and the model index (i.17) and backtracking procedure (7.1) δ(s(i.v) ∈ Ξ(send ) (i. .v) ∈ Ξ(send ) max δ(s(i. the set of predecessors Ξ (send ) will contain all end states s((Nv −1). . For each ﬁrst state s(θ. (TX − 1)) (8.v) . send ) are set to zero.v) . as in that chapter. s(θ.
This is an easy thing to do: we simply take our model for isolated word recognition and introduce a ”loop” transition from the last state back to the ﬁrst state of the model. those two virtual states will not emit any vectors and therefore will not consume any time index. in analogy to the use of vector indices i and partition indices v we had in chapter 6.2: The model structure for connected word recognition Note that this extension is in total analogy to what we did in chapter 6: At each time index t all word models may make a transition from their last state to the virtual ”R” state for path recombination.3). To identify the individual states sθ of the word models. After the path recombination a virtual transition to state ”S” is made and state ”S” may be used as a possible predecessor state for the ﬁrst states of all word models.2) to simplify the between–word path recombination (6. for between–word path recombination. we will have to construct a super HMM capable of generating an unlimited sequence of words from our vocabulary.2 8. the ﬁrst grid point of each word was allowed to select either the the ”horizontal” transition or the best word end as its predecessor grid end. The resulting model is shown in Fig. we did in fact introduce a virtual state like SR for path recombination. 8. In chapter 6.2. By introducing the variable γ(j − 1) as in (6. As before. a 00 a a 11 a a 22 a a 33 a a 44 01 s1 12 s2 23 s3 34 s4 s0 a 02 a 13 a 24 a 00 a a 11 a a 22 a a 33 a a 44 01 s 1 12 s 2 23 s 3 34 s 4 s 0 S a 02 a 13 a 24 R a 00 a a 11 a a 22 a a 33 a a 44 01 s 1 12 s 2 23 s 3 34 s 4 s 0 a 02 a 13 a 24 Figure 8. They are just used to simplify the the ”bookkeeping” for the path recombinations after leaving the last states of the individual word models and the transition into the ﬁrst states of the word models.2. SPEECH RECOGNITION WITH HMMS 62 8.CHAPTER 8. . we use again the model index v as an additional index. To deﬁne the dimensions of our search space is very simple: It is spanned by all states of all word models Λv in one dimension and by the vectors of the vector ˜ sequence X in the other dimension.1 Connected Word Recognition Deﬁnition of the Search Space If we want to perform connected word recognition as in chapter 6.
Note that the predecessor states are identiﬁed by the tupel (i.v ) + δ(si. . (t − 1)) (8.k) ∈ Ξ(s(θ. t) = (i.k) . Now that we are used to it.2.4. t) (8. .3) (i.CHAPTER 8. all we have to to is to reformulate the equations for path recombination and backtracking given in section 7. we must initialize the ﬁrst columns for all models.k) .3) holds also for the ﬁrst states of each model due to the deﬁnition of the set of predecessors.4) To allow backtracking of the word transitions made during the between–word path recombination (8.4. the Viterbi algorithm for connected word recognition with HMMs can be deﬁned as follows: • Initialization: For the ﬁrst time index t = 0. Ξ(sR ) will contain the state indices of all last states s(Nv −1).v) ) + max d(s(i.4).2. in fact merging the states sS and sR into sR : For all ﬁrst states of all models Λv .4 and again we must forward the start time of the word model during the local path recombination. of course.4. Since it is used only as a predecessor for the ﬁrst states of the word HMMs. sθ. the recombination of the paths at the word ends is performed for the time index t.v of all models Λv . (V − 1): .v) . While the ﬁrst states of all models will be initialized with the log score of emitting the ﬁrst vector of the utterance.v) of word model Λv . we need the backtracking arrays w(t) and ts (t) as in section 6. we just keep in mind that the recombination for the state sR has to be tha last action for time index t: δ(sR .k) ∈ Ξ(sR ) max δ(si. For the virtual state sR . all other states will be initialized with the default value −∞. k) consisting of the state index i and the model index k. The state sS has only the predecessor sR and will not emit any vector. Now.2 The Viterbi Algorithm for Connected Word Recognition Now. This has to be done after all the local path recombinations to ensure that all local paths leading to word ends are already recombined correctly before we are selecting the best word end.v) ) The equation for the local path recombination (8. To emphasize this fact in section 6. We start with the local path recombination in analogy to (7.v) be the set of predecessor states of state s(θ.14): δ(s(θ.v ) will contain only the the state index of sR and the state index for the state s0.k) . .v itself to allow the self–transition . s(θ. t) = b(t. – Initialize ﬁrst columns: For all v = 0. . 1. SPEECH RECOGNITION WITH HMMS 63 8. The word start time for all models at that time index is t = 0. we explicitly made this step at the beginning of time index j using the index of the old values j − 1. We extend the set of predecessors Ξ to consider also the two states sR and sS : For sR . we can deﬁne SR as their predecessor state directly. Let Ξ s(θ. Ξ(s0.4 to ﬁt to the new problem.
v) ) (8. . v). t) (8. .k) . k) = arg max (θ.v) . t) (8.v) ) d(s(i.v) ∈ Ξ(sR ) max δ(s((θ. .v) .v ) + δ(s(i.11) – Backtracking at word level: Store the word index and word start time of the best word end at time t: ∗ Get best word end state (i. . SPEECH RECOGNITION WITH HMMS 64 δ(s(θ.7) max d(s(i. . k). k) = arg max (i. t).8) – Backtracking at state level: Forward the word start time associated with the path ending in δ(s(θ. θ = 0 −∞ .k) . 2. 2. t) = b(t.v) ) .k) . sθ. θ=0 −1 . .v) . 0) = 0 .10) – Path recombination at the word ends: Select the best word end at time t and enter the score into δ(sR . v).v) .k) ∈ Ξ(s(θ. sθ. w(0) = −1 • Iteration for t = 1. . . .k) ∈ Ξ(s(θ.v) )+ (i. t − 1) (8.CHAPTER 8. (TX − 1): – Local path recombination: δ(s(θ.5) – Initialize state–level backtracking variables: For all v = 0.v) ∈ Ξ(sR ) δ(s((θ. t) (8.v) .13) ts (t) = ξ((i. t) = (θ. k).k) . (t − 1)) (8. (Nv − 1) (8.v ) + δ(s(i. θ = 1. . 0) = b(0. δ(sR . θ = 1. . (Nv − 1) (8. .9) ∗ Forward the start time: ξ((θ. k) at time t: (i. . (V − 1): ξ((θ. . s(0. 1. . . (t − 1)) (8.12) ∗ Store word index: ∗ Store word start time: w(t) = k (8.6) – Init word level backtracking: ts (0) = 0 . 2. t) = ξ((i. t): ∗ Get predecessor grid point: (i. s(θ.14) .
• The connected word recognizer we described is unable to either provide us with a conﬁdence measure for the accuracy of the recognized word sequence or detect ”out of vocabulary” words. SPEECH RECOGNITION WITH HMMS 65 • Termination: At t = (TX − 1). the accumulated score of the utterance can be found in δ(sR .CHAPTER 8. The ﬂip side of the coin is that this simple approach has several disadvantages. Of course. our algorithm will always produce a sequence of words out of our vocabulary.3 Summary Now you know the principles of the algorithms used for simple speech recognition tasks like the recognition of connected digits. (TX − 1)). real implementations will deal with additional problems like the reduction of the computational efforts during the Viterbi search or the incorporation of silence models into the search process. the index of the best word end (the last word of the utterance) can be found in w(TX − 1) and the start time of that word in ts (TX − 1) • Backtracking Procedure: Beginning with time index (TX − 1).15) 8. go backwards in time through the backtracking variables: – Initialization: t = (TX − 1) – Iteration: While t > 0: wt = w(t) . (8. no other ”higher–level” knowledge sources are exploited. of which at least some should be mentioned here: • No matter what the utterance is. The algorithms we used were quite straightforward and had a very regular structure. only the acoustic knowledge represented by the HMMs is used. We know now how we can use the Maximum Likelihood classiﬁcation for matching a given utterance against a predeﬁned vocabulary represented by HMMs. t = ts (t). . • During the recognition process. • We will get only the best match for the word sequence and no alternative ”nbest” word sequences are presented. • It has no knowledge about any semantic restrictions of the word sequence and is in no way capable of really ”understanding” the utterance.16) (8.
How can we build word models from smaller units which can be reliably estimated? • How can we search ”nbest” word sequences instead of the best word sequence only? • How can we integrate additional knowledge sources like ﬁnite state grammars or stochastic language models into the recognition process? • How can we reduce the computational complexity of the search algorithm so that we are able to meet real time constraints ? We will learn more about some of these topics in the successor of this textbook which will be entitled ”Speech Recognition: Selected Topics” and will be available soon ! 66 . We also pointed out the disadvantages of these primitive algorithms. Among them are the following: • We know now how the basic algorithm for connected word recognition with HMMs works. — But how does the training procedure for HMMs look like? • If we have large vocabularies. there are so many open topics we did not even come close to deal with.Chapter 9 Conclusion This short introduction covered the basics of speech recognition up to the principles of connected word recognition. However. we will not have enough training samples for the creation of word models.
H. [HL92] X. The use of a onestage dynamic programming algorithm for connected word recognition. MIT Press. Laface and R. 77(2):257– 286. Ney. Recent Advances. Ed. Phonene classiﬁcation using semicontinuous hidden markov models.R. Pieraccini. Speech Recognition and Understanding. 64(6):1251–1270. February 1989. Rabiner and Frank K. H. 1992. B.H. Speech and Signal Processing. Oran Brigham. 40(5):1962–1067. Sondhi. 64(6):1211–1234. Some properties of continuous hidden markov model representations.R. August 1989. Levinson L. Lee and R. E. Juang and M. Jelinek. AT & T Technical Journal. IEEE Transactions on Acoustics. 67 [Jel98] [LRS85a] [LRS85b] [LRS89] [Ney84] [Ney93] [Rab89] . Modeling and search in continuous speech recognition. pages 491–498. De Mori.R. J. L. Sondhi. Statistical Methods for Speech Recognition. Huang and K. The Fourier Transform and its Applications.. Ney. on Signal Processessing. N. G. JulyAugust 1985. E. F. Levinson L. May 1992. IEEE Trans. Speech and Signal Processing. 2000. IEEE Transactions on Acoustics. Oldenbourg Verlag. D. A tutorial on hidden markov models and selected applications in speech recognition. 37(8):1214–1225. McGrawHill. JulyAugust 1985. Wilpon L. Speaker Independent Continuous Speech Recognition Using Continuous Density Hidden Markov Models. AT & T Technical Journal. M. Rabiner. [CHLP92] L. 1987. Rabiner. H. M. Rabiner C. volume F75 of NATO ASI Series. SOONG. F. Bracewell. April 1984. ASSP32(2):263–271. R. by P. 1993.H. Schnelle Fourier Transformation. S.R. S. Berlin Heidelberg. Juang and M. 1998. Proceedings of EUROSPEECH 1993. High performance connected digit recognition using hidden markov models. Proceedings of the IEEE. Lee. Rabiner. E. Springer Verlag. B. Recognition of isolated digits using hidden markov models with continuous mixture densities.Bibliography [Bra00] [Bri87] R.
April 1990. Young. Statistical modelling in continuous speech recognition. Hwang.H. G. SchukatTalamazzini. Mexico. J. 1994. S. IEEE Signal Processing Magazine. Proceedings ICASSP 1993. Hon. S. Odell S. IEEE ASSP Magazine. Huang. Juang. W. The general use of tying in phonemebased hmm speech recognisers. u Oldenbourg Verlag. Automatische Spracherkennung. Rabiner and B. Ruske. 1996. pages 4–16. Plainsboro NJ. Proc. Treebased state tying for high accuracy acoustic modelling.. Albuquerque. Proceedings of ICASSP 1992. E. [You92] [You96] [You01] . F. Morgan Kaufman Publishers Inc. Proc. 1995. [XDHH90] K. Young and P. pages 689–692. An introduction to hidden markov models. II:636–639. Lee X. Alleva X.R. [XHH93] F. J. Huang and H. 1992. Belin and M. Uniﬁed stochastic engine (use) for speech recognition. M¨ nchen. Proceedings ICASSP 1990. Automatische Spracherkennung. 13(5):45–57. Vieweg Verlag. 1993. Young. Woodland. 1988. Conference on Uncertainity in Artiﬁcial Intelligence. Human Language Technology Workshop. Seattle.BIBLIOGRAPHY 68 [RJ86] [Rus88] [ST95] [SYW94] L. pages 307–312. M. Large vocabulary continuous speech recognition: A review. D.C. January 1986. Young. S.J.. On semicontinuous hidden markov modeling. I(2):569–572. G. Int. WA. August 2001.
This action might not be possible to undo. Are you sure you want to continue?