This action might not be possible to undo. Are you sure you want to continue?
Hidden Markov Models
in Speech Recognition
The Application of
Hidden Markov Models
in Speech Recognition
Mark Gales
Cambridge University Engineering Department
Cambridge
CB2 1PZ
UK
mjfg@eng.cam.ac.uk
Steve Young
Cambridge University Engineering Department
Cambridge
CB2 1PZ
UK
sjy@eng.cam.ac.uk
Boston – Delft
Foundations and Trends
R
in
Signal Processing
Published, sold and distributed by:
now Publishers Inc.
PO Box 1024
Hanover, MA 02339
USA
Tel. +17819854510
www.nowpublishers.com
sales@nowpublishers.com
Outside North America:
now Publishers Inc.
PO Box 179
2600 AD Delft
The Netherlands
Tel. +31651115274
The preferred citation for this publication is M. Gales and S. Young, The Application
of Hidden Markov Models in Speech Recognition, Foundations and Trends
R
in
Signal Processing, vol 1, no 3, pp 195–304, 2007
ISBN: 9781601981202
c 2008 M. Gales and S. Young
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, mechanical, photocopying, recording
or otherwise, without prior written permission of the publishers.
Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen
ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for
internal or personal use, or the internal or personal use of speciﬁc clients, is granted by
now Publishers Inc for users registered with the Copyright Clearance Center (CCC). The
‘services’ for users can be found on the internet at: www.copyright.com
For those organizations that have been granted a photocopy license, a separate system
of payment has been arranged. Authorization does not extend to other kinds of copy
ing, such as that for general distribution, for advertising or promotional purposes, for
creating new collective works, or for resale. In the rest of the world: Permission to pho
tocopy must be obtained from the copyright owner. Please apply to now Publishers Inc.,
PO Box 1024, Hanover, MA 02339, USA; Tel. +17818710245; www.nowpublishers.com;
sales@nowpublishers.com
now Publishers Inc. has an exclusive license to publish this material worldwide. Permission
to use this content must be obtained from the copyright license holder. Please apply to now
Publishers, PO Box 179, 2600 AD Delft, The Netherlands, www.nowpublishers.com; email:
sales@nowpublishers.com
Foundations and Trends
R
in
Signal Processing
Volume 1 Issue 3, 2007
Editorial Board
EditorinChief:
Robert M. Gray
Dept of Electrical Engineering
Stanford University
350 Serra Mall
Stanford, CA 94305
USA
rmgray@stanford.edu
Editors
Abeer Alwan (UCLA)
John Apostolopoulos (HP Labs)
Pamela Cosman (UCSD)
Michelle Eﬀros (California Institute
of Technology)
Yonina Eldar (Technion)
Yariv Ephraim (George Mason
University)
Sadaoki Furui (Tokyo Institute
of Technology)
Vivek Goyal (MIT)
Sinan Gunturk (Courant Institute)
Christine Guillemot (IRISA)
Sheila Hemami (Cornell)
Lina Karam (Arizona State
University)
Nick Kingsbury (Cambridge
University)
Alex Kot (Nanyang Technical
University)
Jelena Kovacevic (CMU)
B.S. Manjunath (UCSB)
Urbashi Mitra (USC)
Thrasos Pappas (Northwestern
University)
Mihaela van der Shaar (UCLA)
Luis Torres (Technical University
of Catalonia)
Michael Unser (EPFL)
P.P. Vaidyanathan (California
Institute of Technology)
Rabab Ward (University
of British Columbia)
Susie Wee (HP Labs)
Cliﬀord J. Weinstein (MIT Lincoln
Laboratories)
Min Wu (University of Maryland)
Josiane Zerubia (INRIA)
Editorial Scope
Foundations and Trends
R
in Signal Processing will publish sur
vey and tutorial articles on the foundations, algorithms, methods, and
applications of signal processing including the following topics:
• Adaptive signal processing
• Audio signal processing
• Biological and biomedical signal
processing
• Complexity in signal processing
• Digital and multirate signal
processing
• Distributed and network signal
processing
• Image and video processing
• Linear and nonlinear ﬁltering
• Multidimensional signal processing
• Multimodal signal processing
• Multiresolution signal processing
• Nonlinear signal processing
• Randomized algorithms in signal
processing
• Sensor and multiple source signal
processing, source separation
• Signal decompositions, subband
and transform methods, sparse
representations
• Signal processing for
communications
• Signal processing for security and
forensic analysis, biometric signal
processing
• Signal quantization, sampling,
analogtodigital conversion,
coding and compression
• Signal reconstruction,
digitaltoanalog conversion,
enhancement, decoding and
inverse problems
• Speech/audio/image/video
compression
• Speech and spoken language
processing
• Statistical/machine learning
• Statistical signal processing
• Classiﬁcation and detection
• Estimation and regression
• Treestructured methods
Information for Librarians
Foundations and Trends
R
in Signal Processing, 2007, Volume 1, 4 issues. ISSN
paper version 19328346. ISSN online version 19328354. Also available as a
combined paper and online subscription.
Foundations and Trends
R
in
Signal Processing
Vol. 1, No. 3 (2007) 195–304
c 2008 M. Gales and S. Young
DOI: 10.1561/2000000004
The Application of Hidden Markov Models
in Speech Recognition
Mark Gales
1
and Steve Young
2
1
Cambridge University Engineering Department, Trumpington Street,
Cambridge, CB2 1PZ, UK, mjfg@eng.cam.ac.uk
2
Cambridge University Engineering Department, Trumpington Street,
Cambridge, CB2 1PZ, UK, sjy@eng.cam.ac.uk
Abstract
Hidden Markov Models (HMMs) provide a simple and eﬀective frame
work for modelling timevarying spectral vector sequences. As a con
sequence, almost all present day large vocabulary continuous speech
recognition (LVCSR) systems are based on HMMs.
Whereas the basic principles underlying HMMbased LVCSR are
rather straightforward, the approximations and simplifying assump
tions involved in a direct implementation of these principles would
result in a system which has poor accuracy and unacceptable sen
sitivity to changes in operating environment. Thus, the practi
cal application of HMMs in modern systems involves considerable
sophistication.
The aim of this review is ﬁrst to present the core architecture of
a HMMbased LVCSR system and then describe the various reﬁne
ments which are needed to achieve stateoftheart performance. These
reﬁnements include feature projection, improved covariance modelling,
discriminative parameter estimation, adaptation and normalisation,
noise compensation and multipass system combination. The review
concludes with a case study of LVCSR for Broadcast News and
Conversation transcription in order to illustrate the techniques
described.
Contents
1 Introduction 1
2 Architecture of an HMMBased Recogniser 5
2.1 Feature Extraction 6
2.2 HMM Acoustic Models (BasicSingle Component) 8
2.3 Ngram Language Models 15
2.4 Decoding and Lattice Generation 17
3 HMM Structure Reﬁnements 21
3.1 Dynamic Bayesian Networks 21
3.2 Gaussian Mixture Models 23
3.3 Feature Projections 25
3.4 Covariance Modelling 29
3.5 HMM Duration Modelling 32
3.6 HMMs for Speech Generation 33
4 Parameter Estimation 37
4.1 Discriminative Training 38
4.2 Implementation Issues 42
4.3 Lightly Supervised and Unsupervised Training 44
ix
5 Adaptation and Normalisation 47
5.1 FeatureBased Schemes 48
5.2 Linear TransformBased Schemes 51
5.3 Gender/Cluster Dependent Models 55
5.4 Maximum a Posteriori (MAP) Adaptation 57
5.5 Adaptive Training 59
6 Noise Robustness 63
6.1 Feature Enhancement 65
6.2 ModelBased Compensation 67
6.3 UncertaintyBased Approaches 71
7 MultiPass Recognition Architectures 77
7.1 Example Architecture 77
7.2 System Combination 79
7.3 Complementary System Generation 81
7.4 Example Application — Broadcast News
Transcription 82
Conclusions 93
Acknowledgments 95
Notations and Acronyms 97
References 101
1
Introduction
Automatic continuous speech recognition (CSR) has many potential
applications including command and control, dictation, transcription
of recorded speech, searching audio documents and interactive spoken
dialogues. The core of all speech recognition systems consists of a set
of statistical models representing the various sounds of the language to
be recognised. Since speech has temporal structure and can be encoded
as a sequence of spectral vectors spanning the audio frequency range,
the hidden Markov model (HMM) provides a natural framework for
constructing such models [13].
HMMs lie at the heart of virtually all modern speech recognition
systems and although the basic framework has not changed signiﬁcantly
in the last decade or more, the detailed modelling techniques developed
within this framework have evolved to a state of considerable sophisti
cation (e.g. [40, 117, 163]). The result has been steady and signiﬁcant
progress and it is the aim of this review to describe the main techniques
by which this has been achieved.
The foundations of modern HMMbased continuous speech recog
nition technology were laid down in the 1970’s by groups at Carnegie
Mellon and IBM who introduced the use of discrete density HMMs
1
2 Introduction
[11, 77, 108], and then later at Bell Labs [80, 81, 99] where continu
ous density HMMs were introduced.
1
An excellent tutorial covering the
basic HMM technologies developed in this period is given in [141].
Reﬂecting the computational power of the time, initial develop
ment in the 1980’s focussed on either discrete word speaker dependent
large vocabulary systems (e.g. [78]) or whole word small vocabulary
speaker independent applications (e.g. [142]). In the early 90’s, atten
tion switched to continuous speakerindependent recognition. Start
ing with the artiﬁcial 1000 word Resource Management task [140],
the technology developed rapidly and by the mid1990’s, reasonable
accuracy was being achieved for unrestricted speaker independent dic
tation. Much of this development was driven by a series of DARPA
and NSA programmes [188] which set ever more challenging tasks
culminating most recently in systems for multilingual transcription
of broadcast news programmes [134] and for spontaneous telephone
conversations [62].
Many research groups have contributed to this progress, and each
will typically have its own architectural perspective. For the sake of log
ical coherence, the presentation given here is somewhat biassed towards
the architecture developed at Cambridge University and supported by
the HTK Software Toolkit [189].
2
The review is organised as follows. Firstly, in Architecture of a
HMMBased Recogniser the key architectural ideas of a typical HMM
based recogniser are described. The intention here is to present an over
all system design using very basic acoustic models. In particular, simple
single Gaussian diagonal covariance HMMs are assumed. The following
section HMM Structure Reﬁnements then describes the various ways in
which the limitations of these basic HMMs can be overcome, for exam
ple by transforming features and using more complex HMM output
distributions. A key beneﬁt of the statistical approach to speech recog
nition is that the required models are trained automatically on data.
1
This very brief historical perspective is far from complete and out of necessity omits many
other important contributions to the early years of HMMbased speech recognition.
2
Available for free download at htk.eng.cam.ac.uk. This includes a recipe for building a
stateoftheart recogniser for the Resource Management task which illustrates a number
of the approaches described in this review.
3
The section Parameter Estimation discusses the diﬀerent objective
functions that can be optimised in training and their eﬀects on perfor
mance. Any system designed to work reliably in realworld applications
must be robust to changes in speaker and the environment. The section
on Adaptation and Normalisation presents a variety of generic tech
niques for achieving robustness. The following section Noise Robust
ness then discusses more specialised techniques for speciﬁcally handling
additive and convolutional noise. The section MultiPass Recognition
Architectures returns to the topic of the overall system architecture
and explains how multiple passes over the speech signal using diﬀer
ent model combinations can be exploited to further improve perfor
mance. This ﬁnal section also describes some actual systems built for
transcribing English, Mandarin and Arabic in order to illustrate the
various techniques discussed in the review. The review concludes in
Conclusions with some general observations and conclusions.
2
Architecture of an HMMBased Recogniser
The principal components of a large vocabulary continuous speech
recogniser are illustrated in Figure 2.1. The input audio waveform from
a microphone is converted into a sequence of ﬁxed size acoustic vectors
Y
1:T
= y
1
, . . . , y
T
in a process called feature extraction. The decoder
then attempts to ﬁnd the sequence of words w
1:L
= w
1
, . . . , w
L
which
is most likely to have generated Y , i.e. the decoder tries to ﬁnd
ˆ w = arg max
w
{P(wY )}. (2.1)
However, since P(wY ) is diﬃcult to model directly,
1
Bayes’ Rule is
used to transform (2.1) into the equivalent problem of ﬁnding:
ˆ w = arg max
w
{p(Y w)P(w)}. (2.2)
The likelihood p(Y w) is determined by an acoustic model and the
prior P(w) is determined by a language model.
2
The basic unit of sound
1
There are some systems that are based on discriminative models [54] where P(wY ) is
modelled directly, rather than using generative models, such as HMMs, where the obser
vation sequence is modelled, p(Y w).
2
In practice, the acoustic model is not normalised and the language model is often scaled
by an empirically determined constant and a word insertion penalty is added i.e., in the
log domain the total likelihood is calculated as logp(Y w) + αlog(P(w)) + βw where α
is typically in the range 8–20 and β is typically in the range 0 – −20.
5
6 Architecture of an HMMBased Recogniser
Fig. 2.1 Architecture of a HMMbased Recogniser.
represented by the acoustic model is the phone. For example, the word
“bat” is composed of three phones /b/ /ae/ /t/. About 40 such phones
are required for English.
For any given w, the corresponding acoustic model is synthe
sised by concatenating phone models to make words as deﬁned by
a pronunciation dictionary. The parameters of these phone models
are estimated from training data consisting of speech waveforms and
their orthographic transcriptions. The language model is typically
an Ngram model in which the probability of each word is condi
tioned only on its N − 1 predecessors. The Ngram parameters are
estimated by counting Ntuples in appropriate text corpora. The
decoder operates by searching through all possible word sequences
using pruning to remove unlikely hypotheses thereby keeping the search
tractable. When the end of the utterance is reached, the most likely
word sequence is output. Alternatively, modern decoders can gener
ate lattices containing a compact representation of the most likely
hypotheses.
The following sections describe these processes and components in
more detail.
2.1 Feature Extraction
The feature extraction stage seeks to provide a compact representa
tion of the speech waveform. This form should minimise the loss of
2.1 Feature Extraction 7
information that discriminates between words, and provide a good
match with the distributional assumptions made by the acoustic mod
els. For example, if diagonal covariance Gaussian distributions are used
for the stateoutput distributions then the features should be designed
to be Gaussian and uncorrelated.
Feature vectors are typically computed every 10 ms using an over
lapping analysis window of around 25 ms. One of the simplest and
most widely used encoding schemes is based on melfrequency cep
stral coeﬃcients (MFCCs) [32]. These are generated by applying a
truncated discrete cosine transformation (DCT) to a log spectral esti
mate computed by smoothing an FFT with around 20 frequency
bins distributed nonlinearly across the speech spectrum. The non
linear frequency scale used is called a mel scale and it approxi
mates the response of the human ear. The DCT is applied in order
to smooth the spectral estimate and approximately decorrelate the
feature elements. After the cosine transform the ﬁrst element rep
resents the average of the logenergy of the frequency bins. This
is sometimes replaced by the logenergy of the frame, or removed
completely.
Further psychoacoustic constraints are incorporated into a related
encoding called perceptual linear prediction (PLP) [74]. PLP com
putes linear prediction coeﬃcients from a perceptually weighted
nonlinearly compressed power spectrum and then transforms the
linear prediction coeﬃcients to cepstral coeﬃcients. In practice,
PLP can give small improvements over MFCCs, especially in noisy
environments and hence it is the preferred encoding for many
systems [185].
In addition to the spectral coeﬃcients, ﬁrst order (delta) and
secondorder (delta–delta) regression coeﬃcients are often appended
in a heuristic attempt to compensate for the conditional independence
assumption made by the HMMbased acoustic models [47]. If the orig
inal (static) feature vector is y
s
t
, then the delta parameter, ∆y
s
t
, is
given by
∆y
s
t
=
n
i=1
w
i
_
y
s
t+i
− y
s
t−i
_
2
n
i=1
w
2
i
(2.3)
8 Architecture of an HMMBased Recogniser
where n is the window width and w
i
are the regression coeﬃcients.
3
The delta–delta parameters, ∆
2
y
s
t
, are derived in the same fashion, but
using diﬀerences of the delta parameters. When concatenated together
these form the feature vector y
t
,
y
t
=
_
y
sT
t
∆y
sT
t
∆
2
y
sT
t
¸
T
. (2.4)
The ﬁnal result is a feature vector whose dimensionality is typically
around 40 and which has been partially but not fully decorrelated.
2.2 HMM Acoustic Models (BasicSingle Component)
As noted above, each spoken word w is decomposed into a sequence
of K
w
basic sounds called base phones. This sequence is called its pro
nunciation q
(w)
1:K
w
= q
1
, . . . , q
K
w
. To allow for the possibility of multiple
pronunciations, the likelihood p(Y w) can be computed over multiple
pronunciations
4
p(Y w) =
Q
p(Y Q)P(Qw), (2.5)
where the summation is over all valid pronunciation sequences for w,
Q is a particular sequence of pronunciations,
P(Qw) =
L
l=1
P(q
(w
l
)
w
l
), (2.6)
and where each q
(w
l
)
is a valid pronunciation for word w
l
. In practice,
there will only be a very small number of alternative pronunciations
for each w
l
making the summation in (2.5) easily tractable.
Each base phone q is represented by a continuous density HMM of
the form illustrated in Figure 2.2 with transition probability param
eters {a
ij
} and output observation distributions {b
j
()}. In operation,
an HMM makes a transition from its current state to one of its con
nected states every time step. The probability of making a particular
3
In HTK to ensure that the same number of frames is maintained after adding delta and
delta–delta parameters, the start and end elements are replicated to ﬁll the regression
window.
4
Recognisers often approximate this by a max operation so that alternative pronunciations
can be decoded as though they were alternative word hypotheses.
2.2 HMM Acoustic Models (BasicSingle Component) 9
Fig. 2.2 HMMbased phone model.
transition from state s
i
to state s
j
is given by the transition probabil
ity {a
ij
}. On entering a state, a feature vector is generated using the
distribution associated with the state being entered, {b
j
()}. This form
of process yields the standard conditional independence assumptions
for an HMM:
•
states are conditionally independent of all other states given
the previous state;
•
observations are conditionally independent of all other obser
vations given the state that generated it.
For a more detailed discussion of the operation of an HMM see [141].
For now, single multivariate Gaussians will be assumed for the out
put distribution:
b
j
(y) = N(y; µ
(j)
, Σ
(j)
), (2.7)
where µ
(j)
is the mean of state s
j
and Σ
(j)
is its covariance. Since the
dimensionality of the acoustic vector y is relatively high, the covariances
are often constrained to be diagonal. Later in HMM Structure Reﬁne
ments, the beneﬁts of using mixtures of Gaussians will be discussed.
Given the composite HMM Q formed by concatenating all of the
constituent base phones q
(w
1
)
, . . . , q
(w
L
)
then the acoustic likelihood is
given by
p(Y Q) =
θ
p(θ, Y Q), (2.8)
10 Architecture of an HMMBased Recogniser
where θ = θ
0
, . . . , θ
T+1
is a state sequence through the composite
model and
p(θ, Y Q) = a
θ
0
θ
1
T
t=1
b
θ
t
(y
t
)a
θ
t
θ
t+1
. (2.9)
In this equation, θ
0
and θ
T+1
correspond to the nonemitting entry
and exit states shown in Figure 2.2. These are included to simplify the
process of concatenating phone models to make words. For simplicity
in what follows, these nonemitting states will be ignored and the focus
will be on the state sequence θ
1
, . . . , θ
T
.
The acoustic model parameters λ = [{a
ij
}, {b
j
()}] can be eﬃ
ciently estimated from a corpus of training utterances using the
forward–backward algorithm [14] which is an example of expectation
maximisation (EM) [33]. For each utterance Y
(r)
, r = 1, . . . , R, of length
T
(r)
the sequence of baseforms, the HMMs that correspond to the word
sequence in the utterance, is found and the corresponding composite
HMM constructed. In the ﬁrst phase of the algorithm, the Estep, the
forward probability α
(rj)
t
= p(Y
(r)
1:t
, θ
t
= s
j
; λ) and the backward prob
ability β
(ri)
t
= p(Y
(r)
t+1:T
(r)
θ
t
= s
i
; λ) are calculated via the following
recursions
α
(rj)
t
=
_
i
α
(ri)
t−1
a
ij
_
b
j
_
y
(r)
t
_
(2.10)
β
(ri)
t
=
_
_
j
a
ij
b
j
(y
(r)
t+1
)β
(rj)
t+1
_
_
, (2.11)
where i and j are summed over all states. When performing these recur
sions underﬂow can occur for long speech segments, hence in practice
the logprobabilities are stored and log arithmetic is used to avoid this
problem [89].
5
Given the forward and backward probabilities, the probability of
the model occupying state s
j
at time t for any given utterance r is just
γ
(rj)
t
= P(θ
t
= s
j
Y
(r)
; λ) =
1
P
(r)
α
(rj)
t
β
(rj)
t
, (2.12)
5
This is the technique used in HTK [189].
2.2 HMM Acoustic Models (BasicSingle Component) 11
where P
(r)
= p(Y
(r)
; λ). These state occupation probabilities, also called
occupation counts, represent a soft alignment of the model states to the
data and it is straightforward to show that the new set of Gaussian
parameters deﬁned by [83]
ˆ µ
(j)
=
R
r=1
T
(r)
t=1
γ
(rj)
t
y
(r)
t
R
r=1
T
(r)
t=1
γ
(rj)
t
(2.13)
ˆ
Σ
(j)
=
R
r=1
T
(r)
t=1
γ
(rj)
t
(y
(r)
t
− ˆ µ
(j)
)(y
(r)
t
− ˆ µ
(j)
)
T
R
r=1
T
(r)
t=1
γ
(rj)
t
(2.14)
maximise the likelihood of the data given these alignments. A similar
reestimation equation can be derived for the transition probabilities
ˆ a
ij
=
R
r=1
1
P
(r)
T
(r)
t=1
α
(ri)
t
a
ij
b
j
(y
(r)
t+1
)β
(rj)
t+1
y
(r)
t
R
r=1
T
(r)
t=1
γ
(ri)
t
. (2.15)
This is the second or Mstep of the algorithm. Starting from some
initial estimate of the parameters, λ
(0)
, successive iterations of the EM
algorithm yield parameter sets λ
(1)
, λ
(2)
, . . . which are guaranteed to
improve the likelihood up to some local maximum. A common choice for
the initial parameter set λ
(0)
is to assign the global mean and covariance
of the data to the Gaussian output distributions and to set all transition
probabilities to be equal. This gives a socalled ﬂat start model.
This approach to acoustic modelling is often referred to as the
beadsonastring model, socalled because all speech utterances are
represented by concatenating a sequence of phone models together.
The major problem with this is that decomposing each vocabulary
word into a sequence of contextindependent base phones fails to cap
ture the very large degree of contextdependent variation that exists
in real speech. For example, the base form pronunciations for “mood”
and “cool” would use the same vowel for “oo,” yet in practice the
realisations of “oo” in the two contexts are very diﬀerent due to the
inﬂuence of the preceding and following consonant. Context indepen
dent phone models are referred to as monophones.
A simple way to mitigate this problem is to use a unique phone
model for every possible pair of left and right neighbours. The resulting
12 Architecture of an HMMBased Recogniser
models are called triphones and if there are N base phones, there are
N
3
potential triphones. To avoid the resulting data sparsity problems,
the complete set of logical triphones L can be mapped to a reduced
set of physical models P by clustering and tying together the param
eters in each cluster. This mapping process is illustrated in Figure 2.3
and the parameter tying is illustrated in Figure 2.4 where the notation
x − q + y denotes the triphone corresponding to phone q spoken in the
context of a preceding phone x and a following phone y. Each base
Fig. 2.3 Context dependent phone modelling.
Fig. 2.4 Formation of tiedstate phone models.
2.2 HMM Acoustic Models (BasicSingle Component) 13
phone pronunciation q is derived by simple lookup from the pronunci
ation dictionary, these are then mapped to logical phones according to
the context, ﬁnally the logical phones are mapped to physical models.
Notice that the contextdependence spreads across word boundaries
and this is essential for capturing many important phonological pro
cesses. For example, the /p/ in “stop that” has its burst suppressed by
the following consonant.
The clustering of logical to physical models typically operates at the
statelevel rather than the model level since it is simpler and it allows
a larger set of physical models to be robustly estimated. The choice of
which states to tie is commonly made using decision trees [190]. Each
state position
6
of each phone q has a binary tree associated with it.
Each node of the tree carries a question regarding the context. To clus
ter state i of phone q, all states i of all of the logical models derived
from q are collected into a single pool at the root node of the tree.
Depending on the answer to the question at each node, the pool of
states is successively split until all states have trickled down to leaf
nodes. All states in each leaf node are then tied to form a physical
model. The questions at each node are selected from a predetermined
set to maximise the likelihood of the training data given the ﬁnal set
of statetyings. If the state output distributions are single component
Gaussians and the state occupation counts are known, then the increase
in likelihood achieved by splitting the Gaussians at any node can be
calculated simply from the counts and model parameters without ref
erence to the training data. Thus, the decision trees can be grown very
eﬃciently using a greedy iterative node splitting algorithm. Figure 2.5
illustrates this treebased clustering. In the ﬁgure, the logical phones
saw+n and taw+n will both be assigned to leaf node 3 and hence they
will share the same central state of the representative physical model.
7
The partitioning of states using phonetically driven decision trees
has several advantages. In particular, logical models which are required
but were not seen at all in the training data can be easily synthesised.
One disadvantage is that the partitioning can be rather coarse. This
6
Usually each phone model has three states.
7
The total number of tiedstates in a large vocabulary speaker independent system typically
ranges between 1000 and 10,000 states.
14 Architecture of an HMMBased Recogniser
Fig. 2.5 Decision tree clustering.
problem can be reduced using socalled softtying [109]. In this scheme,
a postprocessing stage groups each state with its one or two nearest
neighbours and pools all of their Gaussians. Thus, the single Gaussian
models are converted to mixture Gaussian models whilst holding the
total number of Gaussians in the system constant.
To summarise, the core acoustic models of a modern speech recog
niser typically consist of a set of tied threestate HMMs with Gaus
sian output distributions. This core is commonly built in the following
steps [189, Ch. 3]:
(1) A ﬂatstart monophone set is created in which each base
phone is a monophone singleGaussian HMM with means and
covariances equal to the mean and covariance of the train
ing data.
2.3 Ngram Language Models 15
(2) The parameters of the Gaussian monophones are re
estimated using 3 or 4 iterations of EM.
(3) Each single Gaussian monophone q is cloned once for each
distinct triphone x − q + y that appears in the training data.
(4) The resulting set of trainingdata triphones is again re
estimated using EM and the state occupation counts of the
last iteration are saved.
(5) A decision tree is created for each state in each base phone,
the trainingdata triphones are mapped into a smaller set of
tiedstate triphones and iteratively reestimated using EM.
The ﬁnal result is the required tiedstate contextdependent acoustic
model set.
2.3 Ngram Language Models
The prior probability of a word sequence w = w
1
, . . . , w
K
required in
(2.2) is given by
P(w) =
K
k=1
P(w
k
w
k−1
, . . . , w
1
). (2.16)
For large vocabulary recognition, the conditioning word history in
(2.16) is usually truncated to N − 1 words to form an Ngram lan
guage model
P(w) =
K
k=1
P(w
k
w
k−1
, w
k−2
, . . . , w
k−N+1
), (2.17)
where N is typically in the range 2–4. Language models are often
assessed in terms of their perplexity, H, which is deﬁned as
H = − lim
K→∞
1
K
log
2
(P(w
1
, . . . , w
K
))
≈ −
1
K
K
k=1
log
2
(P(w
k
w
k−1
, w
k−2
, . . . , w
k−N+1
)),
where the approximation is used for Ngram language models with a
ﬁnite length word sequence.
16 Architecture of an HMMBased Recogniser
The Ngram probabilities are estimated from training texts by
counting Ngram occurrences to form maximum likelihood (ML)
parameter estimates. For example, let C(w
k−2
w
k−1
w
k
) represent the
number of occurrences of the three words w
k−2
w
k−1
w
k
and similarly
for C(w
k−2
w
k−1
), then
P(w
k
w
k−1
, w
k−2
) ≈
C(w
k−2
w
k−1
w
k
)
C(w
k−2
w
k−1
)
. (2.18)
The major problem with this simple ML estimation scheme is data
sparsity. This can be mitigated by a combination of discounting and
backingoﬀ. For example, using socalled Katz smoothing [85]
P(w
k
w
k−1
, w
k−2
) =
_
¸
¸
_
¸
¸
_
d
C(w
k−2
w
k−1
w
k
)
C(w
k−2
w
k−1
)
if 0 < C ≤ C
C(w
k−2
w
k−1
w
k
)
C(w
k−2
w
k−1
)
if C > C
α(w
k−1
, w
k−2
) P(w
k
w
k−1
) otherwise,
(2.19)
where C
is a count threshold, C is shorthand for C(w
k−2
w
k−1
w
k
),
d is a discount coeﬃcient and α is a normalisation constant. Thus,
when the Ngram count exceeds some threshold, the ML estimate is
used. When the count is small the same ML estimate is used but dis
counted slightly. The discounted probability mass is then distributed
to the unseen Ngrams which are approximated by a weighted version
of the corresponding bigram. This idea can be applied recursively to
estimate any sparse Ngram in terms of a set of backoﬀ weights and
(N −1)grams. The discounting coeﬃcient is based on the TuringGood
estimate d = (r + 1)n
r+1
/rn
r
where n
r
is the number of Ngrams that
occur exactly r times in the training data. There are many variations
on this approach [25]. For example, when training data is very sparse,
Kneser–Ney smoothing is particularly eﬀective [127].
An alternative approach to robust language model estimation is to
use classbased models in which for every word w
k
there is a corre
sponding class c
k
[21, 90]. Then,
P(w) =
K
k=1
P(w
k
c
k
)p(c
k
c
k−1
, . . . , c
k−N+1
). (2.20)
As for word based models, the class Ngram probabilities are estimated
using ML but since there are far fewer classes (typically a few hundred)
2.4 Decoding and Lattice Generation 17
data sparsity is much less of an issue. The classes themselves are chosen
to optimise the likelihood of the training set assuming a bigram class
model. It can be shown that when a word is moved from one class
to another, the change in perplexity depends only on the counts of a
relatively small number of bigrams. Hence, an iterative algorithm can
be implemented which repeatedly scans through the vocabulary, testing
each word to see if moving it to some other class would increase the
likelihood [115].
In practice it is found that for reasonably sized training sets,
8
an
eﬀective language model for large vocabulary applications consists of
a smoothed wordbased 3 or 4gram interpolated with a classbased
trigram.
2.4 Decoding and Lattice Generation
As noted in the introduction to this section, the most likely word
sequence ˆ w given a sequence of feature vectors Y
1:T
is found by search
ing all possible state sequences arising from all possible word sequences
for the sequence which was most likely to have generated the observed
data Y
1:T
. An eﬃcient way to solve this problem is to use dynamic
programming. Let φ
(j)
t
= max
θ
{p(Y
1:t
, θ
t
= s
j
; λ)}, i.e., the maximum
probability of observing the partial sequence Y
1:t
and then being in
state s
j
at time t given the model parameters λ. This probability can
be eﬃciently computed using the Viterbi algorithm [177]
φ
(j)
t
= max
i
_
φ
(i)
t−1
a
ij
_
b
j
(y
t
). (2.21)
It is initialised by setting φ
(j)
0
to 1 for the initial, nonemitting, entry
state and 0 for all other states. The probability of the most likely
word sequence is then given by max
j
{φ
(j)
T
} and if every maximisation
decision is recorded, a traceback will yield the required best matching
state/word sequence.
In practice, a direct implementation of the Viterbi algorithm
becomes unmanageably complex for continuous speech where the topol
ogy of the models, the language model constraints and the need to
8
i.e. 10
7
words.
18 Architecture of an HMMBased Recogniser
bound the computation must all be taken into account. Ngram lan
guage models and crossword triphone contexts are particularly prob
lematic since they greatly expand the search space. To deal with this, a
number of diﬀerent architectural approaches have evolved. For Viterbi
decoding, the search space can either be constrained by maintaining
multiple hypotheses in parallel [173, 191, 192] or it can be expanded
dynamically as the search progresses [7, 69, 130, 132]. Alternatively,
a completely diﬀerent approach can be taken where the breadthﬁrst
approach of the Viterbi algorithm is replaced by a depthﬁrst search.
This gives rise to a class of recognisers called stack decoders. These can
be very eﬃcient, however, because they must compare hypotheses of
diﬀerent lengths, their runtime search characteristics can be diﬃcult
to control [76, 135]. Finally, recent advances in weighted ﬁnitestate
transducer technology enable all of the required information (acoustic
models, pronunciation, language model probabilities, etc.) to be inte
grated into a single very large but highly optimised network [122]. This
approach oﬀers both ﬂexibility and eﬃciency and is therefore extremely
useful for both research and practical applications.
Although decoders are designed primarily to ﬁnd the solution to
(2.21), in practice, it is relatively simple to generate not just the most
likely hypothesis but the Nbest set of hypotheses. N is usually in the
range 100–1000. This is extremely useful since it allows multiple passes
over the data without the computational expense of repeatedly solving
(2.21) from scratch. A compact and eﬃcient structure for storing these
hypotheses is the word lattice [144, 167, 187].
A word lattice consists of a set of nodes representing points in time
and a set of spanning arcs representing word hypotheses. An example is
shown in Figure 2.6 part (a). In addition to the word IDs shown in the
ﬁgure, each arc can also carry score information such as the acoustic
and language model scores.
Lattices are extremely ﬂexible. For example, they can be rescored by
using them as an input recognition network and they can be expanded
to allow rescoring by a higher order language model. They can also be
compacted into a very eﬃcient representation called a confusion net
work [42, 114]. This is illustrated in Figure 2.6 part (b) where the “”
arc labels indicate null transitions. In a confusion network, the nodes no
2.4 Decoding and Lattice Generation 19
Fig. 2.6 Example lattice and confusion network.
longer correspond to discrete points in time, instead they simply enforce
word sequence constraints. Thus, parallel arcs in the confusion network
do not necessarily correspond to the same acoustic segment. However,
it is assumed that most of the time the overlap is suﬃcient to enable
parallel arcs to be regarded as competing hypotheses. A confusion net
work has the property that for every path through the original lattice,
there exists a corresponding path through the confusion network. Each
arc in the confusion network carries the posterior probability of the
corresponding word w. This is computed by ﬁnding the link probabil
ity of w in the lattice using a forward–backward procedure, summing
over all occurrences of w and then normalising so that all competing
word arcs in the confusion network sum to one. Confusion networks can
be used for minimum worderror decoding [165] (an example of min
imum Bayes’ risk (MBR) decoding [22]), to provide conﬁdence scores
and for merging the outputs of diﬀerent decoders [41, 43, 63, 72] (see
MultiPass Recognition Architectures).
3
HMM Structure Reﬁnements
In the previous section, basic acoustic HMMs and their use in ASR sys
tems have been explained. Although these simple HMMs may be ade
quate for small vocabulary and similar limited complexity tasks, they
do not perform well when used for more complex, and larger vocabulary
tasks such as broadcast news transcription and dictation. This section
describes some of the extensions that have been used to improve the
performance of ASR systems and allow them to be applied to these
more interesting and challenging domains.
The use of dynamic Bayesian networks to describe possible exten
sions is ﬁrst introduced. Some of these extensions are then discussed
in detail. In particular, the use of Gaussian mixture models, eﬃ
cient covariance models and feature projection schemes are presented.
Finally, the use of HMMs for generating, rather than recognising,
speech is brieﬂy discussed.
3.1 Dynamic Bayesian Networks
In Architecture of an HMMBased Recogniser, the HMM was described
as a generative model which for a typical phone has three emitting
21
22 HMM Structure Reﬁnements
Fig. 3.1 Typical phone HMM topology (left) and dynamic Bayesian network (right).
states, as shown again in Figure 3.1. Also shown in Figure 3.1 is an
alternative, complementary, graphical representation called a dynamic
Bayesian network (DBN) which emphasises the conditional dependen
cies of the model [17, 200] and which is particularly useful for describ
ing a variety of extensions to the basic HMM structure. In the DBN
notation used here, squares denote discrete variables; circles continu
ous variables; shading indicates an observed variable; and no shading
an unobserved variable. The lack of an arc between variables shows
conditional independence. Thus Figure 3.1 shows that the observations
generated by an HMM are conditionally independent given the unob
served, hidden, state that generated it.
One of the desirable attributes of using DBNs is that it is simple to
show extensions in terms of how they modify the conditional indepen
dence assumptions of the model. There are two approaches, which may
be combined, to extend the HMM structure in Figure 3.1: adding addi
tional unobserved variables; and adding additional dependency arcs
between variables.
An example of adding additional arcs, in this case between obser
vations is shown in Figure 3.2. Here the observation distribution is
Fig. 3.2 Dynamic Bayesian networks for buried Markov models and HMMs with vector
linear predictors.
3.2 Gaussian Mixture Models 23
Fig. 3.3 Dynamic Bayesian networks for Gaussian mixture models (left) and factoranalysed
models (right).
dependent on the previous two observations in addition to the state
that generated it. This DBN describes HMMs with explicit temporal
correlation modelling [181], vector predictors [184], and buried Markov
models [16]. Although an interesting direction for reﬁning an HMM,
this approach has not yet been adopted in mainstream stateoftheart
systems.
Instead of adding arcs, additional unobserved, latent or hidden, vari
ables may be added. To compute probabilities, these hidden variables
must then be marginalised out in the same fashion as the unobserved
state sequence for HMMs. For continuous variables this requires a con
tinuous integral over all values of the hidden variable and for discrete
variables a summation over all values. Two possible forms of latent vari
able are shown in Figure 3.3. In both cases the temporal dependencies
of the model, where the discrete states are conditionally independent
of all other states given the previous state, are unaltered allowing the
use of standard Viterbi decoding routines. In the ﬁrst case, the obser
vations are dependent on a discrete latent variable. This is the DBN for
an HMM with Gaussian mixture model (GMM) stateoutput distribu
tions. In the second case, the observations are dependent on a continu
ous latent variable. This is the DBN for an HMM with factoranalysed
covariance matrices. Both of these are described in more detail below.
3.2 Gaussian Mixture Models
One of the most commonly used extensions to standard HMMs is to
model the stateoutput distribution as a mixture model. In Architecture
of an HMMBased Recogniser, a single Gaussian distribution was used
24 HMM Structure Reﬁnements
to model the state–output distribution. This model therefore assumed
that the observed feature vectors are symmetric and unimodal. In prac
tice this is seldom the case. For example, speaker, accent and gender
diﬀerences tend to create multiple modes in the data. To address this
problem, the single Gaussian stateoutput distribution may be replaced
by a mixture of Gaussians which is a highly ﬂexible distribution able
to model, for example, asymmetric and multimodal distributed data.
As described in the previous section, mixture models may be viewed
as adding an additional discrete latent variable to the system. The like
lihood for state s
j
is now obtained by summing over all the component
likelihoods weighted by their prior
b
j
(y) =
M
m=1
c
jm
N(y; µ
(jm)
, Σ
(jm)
), (3.1)
where c
jm
is the prior probability for component m of state s
j
. These
priors satisfy the standard constraints for a valid probability mass func
tion (PMF)
M
m=1
c
jm
= 1, c
jm
≥ 0. (3.2)
This is an example of the general technique of mixture modelling. Each
of the M components of the mixture model is a Gaussian probability
density function (PDF).
Since an additional latent variable has been added to the acoustic
model, the form of the EM algorithm previously described needs to be
modiﬁed [83]. Rather than considering the complete dataset in terms
of stateobservation pairings, state/componentobservations are used.
The estimation of the model parameters then follows the same form as
for the single component case. For the mean of Gaussian component m
of state s
j
, s
jm
,
ˆ µ
(jm)
=
R
r=1
T
(r)
t=1
γ
(rjm)
t
y
(r)
t
R
r=1
T
(r)
t=1
γ
(rjm)
t
, (3.3)
where γ
(rjm)
t
= P(θ
t
= s
jm
Y
(r)
; λ) is the probability that component
m of state s
j
generated the observation at time t of sequence Y
(r)
.
3.3 Feature Projections 25
R is the number of training sequences. The component priors are esti
mated in a similar fashion to the transition probabilities.
When using GMMs to model the stateoutput distribution, variance
ﬂooring is often applied.
1
This prevents the variances of the system
becoming too small, for example when a component models a very small
number of tightly packed observations. This improves generalisation.
Using GMMs increases the computational overhead since when
using logarithmetic, a series of logadditions are required to compute
the GMM likelihood. To improve eﬃciency, only components which pro
vide a “reasonable” contribution to the total likelihood are included.
To further decrease the computational load the GMM likelihood can
be approximated simply by the maximum over all the components
(weighted by the priors).
In addition, it is necessary to determine the number of components
per state in the system. A number of approaches, including discrimina
tive schemes, have been adopted for this [27, 106]. The simplest is to
use the same number of components for all states in the system and use
heldout data to determine the optimal numbers. A popular alterna
tive approach is based on the Bayesian information criterion (BIC) [27].
Another approach is to make the number of components assigned to
the state a function of the number of observations assigned to that
state [56].
3.3 Feature Projections
In Architecture for an HMMBased Recogniser, dynamic ﬁrst and
second diﬀerential parameters, the socalled delta and delta–delta
parameters, were added to the static feature parameters to overcome
the limitations of the conditional independence assumption associated
with HMMs. Furthermore, the DCT was assumed to approximately
decorrelate the feature vector to improve the diagonal covariance
approximation and reduce the dimensionality of the feature vector.
It is also possible to use datadriven approaches to decorrelate and
reduce the dimensionality of the features. The standard approach is to
1
See HTK for a speciﬁc implementation of variance ﬂooring [189].
26 HMM Structure Reﬁnements
use linear transformations, that is
y
t
= A
[p]
˜ y
t
, (3.4)
where A
[p]
is a p × d lineartransform, d is the dimension of the source
feature vector ˜ y
t
, and p is the size of the transformed feature vector y
t
.
The detailed form of a feature vector transformation such as this
depends on a number of choices. Firstly, the construction of ˜ y
t
must
be decided and, when required, the class labels that are to be used,
for example phone, state or Gaussian component labels. Secondly, the
dimensionality p of the projected data must be determined. Lastly, the
criterion used to estimate A
[p]
must be speciﬁed.
An important issue when estimating a projection is whether the
class labels of the observations are to be used, or not, i.e., whether the
projection should be estimated in a supervised or unsupervised fash
ion. In general supervised approaches yield better projections since it
is then possible to estimate a transform to improve the discrimination
between classes. In contrast, in unsupervised approaches only general
attributes of the observations, such as the variances, may be used. How
ever, supervised schemes are usually computationally more expensive
than unsupervised schemes. Statistics, such as the means and covari
ance matrices are needed for each of the class labels, whereas unsuper
vised schemes eﬀectively only ever have one class.
The features to be projected are normally based on the standard
MFCC, or PLP, featurevectors. However, the treatment of the delta
and delta–delta parameters can vary. One approach is to splice neigh
bouring static vectors together to form a composite vector of typically
9 frames [10]
˜ y
t
=
_
y
sT
t−4
· · · y
sT
t
· · · y
sT
t+4
¸
T
. (3.5)
Another approach is to expand the static, delta and delta–delta
parameters with thirdorder dynamic parameters (the diﬀerence of
delta–deltas) [116]. Both have been used for large vocabulary speech
recognition systems. The dimensionality of the projected feature vector
is often determined empirically since there is only a single parameter
to tune. In [68] a comparison of a number of possible class deﬁnitions,
phone, state or component, were compared for supervised projection
3.3 Feature Projections 27
schemes. The choice of class is important as the transform tries to
ensure that each of these classes is as separable as possible from all
others. The study showed that performance using state and compo
nent level classes was similar, and both were better than higher level
labels such as phones or words. In practice, the vast majority of systems
use component level classes since, as discussed later, this improves the
assumption of diagonal covariance matrices.
The simplest form of criterion for estimating the transform is prin
cipal component analysis (PCA). This is an unsupervised projection
scheme, so no use is made of class labels. The estimation of the PCA
transform can be found by ﬁnding the p rows of the orthonormal matrix
A that maximises
F
pca
(λ) = log
_
A
[p]
˜
Σ
g
A
T
[p]

_
, (3.6)
where
˜
Σ
g
is the total, or global, covariance matrix of the original data,
˜ y
t
. This selects the orthogonal projections of the data that maximises
the total variance in the projected subspace. Simply selecting subspaces
that yield large variances does not necessarily yield subspaces that
discriminate between the classes. To address this problem, supervised
approaches such as the linear discriminant analysis (LDA) criterion [46]
can be used. In LDA the objective is to increase the ratio of the between
class variance to the average within class variance for each dimension.
This criterion may be expressed as
F
lda
(λ) = log
_
A
[p]
˜
Σ
b
A
T
[p]

A
[p]
˜
Σ
w
A
T
[p]

_
, (3.7)
where
˜
Σ
b
is the betweenclass covariance matrix and
˜
Σ
w
the average
withinclass covariance matrix where each distinct Gaussian compo
nent is usually assumed to be a separate class. This criterion yields an
orthonormal transform such that the average withinclass covariance
matrix is diagonalised, which should improve the diagonal covariance
matrix assumption.
The LDA criterion may be further reﬁned by using the actual class
covariance matrices rather than using the averages. One such form is
heteroscedastic discriminant analysis (HDA) [149] where the following
28 HMM Structure Reﬁnements
criterion is maximised
F
hda
(λ) =
m
γ
(m)
log
_
_
A
[p]
˜
Σ
b
A
T
[p]

A
[p]
˜
Σ
(m)
A
T
[p]

_
_
, (3.8)
where γ
(m)
is the total posterior occupation probability for compo
nent m and
˜
Σ
(m)
is its covariance matrix in the original space given
by ˜ y
t
. In this transformation, the matrix A is not constrained to be
orthonormal, and nor does this criterion help with decorrelating the
data associated with each Gaussian component. It is thus often used
in conjunction with a global semitied transform [51] (also known as
a maximum likelihood linear transform (MLLT) [65]) described in the
next section. An alternative extension to LDA is heteroscedastic LDA
(HLDA) [92]. This modiﬁes the LDA criterion in (3.7) in a similar
fashion to HDA, but now a transform for the complete featurespace
is estimated, rather than for just the dimensions to be retained. The
parameters of the HLDA transform can be found in an ML fashion,
as if they were model parameters as discussed in Architecture for an
HMMBased Recogniser. An important extension of this transform is
to ensure that the distributions for all dimensions to be removed are
constrained to be the same. This is achieved by tying the parame
ters associated with these dimensions to ensure that they are identical.
These dimensions will thus yield no discriminatory information and so
need not be retained for recognition. The HLDA criterion can then be
expressed as
F
hlda
(λ) =
m
γ
(m)
log
_
_
A
2
diag
_
A
[d−p]
˜
Σ
g
A
T
[d−p]

_
diag
_
A
[p]
˜
Σ
(m)
A
T
[p]
_
_
_
,
(3.9)
where
A =
_
A
[p]
A
[d−p]
_
. (3.10)
There are two important diﬀerences between HLDA and HDA. Firstly,
HLDA yields the best projection whilst simultaneously generating the
best transform for improving the diagonal covariance matrix approxi
mation. In contrast, for HDA a separate decorrelating transform must
3.4 Covariance Modelling 29
be added. Furthermore, HLDA yields a model for the complete feature
space, whereas HDA only models the useful (nonprojected) dimen
sions. This means that multiple subspace projections can be used with
HLDA, but not with HDA [53].
Though schemes like HLDA outperform approaches like LDA they
are more computationally expensive and require more memory. Full
covariance matrix statistics for each component are required to estimate
an HLDA transform, whereas only the average within and between
class covariance matrices are required for LDA. This makes HLDA pro
jections from large dimensional features spaces with large numbers of
components impractical. One compromise that is sometimes used is to
follow an LDA projection by a decorrelating global semitied trans
form [163] described in the next section.
3.4 Covariance Modelling
One of the motivations for using the DCT transform, and some of the
projection schemes discussed in the previous section, is to decorrelate
the feature vector so that the diagonal covariance matrix approxima
tion becomes reasonable. This is important as in general the use of full
covariance Gaussians in large vocabulary systems would be impractical
due to the sheer size of the model set.
2
Even with small systems, train
ing data limitations often preclude the use of full covariances. Further
more, the computational cost of using full covariance matrices is O(d
2
)
compared to O(d) for the diagonal case where d is the dimensionality
of the feature vector. To address both of these problems, structured
covariance and precision (inverse covariance) matrix representations
have been developed. These allow covariance modelling to be improved
with very little overhead in terms of memory and computational load.
3.4.1 Structured Covariance Matrices
A standard form of structured covariance matrix arises from factor
analysis and the DBN for this was shown in Figure 3.3. Here the
covariance matrix for each Gaussian component m is represented by
2
Although given enough data, and some smoothing, it can be done [163].
30 HMM Structure Reﬁnements
(the dependency on the state has been dropped for clarity)
Σ
(m)
= A
(m)T
[p]
A
(m)
[p]
+ Σ
(m)
diag
, (3.11)
where A
(m)
[p]
is the componentspeciﬁc, p × d, loading matrix and Σ
(m)
diag
is a component speciﬁc diagonal covariance matrix. The above expres
sion is based on factor analysis which allows each of the Gaussian com
ponents to be estimated separately using EM [151]. Factoranalysed
HMMs [146] generalise this to support both tying over multiple com
ponents of the loading matrices and using GMMs to model the latent
variablespace of z in Figure 3.3.
Although structured covariance matrices of this form reduce the
number of parameters needed to represent each covariance matrix, the
computation of likelihoods depends on the inverse covariance matrix,
i.e., the precision matrix and this will still be a full matrix. Thus,
factoranalysed HMMs still incur the decoding cost associated with
full covariances.
3
3.4.2 Structured Precision Matrices
A more computationally eﬃcient approach to covariance structuring
is to model the inverse covariance matrix and a general form for this
is [131, 158]
Σ
(m)−1
=
B
i=1
ν
(m)
i
S
i
, (3.12)
where ν
(m)
is a Gaussian component speciﬁc weight vector that spec
iﬁes the contribution from each of the B global positive semideﬁnite
matrices, S
i
. One of the desirable properties of this form of model is
that it is computationally eﬃcient during decoding, since
log
_
N(y; µ
(m)
, Σ
(m)
)
_
= −
1
2
log((2π)
d
Σ
(m)
) −
1
2
B
i=1
ν
(m)
i
(y − µ
(m)
)
T
S
i
(y − µ
(m)
)
3
Depending on p, the matrix inversion lemmas, also known as the ShermanMorrisson
Woodbury formula, can be used to make this more eﬃcient.
3.4 Covariance Modelling 31
= −
1
2
log((2π)
d
Σ
(m)
)
−
1
2
B
i=1
ν
(m)
i
_
y
T
S
i
y − 2µ
(m)T
S
i
y + µ
(m)T
S
i
µ
(m)
_
. (3.13)
Thus rather than having the computational cost of a full covariance
matrix, it is possible to cache terms such as S
i
y and y
T
S
i
y which do
not depend on the Gaussian component.
One of the simplest and most eﬀective structured covariance rep
resentations is the semitied covariance matrix (STC) [51]. This is a
speciﬁc form of precision matrix model, where the number of bases is
equal to the number of dimensions (B = d) and the bases are symmet
ric and have rank 1 (thus can be expressed as S
i
= a
T
[i]
a
[i]
, a
[i]
is the ith
row of A). In this case the component likelihoods can be computed by
N
_
y; µ
(m)
, Σ
(m)
_
= AN
_
Ay; Aµ
(m)
, Σ
(m)
diag
_
(3.14)
where Σ
(m)−1
diag
is a diagonal matrix formed from ν
(m)
and
Σ
(m)−1
=
d
i=1
ν
(m)
i
a
T
[i]
a
[i]
= AΣ
(m)−1
diag
A
T
. (3.15)
The matrix A is sometimes referred to as the semitied transform. One
of the reasons that STC systems are simple to use is that the weight for
each dimension is simply the inverse variance for that dimension. The
training procedure for STC systems is (after accumulating the standard
EM fullcovariance matrix statistics):
(1) initialise the transform A
(0)
= I; set Σ
(m0)
diag
equal to the cur
rent model covariance matrices; and set k = 0;
(2) estimate A
(k+1)
given Σ
(mk)
diag
;
(3) set Σ
(m(k+1))
diag
to the component variance for each dimension
using A
(k+1)
;
(4) goto (2) unless converged or maximum number of iterations
reached.
During recognition, decoding is very eﬃcient since Aµ
(m)
is stored
for each component, and the transformed features Ay
t
are cached for
32 HMM Structure Reﬁnements
each time instance. There is thus almost no increase in decoding time
compared to standard diagonal covariance matrix systems.
STC systems can be made more powerful by using multiple
semitied transforms [51]. In addition to STC, other types of struc
tured covariance modelling include subspace constrained precision and
means (SPAM) [8], mixtures of inverse covariances [175], and extended
maximum likelihood linear transforms (EMLLT) [131].
3.5 HMM Duration Modelling
The probability d
j
(t) of remaining in state s
j
for t consecutive obser
vations in a standard HMM is given by
d
j
(t) = a
t−1
jj
(1 − a
jj
), (3.16)
where a
jj
is the selftransition probability of state s
j
. Thus, the stan
dard HMM models the probability of state occupancy as decreasing
exponentially with time and clearly this is a poor model of duration.
Given this weakness of the standard HMM, an obvious reﬁnement
is to introduce an explicit duration model such that the selftransition
loops in Figure 2.2 are replaced by an explicit probability distribution
d
j
(t). Suitable choices for d
j
(t) are the Poisson distribution [147] and
the Gamma distribution [98].
Whilst the use of these explicit distributions can undoubtably model
state segment durations more accurately, they add a very considerable
overhead to the computational complexity. This is because the resulting
models no longer have the Markov property and this prevents eﬃcient
search over possible state alignments. For example, in these socalled
hidden semiMarkov models (HSMMs), the forward probability given
in (2.10) becomes
α
(rj)
t
=
τ
i,i=j
α
(ri)
t−τ
a
ij
d
j
(τ)
τ
l=1
b
j
_
y
(r)
t−τ+l
_
. (3.17)
The need to sum over all possible state durations has increased the
complexity by a factor of O(D
2
) where D is the maximum allowed state
duration, and the backward probability and Viterbi decoding suﬀer the
same increase in complexity.
3.6 HMMs for Speech Generation 33
In practice, explicit duration modelling of this form oﬀers some
small improvements to performance in small vocabulary applications,
and minimal improvements in large vocabulary applications. A typical
experience in the latter is that as the accuracy of the phoneoutput
distributions is increased, the improvements gained from duration
modelling become negligible. For this reason and its computational
complexity, durational modelling is rarely used in current systems.
3.6 HMMs for Speech Generation
HMMs are generative models and although HMMbased acoustic mod
els were developed primarily for speech recognition, it is relevant to
consider how well they can actually generate speech. This is not only
of direct interest for synthesis applications where the ﬂexibility and
compact representation of HMMs oﬀer considerable beneﬁts [168], but
it can also provide further insight into their use in recognition [169].
The key issue of interest here is the extent to which the dynamic
delta and delta–delta terms can be used to enforce accurate trajectories
for the static parameters. If an HMM is used to directly model the
speech features, then given the conditional independence assumptions
from Figure 3.1 and a particular state sequence, θ, the trajectory will
be piecewise stationary where the time segment corresponding to each
state, simply adopts the meanvalue of that state. This would clearly
be a poor ﬁt to real speech where the spectral parameters vary much
more smoothly.
The HMMtrajectory model
4
aims to generate realistic feature tra
jectories by ﬁnding the sequence of ddimensional static observations
which maximises the likelihood of the complete feature vector (i.e. stat
ics + deltas) with respect to the parameters of a standard HMM model.
The trajectory of these static parameters will no longer be piecewise
stationary since the associated delta parameters also contribute to the
likelihood and must therefore be consistent with the HMM parameters.
To see how these trajectories can be computed, consider the case
of using a feature vector comprising static parameters and “simple
4
An implementation of this is available as an extension to HTK at hts.sp.nitech.ac.jp.
34 HMM Structure Reﬁnements
diﬀerence” delta parameters. The observation at time t is a function of
the static features at times t − 1, t and t + 1
y
t
=
_
y
s
t
∆y
s
t
_
=
_
0 I 0
−I 0 I
_
_
_
y
s
t−1
y
s
t
y
s
t+1
_
_
, (3.18)
where 0 is a d × d zero matrix and I is a d × d identity matrix. y
s
t
is
the static element of the feature vector at time t and ∆y
s
t
are the delta
features.
5
Now if a complete sequence of features is considered, the following
relationship can be obtained between the 2Td features used for the
standard HMM and the Td static features
Y
1:T
= AY
s
. (3.19)
To illustrate the form of the 2Td × Td matrix A consider the obser
vation vectors at times, t − 1, t and t + 1 in the complete sequence.
These may be expressed as
_
¸
¸
¸
¸
¸
¸
_
.
.
.
y
t−1
y
t
y
t+1
.
.
.
_
¸
¸
¸
¸
¸
¸
_
=
_
¸
¸
¸
¸
¸
¸
¸
_
· · · 0 I 0 0 0 · · ·
· · · −I 0 I 0 0 · · ·
· · · 0 0 I 0 0 · · ·
· · · 0 −I 0 I 0 · · ·
· · · 0 0 0 I 0 · · ·
· · · 0 0 −I 0 I · · ·
_
¸
¸
¸
¸
¸
¸
¸
_
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
.
.
.
y
s
t−2
y
s
t−1
y
s
t
y
s
t+1
y
s
t+2
.
.
.
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
. (3.20)
The features modelled by the standard HMM are thus a linear trans
form of the static features. It should therefore be possible to derive the
probability distribution of the static feature distribution if the param
eters of the standard HMM are known.
The process of ﬁnding the distribution of the static features is sim
pliﬁed by hypothesising an appropriate state/componentsequence θ for
the observations to be generated. In practical synthesis systems state
duration models are often estimated as in the previous section. In this
5
Here a nonstandard version of delta parameters is considered to simplify notation. In this
case, d is the size of the static parameters, not the complete feature vector.
3.6 HMMs for Speech Generation 35
case the simplest approach is to set the duration of each state equal to
its average (note only single component stateoutput distributions can
be used). Given θ, the distribution of the associated static sequences
Y
s
will be Gaussian distributed. The likelihood of a static sequence
can then be expressed in terms of the standard HMM features as
1
Z
p(AY
s
θ; λ) =
1
Z
N(Y
1:T
; µ
θ
, Σ
θ
) = N(Y
s
; µ
s
θ
, Σ
s
θ
), (3.21)
where Z is a normalisation term. The following relationships exist
between the standard model parameters and the static parameters
Σ
s−1
θ
= A
T
Σ
−1
θ
A (3.22)
Σ
s−1
θ
µ
s
θ
= A
T
Σ
−1
θ
µ
θ
(3.23)
µ
sT
θ
Σ
s−1
θ
µ
s
θ
= µ
T
θ
Σ
−1
θ
µ
θ
, (3.24)
where the segment mean, µ
θ
, and variances, Σ
θ
, (along with the cor
responding versions for the static parameters only, µ
s
θ
and Σ
s
θ
) are
µ
θ
=
_
¸
_
µ
(θ
1
)
.
.
.
µ
(θ
T
)
_
¸
_
, Σ
θ
=
_
¸
_
Σ
(θ
1
)
· · · 0
.
.
.
.
.
.
.
.
.
0 · · · Σ
(θ
T
)
_
¸
_
. (3.25)
Though the covariance matrix of the HMM features, Σ
θ
, has the block
diagonal structure expected from the HMM, the covariance matrix of
the static features, Σ
s
θ
in (3.22), does not have a block diagonal struc
ture since A is not block diagonal. Thus, using the delta parameters to
obtain the distribution of the static parameters does not imply the same
conditional independence assumptions as the standard HMM assumes
in modelling the static features.
The maximum likelihood static feature trajectory,
ˆ
Y
s
, can now be
found using (3.21). This will simply be the static parameter segment
mean, µ
s
θ
, as the distribution is Gaussian. To ﬁnd µ
s
θ
the relationships
in (3.22) to (3.24) can be rearranged to give
ˆ
Y
s
= µ
s
θ
=
_
A
T
Σ
−1
θ
A
_
−1
A
T
Σ
−1
θ
µ
θ
. (3.26)
These are all known as µ
θ
and Σ
θ
are based on the standard HMM
parameters and the matrix A is given in (3.25). Note, given the highly
36 HMM Structure Reﬁnements
structurednature of Aeﬃcient implementations of the matrix inversion
may be found.
The basic HMM synthesis model described has been reﬁned in a
couple of ways. If the overall length of the utterance to synthesise is
known, then an ML estimate of θ can be found in a manner similar
to Viterbi. Rather than simply using the average state durations for
the synthesis process, it is also possible to search for the most likely
state/componentsequence. Here
_
ˆ
Y
s
,
ˆ
θ
_
= argmax
Y
s
,θ
_
1
Z
p(AY
s
θ; λ)P(θ; λ)
_
. (3.27)
In addition, though the standard HMM model parameters, λ, can be
trained in the usual fashion for example using ML, improved perfor
mance can be obtained by training them so that when used to generate
sequences of static features they are a “good” model of the training
data. MLbased estimates for this can be derived [168]. Just as regu
lar HMMs can be trained discriminatively as described later, HMM
synthesis models can also be trained discriminatively [186]. One of the
major issues with both these improvements is that the Viterbi algo
rithm cannot be directly used with the model to obtain the state
sequence. A framedelayed version of the Viterbi algorithm can be
used [196] to ﬁnd the statesequence. However, this is still more expen
sive than the standard Viterbi algorithm.
HMM trajectory models can also be used for speech recogni
tion [169]. This again makes use of the equalities in (3.21). The recog
nition output for feature vector sequence Y
1:T
is now based on
ˆ w = arg max
w
_
arg max
θ
_
1
Z
N(Y
1:T
; µ
θ
, Σ
θ
)P(θw; λ)P(w)
__
. (3.28)
As with synthesis, one of the major issues with this form of decod
ing is that the Viterbi algorithm cannot be used. In practice Nbest
list rescoring is often implemented instead. Though an interesting
research direction, this form of speech recognition system is not in
widespread use.
4
Parameter Estimation
In Architecture of an HMMBased Recogniser, the estimation of the
HMM model parameters, λ, was brieﬂy described based on maximising
the likelihood that the models generate the training sequences. Thus,
given training data Y
(1)
, . . . , Y
(R)
the maximum likelihood (ML) train
ing criterion may be expressed as
F
ml
(λ) =
1
R
R
r=1
log(p(Y
(r)
w
(r)
ref
; λ)), (4.1)
where Y
(r)
is the rth training utterance with transcription w
(r)
ref
. This
optimisation is normally performed using EM [33]. However, for ML
to be the “best” training criterion, the data and models would need to
satisfy a number of requirements, in particular, training data suﬃciency
and modelcorrectness [20]. Since in general these requirements are not
satisﬁed when modelling speech data, alternative criteria have been
developed.
This section describes the use of discriminative criteria for train
ing HMM model parameters. In these approaches, the goal is not to
estimate the parameters that are most likely to generate the training
37
38 Parameter Estimation
data. Instead, the objective is to modify the model parameters so
that hypotheses generated by the recogniser on the training data
more closely “match” the correct wordsequences, whilst generalising
to unseen test data.
One approach would be to explicitly minimise the classiﬁcation error
on the training data. Here, a strict 1/0 loss function is needed as the
measure to optimise. However, the diﬀerential of such a loss function
would be discontinuous preventing the use of gradient descent based
schemes. Hence various approximations described in the next section
have been explored. Compared to ML training, the implementation
of these training criteria is more complex and they have a tendency
to overtrain on the training data, i.e., they do not generalise well to
unseen data. These issues are also discussed.
One of the major costs incurred in building speech recognition
systems lies in obtaining suﬃcient acoustic data with accurate tran
scriptions. Techniques which allow systems to be trained with rough,
approximate transcriptions, or even no transcriptions at all are there
fore of growing interest and work in this area is brieﬂy summarised at
the end of this section.
4.1 Discriminative Training
For speech recognition, three main forms of discriminative training have
been examined, all of which can be expressed in terms of the poste
rior of the correct sentence. Using Bayes’ rule this posterior may be
expressed as
P(w
(r)
ref
Y
(r)
; λ) =
p(Y
(r)
w
(r)
ref
; λ)P(w
(r)
ref
)
w
p(Y
(r)
w; λ)P(w)
, (4.2)
where the summation is over all possible wordsequences. By expressing
the posterior in terms of the HMM generative model parameters, it is
possible to estimate “generative” model parameters with discriminative
criteria.
The language model (or class prior), P(w), is not normally trained
in conjunction with the acoustic model, (though there has been some
work in this area [145]) since typically the amount of text training
4.1 Discriminative Training 39
data for the language model is far greater (orders of magnitude) than
the available acoustic training data. Note, however, that unlike the ML
case, the language model is included in the denominator of the objective
function in the discriminative training case.
In addition to the criteria discussed here, there has been work
on maximummargin based estimation schemes [79, 100, 156]. Also,
interest is growing in using discriminative models for speech recogni
tion, where P(w
(r)
ref
Y
(r)
; λ) is directly modelled [54, 67, 93, 95], rather
than using Bayes’ rule as above to obtain the posterior of the cor
rect word sequence. In addition, these discriminative criteria can be
used to train feature transforms [138, 197] and model parameter trans
forms [159] that are dependent on the observations making them time
varying.
4.1.1 Maximum Mutual Information
One of the ﬁrst discriminative training criteria to be explored was the
maximum mutual information (MMI) criterion [9, 125]. Here, the aim
is to maximise the mutual information between the wordsequence,
w, and the information extracted by a recogniser with parameters λ
from the associated observation sequence, Y , I(w, Y ; λ). As the joint
distribution of the wordsequences and observations is unknown, it is
approximated by the empirical distributions over the training data.
This can be expressed as [20]
I(w, Y ; λ) ≈
1
R
R
r=1
log
_
P(w
(r)
ref
, Y
(r)
; λ)
P(w
(r)
ref
)p(Y
(r)
; λ)
_
. (4.3)
As only the acoustic model parameters are trained, P(w
(r)
ref
) is ﬁxed,
this is equivalent to ﬁnding the model parameters that maximise the
average logposterior probability of the correct word sequence.
1
Thus
1
Given that the class priors, the language model probabilities, are ﬁxed this should really
be called conditional entropy training. When this form of training criterion is used
with discriminative models it is also known as conditional maximum likelihood (CML)
training.
40 Parameter Estimation
the following criterion is maximised
F
mmi
(λ) =
1
R
R
r=1
log(P(w
(r)
ref
Y
(r)
; λ)) (4.4)
=
1
R
R
r=1
log
_
p(Y
(r)
w
(r)
ref
; λ)P(w
(r)
ref
)
w
p(Y
(r)
w; λ)P(w)
_
. (4.5)
Intuitively, the numerator is the likelihood of the data given the cor
rect word sequence w
(r)
ref
, whilst the denominator is the total likelihood
of the data given all possible word sequences w. Thus, the objective
function is maximised by making the correct model sequence likely and
all other model sequences unlikely.
4.1.2 Minimum Classiﬁcation Error
Minimum classiﬁcation error (MCE) is a smooth measure of the
error [82, 29]. This is normally based on a smooth function of the dif
ference between the loglikelihood of the correct word sequence and all
other competing sequences, and a sigmoid is often used for this purpose.
The MCE criterion may be expressed in terms of the posteriors as
F
mce
(λ) =
1
R
R
r=1
_
_
1 +
_
_
P(w
(r)
ref
Y
(r)
; λ)
w=w
(r)
ref
P(wY
(r)
; λ)
_
_
_
_
−1
. (4.6)
There are some important diﬀerences between MCE and MMI. The
ﬁrst is that the denominator term does not include the correct word
sequence. Secondly, the posteriors (or loglikelihoods) are smoothed
with a sigmoid function, which introduces an additional smoothing
term . When = 1 then
F
mce
(λ) = 1 −
1
R
R
r=1
P(w
(r)
ref
Y
(r)
; λ). (4.7)
This is a speciﬁc example of the minimum Bayes risk criterion discussed
next.
4.1 Discriminative Training 41
4.1.3 Minimum Bayes’ Risk
In minimum Bayes’ risk (MBR) training, rather than trying to model
the correct distribution, as in the MMI criterion, the expected loss
during recognition is minimised [22, 84]. To approximate the expected
loss during recognition, the expected loss estimated on the training
data is used. Here
F
mbr
(λ) =
1
R
R
r=1
w
P(wY
(r)
; λ)L(w, w
(r)
ref
), (4.8)
where L(w, w
(r)
ref
) is the loss function of word sequence w against the
reference for sequence r, w
(r)
ref
.
There are a number of loss functions that have been examined.
•
1/0 loss: For continuous speech recognition this is equivalent
to a sentencelevel loss function
L(w, w
(r)
ref
) =
_
1; w = w
(r)
ref
0; w = w
(r)
ref
.
When = 1 MCE and MBR training with a sentence cost
function are the same.
•
Word: The loss function directly related to minimising the
expected word error rate (WER). It is normally computed
by minimising the Levenshtein edit distance.
•
Phone: For large vocabulary speech recognition not all word
sequences will be observed. To assist generalisation, the loss
function is often computed between phone sequences, rather
than word sequences. In the literature this is known as min
imum phone error (MPE) training [136, 139].
•
Phone frame error: When using the phone lossfunction, the
number of possible errors to be corrected is reduced com
pared to the number of frames. This can cause generalisation
issues. To address this minimum phone frame error (MPFE)
may be used where the phone loss is weighted by the number
of frames associated with each frame [198]. This is the same
as the Hamming distance described in [166].
42 Parameter Estimation
It is also possible to base the loss function on the speciﬁc task for which
the classiﬁer is being built [22].
A comparison of the MMI, MCE and MPE criteria on the Wall
Street Journal (WSJ) task and a general framework for discriminative
training is given in [111]. Though all the criteria signiﬁcantly outper
formed ML training, MCE and MPE were found to outperform MMI
on this task. In [198], MPFE is shown to give small but consistent gains
over MPE.
4.2 Implementation Issues
In Architecture of an HMMBased Recogniser, the use of the EM algo
rithm to train the parameters of an HMM using the ML criterion was
described. This section brieﬂy discusses some of the implementation
issues that need to be addressed when using discriminative training cri
teria. Two aspects are considered. Firstly, the form of algorithm used
to estimate the model parameters is described. Secondly, since these
discriminative criteria can overtrain, various techniques for improving
generalisation are detailed.
4.2.1 Parameter Estimation
EM cannot be used to estimate parameters when discriminative criteria
are used. To see why this is the case, note that the MMI criterion can
be expressed as the diﬀerence of two loglikelihood expressions
F
mmi
(λ) =
1
R
R
r=1
_
log
_
p(Y
(r)
w
(r)
ref
; λ)P(w
(r)
ref
)
_
−log
_
w
p(Y
(r)
w; λ)P(w)
__
. (4.9)
The ﬁrst term is referred to as the numerator term and the second, the
denominator term. The numerator term is identical to the standard
ML criterion (the language model term, P(w
(r)
ref
), does not inﬂuence
the MLestimate). The denominator may also be rewritten in a similar
fashion to the numerator by producing a composite HMM with param
eters λ
den
. Hence, although auxiliary functions may be computed for
4.2 Implementation Issues 43
each of these, the diﬀerence of two lowerbounds is not itself a lower
bound and so standard EM cannot be used. To handle this problem, the
extended BaumWelch (EBW) criterion was proposed [64, 129]. In this
case, standard EMlike auxiliary functions are deﬁned for the numer
ator and denominator but stability during reestimation is achieved
by adding scaled current model parameters to the numerator statis
tics. The weighting of these parameters is speciﬁed by a constant D.
A large enough value of D can be shown to guarantee that the criterion
does not decrease. For MMI, the update of the mean parameters then
becomes
ˆ µ
(jm)
=
R
r=1
T
(r)
t=1
_
(γ
(rjm)
numt
− γ
(rjm)
dent
)y
(r)
t
_
+ Dµ
(jm)
R
r=1
T
(r)
t=1
_
γ
(rjm)
numt
− γ
(rjm)
dent
_
+ D
, (4.10)
where µ
(jm)
are the current model parameters, γ
(rm)
numt
is the same as the
posterior used in ML estimation and
γ
(rjm)
dent
= P(θ
t
= s
jm
Y
(r)
; λ
den
) (4.11)
and λ
den
is the composite HMM representing all competing paths.
One issue with EBW is the speed of convergence, which may be very
slow. This may be made faster by making D Gaussian component spe
ciﬁc [136]. A similar expression can be derived from a weaksense aux
iliary function perspective [136]. Here an auxiliary function is deﬁned
which rather than being a strict lowerbound, simply has the same
gradient as the criterion at the current model parameters. Similar
expressions can be derived for the MPE criterion, and recently large
vocabulary training with MCE has also been implemented [119].
For large vocabulary speech recognition systems, where thousands
of hours of training data may be used, it is important that the training
procedure be as eﬃcient as possible. For discriminative techniques one
of the major costs is the accumulation of the statistics for the denom
inator since this computation is equivalent to recognising the training
data. To improve eﬃciency, it is common to use lattices as a compact
representation of the most likely competing word sequences and only
accumulate statistics at each iteration for paths in the lattice [174].
44 Parameter Estimation
4.2.2 Generalisation
Compared to ML training, it has been observed that discriminative
criteria tend to generalise lesswell. To mitigate this, a number of
techniques have been developed that improve the robustness of the
estimates.
As a consequence of the conditional independence assumptions, the
posterior probabilities computed using the HMM likelihoods tend to
have a very large dynamic range and typically one of the hypotheses
dominates. To address this problem, the acoustic model likelihoods are
often raised to a fractional power, referred to as acoustic deweight
ing [182]. Thus when accumulating the statistics, the posteriors are
based on
P(w
(r)
ref
Y
(r)
; λ) =
p(Y
(r)
w
(r)
ref
; λ)
α
P(w
(r)
ref
)
β
w
p(Y
(r)
w; λ)
α
P(w)
β
. (4.12)
In practice β is often set to one and α is set to the inverse of the
language model scalefactor described in footnote 2 in Architecture of
an HMMBased Recogniser.
The form of the language model used in training should in theory
match the form used for recognition. However, it has been found that
using simpler models, unigrams or heavily pruned bigrams, for train
ing despite using trigrams or fourgrams in decoding improves perfor
mance [152]. By weakening the language model, the number of possible
confusions is increased allowing more complex models to be trained
given a ﬁxed quantity of training data.
To improve generalisation, “robust” parameter priors may be used
when estimating the models. These priors may either be based on
the ML parameter estimates [139] or, for example when using MPE
training, the MMI estimates [150]. For MPE training using these
priors has been found to be essential for achieving performance
gains [139].
4.3 Lightly Supervised and Unsupervised Training
One of the major costs in building an acoustic training corpus is
the production of accurate orthographic transcriptions. For some
4.3 Lightly Supervised and Unsupervised Training 45
situations, such as broadcast news (BN) which has associated closed
captions, approximate transcriptions may be derived. These transcrip
tions are known to be errorfull and thus not suitable for direct
use when training detailed acoustic models. However, a number of
lightly supervised training techniques have been developed to overcome
this [23, 94, 128].
The general approach to lightly supervised training is exempliﬁed
by the procedure commonly used for training on BN data using closed
captions (CC). This procedure consists of the following stages:
(1) Construct a language model (LM) using only the CC data.
This CC LM is then interpolated with a general BN LM using
interpolation weights heavily weighted towards the CC LM.
For example in [56] the interpolation weights were 0.9 and
0.1.
2
This yields a biased language model.
(2) Recognise the audio data using an existing acoustic model
and the biased LM trained in (1).
(3) Optionally postprocess the data. For example, only use seg
ments from the training data where the recognition output
from (2) is consistent to some degree with the CCs, or only
use segments with high conﬁdence in the recognition output.
(4) Use the selected segments for acoustic model training with
the hypothesised transcriptions from (2).
For acoustic data which has no transcriptions at all, unsupervised
training techniques must be used [86, 110]. Unsupervised training is
similar to lightly supervised training except that a general recognition
language model is used, rather than the biased language model for
lightly supervised training. This reduces the accuracy of the transcrip
tions [23], but has been successfully applied to a range of tasks [56, 110].
However, the gains from unsupervised approaches can decrease dramat
ically when discriminative training such as MPE is used. If the mis
match between the supervised training data and unsupervised data is
2
The interpolation weights were found to be relatively insensitive to the type of source data
when tuned on heldout data.
46 Parameter Estimation
large, the accuracy of the numerator transcriptions become very poor.
The gains obtained from using the unannotated data then become
small [179]. Thus currently there appears to be a limit to how poor the
transcriptions can be whilst still obtaining worthwhile gains from dis
criminative training. This problem can be partly overcome by using
the recognition output to guide the selection of data to manually
transcribe [195].
5
Adaptation and Normalisation
A fundamental idea in statistical pattern classiﬁcation is that the train
ing data should adequately represent the test data, otherwise a mis
match will occur and recognition accuracy will be degraded. In the
case of speech recognition, there will always be new speakers who are
poorly represented by the training data, and new hitherto unseen envi
ronments. The solution to these problems is adaptation. Adaptation
allows a small amount of data from a target speaker to be used to
transform an acoustic model set to make it more closely match that
speaker. It can be used both in training to make more speciﬁc or more
compact recognition sets, and it can be used in recognition to reduce
mismatch and the consequent recognition errors.
There are varying styles of adaptation which aﬀect both the possible
applications and the method of implementation. Firstly, adaptation
can be supervised in which case accurate transcriptions are available
for all of the adaptation data, or it can be unsupervised in which case
the required transcriptions must be hypothesised. Secondly, adaptation
can be incremental or batchmode. In the former case, adaptation data
becomes available in stages, for example, as is the case for a spoken
dialogue system, when a new caller comes on the line. In batchmode,
47
48 Adaptation and Normalisation
all of the adaptation data is available from the start as is the case in
oﬀline transcription.
A range of adaptation and normalisation approaches have been
proposed. This section gives an overview of some of the more com
mon schemes. First featurebased approaches, which only depend
on the acoustic features, are described. Then maximum a posteri
ori adaptation and cluster based approaches are described. Finally,
linear transformationbased approaches are discussed. Many of these
approaches can be used within an adaptive training framework which
is the ﬁnal topic.
5.1 FeatureBased Schemes
Featurebased schemes only depend on the acoustic features. This is
only strictly true of the ﬁrst two approaches: mean and variance nor
malisation; and Gaussianisation. The ﬁnal approach, vocal tract length
normalisation can be implemented in a range of ways, many of which
do not fully ﬁt within this deﬁnition.
5.1.1 Mean and Variance Normalisation
Cepstral mean normalisation (CMN) removes the average feature value
of the featurevector from each observation. Since cepstra in common
with most frontend feature sets are derived from log spectra, this has
the eﬀect of reducing sensitivity to channel variation, provided that the
channel eﬀects do not signiﬁcantly vary with time or the amplitude of
the signal. Cepstral variance normalisation (CVN) scales each individ
ual feature coeﬃcient to have a unit variance and empirically this has
been found to reduce sensitivity to additive noise [71].
For transcription applications where multiple passes over the data
are possible, the necessary mean and variance statistics should be com
puted over the longest possible segment of speech for which the speaker
and environment conditions are constant. For example, in BN transcrip
tion this will be a speaker segment and in telephone transcription it
will be a whole side of a conversation. Note that for real time systems
which operate in a single continuous pass over the data, the mean and
variance statistics must be computed as running averages.
5.1 FeatureBased Schemes 49
5.1.2 Gaussianisation
Given that normalising the ﬁrst and secondorder statistics yields
improved performance, an obvious extension is to normalise the higher
order statistics for the data from, for example, a speciﬁc speaker so that
the overall distribution of all the features from that speaker is Gaus
sian. This socalled Gaussianisation is performed by ﬁnding a transform
y = φ(˜ y), that yields
y ∼ N(0, I), (5.1)
where 0 is the ddimensional zero vector and I is a ddimensional iden
tity matrix. In general performing Gaussianisation on the complete
feature vector is highly complex. However, if each element of the fea
ture vector is treated independently, then there a number of schemes
that may be used to deﬁne φ(·).
One standard approach is to estimate a cumulative distribution
function (CDF) for each feature element ˜ y
i
, F(˜ y
i
). This can be esti
mated using all the data points from, for example, a speaker in a his
togram approach. Thus for dimension i of base observation ˜ y
F(˜ y
i
) ≈
1
T
T
t=1
h(˜ y
i
− ˜ y
ti
) = rank(˜ y
i
)/T, (5.2)
where ˜ y
1
, . . . , ˜ y
T
are the data, h(·) is the step function and rank(˜ y
i
) is
the rank of element i of ˜ y, ˜ y
i
, when the data are sorted. The transfor
mation required such that the transformed dimension i of observation
y is approximately Gaussian is
y
i
= Φ
−1
_
rank(˜ y
i
)
T
_
, (5.3)
where Φ(·) is the CDF of a Gaussian [148]. Note, however, that although
each dimension will be approximately Gaussian distributed, the com
plete feature vector will not necessarily satisfy (5.1) since the elements
may be correlated with one another.
One diﬃculty with this approach is that when the normalisation
data set is small the CDF estimate, F(˜ y
i
), can be noisy. An alternative
approach is to estimate an Mcomponent GMM on the data and then
50 Adaptation and Normalisation
use this to approximate the CDF [26], that is
y
i
= Φ
−1
_
_
˜ y
i
−∞
M
m=1
c
m
N(z; µ
(m)
i
, σ
(m)2
i
)dz
_
, (5.4)
where µ
(m)
i
and σ
(m)2
i
are the mean and variance of the mth GMM
component of dimension i trained on all the data. This results in
a smoother and more compact representation of the Gaussianisation
transformation [55].
Both these forms of Gaussianisation have assumed that the dimen
sions are independent of one another. To reduce the impact of this
assumption Gaussianisation is often applied after some form of decor
relating transform. For example, a global semitied transform can be
applied prior to the Gaussianisation process.
5.1.3 Vocal Tract Length Normalisation
Variations in vocal tract length cause formant frequencies to shift in
frequency in an approximately linear fashion. Thus, one simple form
of normalisation is to linearly scale the ﬁlter bank centre frequencies
within the frontend feature extractor to approximate a canonical for
mant frequency scaling [96]. This is called vocal tract length normali
sation (VTLN).
To implement VTLN two issues need to be addressed: deﬁnition of
the scaling function and estimation of the appropriate scaling function
parameters for each speaker. Early attempts at VTLN used a simple
linear mapping, but as shown in Figure 5.1(a) this results in a prob
lem at high frequencies where female voices have no information in the
upper frequency band and male voices have the upper frequency band
truncated. This can be mitigated by using a piecewise linear func
tion of the form shown in Figure 5.1(b) [71]. Alternatively, a bilinear
transform can be used [120]. Parameter estimation is performed using a
grid search plotting log likelihoods against parameter values. Once the
optimal values for all training speakers have been computed, the train
ing data is normalised and the acoustic models reestimated. This is
repeated until the VTLN parameters have stabilised. Note here that
5.2 Linear TransformBased Schemes 51
Fig. 5.1 Vocal tract length normalisation.
when comparing log likelihoods resulting from diﬀering VTLN trans
formations, then the Jacobian of the transform should be included. This
is however very complex to estimate and since the application of mean
and variance normalisation will reduce the aﬀect of this approximation,
it is usually ignored.
For very large systems, the overhead incurred from iteratively com
puting the optimal VTLN parameters can be considerable. An alterna
tive is to approximate the eﬀect of VTLN by a linear transform. The
advantage of this approach is that the optimal transformation parame
ters can be determined from the auxiliary function in a single pass over
the data [88].
VTLN is particularly eﬀective for telephone speech where speakers
can be clearly identiﬁed. It is less eﬀective for other applications such as
broadcast news transcription where speaker changes must be inferred
from the data.
5.2 Linear TransformBased Schemes
For cases where adaptation data is limited, linear transformation based
schemes are currently the most eﬀective form of adaptation. These
diﬀer from featurebased approaches in that they use the acoustic model
parameters and they require a transcription of the adaptation data.
52 Adaptation and Normalisation
5.2.1 Maximum Likelihood Linear Regression
In maximum likelihood linear regression (MLLR), a set of linear trans
forms are used to map an existing model set into a new adapted model
set such that the likelihood of the adaptation data is maximised. MLLR
is generally very robust and well suited to unsupervised incremental
adaptation. This section presents MLLR in the form of a single global
linear transform for all Gaussian components. The multiple transform
case, where diﬀerent transforms are used depending on the Gaussian
component to be adapted, is discussed later.
There are two main variants of MLLR: unconstrained and con
strained [50, 97]. In unconstrained MLLR, separate transforms are
trained for the means and variances
ˆ µ
(sm)
= A
(s)
µ
(m)
+ b
(s)
;
ˆ
Σ
(sm)
= H
(s)
Σ
(m)
H
(s)T
(5.5)
where s indicates the speaker. Although (5.5) suggests that the likeli
hood calculation is expensive to compute, unless H is constrained to
be diagonal, it can in fact be made eﬃcient using the following equality
N
_
y; ˆ µ
(sm)
,
ˆ
Σ
(sm)
_
=
1
H
(s)

N(H
(s)−1
y; H
(s)−1
A
(s)
µ
(m)
+ H
(s)−1
b
(s)
, Σ
(m)
). (5.6)
If the original covariances are diagonal (as is common), then by appro
priately caching the transformed observations and means, the likelihood
can be calculated at the same cost as when using the original diagonal
covariance matrices.
For MLLR there are no constraints between the adaptation applied
to the means and the covariances. If the two matrix transforms are
constrained to be the same, then a linear transform related to the
featurespace transforms described earlier may be obtained. This is
constrained MLLR (CMLLR)
ˆ µ
(sm)
=
˜
A
(s)
µ
(m)
+
˜
b
(s)
;
ˆ
Σ
(sm)
=
˜
A
(s)
Σ
(m)
˜
A
(s)T
. (5.7)
In this case, the likelihood can be expressed as
N(y; ˆ µ
(sm)
,
ˆ
Σ
(sm)
) = A
(s)
N(A
(s)
y + b
(s)
; µ
(m)
, Σ
(m)
), (5.8)
5.2 Linear TransformBased Schemes 53
where
A
(s)
=
˜
A
(s)−1
; b
(s)
= −
˜
A
(s)−1
˜
b
(s)
. (5.9)
Thus with this constraint, the actual model parameters are not trans
formed making this form of transform eﬃcient if the speaker (or
acoustic environment) changes rapidly. CMLLR is the form of linear
transform most often used for adaptive training, discussed later.
For both forms of linear transform, the matrix transformation may
be full, blockdiagonal, or diagonal. For a given amount of adaptation,
more diagonal transforms may be reliably estimated than full ones.
However, in practice, full transforms normally outperform larger num
bers of diagonal transforms [126]. Hierarchies of transforms of diﬀerent
complexities may also be used [36].
5.2.2 Parameter Estimation
The maximum likelihood estimation formulae for the various forms
of linear transform are given in [50, 97]. Whereas there are closed
form solutions for unconstrained mean MLLR, the constrained and
unconstrained variance cases are similar to the semitied covariance
transform discussed in HMM Structure Reﬁnements and they require
an iterative solution.
Both forms of linear transforms require transcriptions of the adap
tation data in order to estimate the model parameters. For supervised
adaptation, the transcription is known and may be directly used with
out further consideration. When used in unsupervised mode, the tran
scription must be derived from the recogniser output and in this case,
MLLR is normally applied iteratively [183] to ensure that the best
hypothesis for estimating the transform parameters is used. First, the
unknown speech is recognised, then the hypothesised transcription is
used to estimate MLLR transforms. The unknown speech is then re
recognised using the adapted models. This is repeated until convergence
is achieved.
Using this approach, all words within the hypothesis are treated as
equally probable. A reﬁnement is to use recognition lattices in place
of the 1best hypothesis to accumulate the adaptation statistics. This
54 Adaptation and Normalisation
approach is more robust to recognition errors and avoids the need to
rerecognise the data since the lattice can be simply rescored [133]. An
alternative use of lattices is to obtain conﬁdence scores, which may
then be used for conﬁdencebased MLLR [172]. A diﬀerent approach
using Nbest lists was proposed in [118, 194] whereby a separate
transform was estimated for each hypothesis and used to rescore
only that hypothesis. In [194], this is shown to be a lowerbound
approximation where the transform parameters are considered as latent
variables.
The initial development of transformbased adaptation methods
used the ML criterion, this was then extended to include maximum
a posteriori estimation [28]. Linear transforms can also be estimated
using discriminative criteria [171, 178, 180]. For supervised adapta
tion, any of the standard approaches may be used. However, if unsu
pervised adaptation is used, for example in BN transcription systems,
then there is an additional concern. As discriminative training schemes
attempt to modify the parameters so that the posterior of the tran
scription (or a function thereof) is improved, it is more sensitive to
errors in the transcription hypotheses than ML estimation. This is
the same issue as was observed for unsupervised discriminative train
ing and, in practice, discriminative unsupervised adaptation is not
commonly used.
5.2.3 Regression Class Trees
A powerful feature of linear transformbased adaptation is that it allows
all the acoustic models to be adapted using a variable number of trans
forms. When the amount of adaptation data is limited, a global trans
form can be shared across all Gaussians in the system. As the amount
of data increases, the HMM state components can be grouped into
regression classes with each class having its own transform, for exam
ple A
(r)
for regression class r. As the amount of data increases further,
the number of classes and therefore transforms can be increased corre
spondingly to give better and better adaptation [97].
The number of transforms to use for any speciﬁc set of adapta
tion data can be determined automatically using a regression class tree
5.3 Gender/Cluster Dependent Models 55
Fig. 5.2 A regression class tree.
as illustrated in Figure 5.2. Each node represents a regression class,
i.e., a set of Gaussian components which will share a single trans
form. The total occupation count associated with any node in the tree
can easily be computed since the counts are known at the leaf nodes.
Then, for a given set of adaptation data, the tree is descended and the
most speciﬁc set of nodes is selected for which there is suﬃcient data
(for example, the shaded nodes in Figure 5.2). Regression class trees
may either be speciﬁed using expert knowledge, or more commonly by
automatically training the tree by assuming that Gaussian components
that are “close” to one another are transformed using the same linear
transform [97].
5.3 Gender/Cluster Dependent Models
The models described above have treated all the training data as a
singleblock, with no information about the individual segments of
training data. In practice it is possible to record, or infer, informa
tion about the segment, for example the gender of the speaker. If this
gender information is then used in training, the result is referred to
as a genderdependent (GD) system. One issue with training these GD
models is that during recognition the gender of the testspeaker must
be determined. As this is a single binary latent variable, it can normally
be rapidly and robustly estimated.
56 Adaptation and Normalisation
It is also possible to extend the number of clusters beyond simple
gender dependent models. For example speakers can be automatically
clustered together to form clusters [60].
Rather than considering a single cluster to represent the speaker
(and possibly acoustic environment), multiple clusters or socalled
eigenvoices may be combined together. There are two approaches that
have been adopted to combining the clusters together: likelihood com
bination; and parameter combination. In addition to combining the
clusters, the form of representation of the cluster must be decided.
The simplest approach is to use a mixture of eigenvoices. Here
p(Y
1:T
ν
(s)
; λ) =
K
k=1
ν
(s)
k
p(Y
1:T
; λ
(k)
), (5.10)
where λ
(k)
are the model parameters associated with the kth cluster,
s indicates the speaker, and
K
k=1
ν
(s)
k
= 1, ν
(s)
k
≥ 0. (5.11)
Thus gender dependent models are a speciﬁc case where ν
(s)
k
= 1 for the
“correct” gender and zero otherwise. One problem with this approach
is that all K cluster likelihoods must be computed for each observation
sequence. The training of these forms of model can be implemented
using EM. This form of interpolation can also be combined with linear
transforms [35].
Rather than interpolating likelihoods, the model parameters them
selves can be interpolated. Thus
p(Y
1:T
ν
(s)
; λ) = p
_
Y
1:T
;
K
k=1
ν
(s)
k
λ
(k)
_
. (5.12)
Current approaches limit interpolation to the mean parameters, and the
remaining variances, priors and transition probabilities are tied over all
the clusters
ˆ µ
(sm)
=
K
k=1
ν
(s)
k
µ
(km)
. (5.13)
5.4 Maximum a Posteriori (MAP) Adaptation 57
Various forms of this model have been investigated: reference speaker
weighting (RSW) [73], EigenVoices [91], and cluster adaptive train
ing (CAT) [52]. All use the same interpolation of the means, but
diﬀer in how the eigenvoices are estimated and how the interpola
tion weight vector, ν
(s)
, is estimated. RSW is the simplest approach
using each speaker as an eigenvoice. In the original implementation
of EigenVoices, the cluster means were determined using PCA on the
extended mean supervector, where all the means of the models for
each speaker were concatenated together. In contrast, CAT uses ML
to estimate the cluster parameters. Note that CAT parameters are
estimated using an adaptive training style which is discussed in more
detail below.
Having estimated the cluster parameters in training, for the test
speaker the interpolation weights, ν
(s)
, are estimated. For both Eigen
Voices and CAT, the MLbased smoothing approach in [49] is used.
In contrast RSW uses ML with the added constraint that the weights
satisfy the constraints in (5.11). This complicates the optimisation cri
terion and hence it is not often used.
A number of extensions to this basic parameter interpolation frame
work have been proposed. In [52], linear transforms were used to rep
resent each cluster for CAT, also referred to as EigenMLLR [24] when
used in an Eigenvoicestyle approach. Kernelised versions of both Eigen
Voices [113] and EigenMLLR [112] have also been studied. Finally, the
use of discriminative criteria to obtain the cluster parameters has been
derived [193].
5.4 Maximum a Posteriori (MAP) Adaptation
Rather than hypothesising a form of transformation to represent the
diﬀerences between speakers, it is possible to use standard statisti
cal approaches to obtain robust parameter estimates. One common
approach is maximum a posteriori (MAP) adaptation where in addi
tion to the adaptation data, a prior over the model parameters, p(λ),
is used to estimate the model parameters. Given some adaptation data
Y
(1)
, . . . , Y
(R)
and a model with parameters λ, MAPbased parameter
58 Adaptation and Normalisation
estimation seeks to maximise the following objective function
F
map
(λ) = F
ml
(λ) +
1
R
log(p(λ))
=
_
1
R
R
r=1
logp(Y
(r)
w
(r)
ref
; λ)
_
+
1
R
log(p(λ)). (5.14)
Comparing this with the ML objective function given in (4.1), it can
be seen that the likelihood is weighted by the prior. Note that MAP
can also be used with other training criteria such as MPE [137]. The
choice of distribution for this prior is problematic since there is no con
jugate prior density for a continuous Gaussian mixture HMM. How
ever, if the mixture weights and Gaussian component parameters are
assumed to be independent, then a ﬁnite mixture density of the form
p
D
(c
j
)
m
p
W
(µ
(jm)
, Σ
(jm)
) can be used where p
D
(·) is a Dirichlet dis
tribution over the vector of mixture weights c
j
and p
W
(·) is a normal
Wishart density. It can then be shown that this leads to parameter
estimation formulae of the form [61]
ˆ µ
(jm)
=
τµ
(jm)
+
R
r=1
T
(r)
t=1
γ
(rjm)
t
y
(r)
t
τ +
R
r=1
T
(r)
t=1
γ
(rjm)
t
, (5.15)
where µ
(jm)
is the prior mean and τ is a parameter of p
W
(·) which is
normally determined empirically. Similar, though rather more complex,
formulae can be derived for the variances and mixture weights [61].
Comparing (5.15) with (2.13), it can be seen that MAP adaptation
eﬀectively interpolates the original prior parameter values with those
that would be obtained from the adaptation data alone. As the amount
of adaptation data increases, the parameters tend asymptotically to
the adaptation domain. This is a desirable property and it makes MAP
especially useful for porting a welltrained model set to a new domain
for which there is only a limited amount of data.
A major drawback of MAP adaptation is that every Gaussian com
ponent is updated individually. If the adaptation data is sparse, then
many of the model parameters will not be updated. Various attempts
have been made to overcome this (e.g. [4, 157]) but MAP nevertheless
remains illsuited for rapid incremental adaptation.
5.5 Adaptive Training 59
5.5 Adaptive Training
Ideally, an acoustic model set should encode just those dimensions
which allow the diﬀerent classes to be discriminated. However, in the
case of speaker independent (SI) speech recognition, the training data
necessarily includes a large number of speakers. Hence, acoustic mod
els trained directly on this set will have to “waste” a large number of
parameters encoding the variability between speakers rather than the
variability between spoken words which is the true aim. One approach
to handling this problem is to use adaptation transforms during train
ing. This is referred to as speaker adaptive training (SAT) [5].
The concept behind SAT is illustrated in Figure 5.3. For each
training speaker, a transform is estimated, and then the canonical
model is estimated given all of these speaker transforms. The com
plexity of performing adaptive training depends on the nature of the
adaptation/normalisation transform. These may be split into three
groups.
•
Model independent: These schemes do not make explicit use
of any model information. CMN/CVN and Gaussianisation
belong to this group. VTLN may be independent of the model
for example if estimated using attributes of the signal [96].
In these cases the features are transformed and the models
trained as usual.
Fig. 5.3 Adaptive training.
60 Adaptation and Normalisation
•
Feature transformation: These transforms also act on the
features but are derived, normally using ML estimation,
using the current estimate of the model set. Common ver
sions of these feature transforms are VTLN using MLstyle
estimation, or the linear approximation [88] and constrained
MLLR [50].
•
Model transformation: The model parameters, means and
possibly variances, are transformed. Common schemes are
SAT using MLLR [5] and CAT [52].
For schemes where the transforms are dependent on the model
parameters, the estimation is performed by an iterative process:
(1) initialise the canonical model using a speakerindependent
system and set the transforms for each speaker to be an iden
tity transform (A
(s)
= I, b
(s)
= 0) for each speaker;
(2) estimate a transform for each speaker using only the data for
that speaker;
(3) estimate the canonical model given each of the speaker trans
forms;
(4) Goto (2) unless converged or some maximum number of iter
ations reached.
When MLLR is used, the estimation of the canonical model is more
expensive [5] than standard SI training. This is not the case for CAT,
although additional parameters must be stored. The most common
version of adaptive training uses CMLLR, since it is the simplest to
implement. Using CMLLR, the canonical mean is estimated by
ˆ µ
(jm)
=
S
s=1
T
(s)
t=1
γ
(sjm)
t
_
A
(s)
y
(s)
t
+ b
(s)
_
S
s=1
T
(s)
t=1
γ
(sjm)
t
, (5.16)
where S denotes the number of speakers and γ
(sjm)
t
is the posterior
occupation probability of component mof state s
j
generating the obser
vation at time t based on data for speaker s. In practice, multiple forms
of adaptive training can be combined together, for example using Gaus
sianisation with CMLLRbased SAT [55].
5.5 Adaptive Training 61
Finally, note that SAT trained systems incur the problem that they
can only be used once transforms have been estimated for the test
data. Thus, an SI model set is typically retained to generate the initial
hypothesised transcription or lattice needed to compute the ﬁrst set
of transforms. Recently a scheme for dealing with this problem using
transform priors has been proposed [194].
6
Noise Robustness
In Adaptation and Normalisation, a variety of schemes were described
for adapting a speech recognition system to changes in speaker and
environment. One very common and often extreme form of environment
change is caused by ambient noise and although the general adaptation
techniques described so far can reduce the eﬀects of noise, speciﬁc noise
compensation algorithms can be more eﬀective, especially with very
limited adaptation data.
Noise compensation algorithms typically assume that a clean speech
signal x
t
in the time domain is corrupted by additive noise n
t
and a
stationary convolutive channel impulse response h
t
to give the observed
noisy signal y
t
[1]. That is
y
t
= x
t
⊗ h
t
+ n
t
, (6.1)
where ⊗ denotes convolution.
In practice, as explained in Architecture of an HMMBased Recog
niser, speech recognisers operate on feature vectors derived from a
transformed version of the log spectrum. Thus, for example, in the
melcepstral domain, (6.1) leads to the following relationship between
63
64 Noise Robustness
the static clean speech, noise and noise corrupted speech observations
1
y
s
t
= x
s
t
+ h + Clog
_
1 + exp(C
−1
(n
s
t
− x
s
t
− h)
_
= x
s
t
+ f(x
s
t
, n
s
t
, h), (6.2)
where C is the DCT matrix. For a given set of noise conditions, the
observed (static) speech vector y
s
t
is a highly nonlinear function of
the underlying clean (static) speech signal x
s
t
. In broad terms, a small
amount of additive Gaussian noise adds a further mode to the speech
distribution. As the signal to noise ratio falls, the noise mode starts
to dominate, the noisy speech means move towards the noise and the
variances shrink. Eventually, the noise dominates and there is little or
no information left in the signal. The net eﬀect is that the probability
distributions trained on clean data become very poor estimates of the
data distributions observed in the noisy environment and recognition
error rates increase rapidly.
Of course, the ideal solution to this problem would be to design
a feature representation which is immune to noise. For example, PLP
features do exhibit some measure of robustness [74] and further robust
ness can be obtained by bandpass ﬁltering the features over time in a
process called relative spectral ﬁltering to give socalled RASTAPLP
coeﬃcients [75]. Robustness to noise can also be improved by training
on speech recorded in a variety of noise conditions, this is called multi
style training. However, there is a limit to the eﬀectiveness of these
approaches and in practice SNRs below 15 dB or so require active noise
compensation to maintain performance.
There are two main approaches to achieving noise robust recognition
and these are illustrated in Figure 6.1 which shows the separate clean
training and noisy recognition environments. Firstly, in feature com
pensation, the noisy features are compensated or enhanced to remove
the eﬀects of the noise. Recognition and training then takes place in the
clean environment. Secondly, in model compensation, the clean acous
tic models are compensated to match the noisy environment. In this
latter case, training takes place in the clean environment, but the clean
1
In this section log(·) and exp(·) when applied to a vector perform an elementwise loga
rithm or exponential function to all elements of the vector. They will thus yield a vector.
6.1 Feature Enhancement 65
Fig. 6.1 Approaches to noise robust recognition.
models are then mapped so that recognition can take place directly in
the noisy environment. In general, feature compensation is simpler and
more eﬃcient to implement, but model compensation has the potential
for greater robustness since it can make direct use of the very detailed
knowledge of the underlying clean speech signal encoded in the acoustic
models.
In addition to the above, both feature and model compensation can
be augmented by assigning a conﬁdence or uncertainty to each feature.
The decoding process is then modiﬁed to take account of this additional
information. These uncertaintybased methods include missing feature
decoding and uncertainty decoding. The following sections discuss each
of the above in more detail.
6.1 Feature Enhancement
The basic goal of feature enhancement is to remove the eﬀects of noise
from measured features. One of the simplest approaches to this prob
lem is spectral subtraction (SS) [18]. Given an estimate of the noise
spectrum in the linear frequency domain n
f
, a clean speech spectrum
is computed simply by subtracting n
f
from the observed spectra. Any
desired feature vector, such as melcepstral coeﬃcients, can then be
derived in the normal way from the compensated spectrum. Although
spectral subtraction is widely used, it nevertheless has several prob
lems. Firstly, speech/nonspeech detection is needed to isolate segments
of the spectrum from which a noise estimate can be made, typically
66 Noise Robustness
the ﬁrst second or so of each utterance. This procedure is prone to
errors. Secondly, smoothing may lead to an underestimate of the noise
and occasionally errors lead to an overestimate. Hence, a practical SS
scheme takes the form:
ˆ x
f
i
=
_
y
f
i
− αn
f
i
, if ˆ x
f
i
> βn
f
i
βn
f
i
, otherwise,
(6.3)
where the f superscript indicates the frequency domain, α is an empiri
cally determined oversubtraction factor and β sets a ﬂoor to avoid the
clean spectrum going negative [107].
A more robust approach to feature compensation is to compute a
minimum mean square error (MMSE) estimate
2
ˆ x
t
= E{x
t
y
t
}. (6.4)
The distribution of the noisy speech is usually represented as an
Ncomponent Gaussian mixture
p(y
t
) =
N
n=1
c
n
N(y
t
; µ
(n)
y
, Σ
(n)
y
). (6.5)
If x
t
and y
t
are then assumed to be jointly Gaussian within each mix
ture component n then
E{x
t
y
t
, n} = µ
(n)
x
+ Σ
(n)
xy
(Σ
(n)
y
)
−1
(y
t
− µ
(n)
y
)
= A
(n)
y
t
+ b
(n)
(6.6)
and hence the required MMSE estimate is just a weighted sum of linear
predictions
ˆ x
t
=
N
n=1
P(ny
t
)(A
(n)
y
t
+ b
(n)
), (6.7)
where the posterior component probability is given by
P(ny
t
) = c
n
N(y
t
; µ
(n)
y
, Σ
(n)
y
)/p(y
t
). (6.8)
2
The derivation given here considers the complete feature vector y
t
to predict the clean
speech vector ˆ x
t
. It is also possible to just predict the static clean means ˆ x
s
t
and then
derive the delta parameters from these estimates.
6.2 ModelBased Compensation 67
A number of diﬀerent algorithms have been proposed for estimating
the parameters A
(n)
, b
(n)
, µ
(n)
y
and Σ
(n)
y
[1, 123]. If socalled stereo
data, i.e., simultaneous recordings of noisy and clean data is available,
then the parameters can be estimated directly. For example, in the
SPLICE algorithm [34, 39], the noisy speech GMM parameters µ
(n)
y
and
Σ
(n)
y
are trained directly on the noisy data using EM. For simplicity,
the matrix transform A
(n)
is assumed to be identity and
b
(n)
=
T
t=1
P(ny
t
)(x
t
− y
t
)
T
t=1
P(ny
t
)
. (6.9)
A further simpliﬁcation in the SPLICE algorithm is to ﬁnd n
∗
=
arg max
n
{P(ny
t
)} and then ˆ x
t
= y
t
+ b
(n
∗
)
.
If stereo data is not available, then the conditional statistics required
to estimate A
(n)
and b
(n)
can be derived using (6.2). One approach
to dealing with the nonlinearity is to use a ﬁrstorder vector Tay
lor series (VTS) expansion [124] to relate the static speech, noise and
speech+noise parameters
y
s
= x
s
+ f(x
s
0
, n
s
0
, h
0
) + (x
s
− x
s
0
)
∂f
∂x
s
+(n
s
− n
s
0
)
∂f
dn
s
+ (h − h
0
)
∂f
∂h
, (6.10)
where the partial derivatives are evaluated at x
s
0
, n
s
0
, h
0
. The convolu
tional term, h is assumed to be a constant convolutional channel eﬀect.
Given a clean speech GMM and an initial estimate of the noise param
eters, the above function is expanded around the mean vector of each
Gaussian and the parameters of the corresponding noisy speech Gaus
sian are estimated. Then a single iteration of EM is used to update the
noise parameters. This ability to estimate the noise parameters without
having to explicity identify noiseonly segments of the signal is a major
advantage of the VTS algorithm.
6.2 ModelBased Compensation
Feature compensation methods are limited by the need to use a very
simple model of the speech signal such as a GMM. Given that the
acoustic models within the recogniser provide a much more detailed
68 Noise Robustness
representation of the speech, better performance can be obtained by
transforming these models to match the current noise conditions.
Modelbased compensation is usually based on combining the clean
speech models with a model of the noise and for most situations a
single Gaussian noise model is suﬃcient. However, for some situa
tions with highly varying noise conditions, a multistate, or multi
Gaussian component, noise model may be useful. This impacts the
computational load in both compensation and decoding. If the speech
models are M component Gaussians and the noise models are N com
ponent Gaussians, then the combined models will have MN compo
nents. The same state expansion occurs when combining multiplestate
noise models, although in this case, the decoding framework must also
be extended to independently track the state alignments in both the
speech and noise models [48, 176]. For most purposes, however, single
Gaussian noise models are suﬃcient and in this case the complexity of
the resulting models and decoder is unchanged.
The aim of model compensation can be viewed as obtaining the
parameters of the speech+noise distribution from the clean speech
model and the noise model. Most model compensation methods assume
that if the speech and noise models are Gaussian then the combined
noisy model will also be Gaussian.
Initially, only the static parameter means and variances will be
considered. The parameters of the corrupted speech distribution,
N(µ
y
, Σ
y
) can be found from
µ
y
= E {y
s
} (6.11)
Σ
y
= E
_
y
s
y
sT
_
− µ
y
µ
T
y
, (6.12)
where y
s
are the corrupted speech “observations” obtained from a par
ticular speech component combined with noise “observations” from the
noise model. There is no simple closedform solution to the expectation
in these equations so various approximations have been proposed.
One approach to modelbased compensation is parallel model com
bination (PMC) [57]. Standard PMC assumes that the feature vectors
are linear transforms of the log spectrum, such as MFCCs, and that
the noise is primarily additive (see [58] for the convolutive noise case).
6.2 ModelBased Compensation 69
The basic idea of the algorithm is to map the Gaussian means and vari
ances in the cepstral domain, back into the linear domain where the
noise is additive, compute the means and variances of the new com
bined speech+noise distribution, and then map back to the cepstral
domain.
The mapping of a Gaussian N(µ, Σ) from the cepstral to log
spectral domain (indicated by the l superscript) is
µ
l
= C
−1
µ (6.13)
Σ
l
= C
−1
Σ(C
−1
)
T
, (6.14)
where C is the DCT. The mapping from the logspectral to the linear
domain (indicated by the f superscript) is
µ
f
i
= exp{µ
l
i
+ σ
l2
i
/2} (6.15)
σ
f
ij
= µ
f
i
µ
f
j
(exp{σ
l
ij
} − 1). (6.16)
Combining the speech and noise parameters in this linear domain is
now straightforward as the speech and noise are independent
µ
f
y
= µ
f
x
+ µ
f
n
(6.17)
Σ
f
y
= Σ
f
x
+ Σ
f
n
. (6.18)
The distributions in the linear domain will be lognormal and unfortu
nately the sum of two lognormal distributions is not lognormal. Stan
dard PMC ignores this and assumes that the combined distribution is
also lognormal. This is referred to as the lognormal approximation.
In this case, the above mappings are simply inverted to map µ
f
y
and
Σ
y
back into the cepstral domain. The lognormal approximation is
rather poor, especially at low SNRs. A more accurate mapping can
be achieved using Monte Carlo techniques to sample the speech and
noise distributions, combine the samples using (6.2) and then estimate
the new distribution [48]. However, this is computationally infeasible
for large systems. A simpler, faster, PMC approximation is to ignore
the variances of the speech and noise distributions. This is the logadd
approximation, here
µ
y
= µ
x
+ f(µ
x
, µ
n
, µ
h
) (6.19)
Σ
y
= Σ
x
, (6.20)
70 Noise Robustness
where f(·) is deﬁned in (6.2). This is much faster than standard PMC
since the mapping of the covariance matrix toandfrom the cepstral
domain is not required.
An alternative approach to model compensation is to use the VTS
approximation described earlier in (6.10) [2]. The expansion points for
the vector Taylor series are the means of the speech, noise and convo
lutional noise. Thus for the mean of the corrupted speech distribution
µ
y
= E
_
µ
x
+ f(µ
x
, µ
n
, µ
h
) + (x
s
− µ
x
)
∂f
∂x
s
+ (n
s
− µ
n
)
∂f
∂n
s
+ (h − µ
h
)
∂f
∂h
_
. (6.21)
The partial derivatives above can be expressed in terms of partial
derivatives of the y
s
with respect to x
s
, n
s
and h evaluated at
µ
x
, µ
n
, µ
h
. These have the form
∂y
s
/∂x
s
= ∂y
s
/∂h = A (6.22)
∂y
s
/∂n
s
= I − A, (6.23)
where A = CFC
−1
and F is a diagonal matrix whose inverse has lead
ing diagonal elements given by (1 + exp(C
−1
(µ
n
− µ
x
− µ
h
))). It fol
lows that the means and variances of the noisy speech are given by
µ
y
= µ
x
+ f(µ
x
, µ
n
, µ
h
) (6.24)
Σ
y
= AΣ
x
A
T
+ AΣ
h
A
T
+ (I − A)Σ
n
(I − A)
T
. (6.25)
It can be shown empirically that this VTS approximation is in gen
eral more accurate than the PMC lognormal approximation. Note the
compensation of the mean is exactly the same as the PMC logadd
approximation.
For modelbased compensation schemes, an interesting issue is how
to compensate the delta and delta–delta parameters [59, 66]. A com
mon approximation is the continuous time approximation [66]. Here the
dynamic parameters, which are a discrete time estimate of the gradient,
are approximated by the instantaneous derivative with respect to time
∆y
s
t
=
n
i=1
w
i
_
y
s
t+i
− y
s
t−i
_
2
n
i=1
w
2
i
≈
∂y
s
t
∂t
. (6.26)
6.3 UncertaintyBased Approaches 71
It is then possible to show that, for example, the mean of the delta
parameters µ
∆y
can be expressed as
µ
∆y
= Aµ
∆x
. (6.27)
The variances and delta–delta parameters can also be compensated in
this fashion.
The compensation schemes described above have assumed that the
noise model parameters, µ
n
, Σ
n
and µ
h
, are known. In a similar fashion
to estimating the parameters for featureenhancement MLestimates of
the noise parameters may be found using VTS [101, 123] if this is not
the case.
6.3 UncertaintyBased Approaches
As noted above, whilst feature compensation schemes are computation
ally eﬃcient, they are limited by the very simple speech models that
can be used. On the other hand, model compensation schemes can take
full advantage of the very detailed speech model within the recogniser
but they are computationally very expensive. The uncertaintybased
techniques discussed in this section oﬀer a compromise position.
The eﬀects of additive noise can be represented by a DBN as shown
in Figure 6.2. In this case, the state likelihood of a noisy speech feature
vector can be written as
p(y
t
θ
t
; λ,
˘
λ) =
_
x
t
p(y
t
x
t
;
˘
λ)p(x
t
θ
t
; λ)dx
t
(6.28)
Fig. 6.2 DBN of combined speech and noise model.
72 Noise Robustness
where
p(y
t
x
t
;
˘
λ) =
_
n
t
p(y
t
x
t
, n
t
)p(n
t
;
˘
λ)dn
t
(6.29)
and λ denotes the clean speech model and
˘
λ denotes the noise model.
The conditional probability in (6.29) can be regarded as a measure
of the uncertainty in y
t
as a noisy estimate of x
t
.
3
At the extremes, this
could be regarded as a switch such that if the probability is one then
x
t
is known with certainty otherwise it is completely unknown or miss
ing. This leads to a set of techniques called missing feature techniques.
Alternatively, if p(y
t
x
t
;
˘
λ) can be expressed parametrically, it can be
passed directly to the recogniser as a measure of uncertainty in the
observed feature vector. This leads to an approach called uncertainty
decoding.
6.3.1 Missing feature techniques
Missing feature techniques grew out of work on auditory scene analysis
motivated by the intuition that humans can recognise speech in noise by
identifying strong speech events in the timefrequency plane and ignor
ing the rest of the signal [30, 31]. In the spectral domain, a sequence
of feature vectors Y
1:T
can be regarded as a spectrogram indexed by
t = 1, . . . , T in the time dimension and by k = 1, . . . , K in the frequency
dimension. Each feature element y
tk
has associated with it a bit which
determines whether or not that “pixel” is reliable. In aggregate, these
bits form a noise mask which determine regions of the spectrogram that
are hidden by noise and therefore “missing.” Speech recognition within
this framework therefore involves two problems: ﬁnding the mask, and
then computing the likelihoods given that mask [105, 143].
The problem of ﬁnding the mask is the most diﬃcult aspect of
missing feature techniques. A simple approach is to estimate the SNR
and then apply a ﬂoor. This can be improved by combining it with
perceptual criteria such as harmonic peak information, energy ratios,
etc. [12]. More robust performance can then be achieved by training a
3
In [6] this uncertainty is made explicit and it is assumed that p(y
t
x
t
;
˘
λ) ≈ p(x
t
y
t
;
˘
λ).
This is sometimes called feature uncertainty.
6.3 UncertaintyBased Approaches 73
Bayesian classiﬁer on clean speech which has been artiﬁcially corrupted
with noise [155].
Once the noise mask has been determined, likelihoods are calcu
lated either by featurevector imputation or by marginalising out the
unreliable features. In the former case, the missing feature values can
be determined by a MAP estimate assuming that the vector was drawn
from a Gaussian mixture model of the clean speech (see [143] for details
and other methods).
Empirical results show that marginalisation gives better perfor
mance than imputation. However, when imputation is used, the recon
structed feature vectors can be transformed into the cepstral domain
and this outweighs the performance advantage of marginalisation. An
alternative approach is to map the spectographic mask directly into
cepstral vectors with an associated variance [164]. This then becomes
a form of uncertainty decoding discussed next.
6.3.2 Uncertainty Decoding
In uncertainty decoding, the goal is to express the conditional prob
ability in (6.29) in a parametric form which can be passed to the
decoder and then used to compute (6.28) eﬃciently. An important
decision is the form that the conditional probability takes. The orig
inal forms of uncertainty decoding made the form of the conditional
distribution dependent on the region of acoustic space. An alterna
tive approach is to make the form of conditional distribution depen
dent on the Gaussian component, in a similar fashion to the regression
classes in linear transformbased adaptation described in Adaptation
and Normalisation.
A natural choice for approximating the conditional is to use an
Ncomponent GMM to model the acoustic space. Depending on the
form of uncertainty being used, this GMM can be used in a number
of ways to model the conditional distribution, p(y
t
x
t
;
˘
λ) [38, 102]. For
example in [102]
p(y
t
x
t
;
˘
λ) ≈
N
n=1
P(nx
t
;
˘
λ)p(y
t
x
t
, n;
˘
λ). (6.30)
74 Noise Robustness
If the component posterior is approximated by
P(nx
t
;
˘
λ) ≈ P(ny
t
;
˘
λ) (6.31)
it is possible to construct a GMM based on the corrupted speech obser
vation, y
t
, rather than the unseen clean speech observation, x
t
. If the
conditional distribution is further approximated by choosing just the
most likely component n
∗
, then
p(y
t
x
t
;
˘
λ) ≈ p(y
t
x
t
, ˘ c
n
∗;
˘
λ). (6.32)
The conditional likelihood in (6.32) has been reduced to a single Gaus
sian, and since the acoustic model mixture components are also Gaus
sian, the state/component likelihood in (6.28) becomes the convolution
of two Gaussians which is also Gaussian. It can be shown that in this
case, the form of (6.28) reduces to
p(y
t
θ
t
; λ,
˘
λ) ∝
m∈θ
t
c
m
N(A
(n
∗
)
y
t
+ b
(n
∗
)
; µ
(m)
, Σ
(m)
+ Σ
(n
∗
)
b
).
(6.33)
Hence, the input feature vector is linearly transformed in the frontend
and at runtime, the only additional cost is that a global bias must
be passed to the recogniser and added to the model variances. The
same form of compensation scheme is obtained using SPLICE with
uncertainty [38], although the way that the compensation parameters,
A
(n)
, b
(n)
and Σ
(n)
b
are computed diﬀers. An interesting advantage of
these schemes is that the cost of calculating the compensation param
eters is a function of the number of components used to model the
featurespace. In contrast for modelcompensation techniques such as
PMC and VTS, the compensation cost depends on the number of com
ponents in the recogniser. This decoupling yields a useful level of ﬂex
ibility when designing practical systems.
Although uncertainty decoding based on a frontend GMM can yield
good performance, in low SNR conditions, it can cause all compen
sated models to become identical, thus removing all discriminatory
power from the recogniser [103]. To overcome this problem, the condi
tional distribution can alternatively be linked with the recogniser model
components. This is similar to using regression classes in adaptation
6.3 UncertaintyBased Approaches 75
schemes such as MLLR. In this case
p(y
t
θ
t
; λ,
˘
λ) =
m∈θ
t
A
(r
m
)
c
m
N(A
(r
m
)
y
t
+ b
(r
m
)
; µ
(m)
, Σ
(m)
+ Σ
(r
m
)
b
),
(6.34)
where r
m
is the regression class that component m belongs to. Note that
for these schemes, the Jacobian of the transform, A
(r
m
)
 can vary from
component to component in contrast to the ﬁxed value used in (6.33).
The calculation of the compensation parameters follows a similar fash
ion to the frontend schemes. An interesting aspect of this approach
is that as the number of regression classes tends to the number of
components in the recogniser, so the performance tends to that of the
modelbased compensation schemes. Uncertainty decoding of this form
can also be incorporated into an adaptive training framework [104].
7
MultiPass Recognition Architectures
The previous sections have reviewed some of the basic techniques avail
able for both training and adapting an HMMbased recognition system.
In general, any particular combination of model set and adaptation
technique will have diﬀerent characteristics and make diﬀerent errors.
Furthermore, if the outputs of these systems are converted to confusion
networks, then it is straightforward to combine the confusion networks
and select the word sequence with the overall maximum posterior like
lihood. Thus, modern transcription systems typically utilise multiple
model sets and make multiple passes over the data. This section starts
by describing a typical multipass combination framework. This is fol
lowed by discussion of how multiple systems may be combined together,
as well as how complementary systems may be generated. Finally a
description of how these approaches, along with the schemes described
earlier, are used for a BN transcription task in three languages: English;
Mandarin; and Arabic.
7.1 Example Architecture
A typical architecture is shown in Figure 7.1. A ﬁrst pass is made over
the data using a relatively simple SI model set. The 1best output
77
78 MultiPass Recognition Architectures
Fig. 7.1 Multipass/system combination architectures.
from this pass is used to perform a ﬁrst round of adaptation. The
adapted models are then used to generate lattices using a basic bigram
or trigram wordbased language model. Once the lattices have been
generated, a range of more complex models and adaptation techniques
can be applied in parallel to provide K candidate output confusion
networks from which the best word sequence is extracted. These 3rd
pass models may include ML and MPE trained systems, SI and SAT
trained models, triphone and quinphone models, latticebased MLLR,
CMLLR, 4gram language models interpolated with classngrams and
many other variants. For examples of recent largescale transcription
systems see [55, 160, 163].
When using these combination frameworks, the form of combina
tion scheme must be determined and the type of individual model
set designed. Both these topics will be discussed in the following
sections.
7.2 System Combination 79
7.2 System Combination
Each of the systems, or branches, in Figure 7.1 produces a number of
outputs that can be used for combination. Example outputs that can
be used are: the hypothesis from each system, w
(k)
for the kth system;
the score from the systems, p(Y
1:T
; λ
(k)
), as well as the possibility of
using the hypotheses and lattices for adaptation and/or rescoring. Log
linear combination of the scores may also be used [15], but this is not
common practice in current multipass systems.
7.2.1 Consensus Decoding
One approach is to combine either the 1best hypotheses or confusion
network outputs, from the systems together. This may be viewed as an
extension of the MBR decoding strategy mentioned earlier. Here
ˆ w = arg min
˜ w
_
w
P(wY
1:T
; λ
(1)
, . . . , λ
(K)
)L(w, ˜ w)
_
, (7.1)
where L(w, ˜ w) is the loss, usually measured in words, between the
two word sequences. In contrast to the standard decoding strategy, the
wordposterior is a function of multiple systems, λ
(1)
, . . . , λ
(K)
, not a
single one (the parameters of this ensemble will be referred to as λ in
this section).
There are two forms of system combination commonly used: recog
niser output voting error reduction (ROVER) [43]; and confusion net
work combination (CNC) [42]. The diﬀerence between the two schemes
is that ROVER uses the 1best system output, whereas CNC uses con
fusion networks. Both schemes use a twostage process:
(1) Align hypotheses/confusion networks: The output from each
of the systems are aligned against one another minimising
for example the Levenshtein edit distance (and for CNC the
expected edit distance). Figure 7.2 shows the process for com
bining four system outputs, to form a single word transition
network (WTN), similar in structure to a confusion network.
For ROVER, conﬁdence scores may also be incorporated into
80 MultiPass Recognition Architectures
Fig. 7.2 Example of output combination.
the combination scheme (this is essential if only two systems
are to be combined).
(2) Vote/maximum posterior: Having obtained the WTN (or
composite confusion network) decisions need to be made for
the ﬁnal output. As a linear network has been obtained, it is
possible to select words for each nodepairing independently.
For ROVER the selected word is based on the maximum
posterior where the ith word is
P(w
i
Y
1:T
; λ)
≈
1
K
K
k=1
_
(1 − α)P(w
i
Y
1:T
; λ
(k)
) + αD
(k)
i
(w
i
)
_
(7.2)
D
(k)
i
(w
i
) = 1 when the ith word in the 1best output is w
i
for the kth system and α controls the balance between the
1best decision and the conﬁdence score.
CNC has been shown to yield small gains over ROVER
combination [42].
7.2.2 Implicit Combination
The schemes described above explicitly combine the scores and
hypotheses from the various systems. Alternatively, it is possible to
implicitly combine the information from one system with another.
There are two approaches that are commonly used, lattice or Nbest
rescoring and crossadaptation.
7.3 Complementary System Generation 81
Lattice [7, 144] and Nbest [153] rescoring, as discussed in Architec
ture of an HMMBased Recogniser, are standard approaches to speeding
up decoding by reducing the searchspace. Though normally described
as approaches to improve eﬃciency, they have a sideeﬀect of implicit
system combination. By restricting the set of possible errors that can
be made, the performance of the rescoring system may be improved as
it cannot make errors that are not in the Nbest list or lattices. This
is particularly true if the lattices or Nbest lists are heavily pruned.
Another common implicit combination scheme is crossadaptation.
Here one speech recognition system is used to generate the transcription
hypotheses for a second speech recognition system to use for adapta
tion. The level of information propagated between the two systems is
determined by the complexity of the adaptation transforms being used,
for example in MLLR the number of linear transforms.
7.3 Complementary System Generation
For the combination schemes above, it is assumed that systems that
make diﬀerent errors are available to combine. Many sites adopt an
adhoc approach to determining the systems to be combined. Initially
a number of candidates are constructed. For example: triphone and
quinphone models; SAT and GD models; MFCC, PLP, MLPbased
posteriors [199] and Gaussianised frontends; multiple and single pro
nunciation dictionaries; random decision trees [162]. On a heldout test
set, the performance of the combinations is evaluated and the “best”
(possibly taking into account decoding costs) is selected and used. An
alternative approach is to train systems that are designed to be com
plementary to one another.
Boosting is a standard machine learning approach that allows com
plementary systems to be generated. Initially this was applied to binary
classiﬁcation tasks, for example AdaBoost [44], and later to multiclass
tasks, e.g., [45]. The general approach in boosting is to deﬁne a distribu
tion, or weight, over the training data, so that correctly classiﬁed data
is deweighted and incorrectly classiﬁed data is more heavily weighted.
A classiﬁer is then trained using this weighted data and a new distribu
tion speciﬁed combining the new classiﬁer with all the other classiﬁers.
82 MultiPass Recognition Architectures
There are a number of issues which make the application of boosting,
or boostinglike, algorithms to speech recognition diﬃcult. In ASR, the
forms of classiﬁer used are highly complex and the data is dynamic.
Also, a simple weighted voting scheme is not directly applicable given
the potentially vast number of classes in speech recognition. Despite
these issues, there have been a number of applications of boosting to
speech recognition [37, 121, 154, 201]. An alternative approach, again
based on weighting the data, is to modify the form of the decision tree
so that leaves are concentrated in regions where the classiﬁer performs
poorly [19].
Although initial attempts to explicitly generate complementary
systems are promising, the majority of multipass systems currently
deployed are built using the adhoc approach described earlier.
7.4 Example Application — Broadcast News Transcription
This section describes a number of systems built at Cambridge Uni
versity for transcribing BN in three languages, English, Mandarin and
Arabic. These illustrate many of the design issues that must be consid
ered when building largevocabulary multipass combination systems.
Though the three languages use the same acoustic model structure
there are language speciﬁc aspects that are exploited to improve per
formance, or yield models suitable for combination. In addition to BN
transcription, the related task of broadcast conversation (BC) tran
scription is also discussed.
For all the systems described, the frontend processing from the CU
HTK 2003 BN system [87] was used. Each frame of speech is represented
by 13 PLP coeﬃcients based on a linear prediction order of 12 with ﬁrst,
second and thirdorder derivatives appended and then projected down
to 39 dimensions using HLDA [92] optimised using the eﬃcient iterative
approach described in [51]. Cepstral mean normalisation is applied at
the segment level where each segment is a contiguous block of data that
has been assigned to the same speaker/acoustic condition.
All models were built using the HTK toolkit [189]. All phones
have three emitting state with a lefttoright noskip topology. Context
dependence was provided by using stateclustered crossword triphone
7.4 Example Application — Broadcast News Transcription 83
models [190]. In addition to the phone models, two silence models were
used: sil represents longer silences that break phonetic context; and sp
represents short silences across which phonetic context is maintained.
Gaussian mixture models with an average of 36 Gaussian components
per state were used for all systems. The distribution of components over
the states was determined based on the state occupancy count [189].
Models were built initially using ML training and then discriminatively
trained using MPE training [139, 182]. Genderdependent versions
of these models were then built using discriminative MPE–MAP adap
tation [137]. As some BN data, for example telephone interviews, are
transmitted over bandwidthlimited channels, wideband and narrow
band spectral analysis variants of each model set were also trained.
The following sections describe how the above general procedure
is applied to BN transcription in three diﬀerent languages; English,
Mandarin and Arabic; and BC transcription in Arabic and Mandarin.
For each of these tasks the training data, phoneset and any frontend
speciﬁc processing are described. In addition, examples are given of
how a complementary system may be generated.
The language models used for the transcription systems are built
from a range of data sources. The ﬁrst are the transcriptions from the
audio data. This will be closely matched to the wordsequences of the
test data but is typically limited in size, normally less than 10 million
words. In addition, text corpora from newswire and newspapers are
used to increase the quantity of data to around a billion words. Given
the nature of the training data, the resulting LMs are expected to be
more closely matched to BNstyle rather than the more conversational
BCstyle data.
7.4.1 English BN Transcription
The training data for the English BN transcription system comprised
two blocks of data. The ﬁrst block of about 144 hours had detailed
manual transcriptions. The second block of about 1200 hours of data
had either closedcaption or similar rough transcriptions provided. This
latter data was therefore used in lightly supervised fashion. The dictio
nary for the English system was based on 45 phones and a 59K wordlist
84 MultiPass Recognition Architectures
was used, which gave outofvocabulary (OOV) rates of less than 1%.
Two dictionaries were constructed. The ﬁrst was a multiple pronunci
ation dictionary (MPron), having about 1.1 pronunciations per word
on average. The second was a single pronunciation dictionary (SPron)
built using the approach described in [70]. This is an example of using
a modiﬁed dictionary to build complementary systems. HMMs were
built using each of these dictionaries with about 9000 distinct states.
A fourgram language model was used for all the results quoted here.
This was trained using 1.4 billion words of data, which included the
acoustic training data references and closedcaptions.
In order to assess the performance of the English BN transcription
system, a range of development and evaluation data sets were used.
Results are given here on three separate test sets. dev04 and dev04f
were each 3 hours in size. dev04f was expected to be harder to recognise
as it contains more challenging data with high levels of background
noise/music and nonnative speakers. eval04 was a 6 hour test set
evenly split between data similar to dev04 and dev04f.
Table 7.1 shows the performance of the MPron and the SPron sys
tems using the architecture shown in Figure 7.1. Here a segmenta
tion and clustering supplied by the Computer Sciences Laboratory for
Mechanics and Engineering Sciences (LIMSI), a French CNRS labo
ratory, was used in the decoding experiments. There are a number of
points illustrated by these results. First, the P3c branch is actually
an example of crossadaptation since the MPron hypotheses from the
P2stage are used in the initial transform estimation for the SPron
system. By themselves the SPron and MPron systems perform about
Table 7.1 %WER of using MPron and SPron acoustic models and the LIMSI segmenta
tion/clustering for English BN transcription.
%WER
System dev04 dev04f eval04
P2 MPron 11.1 15.9 13.6
P3b MPron 10.6 15.4 13.4
P3c SPron 10.4 15.2 13.0
P2+P3b CNC 10.8 15.3 13.3
P2+P3c CNC 10.3 14.8 12.8
P2+P3b+P3c CNC 10.5 15.1 12.9
7.4 Example Application — Broadcast News Transcription 85
the same [56], but the P3c branch is consistently better than the P3b
branch. This is due to crossadaptation. Combining the P3c branch with
the P2 branch using CNC (P2+P3c) shows further gains of 0.2%–0.4%.
This shows the advantage of combining systems with diﬀerent acoustic
models. Finally, combining the P2, P3b and P3c branches together gave
no gains, indeed slight degradations compared to the P2+P3c combi
nation can be seen. This illustrates one of the issues with combining
systems together. The conﬁdence scores used are not accurate represen
tations of whether the word is correct or not, thus combining systems
may degrade performance. This is especially true when the systems
have very diﬀerent performance, or two of the systems being combined
are similar to one another.
It is interesting to note that the average performance over the vari
ous test sets varies from 10.5% (dev04) to 15.1% (dev04f). This reﬂects
the more challenging nature of the dev04f data. Thus, even though only
test data classiﬁed as BN is being used, the error rates vary by about
50% between the test sets.
In addition to using multiple forms of acoustic model, it is also possi
ble to use multiple forms of segmentation [56]. The ﬁrst stage in process
ing the audio is to segment the audio data into homogeneous blocks.
1
These blocks are then clustered together so that all the data from
the speaker/environment are in the same cluster. This process, some
times referred to as diarisation, may be scored as a task in itself [170].
However, the interest here is how this segmentation and clustering
aﬀects the system combination performance. CNC could be used to
combine the output from the two segmentations but here ROVER was
used. To perform segmentation combination, separate branches from
the initial P1–P2 stage are required. This architecture is shown in
Figure 7.3.
The results using this dual architecture are shown in Table 7.2.
The two segmentations used were: the LIMSI segmentation and clus
tering used to generate the results in Table 7.1; and the segmentation
generated at Cambridge University Engineering Department (CUED)
1
Ideally, contiguous blocks of data that have the same speaker and the same acoustic/
channel conditions.
86 MultiPass Recognition Architectures
Fig. 7.3 Multipass/system cross segmentation combination architecture.
Table 7.2 %WER of the dual segmentation system for English BN transcription).
Segmentation/ %WER
Clustering dev04 dev04f eval04
LIMSI 10.3 14.9 12.8
CUED 10.4 15.2 12.9
LIMSI ⊕ CUED 9.8 14.7 12.4
using diﬀerent algorithms [161]. Thus, the systems diﬀer in both the
segmentation and clustering fed into the recognition system [56].
For both segmentation and clusterings, the equivalent of the
P2+P3c branch in Table 7.1 were used. Although the segmentations
7.4 Example Application — Broadcast News Transcription 87
have similar average lengths 13.6 and 14.2 seconds for the LIMSI and
CUED, respectively, they have diﬀerent numbers of clusters and diarisa
tion scores. The slight diﬀerence in performance for the LIMSI segmen
tations was because slightly tighter beams for P2 and P3c branches were
used to keep the overall runtime approximately the same. Combining
the two segmentations together gave gains of between 0.2% and 0.5%.
7.4.2 Mandarin BN and BC Transcription
About 1390 hours of acoustic training data was used to build the Man
darin BN transcription system. In contrast to the English BN system,
the Mandarin system was required to recognise both BN and BC data.
The training data was approximately evenly split between these two
types. For this task, manual transcriptions were available for all the
training data. As Mandarin is a tonal language, fundamental frequency
(F0) was added to the feature vector after the application of the HLDA
transform. Delta and delta–delta F0 features were also added to give
a 42 dimensional feature vector. The phoneset used for these systems
was derived from the 60 toneless phones used in the 44K dictionary
supplied by the Linguistic Data Consortium (LDC). This set is the
same as the one used in CUHTK 2004 Mandarin CTS system [160]. It
consists of 46 toneless phones, obtained from the 60 phone LDC set by
applying mappings of the form “[aeiu]n→[aeiu] n”, “[aeiou]ng→[aeiou]
ng” and “u:e→ue”. In addition to the F0 features, the inﬂuence of tone
on the acoustic models can be incorporated into the acoustic model
units. Most dictionaries for Mandarin give tonemarkers for the vow
els. This tone information may be used either by treating each tonal
vowel as a separate acoustic unit, or by including tone questions about
both the centre and surround phones into the decision trees used for
phonetic context clustering. The latter approach is adopted in these
experiments.
One issue in language modelling for the Chinese language is that
there are no natural word boundaries in normal texts. A string of char
acters may be partitioned into “words” in a range of ways and there
are multiple valid partitions that may be used. As in the CUHTK
Mandarin CTS system [160], the LDC character to word segmenter
88 MultiPass Recognition Architectures
was used. This implements a simple deepest ﬁrst search method to
determine the word boundaries. Any Chinese characters that are not
present in the word list are treated as individual words. The Mandarin
wordlist required for this part of the process was based on the 44K
LDC Mandarin dictionary. As some English words were present in the
acoustic training data transcriptions, all English words and single char
acter Mandarin words not in the LDC list were added to the wordlist
to yield a total of about 58K words. Note that the Mandarin OOV rate
is therefore eﬀectively zero since pronunciations were available for all
single characters. A fourgram language model was constructed at the
word level using 1.3 billion “words” of data.
The performance of the Mandarin system was evaluated on three
test sets. bnmdev06 consists of 3 hours of BNstyle data, whereas
bcmdev05 comprises about 3 hours of BCstyle data. eval06 is a mix
ture of the two styles. As word boundaries are not usually given in
Mandarin, performance is assessed in terms of the character error rate
(CER). For this task the average number of characters per word is
about 1.6.
A slightly modiﬁed version of the architecture shown in Figure 7.1
was used for these experiments. The P2 output was not combined
with the P3 branches to produce the ﬁnal output, instead two P3
branches in addition to the HLDA system (P3b) were generated. A SAT
system using CMLLR transforms to represent each speaker (HLDA
SAT) was used for branch P3a. Gaussianisation was applied after the
HLDA transform, using the GMM approach with 4 components for each
dimension, to give a GAUSS system for branch P3c. The same adap
tation stages were used for all P3 branches irrespective of the form of
model.
Table 7.3 shows the performance of the various stages. For all sys
tems, the performance on the BC data is, as expected, worse than the
performance on the BN data. The BC data is more conversational in
style and thus inherently harder to recognise as well as not being as
well matched to the language model training data. The best performing
single branch was the GAUSS branch. For this task the performance
of the HLDASAT system was disappointing, giving only small gains if
7.4 Example Application — Broadcast News Transcription 89
Table 7.3 %CER using a multipass combination scheme for Mandarin BN and BC
transcription.
%CER
System bnmdev06 bcmdev05 eval06
P2 HLDA 8.3 18.8 18.7
P3a HLDASAT 8.0 18.1 17.9
P3b HLDA 7.9 18.2 18.1
P3c GAUSS 7.9 17.7 17.4
P3a+P3c CNC 7.7 17.6 17.3
any. On a range of tasks, SAT training, though theoretically interest
ing, has not been found to yield large gains in current conﬁgurations.
In common with the English system, however, small consistent gains
were obtained by combining multiple branches together.
7.4.3 Arabic BN and BC Transcription
Two forms of Arabic system are commonly constructed: graphemic
and phonetic systems. In a graphemic system, a dictionary is generated
using onetoone lettertosound rules for each word. Note that normally
the Arabic text is romanised and the word and letter order swapped
to be lefttoright. Thus, the dictionary entry for the word “book” in
Arabic is
ktAb /k/ /t/ /A/ /b/.
This scheme yields 28 consonants, four alif variants (madda and hamza
above and below), ya and wa variants (hamza above), tamarbuta and
hamza. Thus, the total number of graphemes is 36 (excluding silence).
This simple graphemic representation allows a pronunciation to be
automatically generated for any word in the dictionary. In Arabic,
the short vowels (fatha /a/, kasra /i/ and damma /u/) and diacrit
ics (shadda, sukun) are commonly not marked in texts. Additionally,
nunation can result in a wordﬁnal nun (/n/) being added to nouns and
adjectives in order to indicate that they are unmarked for deﬁniteness.
In graphemic systems the acoustic models are required to model the
implied pronunciation variations implicitly. An alternative approach
is to use a phonetic system where the pronunciations for each word
90 MultiPass Recognition Architectures
explicitly include the short vowels and nun. In this case, for example,
the lexicon entry for the word book becomes
ktAb /k/ /i/ /t/ /A/ /b/.
A common approach to obtaining these phonetic pronunciations is to
use the Buckwalter Morphological Analyser (version 2.0)
2
[3]. Buckwal
ter yields pronunciations for about 75% of the words in the 350 K dic
tionary used in the experiments. This means that pronunciations must
be added by hand, or rules used to derive them, or the word removed
from the wordlist. In this work, only a small number of words were
added by hand. In order to reduce the eﬀect of inconsistent diacritic
markings (and to make the phonetic system diﬀer from the graphemic
system) the variants of alef, ya and wa were mapped to their simple
forms. This gives a total of 32 “phones” excluding silence.
With this phonetic system there are an average of about 4.3 pronun
ciations per word, compared to English with about 1.1 pronunciations.
This means that the use of pronunciation probabilities is important.
Two wordlists were used: a 350 K wordlist for the graphemic sys
tem, derived using word frequency counts only; and a 260 K subset
of the graphemic wordlist comprising words for which phonetic pro
nunciations could be found. Thus two language models were built, the
ﬁrst using the 350 K graphemic wordlist, LM1, and the second using
the 260 K phonetic wordlist. About 1 billion words of language model
training data were used and the acoustic models were trained on about
1000 hours of acoustic data, using the same features as for the English
system.
An additional complication is that Arabic is a highly inﬂected agglu
tinative language. This results in a potentially very large vocabulary,
as words are often formed by attaching aﬃxes to triconsonantal roots,
which may result in a large OOV rate compared to an equivalent
English system for example. Using a 350 K vocabulary, the OOV rate
on bcat06 was 2.0%, higher than the OOV rate in English with a 59 K
vocabulary.
2
Available at http://www.qamus.org/index.html.
7.4 Example Application — Broadcast News Transcription 91
Table 7.4 %WER using a multipass combination scheme for Arabic BN and BC
transcription.
WER (%)
System bnat06 bcat06 eval06
P2aLM1 Graphemic 21.1 27.6 23.9
P2bLM2 Graphemic 21.3 27.8 23.8
P3aLM1 Graphemic 20.4 27.2 23.3
P3bLM2 Phonetic 20.0 27.3 22.4
P3a+P3b CNC 18.7 25.3 21.2
Three test sets deﬁned by BBN Technologies (referred to as BBN)
were used for evaluating the systems. The ﬁrst, bnat06, consists of
about 3 hours of BNstyle data. The second, bcat06 consists of about
3 hours of BCstyle data. The eval06 is approximately 3 hours of a
mixture of the two styles.
Table 7.4 shows the performance of the Arabic system using the
multipass architecture in Figure 7.3. However, rather than using two
diﬀerent segmentations, in this case graphemic and phonetic branches
were used. Given the large number of pronunciations and the wordlist
size, the lattices needed for the P3 rescoring were generated using the
graphemic acoustic models for both branches. Thus, the phonetic sys
tem has an element of crossadaptation between the graphemic and the
phonetic systems. Even with the crossadaptation gains, the graphemic
system is still slightly better than the phonetic system on the BC data.
However, the most obvious eﬀect is that performance gain obtained
from combining the phonetic and graphemic systems using CNC is
large, around 6% relative reduction over the best individual branch.
7.4.4 Summary
The above experiments have illustrated a number of points about large
vocabulary continuous speech recognition.
In terms of the underlying speech recognition technology, the same
broad set of techniques can be used for multiple languages, in this case
English, Arabic and Mandarin. However, to enable the technology to
work well for diﬀerent languages, modiﬁcations beyond specifying the
phoneunits and dictionary are required. For example in Mandarin,
92 MultiPass Recognition Architectures
charactertoword segmentation must be considered, as well as the
inclusion of tonal features.
The systems described above have all been based on HLDA feature
projection/decorrelation. This, along with schemes such as global STC,
are standard approaches used by many of the research groups develop
ing large vocabulary transcription systems. Furthermore, all the acous
tic models described have been trained using discriminative, MPE,
training. These discriminative approaches are also becoming standard
since typical gains of around 10% relative compared to ML can be
obtained [56].
Though no explicit contrasts have been given, all the systems
described make use of a combination of feature normalisation and lin
ear adaptation schemes. For large vocabulary speech recognition tasks
where the test data may comprise a range of speaking styles and
accents, these techniques are essential.
As the results have shown, signiﬁcant further gains can be obtained
by combining multiple systems together. For example in Arabic, by
combining graphemic and phonetic systems together, gains of 6% rel
ative over the best individual branch were obtained. All the systems
described here were built at CUED. Interestingly, even greater gains
can be obtained by using crosssite system combination [56].
Though BN and BC transcription is a good task to illustrate many
of the schemes discussed in this review, rather diﬀerent tasks would be
needed to properly illustrate the various noise robustness techniques
described and space prevented inclusion of these. Some modelbased
schemes have been applied to transcription tasks [104], however, the
gains have been small since the levels of background noise encountered
in this type of transcription are typically low.
The experiments described have also avoided issues of computa
tional eﬃciency. For practical systems this is an important considera
tion, but beyond the scope of this review.
Conclusions
This review has reviewed the core architecture of an HMMbased CSR
system and outlined the major areas of reﬁnement incorporated into
modernday systems including those designed to meet the demanding
task of large vocabulary transcription.
HMMbased speech recognition systems depend on a number of
assumptions: that speech signals can be adequately represented by a
sequence of spectrally derived feature vectors; that this sequence can
be modelled using an HMM; that the feature distributions can be mod
elled exactly; that there is suﬃcient training data; and that the training
and test conditions are matched. In practice, all of these assumptions
must be relaxed to some extent and it is the extent to which the result
ing approximations can be minimised which determines the eventual
performance.
Starting from a base of simple singleGaussian diagonal covariance
continuous density HMMs, a range of reﬁnements have been described
including distribution and covariance modelling, discriminative param
eter estimation, and algorithms for adaptation and noise compensation.
All of these aim to reduce the inherent approximation in the basic HMM
framework and hence improve performance.
93
94 Conclusions
In many cases, a number of alternative approaches have been
described amongst which there was often no clear winner. Usually,
the various options involve a tradeoﬀ between factors such as the
availability of training or adaptation data, runtime computation, sys
tem complexity and target application. The consequence of this is that
building a successful large vocabulary transcription system is not just
about ﬁnding elegant solutions to statistical modelling and classiﬁ
cation problems, it is also a challenging system integration problem.
As illustrated by the ﬁnal presentation of multipass architectures and
actual example conﬁgurations, the most challenging problems require
complex solutions.
HMMbased speech recognition is a maturing technology and as
evidenced by the rapidly increasing commercial deployment, perfor
mance has already reached a level which can support viable appli
cations. Progress is continuous and error rates continue to fall. For
example, error rates on the transcription of conversational telephone
speech were around 50% in 1995. Today, with the beneﬁt of more data,
and the reﬁnements described in this review, error rates are now well
below 20%.
Nevertheless, there is still much to do. As shown by the example
performance ﬁgures given for transcription of broadcast news and con
versations, results are impressive but still fall short of human capabil
ity. Furthermore, even the best systems are vulnerable to spontaneous
speaking styles, nonnative or highly accented speech and high ambi
ent noise. Fortunately, as indicated in this review, there are many ideas
which have yet to be fully explored and many more still to be conceived.
Finally, it should be acknowledged that despite their dominance and
the continued rate of improvement, many argue that the HMM archi
tecture is fundamentally ﬂawed and that performance must asymptote
at some point. This is undeniably true, however, no good alternative
to the HMM has been found yet. In the meantime, the authors of this
review believe that the performance asymptote for HMMbased speech
recognition is still some way away.
Acknowledgments
The authors would like to thank all research students and staﬀ, past
and present, who have contributed to the development of the HTK
toolkit and HTKderived speech recognition systems and algorithms.
In particular, the contribution of Phil Woodland to all aspects of this
research and development is gratefully acknowledged. Without all of
these people, this review would not have been possible.
95
Notations and Acronyms
The following notation has been used throughout this review.
y
t
speech observation at time t
Y
1:T
observation sequence y
1
, . . . , y
T
x
t
“clean” speech observation at time t
n
t
“noise” observation at time t
y
s
t
the “static” elements of feature vector y
t
∆y
s
t
the “delta” elements of feature vector y
t
∆
2
y
s
t
the “delta–delta” elements of feature vector y
t
Y
(r)
rth observation sequence y
(r)
1
, . . . , y
(r)
T
(r)
c
m
the prior of Gaussian component m
µ
(m)
the mean of Gaussian component m
µ
(m)
i
element i of mean vector µ
(m)
Σ
(m)
the covariance matrix of Gaussian component m
σ
(m)2
i
ith leading diagonal element of covariance
matrix Σ
(m)
N(µ, Σ) Gaussian distribution with mean µ and covariance
matrix Σ
97
98 Notations and Acronyms
N(y
t
; µ, Σ) likelihood of observation y
t
being generated
by N(µ, Σ)
A transformation matrix
A
T
transpose of A
A
−1
inverse of A
A
[p]
top p rows of A
A determinant of A
a
ij
element i, j of A
b bias column vector
λ set of HMM parameters
p(Y
1:T
; λ) likelihood of observation sequence Y
1:T
given
model parameters λ
θ
t
state/component occupied at time t
s
j
state j of the HMM
s
jm
component m of state j of the HMM
γ
(j)
t
P(θ
t
= s
j
Y
1:T
; λ)
γ
(jm)
t
P(θ
t
= s
jm
Y
1:T
; λ)
α
(j)
t
forward likelihood, p(y
1
, . . . , y
t
, θ
t
= s
j
; λ)
β
(j)
t
backward likelihood, p(y
t+1
, . . . , y
T
θ
t
= s
j
; λ)
d
j
(t) probability of occupying state s
j
for t time steps
q a phone such as /a/, /t/, etc
q
(w)
a pronunciation for word w, q
1
, . . . , q
K
w
P(w
1:L
) probability of word sequence w
1:L
w
1:L
word sequence, w
1
, . . . , w
L
Acronyms
BN broadcast news
BC broadcast conversations
CAT cluster adaptive training
CDF cumulative density function
CMN cepstral mean normalisation
CML conditional maximum likelihood
CMLLR constrained maximum likelihood linear regression
CNC confusion network combination
Notations and Acronyms 99
CTS conversation transcription system
CVN cepstral variance normalisation
DBN dynamic Bayesian network
EBW extended BaumWelch
EM expectation maximisation
EMLLT extended maximum likelihood linear transform
FFT fast Fourier transform
GAUSS Gaussianisation
GD gender dependent
GMM Gaussian mixture model
HDA heteroscedastic discriminant analysis
HLDA heteroscedastic linear discriminant analysis
HMM hidden Markov model
HTK hidden Markov model toolkit
(see http://htk.eng.cam.ac.uk)
LDA linear discriminant analysis
LM language model
MAP maximum a posteriori
MBR minimum Bayes risk
MCE minimum classiﬁcation error
MFCC melfrequency cepstral coeﬃcients
ML maximum likelihood
MLLR maximum likelihood linear regression
MLLT maximum likelihood linear transform
MLP multilayer perceptron
MMI maximum mutual information
MMSE minimum mean square error
MPE minimum phone error
PLP perceptual linear prediction
PMC parallel model combination
RSW reference speaker weighting
PMF probability mass function
ROVER recogniser output voting error reduction
SAT speaker adaptive training
SI speaker independent
100 Notations and Acronyms
SNR signal to noise ratio
SPAM subspace constrained precision and means
VTLN vocal tract length normalisation
VTS vector Taylor series
WER word error rate
WTN word transition network
References
[1] A. Acero, Acoustical and Environmental Robustness in Automatic Speech
Recognition. Kluwer Academic Publishers, 1993.
[2] A. Acero, L. Deng, T. Kristjansson, and J. Zhang, “HMM adaptation using
vector Taylor series for noisy speech recognition,” in Proceedings of ICSLP,
Beijing, China, 2000.
[3] M. Aﬁfy, L. Nguyen, B. Xiang, S. Abdou, and J. Makhoul, “Recent progress in
Arabic broadcast news transcription at BBN,” in Proceedings of Interspeech,
Lisbon, Portugal, September 2005.
[4] S. M. Ahadi and P. C. Woodland, “Combined Bayesian and predictive tech
niques for rapid speaker adaptation of continuous density hidden Markov mod
els,” Computer Speech and Language, vol. 11, no. 3, pp. 187–206, 1997.
[5] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, “A compact
model for speaker adaptive training,” in Proceedings of ICSLP, Philadelphia,
1996.
[6] J. A. Arrowood, Using Observation Uncertainty for Robust Speech Recognition.
PhD thesis, Georgia Institute of Technology, 2003.
[7] X. Aubert and H. Ney, “Large vocabulary continuous speech recognition using
word graphs,” in Proceedings of ICASSP, vol. 1, pp. 49–52, Detroit, 1995.
[8] S. Axelrod, R. Gopinath, and P. Olsen, “Modelling with a subspace constraint
on inverse covariance matrices,” in Proceedings of ICSLP, Denver, CO, 2002.
[9] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, “Maximum mutual
information estimation of hidden Markov model parameters for speech recog
nition,” in Proceedings of ICASSP, pp. 49–52, Tokyo, 1986.
[10] L. R. Bahl, P. V. de Souza, P. S. Gopalakrishnan, D. Nahamoo, and M. A.
Picheny, “Robust methods for using contextdependent features and models
101
102 References
in a continuous speech recognizer,” in Proceedings of ICASSP, pp. 533–536,
Adelaide, 1994.
[11] J. K. Baker, “The Dragon system — An overview,” IEEE Transactions on
Acoustics Speech and Signal Processing, vol. 23, no. 1, pp. 24–29, 1975.
[12] J. Barker, M. Cooke, and P. D. Green, “Robust ASR based on clean speech
models: an evaluation of missing data techniques for connected digit recogni
tion in noise,” in Proceedings of Eurospeech, pp. 213–216, Aalberg, Denmark,
2001.
[13] L. E. Baum and J. A. Eagon, “An inequality with applications to statistical
estimation for probabilistic functions of Markov processes and to a model for
ecology,” Bulletin of American Mathematical Society, vol. 73, pp. 360–363,
1967.
[14] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximisation tech
nique occurring in the statistical analysis of probabilistic functions of Markov
chains,” Annals of Mathematical Statistics, vol. 41, pp. 164–171, 1970.
[15] P. Beyerlein, “Discriminative model combination,” in Proceedings of ASRU,
Santa Barbara, 1997.
[16] J. A. Bilmes, “Buried Markov models: A graphicalmodelling approach
to automatic speech recognition,” Computer Speech and Language, vol. 17,
no. 2–3, 2003.
[17] J. A. Bilmes, “Graphical models and automatic speech recognition,” in Math
ematical Foundations of Speech and Language: Processing Institute of Mathe
matical Analysis Volumes in Mathematics Series, SpringerVerlag, 2003.
[18] S. Boll, “Suppression of acoustic noise in speech using spectral subtrac
tion,” IEEE Transactions on Acoustics Speech and Signal Processing, vol. 27,
pp. 113–120, 1979.
[19] C. Breslin and M. J. F. Gales, “Complementary system generation using
directed decision trees,” in Proceedings of ICSLP, Toulouse, 2006.
[20] P. F. Brown, The AcousticModelling Problem in Automatic Speech Recogni
tion. PhD thesis, Carnegie Mellon, 1987.
[21] P. F. Brown, V. J. Della Pietra, P. V. de Souza, J. C. Lai, and R. L. Mercer,
“Classbased Ngram models of natural language,” Computational Linguistics,
vol. 18, no. 4, pp. 467–479, 1992.
[22] W. Byrne, “Minimum Bayes risk estimation and decoding in large vocabulary
continuous speech recognition,” IEICE Transactions on Information and Sys
tems: Special Issue on Statistical Modelling for Speech Recognition, vol. E89
D(3), pp. 900–907, 2006.
[23] H. Y. Chan and P. C. Woodland, “Improving broadcast news transcription by
lightly supervised discriminative training,” in Proceedings of ICASSP, Mon
treal, Canada, March 2004.
[24] K. T. Chen, W. W. Liao, H. M. Wang, and L. S. Lee, “Fast speaker adaptation
using eigenspacebased maximum likelihood linear regression,” in Proceedings
of ICSLP, Beijing, China, 2000.
[25] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for
language modelling,” Computer Speech and Language, vol. 13, pp. 359–394,
1999.
References 103
[26] S. S. Chen and R. Gopinath, “Gaussianization,” in NIPS 2000, Denver, CO,
2000.
[27] S. S. Chen and R. A. Gopinath, “Model selection in acoustic modelling,” in
Proceedings of Eurospeech, pp. 1087–1090, Rhodes, Greece, 1997.
[28] W. Chou, “Maximum aposterior linear regression with elliptical symmetric
matrix variants,” in Proceedings of ICASSP, pp. 1–4, Phoenix, USA, 1999.
[29] W. Chou, C. H. Lee, and B. H. Juang, “Minimum error rate training based on
Nbest string models,” in Proceedings of ICASSP, pp. 652–655, Minneapolis,
1993.
[30] M. Cooke, P. D. Green, and M. D. Crawford, “Handling missing data in speech
recognition,” in Proceedings of ICSLP, pp. 1555–1558, Yokohama, Japan,
1994.
[31] M. Cooke, A. Morris, and P. D. Green, “Missing data techniques for robust
speech recognition,” in Proceedings of ICASSP, pp. 863–866, Munich, Ger
many, 1997.
[32] S. B. Davis and P. Mermelstein, “Comparison of parametric representa
tions for monosyllabic word recognition in continuously spoken sentences,”
IEEE Transactions on Acoustics Speech and Signal Processing, vol. 28, no. 4,
pp. 357–366, 1980.
[33] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from
incomplete data via the EM algorithm,” Journal of the Royal Statistical Soci
ety Series B, vol. 39, pp. 1–38, 1977.
[34] L. Deng, A. Acero, M. Plumpe, and X. D. Huang, “Largevocabulary speech
recognition under adverse acoustic environments,” in Proceedings of ICSLP,
pp. 806–809, Beijing, China, 2000.
[35] V. Diakoloukas and V. Digalakis, “Maximum likelihood stochastic transfor
mation adaptation of hidden Markov models,” IEEE Transactions on Speech
and Audio Processing, vol. 7, no. 2, pp. 177–187, 1999.
[36] V. Digalakis, H. Collier, S. Berkowitz, A. Corduneanu, E. Bocchieri, A. Kan
nan, C. Boulis, S. Khudanpur, W. Byrne, and A. Sankar, “Rapid speech rec
ognizer adaptation to new speakers,” in Proceedings of ICASSP, pp. 765–768,
Phoenix, USA, 1999.
[37] C. Dimitrakakis and S. Bengio, “Boosting HMMs with an application to speech
recognition,” in Proceedings of ICASSP, Montreal, Canada, 2004.
[38] J. Droppo, A. Acero, and L. Deng, “Uncertainty decoding with SPLICE for
noise robust speech recognition,” in Proceedings of ICASSP, Orlando, FL,
2002.
[39] J. Droppo, L. Deng, and A. Acero, “Evaluation of the SPLICE algorithm on
the aurora 2 database,” in Proceedings of Eurospeech, pp. 217–220, Aalberg,
Denmark, 2001.
[40] G. Evermann, H. Y. Chan, M. J. F. Gales, T. Hain, X. Liu, D. Mrva, L. Wang,
and P. Woodland, “Development of the 2003 CUHTK conversational
telephone speech transcription system,” in Proceedings of ICASSP, Montreal,
Canada, 2004.
[41] G. Evermann and P. C. Woodland, “Large vocabulary decoding and conﬁdence
estimation using word posterior probabilities,” in Proceedings of ICASSP,
pp. 1655–1658, Istanbul, Turkey, 2000.
104 References
[42] G. Evermann and P. C. Woodland, “Posterior probability decoding, conﬁdence
estimation and system combination,” in Proceedings of Speech Transcription
Workshop, Baltimore, 2000.
[43] J. Fiscus, “A postprocessing system to yield reduced word error rates: Recog
niser output voting error reduction (ROVER),” in Proceedings of IEEE ASRU
Workshop, pp. 347–352, Santa Barbara, 1997.
[44] Y. Freund and R. Schapire, “Experiments with a new boosting algorithm,”
Proceedings of the Thirteenth International Conference on Machine Learning,
1996.
[45] Y. Freund and R. Schapire, “A decision theoretic generalization of online
learning and an application to boosting,” Journal of Computer and System
Sciences, pp. 119–139, 1997.
[46] K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press,
1972.
[47] S. Furui, “Speaker independent isolated word recognition using dynamic fea
tures of speech spectrum,” IEEE Transactions ASSP, vol. 34, pp. 52–59, 1986.
[48] M. J. F. Gales, Modelbased Techniques for Noise Robust Speech Recognition.
PhD thesis, Cambridge University, 1995.
[49] M. J. F. Gales, “Transformation smoothing for speaker and environmental
adaptation,” in Proceedings of Eurospeech, pp. 2067–2070, Rhodes, Greece,
1997.
[50] M. J. F. Gales, “Maximum likelihood linear transformations for HMMbased
speech recognition,” Computer Speech and Language, vol. 12, pp. 75–98, 1998.
[51] M. J. F. Gales, “Semitied covariance matrices for hidden Markov models,”
IEEE Transactions on Speech and Audio Processing, vol. 7, no. 3, pp. 272–281,
1999.
[52] M. J. F. Gales, “Cluster adaptive training of hidden Markov models,” IEEE
Transactions on Speech and Audio Processing, vol. 8, pp. 417–428, 2000.
[53] M. J. F. Gales, “Maximum likelihood multiple subspace projections for hidden
Markov models,” IEEE Transactions on Speech and Audio Processing, vol. 10,
no. 2, pp. 37–47, 2002.
[54] M. J. F. Gales, “Discriminative models for speech recognition,” in ITA Work
shop, University San Diego, USA, February 2007.
[55] M. J. F. Gales, B. Jia, X. Liu, K. C. Sim, P. Woodland, and K. Yu, “Devel
opment of the CUHTK 2004 RT04 Mandarin conversational telephone speech
transcription system,” in Proceedings of ICASSP, Philadelphia, PA, 2005.
[56] M. J. F. Gales, D. Y. Kim, P. C. Woodland, D. Mrva, R. Sinha, and S. E.
Tranter, “Progress in the CUHTK broadcast news transcription system,”
IEEE Transactions on Speech and Audio Processing, vol. 14, no. 5, September
2006.
[57] M. J. F. Gales and S. J. Young, “Cepstral parameter compensation for
HMM recognition in noise,” Speech Communication, vol. 12, no. 3, pp. 231–
239, 1993.
[58] M. J. F. Gales and S. J. Young, “Robust speech recognition in additive and
convolutional noise using parallel model combination,” Computer Speech and
Language, vol. 9, no. 4, pp. 289–308, 1995.
References 105
[59] M. J. F. Gales and S. J. Young, “Robust continuous speech recognition using
parallel model combination,” IEEE Transactions on Speech and Audio Pro
cessing, vol. 4, no. 5, pp. 352–359, 1996.
[60] Y. Gao, M. Padmanabhan, and M. Picheny, “Speaker adaptation based on
preclustering training speakers,” in Proceedings of EuroSpeech, pp. 2091–2094,
Rhodes, Greece, 1997.
[61] J.L. Gauvain and C.H. Lee, “Maximum a posteriori estimation of multivari
ate Gaussian mixture observations of Markov chains,” IEEE Transactions on
Speech and Audio Processing, vol. 2, no. 2, pp. 291–298, 1994.
[62] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD,” in Pro
ceedings of ICASSP, vol. 1, pp. 517–520, 1992. San Francisco.
[63] V. Goel, S. Kumar, and B. Byrne, “Segmental minimum Bayesrisk ASR vot
ing strategies,” in Proceedings of ICSLP, Beijing, China, 2000.
[64] P. S. Gopalakrishnan, D. Kanevsky, A. Nadas, and D. Nahamoo, “A general
isation of the Baum algorithm to rational objective functions,” in Proceedings
of ICASSP, vol. 12, pp. 631–634, Glasgow, 1989.
[65] R. Gopinath, “Maximum likelihood modelling with Gaussian distributions for
classiﬁcation,” in Proceedings of ICASSP, pp. II–661–II–664, Seattle, 1998.
[66] R. A. Gopinath, M. J. F. Gales, P. S. Gopalakrishnan, S. BalakrishnanAiyer,
and M. A. Picheny, “Robust speech recognition in noise  performance of the
IBM continuous speech recognizer on the ARPA noise spoke task,” in Pro
ceedings of ARPA Workshop on Spoken Language System Technology, Austin,
TX, 1999.
[67] A. Gunawardana, M. Mahajan, A. Acero, and J. C. Platt, “Hidden conditional
random ﬁelds for phone classiﬁcation,” in Proceedings of Interspeech, Lisbon,
Portugal, September 2005.
[68] R. HaebUmbach and H. Ney, “Linear discriminant analysis for improved large
vocabulary continuous speech recognition,” in Proceedings of ICASSP, pp. 13–
16, Tokyo, 1992.
[69] R. HaebUmbach and H. Ney, “Improvements in timesynchronous beam
search for 10000word continuous speech recognition,” IEEE Transactions on
Speech and Audio Processing, vol. 2, pp. 353–356, 1994.
[70] T. Hain, “Implicit pronunciation modelling in ASR,” in ISCA ITRW PMLA,
2002.
[71] T. Hain, P. C. Woodland, T. R. Niesler, and E. W. D. Whittaker, “The 1998
HTK System for transcription of conversational telephone speech,” in Pro
ceedings of ICASSP, pp. 57–60, Phoenix, 1999.
[72] D. HakkaniTur, F. Bechet, G. Riccardi, and G. Tur, “Beyond ASR 1best:
Using word confusion networks in spoken language understanding ,” Computer
Speech and Language, vol. 20, no. 4, October 2006.
[73] T. J. Hazen and J. Glass, “A comparison of novel techniques for instantaneous
speaker adaptation,” in Proceedings of Eurospeech, pp. 2047–2050, 1997.
[74] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Jour
nal of Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
[75] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans
actions on Speech and Audio Processing, vol. 2, no. 4, 1994.
106 References
[76] F. Jelinek, “A fast sequential decoding algorithm using a stack,” IBM Journal
on Research and Development, vol. 13, 1969.
[77] F. Jelinek, “Continuous speech recognition by statistical methods,” Proceed
ings of IEEE, vol. 64, no. 4, pp. 532–556, 1976.
[78] F. Jelinek, “A discrete utterance recogniser,” Proceedings of IEEE, vol. 73,
no. 11, pp. 1616–1624, 1985.
[79] H. Jiang, X. Li, and X. Liu, “Large margin hidden Markov models for speech
recognition,” IEEE Transactions on Audio, Speech and Language Processing,
vol. 14, no. 5, pp. 1584–1595, September 2006.
[80] B.H. Juang, “On the hidden Markov model and dynamic time warping for
speech recognition — A uniﬁed view,” AT and T Technical Journal, vol. 63,
no. 7, pp. 1213–1243, 1984.
[81] B.H. Juang, “Maximumlikelihood estimation for mixture multivariate
stochastic observations of Markov chains,” AT and T Technical Journal,
vol. 64, no. 6, pp. 1235–1249, 1985.
[82] B.H. Juang and S. Katagiri, “Discriminative learning for minimum error clas
siﬁcation,” IEEE Transactions on Signal Processing, vol. 14, no. 4, 1992.
[83] B.H. Juang, S. E. Levinson, and M. M. Sondhi, “Maximum likelihood esti
mation for multivariate mixture observations of Markov chains,” IEEE Trans
actions on Information Theory, vol. 32, no. 2, pp. 307–309, 1986.
[84] J. Kaiser, B. Horvat, and Z. Kacic, “A novel loss function for the overall risk
criterion based discriminative training of HMM models,” in Proceedings of
ICSLP, Beijing, China, 2000.
[85] S. M. Katz, “Estimation of probabilities from sparse data for the language
model component of a speech recogniser,” IEEE Transactions on ASSP,
vol. 35, no. 3, pp. 400–401, 1987.
[86] T. Kemp and A. Waibel, “Unsupervised training of a speech recognizer: Recent
experiments,” in Proceedings of EuroSpeech, pp. 2725–2728, September 1999.
[87] D. Y. Kim, G. Evermann, T. Hain, D. Mrva, S. E. Tranter, L. Wang, and P. C.
Woodland, “Recent advances in broadcast news transcription,” in Proceedings
of IEEE ASRU Workshop, (St. Thomas, U.S. Virgin Islands), pp. 105–110,
November 2003.
[88] D. Y. Kim, S. Umesh, M. J. F. Gales, T. Hain, and P. Woodland, “Using VTLN
for broadcast news transcription,” in Proceedings of ICSLP, Jeju, Korea, 2004.
[89] N. G. Kingsbury and P. J. W. Rayner, “Digital ﬁltering using logarithmic
arithmetic,” Electronics Letters, vol. 7, no. 2, pp. 56–58, 1971.
[90] R. Kneser and H. Ney, “Improved clustering techniques for classbased statis
tical language modelling,” in Proceedings of Eurospeech, pp. 973–976, Berlin,
1993.
[91] R. Kuhn, L. Nguyen, J.C. Junqua, L. Goldwasser, N. Niedzielski, S. Finke,
K. Field, and M. Contolini, “Eigenvoices for speaker adaptation,” in Proceed
ings of ICSLP, Sydney, 1998.
[92] N. Kumar and A. G. Andreou, “Heteroscedastic discriminant analysis and
reduced rank HMMs for improved speech recognition,” Speech Communica
tion, vol. 26, pp. 283–297, 1998.
References 107
[93] H.K. Kuo and Y. Gao, “Maximum entropy direct models for speech recogni
tion,” IEEE Transactions on Audio Speech and Language Processing, vol. 14,
no. 3, pp. 873–881, 2006.
[94] L. Lamel and J.L. Gauvain, “Lightly supervised and unsupervised acoustic
model training,” Computer Speech and Language, vol. 16, pp. 115–129, 2002.
[95] M. I. Layton and M. J. F. Gales, “Augmented statistical models for speech
recognition,” in Proceedings of ICASSP, Toulouse, 2006.
[96] L. Lee and R. C. Rose, “Speaker normalisation using eﬃcient frequency warp
ing procedures,” in Proceedings of ICASSP, Atlanta, 1996.
[97] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for
speaker adaptation of continuous density hidden Markov models,” Computer
Speech and Language, vol. 9, no. 2, pp. 171–185, 1995.
[98] S. E. Levinson, “Continuously variable duration hidden Markov models for
automatic speech recognition,” Computer Speech and Language, vol. 1, pp. 29–
45, 1986.
[99] S. E. Levinson, L. R. Rabiner, and M. M. Sondhi, “An introduction to the
application of the theory of probabilistic functions of a Markov process to
automatic speech recognition,” Bell Systems Technical Journal, vol. 62, no. 4,
pp. 1035–1074, 1983.
[100] J. Li, M. Siniscalchi, and C.H. Lee, “Approximate test risk minimization
through soft margin training,” in Proceedings of ICASSP, Honolulu, USA,
2007.
[101] H. Liao, Uncertainty Decoding For Noise Robust Speech Recognition. PhD
thesis, Cambridge University, 2007.
[102] H. Liao and M. J. F. Gales, “Joint uncertainty decoding for noise robust speech
recognition,” in Proceedings of Interspeech, pp. 3129–3132, Lisbon, Portugal,
2005.
[103] H. Liao and M. J. F. Gales, “Issues with uncertainty decoding for noise robust
speech recognition,” in Proceedings of ICSLP, Pittsburgh, PA, 2006.
[104] H. Liao and M. J. F. Gales, “Adaptive training with joint uncertainty decoding
for robust recognition of noise data,” in Proceedings of ICASSP, Honolulu,
USA, 2007.
[105] R. P. Lippmann and B. A. Carlson, “Using missing feature theory to actively
select features for robust speech recognition with interruptions, ﬁltering and
noise,” in Proceedings of Eurospeech, pp. KN37–KN40, Rhodes, Greece, 1997.
[106] X. Liu and M. J. F. Gales, “Automatic model complexity control using
marginalized discriminative growth functions,” IEEE Transactions on Audio,
Speech and Language Processing, vol. 15, no. 4, pp. 1414–1424, May 2007.
[107] P. Lockwood and J. Boudy, “Experiments with a nonlinear spectral subtractor
(NSS), hidden Markov models and the projection, for robust speech recogni
tion in cars,” Speech Communication, vol. 11, no. 2, pp. 215–228, 1992.
[108] B. T. Lowerre, The Harpy Speech Recognition System. PhD thesis, Carnegie
Mellon, 1976.
[109] X. Luo and F. Jelinek, “Probabilistic classiﬁcation of HMM states for large
vocabulary,” in Proceedings of ICASSP, pp. 2044–2047, Phoenix, USA, 1999.
108 References
[110] J. Ma, S. Matsoukas, O. Kimball, and R. Schwartz, “Unsupervised training on
large amount of broadcast news data,” in Proceedings of ICASSP, pp. 1056–
1059, Toulouse, May 2006.
[111] W. Macherey, L. Haferkamp, R. Schl¨ uter, and H. Ney, “Investigations on error
minimizing training criteria for discriminative training in automatic speech
recognition,” in Proceedings of Interspeech, Lisbon, Portugal, September 2005.
[112] B. Mak and R. Hsiao, “Kernel eigenspacebased MLLR adaptation,” IEEE
Transactions Speech and Audio Processing, March 2007.
[113] B. Mak, J. T. Kwok, and S. Ho, “Kernel eigenvoices speaker adaptation,”
IEEE Transactions Speech and Audio Processing, vol. 13, no. 5, pp. 984–992,
September 2005.
[114] L. Mangu, E. Brill, and A. Stolcke, “Finding consensus among words: Lattice
based word error minimisation,” Computer Speech and Language, vol. 14,
no. 4, pp. 373–400, 2000.
[115] S. Martin, J. Liermann, and H. Ney, “Algorithms for bigram and trigram word
clustering,” in Proceedings of Eurospeech, vol. 2, pp. 1253–1256, Madrid, 1995.
[116] A. Matsoukas, T. Colthurst, F. Richardson, O. Kimball, C. Quillen,
A. Solomonoﬀ, and H. Gish, “The BBN 2001 English conversational speech
system,” in Presentation at 2001 NIST Large Vocabulary Conversational
Speech Workshop, 2001.
[117] S. Matsoukas, J.L. Gauvain, A. Adda, T. Colthurst, C. I. Kao, O. Kimball,
L. Lamel, F. Lefevre, J. Z. Ma, J. Makhoul, L. Nguyen, R. Prasad, R. Schwartz,
H. Schwenk, and B. Xiang, “Advances in transcription of broadcast news and
conversational telephone speech within the combined EARS BBN/LIMSI sys
tem,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14,
no. 5, pp. 1541–1556, September 2006.
[118] T. Matsui and S. Furui, “Nbest unsupervised speaker adaptation for speech
recognition,” Computer Speech and Language, vol. 12, pp. 41–50, 1998.
[119] E. McDermott, T. J. Hazen, J. Le Roux, A. Nakamura, and K. Katagiri,
“Discriminative training for largevocabulary speech recognition using mini
mum classiﬁcation error,” IEEE Transactions on Audio Speech and Language
Processing, vol. 15, no. 1, pp. 203–223, 2007.
[120] J. McDonough, W. Byrne, and X. Luo, “Speaker normalisation with all pass
transforms,” in Proceedings of ICSLP, Sydney, 1998.
[121] C. Meyer, “Utterancelevel boosting of HMM speech recognisers,” in Proceed
ings of ICASSP, Orlando, FL, 2002.
[122] M. Mohri, F. Pereira, and M. Riley, “Weighted ﬁnite state transducers in
speech recognition,” Computer Speech and Language, vol. 16, no. 1, pp. 69–
88, 2002.
[123] P. J. Moreno, Speech recognition in noisy environments. PhD thesis, Carnegie
Mellon University, 1996.
[124] P. J. Moreno, B. Raj, and R. Stern, “A vector Taylor series approach for
environmentindependent speech recognition,” in Proceedings of ICASSP,
pp. 733–736, Atlanta, 1996.
[125] A. Nadas, “A decision theoretic formulation of a training problem in speech
recognition and a comparison of training by unconditional versus conditional
References 109
maximum likelihood,” IEEE Transactions on Acoustics Speech and Signal Pro
cessing, vol. 31, no. 4, pp. 814–817, 1983.
[126] L. R. Neumeyer, A. Sankar, and V. V. Digalakis, “A comparative study of
speaker adaptation techniques,” in Proceedings of Eurospeech, pp. 1127–1130,
Madrid, 1995.
[127] H. Ney, U. Essen, and R. Kneser, “On structuring probabilistic dependences in
stochastic language modelling,” Computer Speech and Language, vol. 8, no. 1,
pp. 1–38, 1994.
[128] L. Nguyen and B. Xiang, “Light supervision in acoustic model training,” in
Proceedings of ICASSP, Montreal, Canada, March 2004.
[129] Y. Normandin, “An improved MMIE training algorithm for speaker inde
pendent, small vocabulary, continuous speech recognition,” in Proceedings of
ICASSP, Toronto, 1991.
[130] J. J. Odell, V. Valtchev, P. C. Woodland, and S. J. Young, “A onepass decoder
design for large vocabulary recognition,” in Proceedings of Human Language
Technology Workshop, pp. 405–410, Plainsboro NJ, Morgan Kaufman Pub
lishers Inc., 1994.
[131] P. Olsen and R. Gopinath, “Modelling inverse covariance matrices by basis
expansion,” in Proceedings of ICSLP, Denver, CO, 2002.
[132] S. Ortmanns, H. Ney, and X. Aubert, “A word graph algorithm for large
vocabulary continuous speech recognition,” Computer Speech and Language,
vol. 11, no. 1, pp. 43–72, 1997.
[133] M. Padmanabhan, G. Saon, and G. Zweig, “Latticebased unsupervised MLLR
for speaker adaptation,” in Proceedings of ITRW ASR2000: ASR Challenges
for the New Millenium, pp. 128–132, Paris, 2000.
[134] D. S. Pallet, J. G. Fiscus, J. Garofolo, A. Martin, and M. Przybocki, “1998
broadcast news benchmark test results: English and nonEnglish word error
rate performance measures,” Tech. Rep., National Institute of Standards and
Technology (NIST), 1998.
[135] D. B. Paul, “Algorithms for an optimal A* search and linearizing the search
in the stack decoder,” in Proceedings of ICASSP, pp. 693–996, Toronto, 1991.
[136] D. Povey, Discriminative Training for Large Vocabulary Speech Recognition.
PhD thesis, Cambridge University, 2004.
[137] D. Povey, M. J. F. Gales, D. Y. Kim, and P. C. Woodland, “MMIMAP
and MPEMAP for acoustic model adaptation,” in Proceedings of EuroSpeech,
Geneva, Switzerland, September 2003.
[138] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, and G. Zweig, “fMPE:
Discriminatively trained features for speech recognition,” in Proceedings of
ICASSP, Philadelphia, 2005.
[139] D. Povey and P. Woodland, “Minimum phone error and Ismoothing for
improved discriminative training,” in Proceedings of ICASSP, Orlando, FL,
2002.
[140] P. J. Price, W. Fisher, J. Bernstein, and D. S. Pallet, “The DARPA 1000word
Resource Management database for continuous speech recognition,” Proceed
ings of ICASSP, vol. 1, pp. 651–654, New York, 1988.
110 References
[141] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications
in speech recognition,” Proceedings of IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[142] L. R. Rabiner, B.H. Juang, S. E. Levinson, and M. M. Sondhi, “Recognition
of isolated digits using HMMs with continuous mixture densities,” AT and T
Technical Journal, vol. 64, no. 6, pp. 1211–1233, 1985.
[143] B. Raj and R. Stern, “Missing feature approaches in speech recognition,”
IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 101–116, 2005.
[144] F. Richardson, M. Ostendorf, and J. R. Rohlicek, “Latticebased search strate
gies for large vocabulary recognition,” in Proceedings of ICASSP, vol. 1,
pp. 576–579, Detroit, 1995.
[145] B. Roark, M. Saraclar, and M. Collins, “Discriminative Ngram language mod
elling,” Computer Speech and Language, vol. 21, no. 2, 2007.
[146] A. V. Rosti and M. Gales, “Factor analysed hidden Markov models for speech
recognition,” Computer Speech and Language, vol. 18, no. 2, pp. 181–200, 2004.
[147] M. J. Russell and R. K. Moore, “Explicit modelling of state occupancy in
hidden Markov models for automatic speech recognition,” in Proceedings of
ICASSP, pp. 5–8, Tampa, FL, 1985.
[148] G. Saon, A. Dharanipragada, and D. Povey, “Feature space Gaussianization,”
in Proceedings of ICASSP, Montreal, Canada, 2004.
[149] G. Saon, M. Padmanabhan, R. Gopinath, and S. Chen, “Maximum likelihood
discriminant feature spaces,” in Proceedings of ICASSP, Istanbul, 2000.
[150] G. Saon, D. Povey, and G. Zweig, “CTS decoding improvements at IBM,” in
EARS STT workshop, St. Thomas, U.S. Virgin Islands, December 2003.
[151] L. K. Saul and M. G. Rahim, “Maximum likelihood and minimum classiﬁca
tion error factor analysis for automatic speech recognition,” IEEE Transac
tions on Speech and Audio Processing, vol. 8, pp. 115–125, 2000.
[152] R. Schluter, B. Muller, F. Wessel, and H. Ney, “Interdependence of language
models and discriminative training,” in Proceedings of IEEE ASRU Workshop,
pp. 119–122, Keystone, CO, 1999.
[153] R. Schwartz and Y.L. Chow, “A comparison of several approximate algo
rithms for ﬁnding multiple (Nbest) sentence hypotheses,” in Proceedings of
ICASSP, pp. 701–704, Toronto, 1991.
[154] H. Schwenk, “Using boosting to improve a hybrid HMM/NeuralNetwork
speech recogniser,” in Proceedings of ICASSP, Phoenix, 1999.
[155] M. Seltzer, B. Raj, and R. Stern, “A Bayesian framework for spectrographic
mask estimation for missing feature speech recognition,” Speech Communica
tion, vol. 43, no. 4, pp. 379–393, 2004.
[156] F. Sha and L. K. Saul, “Large margin Gaussian mixture modelling for auto
matic speech recognition,” in Advances in Neural Information Processing Sys
tems, pp. 1249–1256, 2007.
[157] K. Shinoda and C. H. Lee, “Structural MAP speaker adaptation using hier
archical priors,” in Proceedings of ASRU’97, Santa Barbara, 1997.
[158] K. C. Sim and M. J. F. Gales, “Basis superposition precision matrix models for
large vocabulary continuous speech recognition,” in Proceedings of ICASSP,
Montreal, Canada, 2004.
References 111
[159] K. C. Sim and M. J. F. Gales, “Discriminative semiparametric trajectory
models for speech recognition,” Computer Speech and Language, vol. 21,
pp. 669–687, 2007.
[160] R. Sinha, M. J. F. Gales, D. Y. Kim, X. Liu, K. C. Sim, and P. C. Woodland,
“The CUHTK Mandarin broadcast news transcription system,” in Proceed
ings of ICASSP, Toulouse, 2006.
[161] R. Sinha, S. E. Tranter, M. J. F. Gales, and P. C. Woodland, “The Cam
bridge University March 2005 speaker diarisation system,” in Proceedings of
InterSpeech, Lisbon, Portugal, September 2005.
[162] O. Siohan, B. Ramabhadran, and B. Kingsbury, “Constructing ensembles of
ASR systems using randomized decision trees,” in Proceedings of ICASSP,
Philadelphia, 2005.
[163] H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and G. Zweig, “The
IBM 2004 conversational telephony system for rich transcription,” in Proceed
ings of ICASSP, Philadelphia, PA, 2005.
[164] S. Srinivasan and D. Wang, “Transforming binary uncertainties for robust
speech recognition,” IEEE Transactions of Audio, Speech and Language Pro
cessing, vol. 15, no. 7, pp. 2130–2140, 2007.
[165] A. Stolcke, E. Brill, and M. Weintraub, “Explicit word error minimization in
NBest list rescoring,” in Proceedings of EuroSpeech, Rhodes, Greece, 1997.
[166] B. Tasker, Learning Structured Prediction Models: A Large Margin Approach.
PhD thesis, Stanford University, 2004.
[167] H. Thompson, “Bestﬁrst enumeration of paths through a lattice — An active
chart parsing solution,” Computer Speech and Language, vol. 4, no. 3, pp. 263–
274, 1990.
[168] K. Tokuda, H. Zen, and A. W. Black, Text to speech synthesis: New paradigms
and advances. Chapter HMMBased Approach to Multilingual Speech Synthe
sis, Prentice Hall, 2004.
[169] K. Tokuda, H. Zen, and T. Kitamura, “Reformulating the HMM as a trajec
tory model,” in Proceedings Beyond HMM Workshop, Tokyo, December 2004.
[170] S. E. Tranter and D. A. Reynolds, “An overview of automatic speaker diarisa
tion systems,” IEEE Transactions Speech and Audio Processing, vol. 14, no. 5,
September 2006.
[171] S. Tsakalidis, V. Doumpiotis, and W. J. Byrne, “Discriminative linear trans
forms for feature normalisation and speaker adaptation in HMM estimation,”
IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 367–
376, 2005.
[172] L. F. Uebel and P. C. Woodland, “Improvements in linear transform based
speaker adaptation,” in Proceedings of ICASSP, pp. 49–52, Seattle, 2001.
[173] V. Valtchev, J. Odell, P. C. Woodland, and S. J. Young, “A novel decoder
design for large vocabulary recognition,” in Proceedings of ICSLP, Yokohama,
Japan, 1994.
[174] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young, “MMIE train
ing of large vocabulary recognition systems,” Speech Communication, vol. 22,
pp. 303–314, 1997.
112 References
[175] V. Vanhoucke and A. Sankar, “Mixtures of inverse covariances,” in Proceedings
of ICASSP, Montreal, Canada, 2003.
[176] A. P. Varga and R. K. Moore, “Hidden Markov model decomposition of speech
and noise,” in Proceedings of ICASSP, pp. 845–848, Albuqerque, 1990.
[177] A. J. Viterbi, “Error bounds for convolutional codes and asymptotically opti
mum decoding algorithm,” IEEE Transactions on Information Theory, vol. 13,
pp. 260–269, 1982.
[178] F. Wallhof, D. Willett, and G. Rigoll, “Framediscriminative and conﬁdence
driven adaptation for LVCSR,” in Proceedings of ICASSP, pp. 1835–1838,
Istanbul, 2000.
[179] L. Wang, M. J. F. Gales, and P. C. Woodland, “Unsupervised training for
Mandarin broadcast news and conversation transcription,” in Proceedings of
ICASSP, Honolulu, USA, 2007.
[180] L. Wang and P. Woodland, “Discriminative adaptive training using the MPE
criterion,” in Proceedings of ASRU, St Thomas, U.S. Virgin Islands, 2003.
[181] C. J. Wellekens, “Explicit time correlation in hidden Markov models for speech
recognition,” in Proceedings of ICASSP, pp. 384–386, Dallas, TX, 1987.
[182] P. Woodland and D. Povey, “Large scale discriminative training of hidden
Markov models for speech recognition,” Computer Speech and Language,
vol. 16, pp. 25–47, 2002.
[183] P. Woodland, D. Pye, and M. J. F. Gales, “Iterative unsupervised adapta
tion using maximum likelihood linear regression,” in Proceedings of ICSLP’96,
pp. 1133–1136, Philadelphia, 1996.
[184] P. C. Woodland, “Hidden Markov models using vector linear predictors and
discriminative output distributions,” in Proceedings of ICASSP, San Fran
cisco, 1992.
[185] P. C. Woodland, M. J. F. Gales, D. Pye, and S. J. Young, “The development of
the 1996 HTK broadcast news transcription system,” in Proceedings of Human
Language Technology Workshop, 1997.
[186] Y.J. Wu and R.H. Wang, “Minimum generation error training for HMM
based speech synthesis,” in Proceedings of ICASSP, Toulouse, 2006.
[187] S. J. Young, “Generating multiple solutions from connected word DP recogni
tion algorithms,” in Proceedings of IOA Autumn Conference, vol. 6, pp. 351–
354, 1984.
[188] S. J. Young and L. L. Chase, “Speech recognition evaluation: A review of the
US CSR and LVCSR programmes,” Computer Speech and Language, vol. 12,
no. 4, pp. 263–279, 1998.
[189] S. J. Young, G. Evermann, M. J. F. Gales, T. Hain, D. Kershaw, X. Liu,
G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. C. Wood
land, The HTK Book (for HTK Version 3.4). University of Cambridge,
http://htk.eng.cam.ac.uk, December 2006.
[190] S. J. Young, J. J. Odell, and P. C. Woodland, “Treebased state tying for
high accuracy acoustic modelling,” in Proceedings of Human Language Tech
nology Workshop, pp. 307–312, Plainsboro NJ, Morgan Kaufman Publishers
Inc, 1994.
References 113
[191] S. J. Young, N. H. Russell, and J. H. S. Thornton, “Token passing: A sim
ple conceptual model for connected speech recognition systems,” Tech.
Rep. CUED/FINFENG/TR38, Cambridge University Engineering Depart
ment, 1989.
[192] S. J. Young, N. H. Russell, and J. H. S. Thornton, “The use of syntax and
multiple alternatives in the VODIS voice operated database inquiry system,”
Computer Speech and Language, vol. 5, no. 1, pp. 65–80, 1991.
[193] K. Yu and M. J. F. Gales, “Discriminative cluster adaptive training,” IEEE
Transactions on Speech and Audio Processing, vol. 14, no. 5, pp. 1694–1703,
2006.
[194] K. Yu and M. J. F. Gales, “Bayesian adaptive inference and adaptive training,”
IEEE Transactions Speech and Audio Processing, vol. 15, no. 6, pp. 1932–1943,
August 2007.
[195] K. Yu, M. J. F. Gales, and P. C. Woodland, “Unsupervised training with
directed manual transcription for recognising Mandarin broadcast audio,” in
Proceedings of InterSpeech, Antwerp, 2007.
[196] H. Zen, K. Tokuda, and T. Kitamura, “A Viterbi algorithm for a trajec
tory model derived from HMM with explicit relationship between static and
dynamic features,” in Proceedings of ICASSP, Montreal, Canada, 2004.
[197] B. Zhang, S. Matsoukas, and R. Schwartz, “Discriminatively trained region
dependent feature transforms for speech recognition,” in Proceedings of
ICASSP, Toulouse, 2006.
[198] J. Zheng and A. Stolcke, “Improved discriminative training using phone lat
tices,” in Proceedings of InterSpeech, Lisbon, Portugal, September 2005.
[199] Q. Zhu, Y. Chen, and N. Morgan, “On using MLP features in LVCSR,” in
Proceedings of ICSLP, Jeju, Korea, 2004.
[200] G. Zweig, Speech Recognition with Dynamic Bayesian Networks. PhD thesis,
University of California, Berkeley, 1998.
[201] G. Zweig and M. Padmanabhan, “Boosting Gaussian Mixtures in an LVCSR
System,” in Proceedings of ICASSP, Istanbul, 2000.
The Application of Hidden Markov Models in Speech Recognition
The Application of Hidden Markov Models in Speech Recognition Mark Gales Cambridge University Engineering Department Cambridge CB2 1PZ UK mjfg@eng.ac.cam.cam.ac.uk Boston – Delft .uk Steve Young Cambridge University Engineering Department Cambridge CB2 1PZ UK sjy@eng.
. has an exclusive license to publish this material worldwide.nowpublishers. Danvers.copyright.. The Application of Hidden Markov Models in Speech Recognition. Tel. Foundations and Trends R in Signal Processing. MA 02339 USA Tel. Inc.com now Publishers Inc. +17818710245. Please apply to now Publishers. or for resale. The ‘services’ for users can be found on the internet at: www. Authorization to photocopy items for internal or personal use. PO Box 179 2600 AD Delft The Netherlands Tel. sold and distributed by: now Publishers Inc. Hanover. PO Box 1024. vol 1. No part of this publication may be reproduced.nowpublishers. Photocopying. Young All rights reserved. In the USA: This journal is registered at the Copyright Clearance Center. photocopying. The Netherlands. Permission to use this content must be obtained from the copyright license holder. 2600 AD Delft.com . +31651115274 The preferred citation for this publication is M.com. without prior written permission of the publishers. is granted by now Publishers Inc for users registered with the Copyright Clearance Center (CCC). PO Box 1024 Hanover. for creating new collective works. Gales and S. www. such as that for general distribution. +17819854510 www. pp 195–304. or transmitted in any form or by any means. MA 01923. stored in a retrieval system. for advertising or promotional purposes. 2007 ISBN: 9781601981202 c 2008 M. Authorization does not extend to other kinds of copying. mechanical. recording or otherwise. 222 Rosewood Drive. Gales and S.Foundations and Trends R in Signal Processing Published.com For those organizations that have been granted a photocopy license. In the rest of the world: Permission to photocopy must be obtained from the copyright owner. sales@nowpublishers. www. no 3.nowpublishers. PO Box 179.com sales@nowpublishers. Young.com. email: sales@nowpublishers.com Outside North America: now Publishers Inc. or the internal or personal use of speciﬁc clients. USA. a separate system of payment has been arranged. Please apply to now Publishers Inc. MA 02339.
CA 94305 USA rmgray@stanford.P. Weinstein (MIT Lincoln Laboratories) Min Wu (University of Maryland) Josiane Zerubia (INRIA) .edu Editors Abeer Alwan (UCLA) John Apostolopoulos (HP Labs) Pamela Cosman (UCSD) Michelle Eﬀros (California Institute of Technology) Yonina Eldar (Technion) Yariv Ephraim (George Mason University) Sadaoki Furui (Tokyo Institute of Technology) Vivek Goyal (MIT) Sinan Gunturk (Courant Institute) Christine Guillemot (IRISA) Sheila Hemami (Cornell) Lina Karam (Arizona State University) Nick Kingsbury (Cambridge University) Alex Kot (Nanyang Technical University) Jelena Kovacevic (CMU) B.Foundations and Trends R in Signal Processing Volume 1 Issue 3. Manjunath (UCSB) Urbashi Mitra (USC) Thrasos Pappas (Northwestern University) Mihaela van der Shaar (UCLA) Luis Torres (Technical University of Catalonia) Michael Unser (EPFL) P. Vaidyanathan (California Institute of Technology) Rabab Ward (University of British Columbia) Susie Wee (HP Labs) Cliﬀord J. Gray Dept of Electrical Engineering Stanford University 350 Serra Mall Stanford. 2007 Editorial Board EditorinChief: Robert M.S.
sparse representations • Signal processing for communications • Signal processing for security and forensic analysis. biometric signal processing • Signal quantization. Also available as a combined paper and online subscription.Editorial Scope Foundations and Trends R in Signal Processing will publish survey and tutorial articles on the foundations. source separation • Signal decompositions. coding and compression • Signal reconstruction. digitaltoanalog conversion. enhancement. ISSN paper version 19328346. algorithms. decoding and inverse problems • Speech/audio/image/video compression • Speech and spoken language processing • Statistical/machine learning • Statistical signal processing • Classiﬁcation and detection • Estimation and regression • Treestructured methods Information for Librarians Foundations and Trends R in Signal Processing. Volume 1. sampling. and applications of signal processing including the following topics: • Adaptive signal processing • Audio signal processing • Biological and biomedical signal processing • Complexity in signal processing • Digital and multirate signal processing • Distributed and network signal processing • Image and video processing • Linear and nonlinear ﬁltering • Multidimensional signal processing • Multimodal signal processing • Multiresolution signal processing • Nonlinear signal processing • Randomized algorithms in signal processing • Sensor and multiple source signal processing. subband and transform methods. ISSN online version 19328354. . 2007. 4 issues. analogtodigital conversion. methods.
UK.1561/2000000004 The Application of Hidden Markov Models in Speech Recognition Mark Gales1 and Steve Young2 1 2 Cambridge University Engineering Department.cam. CB2 1PZ. Thus. UK. No.ac. Trumpington Street. sjy@eng. Trumpington Street.ac. mjfg@eng. These . Cambridge. Whereas the basic principles underlying HMMbased LVCSR are rather straightforward. the practical application of HMMs in modern systems involves considerable sophistication. 3 (2007) 195–304 c 2008 M. The aim of this review is ﬁrst to present the core architecture of a HMMbased LVCSR system and then describe the various reﬁnements which are needed to achieve stateoftheart performance.Foundations and Trends R in Signal Processing Vol. almost all present day large vocabulary continuous speech recognition (LVCSR) systems are based on HMMs. CB2 1PZ.cam. Cambridge. Gales and S. Young DOI: 10. As a consequence. 1.uk Cambridge University Engineering Department.uk Abstract Hidden Markov Models (HMMs) provide a simple and eﬀective framework for modelling timevarying spectral vector sequences. the approximations and simplifying assumptions involved in a direct implementation of these principles would result in a system which has poor accuracy and unacceptable sensitivity to changes in operating environment.
improved covariance modelling. noise compensation and multipass system combination. discriminative parameter estimation. adaptation and normalisation.reﬁnements include feature projection. . The review concludes with a case study of LVCSR for Broadcast News and Conversation transcription in order to illustrate the techniques described.
1 2.5 3.6 Dynamic Bayesian Networks Gaussian Mixture Models Feature Projections Covariance Modelling HMM Duration Modelling HMMs for Speech Generation 4 Parameter Estimation 4.4 3.3 2.2 3.Contents 1 Introduction 2 Architecture of an HMMBased Recogniser 2.4 Feature Extraction HMM Acoustic Models (BasicSingle Component) N gram Language Models Decoding and Lattice Generation 1 5 6 8 15 17 21 21 23 25 29 32 33 37 38 42 44 3 HMM Structure Reﬁnements 3.3 3.2 2.2 4.1 3.3 Discriminative Training Implementation Issues Lightly Supervised and Unsupervised Training ix .1 4.
4 Example Architecture System Combination Complementary System Generation Example Application — Broadcast News Transcription Conclusions Acknowledgments Notations and Acronyms References .5 Adaptation and Normalisation 5.2 7.5 FeatureBased Schemes Linear TransformBased Schemes Gender/Cluster Dependent Models Maximum a Posteriori (MAP) Adaptation Adaptive Training 47 48 51 55 57 59 63 65 67 71 77 77 79 81 82 93 95 97 101 6 Noise Robustness 6.1 6.3 7.3 5.1 5.2 5.4 5.3 Feature Enhancement ModelBased Compensation UncertaintyBased Approaches 7 MultiPass Recognition Architectures 7.2 6.1 7.
The core of all speech recognition systems consists of a set of statistical models representing the various sounds of the language to be recognised. the hidden Markov model (HMM) provides a natural framework for constructing such models [13]. HMMs lie at the heart of virtually all modern speech recognition systems and although the basic framework has not changed signiﬁcantly in the last decade or more. Since speech has temporal structure and can be encoded as a sequence of spectral vectors spanning the audio frequency range. transcription of recorded speech. The foundations of modern HMMbased continuous speech recognition technology were laid down in the 1970’s by groups at CarnegieMellon and IBM who introduced the use of discrete density HMMs 1 . searching audio documents and interactive spoken dialogues. The result has been steady and signiﬁcant progress and it is the aim of this review to describe the main techniques by which this has been achieved. 117.g.1 Introduction Automatic continuous speech recognition (CSR) has many potential applications including command and control. 163]). [40. dictation. the detailed modelling techniques developed within this framework have evolved to a state of considerable sophistication (e.
2 The review is organised as follows. A key beneﬁt of the statistical approach to speech recognition is that the required models are trained automatically on data.g. . and then later at Bell Labs [80. For the sake of logical coherence. In the early 90’s.cam. Starting with the artiﬁcial 1000 word Resource Management task [140]. 99] where continuous density HMMs were introduced. attention switched to continuous speakerindependent recognition. Many research groups have contributed to this progress. Reﬂecting the computational power of the time. the presentation given here is somewhat biassed towards the architecture developed at Cambridge University and supported by the HTK Software Toolkit [189]. in Architecture of a HMMBased Recogniser the key architectural ideas of a typical HMMbased recogniser are described. the technology developed rapidly and by the mid1990’s. for example by transforming features and using more complex HMM output distributions. The intention here is to present an overall system design using very basic acoustic models.uk. In particular.ac. 2 Available for free download at htk. Much of this development was driven by a series of DARPA and NSA programmes [188] which set ever more challenging tasks culminating most recently in systems for multilingual transcription of broadcast news programmes [134] and for spontaneous telephone conversations [62]. simple single Gaussian diagonal covariance HMMs are assumed. The following section HMM Structure Reﬁnements then describes the various ways in which the limitations of these basic HMMs can be overcome. [142]).g.2 Introduction [11. 81. [78]) or whole word small vocabulary speaker independent applications (e. reasonable accuracy was being achieved for unrestricted speaker independent dictation.eng. initial development in the 1980’s focussed on either discrete word speaker dependent large vocabulary systems (e. 77. 1 This very brief historical perspective is far from complete and out of necessity omits many other important contributions to the early years of HMMbased speech recognition.1 An excellent tutorial covering the basic HMM technologies developed in this period is given in [141]. Firstly. and each will typically have its own architectural perspective. 108]. This includes a recipe for building a stateoftheart recogniser for the Resource Management task which illustrates a number of the approaches described in this review.
Any system designed to work reliably in realworld applications must be robust to changes in speaker and the environment. The section MultiPass Recognition Architectures returns to the topic of the overall system architecture and explains how multiple passes over the speech signal using diﬀerent model combinations can be exploited to further improve performance. Mandarin and Arabic in order to illustrate the various techniques discussed in the review. The review concludes in Conclusions with some general observations and conclusions. . This ﬁnal section also describes some actual systems built for transcribing English.3 The section Parameter Estimation discusses the diﬀerent objective functions that can be optimised in training and their eﬀects on performance. The section on Adaptation and Normalisation presents a variety of generic techniques for achieving robustness. The following section Noise Robustness then discusses more specialised techniques for speciﬁcally handling additive and convolutional noise.
.
. w (2. i.1. such as HMMs. .1 Bayes’ Rule is used to transform (2.2 The basic unit of sound 1 There are some systems that are based on discriminative models [54] where P (wY ) is modelled directly. 5 . in the log domain the total likelihood is calculated as log p(Y w) + α log(P (w)) + βw where α is typically in the range 8–20 and β is typically in the range 0 – −20. the decoder tries to ﬁnd ˆ w = arg max{P (wY )}. the acoustic model is not normalised and the language model is often scaled by an empirically determined constant and a word insertion penalty is added i. . p(Y w).2 Architecture of an HMMBased Recogniser The principal components of a large vocabulary continuous speech recogniser are illustrated in Figure 2. wL which is most likely to have generated Y . y T in a process called feature extraction. rather than using generative models. The input audio waveform from a microphone is converted into a sequence of ﬁxed size acoustic vectors Y 1:T = y 1 . w (2.e. . . The decoder then attempts to ﬁnd the sequence of words w1:L = w1 . .1) However.1) into the equivalent problem of ﬁnding: ˆ w = arg max{p(Y w)P (w)}.2) The likelihood p(Y w) is determined by an acoustic model and the prior P (w) is determined by a language model.e. 2 In practice. .. since P (wY ) is diﬃcult to model directly. . where the observation sequence is modelled.
The parameters of these phone models are estimated from training data consisting of speech waveforms and their orthographic transcriptions. Alternatively. 2. 2. For example. the most likely word sequence is output.6 Architecture of an HMMBased Recogniser Fig.1 Architecture of a HMMbased Recogniser. modern decoders can generate lattices containing a compact representation of the most likely hypotheses. About 40 such phones are required for English. represented by the acoustic model is the phone. The language model is typically an N gram model in which the probability of each word is conditioned only on its N − 1 predecessors. the corresponding acoustic model is synthesised by concatenating phone models to make words as deﬁned by a pronunciation dictionary. This form should minimise the loss of . The decoder operates by searching through all possible word sequences using pruning to remove unlikely hypotheses thereby keeping the search tractable. The following sections describe these processes and components in more detail. the word “bat” is composed of three phones /b/ /ae/ /t/.1 Feature Extraction The feature extraction stage seeks to provide a compact representation of the speech waveform. For any given w. The N gram parameters are estimated by counting N tuples in appropriate text corpora. When the end of the utterance is reached.
∆y s . is t t given by ∆y s t = n i=1 wi ys − ys t−i t+i n 2 2 i=1 wi (2. The DCT is applied in order to smooth the spectral estimate and approximately decorrelate the feature elements. or removed completely. If the original (static) feature vector is y s . Further psychoacoustic constraints are incorporated into a related encoding called perceptual linear prediction (PLP) [74]. PLP computes linear prediction coeﬃcients from a perceptually weighted nonlinearly compressed power spectrum and then transforms the linear prediction coeﬃcients to cepstral coeﬃcients. These are generated by applying a truncated discrete cosine transformation (DCT) to a log spectral estimate computed by smoothing an FFT with around 20 frequency bins distributed nonlinearly across the speech spectrum.2. ﬁrst order (delta) and secondorder (delta–delta) regression coeﬃcients are often appended in a heuristic attempt to compensate for the conditional independence assumption made by the HMMbased acoustic models [47]. The nonlinear frequency scale used is called a mel scale and it approximates the response of the human ear. and provide a good match with the distributional assumptions made by the acoustic models. then the delta parameter. if diagonal covariance Gaussian distributions are used for the stateoutput distributions then the features should be designed to be Gaussian and uncorrelated. especially in noisy environments and hence it is the preferred encoding for many systems [185]. This is sometimes replaced by the logenergy of the frame. One of the simplest and most widely used encoding schemes is based on melfrequency cepstral coeﬃcients (MFCCs) [32]. After the cosine transform the ﬁrst element represents the average of the logenergy of the frequency bins.3) . In practice. In addition to the spectral coeﬃcients. Feature vectors are typically computed every 10 ms using an overlapping analysis window of around 25 ms. PLP can give small improvements over MFCCs. For example.1 Feature Extraction 7 information that discriminates between words.
6) and where each q (wl ) is a valid pronunciation for word wl . there will only be a very small number of alternative pronunciations for each wl making the summation in (2. . When concatenated together these form the feature vector y t . (2. (2. but t using diﬀerences of the delta parameters. L P (Qw) = l=1 P (q (wl ) wl ).4) The ﬁnal result is a feature vector whose dimensionality is typically around 40 and which has been partially but not fully decorrelated. . ∆2 y s . In operation.8 Architecture of an HMMBased Recogniser where n is the window width and wi are the regression coeﬃcients. each spoken word w is decomposed into a sequence of Kw basic sounds called base phones. Q is a particular sequence of pronunciations. the likelihood p(Y w) can be computed over multiple pronunciations4 p(Y w) = Q p(Y Q)P (Qw). are derived in the same fashion.2 HMM Acoustic Models (BasicSingle Component) As noted above. qKw . the start and end elements are replicated to ﬁll the regression window. This sequence is called its pro(w) nunciation q 1:Kw = q1 . . To allow for the possibility of multiple pronunciations.2 with transition probability parameters {aij } and output observation distributions {bj ()}. 2.5) easily tractable.5) where the summation is over all valid pronunciation sequences for w. an HMM makes a transition from its current state to one of its connected states every time step. The probability of making a particular 3 In HTK to ensure that the same number of frames is maintained after adding delta and delta–delta parameters. 4 Recognisers often approximate this by a max operation so that alternative pronunciations can be decoded as though they were alternative word hypotheses.3 The delta–delta parameters. Each base phone q is represented by a continuous density HMM of the form illustrated in Figure 2. . y t = y sT ∆y sT ∆2 y sT t t t T . In practice. . (2.
Y Q).7) where µ(j) is the mean of state sj and Σ(j) is its covariance. . • observations are conditionally independent of all other observations given the state that generated it. single multivariate Gaussians will be assumed for the output distribution: bj (y) = N (y. Later in HMM Structure Reﬁnements. This form of process yields the standard conditional independence assumptions for an HMM: • states are conditionally independent of all other states given the previous state. a feature vector is generated using the distribution associated with the state being entered. 2. Given the composite HMM Q formed by concatenating all of the constituent base phones q (w1 ) . µ(j) . . .2 HMM Acoustic Models (BasicSingle Component) 9 Fig. On entering a state.8) . Σ(j) ). (2. the beneﬁts of using mixtures of Gaussians will be discussed. {bj ()}. transition from state si to state sj is given by the transition probability {aij }.2 HMMbased phone model. q (wL ) then the acoustic likelihood is given by p(Y Q) = θ p(θ.2. For a more detailed discussion of the operation of an HMM see [141]. Since the dimensionality of the acoustic vector y is relatively high. For now. the covariances are often constrained to be diagonal. (2. .
10
Architecture of an HMMBased Recogniser
where θ = θ0 , . . . , θT +1 is a state sequence through the composite model and
T
p(θ, Y Q) = aθ0 θ1
t=1
bθt (y t )aθt θt+1 .
(2.9)
In this equation, θ0 and θT +1 correspond to the nonemitting entry and exit states shown in Figure 2.2. These are included to simplify the process of concatenating phone models to make words. For simplicity in what follows, these nonemitting states will be ignored and the focus will be on the state sequence θ1 , . . . , θT . The acoustic model parameters λ = [{aij }, {bj ()}] can be eﬃciently estimated from a corpus of training utterances using the forward–backward algorithm [14] which is an example of expectationmaximisation (EM) [33]. For each utterance Y (r) , r = 1, . . . , R, of length T (r) the sequence of baseforms, the HMMs that correspond to the wordsequence in the utterance, is found and the corresponding composite HMM constructed. In the ﬁrst phase of the algorithm, the Estep, the (rj) (r) forward probability αt = p(Y 1:t , θt = sj ; λ) and the backward prob(ri) (r) ability βt = p(Y t+1:T (r) θt = si ; λ) are calculated via the following recursions αt
(rj)
=
i
αt−1 aij bj y t
(r)
(ri)
(r)
(2.10)
aij bj (y t+1 )βt+1 ,
j (rj)
βt
(ri)
=
(2.11)
where i and j are summed over all states. When performing these recursions underﬂow can occur for long speech segments, hence in practice the logprobabilities are stored and log arithmetic is used to avoid this problem [89].5 Given the forward and backward probabilities, the probability of the model occupying state sj at time t for any given utterance r is just γt
5 This
(rj)
= P (θt = sj Y (r) ; λ) =
1 P (r)
αt
(rj) (rj) βt ,
(2.12)
is the technique used in HTK [189].
2.2 HMM Acoustic Models (BasicSingle Component)
11
where P (r) = p(Y (r) ; λ). These state occupation probabilities, also called occupation counts, represent a soft alignment of the model states to the data and it is straightforward to show that the new set of Gaussian parameters deﬁned by [83] µ(j) = ˆ ˆ (j) Σ =
R r=1 R r=1 R r=1 T (r) (rj) (r) yt t=1 γt T (r) (rj) t=1 γt (r) (rj) (r) (r) T (y t − µ(j) )(y t ˆ t=1 γt T (r) (rj) R r=1 t=1 γt
(2.13) − µ(j) )T ˆ
(2.14)
maximise the likelihood of the data given these alignments. A similar reestimation equation can be derived for the transition probabilities aij = ˆ
R 1 r=1 P (r) (r) (rj) (r) T (r) (ri) t=1 αt aij bj (y t+1 )βt+1 y t . T (r) (ri) R γt r=1 t=1
(2.15)
This is the second or Mstep of the algorithm. Starting from some initial estimate of the parameters, λ(0) , successive iterations of the EM algorithm yield parameter sets λ(1) , λ(2) , . . . which are guaranteed to improve the likelihood up to some local maximum. A common choice for the initial parameter set λ(0) is to assign the global mean and covariance of the data to the Gaussian output distributions and to set all transition probabilities to be equal. This gives a socalled ﬂat start model. This approach to acoustic modelling is often referred to as the beadsonastring model, socalled because all speech utterances are represented by concatenating a sequence of phone models together. The major problem with this is that decomposing each vocabulary word into a sequence of contextindependent base phones fails to capture the very large degree of contextdependent variation that exists in real speech. For example, the base form pronunciations for “mood” and “cool” would use the same vowel for “oo,” yet in practice the realisations of “oo” in the two contexts are very diﬀerent due to the inﬂuence of the preceding and following consonant. Context independent phone models are referred to as monophones. A simple way to mitigate this problem is to use a unique phone model for every possible pair of left and right neighbours. The resulting
12
Architecture of an HMMBased Recogniser
models are called triphones and if there are N base phones, there are N 3 potential triphones. To avoid the resulting data sparsity problems, the complete set of logical triphones L can be mapped to a reduced set of physical models P by clustering and tying together the parameters in each cluster. This mapping process is illustrated in Figure 2.3 and the parameter tying is illustrated in Figure 2.4 where the notation x − q + y denotes the triphone corresponding to phone q spoken in the context of a preceding phone x and a following phone y. Each base
Fig. 2.3 Context dependent phone modelling.
Fig. 2.4 Formation of tiedstate phone models.
The questions at each node are selected from a predetermined set to maximise the likelihood of the training data given the ﬁnal set of statetyings. .5 illustrates this treebased clustering. logical models which are required but were not seen at all in the training data can be easily synthesised.7 The partitioning of states using phonetically driven decision trees has several advantages. total number of tiedstates in a large vocabulary speaker independent system typically ranges between 1000 and 10. Thus. In particular.2 HMM Acoustic Models (BasicSingle Component) 13 phone pronunciation q is derived by simple lookup from the pronunciation dictionary.000 states. Figure 2. Each node of the tree carries a question regarding the context. This 6 Usually 7 The each phone model has three states. the logical phones saw+n and taw+n will both be assigned to leaf node 3 and hence they will share the same central state of the representative physical model. the pool of states is successively split until all states have trickled down to leaf nodes. To cluster state i of phone q. Each state position6 of each phone q has a binary tree associated with it. One disadvantage is that the partitioning can be rather coarse. The clustering of logical to physical models typically operates at the statelevel rather than the model level since it is simpler and it allows a larger set of physical models to be robustly estimated. these are then mapped to logical phones according to the context. then the increase in likelihood achieved by splitting the Gaussians at any node can be calculated simply from the counts and model parameters without reference to the training data. All states in each leaf node are then tied to form a physical model. the decision trees can be grown very eﬃciently using a greedy iterative node splitting algorithm. Depending on the answer to the question at each node. The choice of which states to tie is commonly made using decision trees [190]. ﬁnally the logical phones are mapped to physical models. the /p/ in “stop that” has its burst suppressed by the following consonant. In the ﬁgure. For example. Notice that the contextdependence spreads across word boundaries and this is essential for capturing many important phonological processes. If the state output distributions are single component Gaussians and the state occupation counts are known.2. all states i of all of the logical models derived from q are collected into a single pool at the root node of the tree.
the core acoustic models of a modern speech recogniser typically consist of a set of tied threestate HMMs with Gaussian output distributions. Ch. problem can be reduced using socalled softtying [109]. Thus. .14 Architecture of an HMMBased Recogniser Fig.5 Decision tree clustering. the single Gaussian models are converted to mixture Gaussian models whilst holding the total number of Gaussians in the system constant. To summarise. 3]: (1) A ﬂatstart monophone set is created in which each base phone is a monophone singleGaussian HMM with means and covariances equal to the mean and covariance of the training data. In this scheme. 2. a postprocessing stage groups each state with its one or two nearest neighbours and pools all of their Gaussians. This core is commonly built in the following steps [189.
(2. w1 ). . . wK required in (2. (3) Each single Gaussian monophone q is cloned once for each distinct triphone x − q + y that appears in the training data. which is deﬁned as H = − lim ≈− 1 K 1 log2 (P (w1 . .16) For large vocabulary recognition. .16) is usually truncated to N − 1 words to form an N gram language model K P (w) = k=1 P (wk wk−1 . .17) where N is typically in the range 2–4.2.2) is given by K P (w) = k=1 P (wk wk−1 . . . . wk−N +1 )). . . . . the conditioning word history in (2. H.3 N gram Language Models The prior probability of a word sequence w = w1 . . . 2. wk−N +1 ). (5) A decision tree is created for each state in each base phone.3 N gram Language Models 15 (2) The parameters of the Gaussian monophones are reestimated using 3 or 4 iterations of EM. The ﬁnal result is the required tiedstate contextdependent acoustic model set. k=1 where the approximation is used for N gram language models with a ﬁnite length word sequence. . Language models are often assessed in terms of their perplexity. . . . the trainingdata triphones are mapped into a smaller set of tiedstate triphones and iteratively reestimated using EM. (2. wk−2 . wk−2 . wK )) K→∞ K K log2 (P (wk wk−1 . . . . (4) The resulting set of trainingdata triphones is again reestimated using EM and the state occupation counts of the last iteration are saved.
using socalled Katz smoothing [85] d C(wk−2 wk−1 wk ) if 0 < C ≤ C C(wk−2 wk−1 ) P (wk wk−1 . then C(wk−2 wk−1 wk ) . For example. For example. The discounted probability mass is then distributed to the unseen N grams which are approximated by a weighted version of the corresponding bigram. (2. let C(wk−2 wk−1 wk ) represent the number of occurrences of the three words wk−2 wk−1 wk and similarly for C(wk−2 wk−1 ).16 Architecture of an HMMBased Recogniser The N gram probabilities are estimated from training texts by counting N gram occurrences to form maximum likelihood (ML) parameter estimates. . There are many variations on this approach [25]. the class N gram probabilities are estimated using ML but since there are far fewer classes (typically a few hundred) .18) C(wk−2 wk−1 ) The major problem with this simple ML estimation scheme is data sparsity. K P (w) = k=1 P (wk ck )p(ck ck−1 . the ML estimate is used. An alternative approach to robust language model estimation is to use classbased models in which for every word wk there is a corresponding class ck [21. (2. when training data is very sparse. (2. 90].20) As for word based models. Then. . wk−2 ) = C(wk−2 wk−1 wk ) C(wk−2 wk−1 ) α(w . The discounting coeﬃcient is based on the TuringGood estimate d = (r + 1)nr+1 /rnr where nr is the number of N grams that occur exactly r times in the training data. . C is shorthand for C(wk−2 wk−1 wk ). wk−2 ) ≈ P (wk wk−1 . ck−N +1 ). . Thus. This idea can be applied recursively to estimate any sparse N gram in terms of a set of backoﬀ weights and (N − 1)grams. When the count is small the same ML estimate is used but discounted slightly. d is a discount coeﬃcient and α is a normalisation constant. when the N gram count exceeds some threshold. Kneser–Ney smoothing is particularly eﬀective [127]. This can be mitigated by a combination of discounting and backingoﬀ. w ) P (w w ) k−1 k−2 k k−1 if C > C otherwise.19) where C is a count threshold. For example.
In practice it is found that for reasonably sized training sets. 2. This probability can be eﬃciently computed using the Viterbi algorithm [177] φt = max φt−1 aij bj (y t ). . the change in perplexity depends only on the counts of a relatively small number of bigrams. The classes themselves are chosen to optimise the likelihood of the training set assuming a bigram class model.e.2. entry state and 0 for all other states. i. the most likely word ˆ sequence w given a sequence of feature vectors Y 1:T is found by searching all possible state sequences arising from all possible word sequences for the sequence which was most likely to have generated the observed data Y 1:T . It can be shown that when a word is moved from one class to another. a traceback will yield the required best matching state/word sequence. the language model constraints and the need to 8 i. i (j) (j) (i) (2. Hence. In practice. an iterative algorithm can be implemented which repeatedly scans through the vocabulary. testing each word to see if moving it to some other class would increase the likelihood [115].4 Decoding and Lattice Generation 17 data sparsity is much less of an issue.4 Decoding and Lattice Generation As noted in the introduction to this section.8 an eﬀective language model for large vocabulary applications consists of a smoothed wordbased 3 or 4gram interpolated with a classbased trigram. Let φt = maxθ {p(Y 1:t .21) It is initialised by setting φ0 to 1 for the initial.e. the maximum probability of observing the partial sequence Y 1:t and then being in state sj at time t given the model parameters λ. θt = sj . The probability of the most likely (j) word sequence is then given by maxj {φT } and if every maximisation decision is recorded. λ)}. An eﬃcient way to solve this problem is to use dynamic (j) programming. nonemitting. a direct implementation of the Viterbi algorithm becomes unmanageably complex for continuous speech where the topology of the models.. 107 words.
A compact and eﬃcient structure for storing these hypotheses is the word lattice [144. the search space can either be constrained by maintaining multiple hypotheses in parallel [173. Although decoders are designed primarily to ﬁnd the solution to (2. pronunciation. the nodes no . A word lattice consists of a set of nodes representing points in time and a set of spanning arcs representing word hypotheses. in practice. however. because they must compare hypotheses of diﬀerent lengths. In addition to the word IDs shown in the ﬁgure. This is illustrated in Figure 2. They can also be compacted into a very eﬃcient representation called a confusion network [42. This approach oﬀers both ﬂexibility and eﬃciency and is therefore extremely useful for both research and practical applications. a completely diﬀerent approach can be taken where the breadthﬁrst approach of the Viterbi algorithm is replaced by a depthﬁrst search. Finally. This is extremely useful since it allows multiple passes over the data without the computational expense of repeatedly solving (2. For Viterbi decoding.21). recent advances in weighted ﬁnitestate transducer technology enable all of the required information (acoustic models. 187]. 132]. An example is shown in Figure 2. To deal with this. In a confusion network.21) from scratch. For example. Lattices are extremely ﬂexible. N gram language models and crossword triphone contexts are particularly problematic since they greatly expand the search space. it is relatively simple to generate not just the most likely hypothesis but the N best set of hypotheses.18 Architecture of an HMMBased Recogniser bound the computation must all be taken into account. 167. their runtime search characteristics can be diﬃcult to control [76. etc. they can be rescored by using them as an input recognition network and they can be expanded to allow rescoring by a higher order language model. 191. 135].6 part (a). language model probabilities. This gives rise to a class of recognisers called stack decoders. 114].6 part (b) where the “” arc labels indicate null transitions. These can be very eﬃcient. N is usually in the range 100–1000. 69. a number of diﬀerent architectural approaches have evolved.) to be integrated into a single very large but highly optimised network [122]. 130. 192] or it can be expanded dynamically as the search progresses [7. Alternatively. each arc can also carry score information such as the acoustic and language model scores.
63. summing over all occurrences of w and then normalising so that all competing word arcs in the confusion network sum to one. 72] (see MultiPass Recognition Architectures). Confusion networks can be used for minimum worderror decoding [165] (an example of minimum Bayes’ risk (MBR) decoding [22]). Thus. instead they simply enforce word sequence constraints. there exists a corresponding path through the confusion network. . it is assumed that most of the time the overlap is suﬃcient to enable parallel arcs to be regarded as competing hypotheses. parallel arcs in the confusion network do not necessarily correspond to the same acoustic segment.2. 43. This is computed by ﬁnding the link probability of w in the lattice using a forward–backward procedure.6 Example lattice and confusion network. A confusion network has the property that for every path through the original lattice.4 Decoding and Lattice Generation 19 Fig. to provide conﬁdence scores and for merging the outputs of diﬀerent decoders [41. longer correspond to discrete points in time. 2. Each arc in the confusion network carries the posterior probability of the corresponding word w. However.
.
3
HMM Structure Reﬁnements
In the previous section, basic acoustic HMMs and their use in ASR systems have been explained. Although these simple HMMs may be adequate for small vocabulary and similar limited complexity tasks, they do not perform well when used for more complex, and larger vocabulary tasks such as broadcast news transcription and dictation. This section describes some of the extensions that have been used to improve the performance of ASR systems and allow them to be applied to these more interesting and challenging domains. The use of dynamic Bayesian networks to describe possible extensions is ﬁrst introduced. Some of these extensions are then discussed in detail. In particular, the use of Gaussian mixture models, eﬃcient covariance models and feature projection schemes are presented. Finally, the use of HMMs for generating, rather than recognising, speech is brieﬂy discussed.
3.1
Dynamic Bayesian Networks
In Architecture of an HMMBased Recogniser, the HMM was described as a generative model which for a typical phone has three emitting
21
22
HMM Structure Reﬁnements
Fig. 3.1 Typical phone HMM topology (left) and dynamic Bayesian network (right).
states, as shown again in Figure 3.1. Also shown in Figure 3.1 is an alternative, complementary, graphical representation called a dynamic Bayesian network (DBN) which emphasises the conditional dependencies of the model [17, 200] and which is particularly useful for describing a variety of extensions to the basic HMM structure. In the DBN notation used here, squares denote discrete variables; circles continuous variables; shading indicates an observed variable; and no shading an unobserved variable. The lack of an arc between variables shows conditional independence. Thus Figure 3.1 shows that the observations generated by an HMM are conditionally independent given the unobserved, hidden, state that generated it. One of the desirable attributes of using DBNs is that it is simple to show extensions in terms of how they modify the conditional independence assumptions of the model. There are two approaches, which may be combined, to extend the HMM structure in Figure 3.1: adding additional unobserved variables; and adding additional dependency arcs between variables. An example of adding additional arcs, in this case between observations is shown in Figure 3.2. Here the observation distribution is
Fig. 3.2 Dynamic Bayesian networks for buried Markov models and HMMs with vector linear predictors.
3.2 Gaussian Mixture Models
23
Fig. 3.3 Dynamic Bayesian networks for Gaussian mixture models (left) and factoranalysed models (right).
dependent on the previous two observations in addition to the state that generated it. This DBN describes HMMs with explicit temporal correlation modelling [181], vector predictors [184], and buried Markov models [16]. Although an interesting direction for reﬁning an HMM, this approach has not yet been adopted in mainstream stateoftheart systems. Instead of adding arcs, additional unobserved, latent or hidden, variables may be added. To compute probabilities, these hidden variables must then be marginalised out in the same fashion as the unobserved state sequence for HMMs. For continuous variables this requires a continuous integral over all values of the hidden variable and for discrete variables a summation over all values. Two possible forms of latent variable are shown in Figure 3.3. In both cases the temporal dependencies of the model, where the discrete states are conditionally independent of all other states given the previous state, are unaltered allowing the use of standard Viterbi decoding routines. In the ﬁrst case, the observations are dependent on a discrete latent variable. This is the DBN for an HMM with Gaussian mixture model (GMM) stateoutput distributions. In the second case, the observations are dependent on a continuous latent variable. This is the DBN for an HMM with factoranalysed covariance matrices. Both of these are described in more detail below.
3.2
Gaussian Mixture Models
One of the most commonly used extensions to standard HMMs is to model the stateoutput distribution as a mixture model. In Architecture of an HMMBased Recogniser, a single Gaussian distribution was used
This model therefore assumed that the observed feature vectors are symmetric and unimodal. asymmetric and multimodal distributed data. m=1 cjm ≥ 0. µ(jm) . speaker.3) where γt = P (θt = sjm Y (r) . λ) is the probability that component m of state sj generated the observation at time t of sequence Y (r) . Since an additional latent variable has been added to the acoustic model. (3. the single Gaussian stateoutput distribution may be replaced by a mixture of Gaussians which is a highly ﬂexible distribution able to model. state/componentobservations are used. ˆ µ(jm) = (rjm) R T (r) (rjm) (r) yt r=1 t=1 γt . accent and gender diﬀerences tend to create multiple modes in the data. The estimation of the model parameters then follows the same form as for the single component case. These priors satisfy the standard constraints for a valid probability mass function (PMF) M cjm = 1. As described in the previous section. The likelihood for state sj is now obtained by summing over all the component likelihoods weighted by their prior M bj (y) = m=1 cjm N (y. Σ(jm) ). To address this problem. Each of the M components of the mixture model is a Gaussian probability density function (PDF). For example. mixture models may be viewed as adding an additional discrete latent variable to the system. For the mean of Gaussian component m of state sj . sjm . (3.24 HMM Structure Reﬁnements to model the state–output distribution. the form of the EM algorithm previously described needs to be modiﬁed [83]. In practice this is seldom the case.2) This is an example of the general technique of mixture modelling. for example. Rather than considering the complete dataset in terms of stateobservation pairings.1) where cjm is the prior probability for component m of state sj . R T (r) (rjm) γt r=1 t=1 (3. .
A number of approaches. including discriminative schemes. 106]. variance ﬂooring is often applied. .3 Feature Projections In Architecture for an HMMBased Recogniser. It is also possible to use datadriven approaches to decorrelate and reduce the dimensionality of the features. only components which provide a “reasonable” contribution to the total likelihood are included. This improves generalisation. To further decrease the computational load the GMM likelihood can be approximated simply by the maximum over all the components (weighted by the priors). The standard approach is to 1 See HTK for a speciﬁc implementation of variance ﬂooring [189].3. In addition. 3. Furthermore. have been adopted for this [27. a series of logadditions are required to compute the GMM likelihood. The component priors are estimated in a similar fashion to the transition probabilities. To improve eﬃciency. Another approach is to make the number of components assigned to the state a function of the number of observations assigned to that state [56].1 This prevents the variances of the system becoming too small. Using GMMs increases the computational overhead since when using logarithmetic. were added to the static feature parameters to overcome the limitations of the conditional independence assumption associated with HMMs. When using GMMs to model the stateoutput distribution. dynamic ﬁrst and second diﬀerential parameters. the socalled delta and delta–delta parameters. A popular alternative approach is based on the Bayesian information criterion (BIC) [27]. for example when a component models a very small number of tightly packed observations.3 Feature Projections 25 R is the number of training sequences. the DCT was assumed to approximately decorrelate the feature vector to improve the diagonal covariance approximation and reduce the dimensionality of the feature vector. it is necessary to determine the number of components per state in the system. The simplest is to use the same number of components for all states in the system and use heldout data to determine the optimal numbers.
state or component. when required. i. The features to be projected are normally based on the standard MFCC. featurevectors. (3. may be used. in unsupervised approaches only general attributes of the observations. the dimensionality p of the projected data must be determined. for example phone. delta and delta–delta parameters with thirdorder dynamic parameters (the diﬀerence of delta–deltas) [116]. d is the dimension of the source ˜ feature vector y t . were compared for supervised projection . In contrast. One approach is to splice neighbouring static vectors together to form a composite vector of typically 9 frames [10] ˜ y t = y sT · · · y sT · · · y sT t t−4 t+4 T . An important issue when estimating a projection is whether the class labels of the observations are to be used. phone. state or Gaussian component labels. Secondly.4) where A[p] is a p × d lineartransform. In [68] a comparison of a number of possible class deﬁnitions. (3.5) Another approach is to expand the static. Firstly. The detailed form of a feature vector transformation such as this ˜ depends on a number of choices. Both have been used for large vocabulary speech recognition systems. supervised schemes are usually computationally more expensive than unsupervised schemes. and p is the size of the transformed feature vector y t . whether the projection should be estimated in a supervised or unsupervised fashion.26 HMM Structure Reﬁnements use linear transformations. the criterion used to estimate A[p] must be speciﬁed. However. such as the variances. Lastly. Statistics. whereas unsupervised schemes eﬀectively only ever have one class. the treatment of the delta and delta–delta parameters can vary. the construction of y t must be decided and. that is ˜ y t = A[p] y t . However. such as the means and covariance matrices are needed for each of the class labels. the class labels that are to be used. The dimensionality of the projected feature vector is often determined empirically since there is only a single parameter to tune. In general supervised approaches yield better projections since it is then possible to estimate a transform to improve the discrimination between classes.e. or PLP.. or not.
and both were better than higher level labels such as phones or words. In LDA the objective is to increase the ratio of the between class variance to the average within class variance for each dimension. supervised approaches such as the linear discriminant analysis (LDA) criterion [46] can be used. The choice of class is important as the transform tries to ensure that each of these classes is as separable as possible from all others. The simplest form of criterion for estimating the transform is principal component analysis (PCA).3. ˜ y t . This criterion may be expressed as Flda (λ) = log ˜ A[p] Σb AT  [p] ˜ A[p] Σw AT  [p] . The study showed that performance using state and component level classes was similar. This is an unsupervised projection scheme. The LDA criterion may be further reﬁned by using the actual class covariance matrices rather than using the averages. The estimation of the PCA transform can be found by ﬁnding the p rows of the orthonormal matrix A that maximises ˜ Fpca (λ) = log A[p] Σg AT  .3 Feature Projections 27 schemes. the vast majority of systems use component level classes since. so no use is made of class labels. Simply selecting subspaces that yield large variances does not necessarily yield subspaces that discriminate between the classes. this improves the assumption of diagonal covariance matrices.7) ˜ ˜ where Σb is the betweenclass covariance matrix and Σw the average withinclass covariance matrix where each distinct Gaussian component is usually assumed to be a separate class. as discussed later.6) ˜ where Σg is the total. covariance matrix of the original data. In practice. or global. [p] (3. (3. One such form is heteroscedastic discriminant analysis (HDA) [149] where the following . This criterion yields an orthonormal transform such that the average withinclass covariance matrix is diagonalised. This selects the orthogonal projections of the data that maximises the total variance in the projected subspace. To address this problem. which should improve the diagonal covariance matrix assumption.
In contrast. A[d−p] (3.28 HMM Structure Reﬁnements criterion is maximised Fhda (λ) = m γ (m) log ˜ A[p] Σb AT  [p] ˜ (m) A[p] Σ AT  [p] . Firstly. for HDA a separate decorrelating transform must . An alternative extension to LDA is heteroscedastic LDA (HLDA) [92]. and nor does this criterion help with decorrelating the data associated with each Gaussian component. HLDA yields the best projection whilst simultaneously generating the best transform for improving the diagonal covariance matrix approximation.8) where γ (m) is the total posterior occupation probability for compo˜ (m) is its covariance matrix in the original space given nent m and Σ ˜ by y t . as if they were model parameters as discussed in Architecture for an HMMBased Recogniser. The HLDA criterion can then be expressed as A2 . It is thus often used in conjunction with a global semitied transform [51] (also known as a maximum likelihood linear transform (MLLT) [65]) described in the next section. rather than for just the dimensions to be retained. In this transformation. The parameters of the HLDA transform can be found in an ML fashion. the matrix A is not constrained to be orthonormal.9) where A= A[p] . This modiﬁes the LDA criterion in (3. Fhlda (λ) = γ (m) log ˜ g AT  diag A[p] Σ(m)AT ˜ diag A[d−p] Σ m [d−p] [p] (3.10) There are two important diﬀerences between HLDA and HDA. An important extension of this transform is to ensure that the distributions for all dimensions to be removed are constrained to be the same.7) in a similar fashion to HDA. This is achieved by tying the parameters associated with these dimensions to ensure that they are identical. but now a transform for the complete featurespace is estimated. (3. These dimensions will thus yield no discriminatory information and so need not be retained for recognition.
whereas HDA only models the useful (nonprojected) dimensions. One compromise that is sometimes used is to follow an LDA projection by a decorrelating global semitied transform [163] described in the next section. it can be done [163]. Furthermore. 3. Here the covariance matrix for each Gaussian component m is represented by 2 Although given enough data. This means that multiple subspace projections can be used with HLDA.2 Even with small systems. These allow covariance modelling to be improved with very little overhead in terms of memory and computational load. the computational cost of using full covariance matrices is O(d2 ) compared to O(d) for the diagonal case where d is the dimensionality of the feature vector.3. is to decorrelate the feature vector so that the diagonal covariance matrix approximation becomes reasonable.3. Though schemes like HLDA outperform approaches like LDA they are more computationally expensive and require more memory. Fullcovariance matrix statistics for each component are required to estimate an HLDA transform. HLDA yields a model for the complete featurespace. and some of the projection schemes discussed in the previous section. This makes HLDA projections from large dimensional features spaces with large numbers of components impractical.4 Covariance Modelling One of the motivations for using the DCT transform.4.1 Structured Covariance Matrices A standard form of structured covariance matrix arises from factor analysis and the DBN for this was shown in Figure 3. Furthermore. To address both of these problems. This is important as in general the use of full covariance Gaussians in large vocabulary systems would be impractical due to the sheer size of the model set.4 Covariance Modelling 29 be added. but not with HDA [53]. 3. whereas only the average within and between class covariance matrices are required for LDA. training data limitations often preclude the use of full covariances. . and some smoothing. structured covariance and precision (inverse covariance) matrix representations have been developed.
12) where ν (m) is a Gaussian component speciﬁc weight vector that speciﬁes the contribution from each of the B global positive semideﬁnite matrices. One of the desirable properties of this form of model is that it is computationally eﬃcient during decoding. Σ(m) ) 1 1 = − log((2π)d Σ(m) ) − 2 2 3 Depending B νi i=1 (m) (y − µ(m) )T Si (y − µ(m) ) on p.3 3.2 Structured Precision Matrices A more computationally eﬃcient approach to covariance structuring is to model the inverse covariance matrix and a general form for this is [131.. factoranalysed HMMs still incur the decoding cost associated with full covariances.30 HMM Structure Reﬁnements (the dependency on the state has been dropped for clarity) Σ(m) = A[p] (m) (m)T A[p] + Σdiag . (3. also known as the ShermanMorrissonWoodbury formula. the computation of likelihoods depends on the inverse covariance matrix. i. Factoranalysed HMMs [146] generalise this to support both tying over multiple components of the loading matrices and using GMMs to model the latent variablespace of z in Figure 3. µ(m) . Thus. The above expression is based on factor analysis which allows each of the Gaussian components to be estimated separately using EM [151].4. 158] B Σ (m)−1 = i=1 νi (m) Si .e. Although structured covariance matrices of this form reduce the number of parameters needed to represent each covariance matrix.11) (m) where A[p] is the componentspeciﬁc. Si . since log N (y. the precision matrix and this will still be a full matrix. can be used to make this more eﬃcient.3. . the matrix inversion lemmas. loading matrix and Σdiag is a component speciﬁc diagonal covariance matrix. (m) (m) (3. p × d.
and set k = 0. a[i] is the ith [i] row of A). (3) set Σdiag to the component variance for each dimension (k+1) .3. decoding is very eﬃcient since Aµ(m) is stored for each component.13) Thus rather than having the computational cost of a full covariance matrix.15) The matrix A is sometimes referred to as the semitied transform. (3. This is a speciﬁc form of precision matrix model. (mk) (2) estimate A(k+1) given Σdiag . One of the reasons that STC systems are simple to use is that the weight for each dimension is simply the inverse variance for that dimension. where the number of bases is equal to the number of dimensions (B = d) and the bases are symmetric and have rank 1 (thus can be expressed as Si = aT a[i] . set Σdiag equal to the current model covariance matrices. and the transformed features Ay t are cached for (m(k+1)) (m0) . Σ(m) = AN Ay. it is possible to cache terms such as Si y and y T Si y which do not depend on the Gaussian component. One of the simplest and most eﬀective structured covariance representations is the semitied covariance matrix (STC) [51]. (m)−1 (3. In this case the component likelihoods can be computed by N y. using A (4) goto (2) unless converged or maximum number of iterations reached. The training procedure for STC systems is (after accumulating the standard EM fullcovariance matrix statistics): (1) initialise the transform A(0) = I.4 Covariance Modelling 31 1 = − log((2π)d Σ(m) ) 2 − 1 2 B νi i=1 (m) y T Si y − 2µ(m)T Si y + µ(m)T Si µ(m) . µ(m) . Σdiag where Σdiag (m)−1 (m) (3. During recognition. Aµ(m) .14) is a diagonal matrix formed from ν (m) and d Σ (m)−1 = i=1 νi (m) T a[i] a[i] = AΣdiag AT .
This is because the resulting models no longer have the Markov property and this prevents eﬃcient search over possible state alignments. For example.17) The need to sum over all possible state durations has increased the complexity by a factor of O(D2 ) where D is the maximum allowed state duration.10) becomes τ αt (rj) = τ i.16) where ajj is the selftransition probability of state sj . and the backward probability and Viterbi decoding suﬀer the same increase in complexity. 3. an obvious reﬁnement is to introduce an explicit duration model such that the selftransition loops in Figure 2.i=j αt−τ aij dj (τ ) l=1 (ri) bj y t−τ +l . they add a very considerable overhead to the computational complexity.2 are replaced by an explicit probability distribution dj (t). and extended maximum likelihood linear transforms (EMLLT) [131]. mixtures of inverse covariances [175]. jj (3. the forward probability given in (2. other types of structured covariance modelling include subspace constrained precision and means (SPAM) [8]. Thus. In addition to STC. Whilst the use of these explicit distributions can undoubtably model state segment durations more accurately. (r) (3. .32 HMM Structure Reﬁnements each time instance. Given this weakness of the standard HMM. STC systems can be made more powerful by using multiple semitied transforms [51].5 HMM Duration Modelling The probability dj (t) of remaining in state sj for t consecutive observations in a standard HMM is given by dj (t) = at−1 (1 − ajj ). the standard HMM models the probability of state occupancy as decreasing exponentially with time and clearly this is a poor model of duration. There is thus almost no increase in decoding time compared to standard diagonal covariance matrix systems. in these socalled hidden semiMarkov models (HSMMs). Suitable choices for dj (t) are the Poisson distribution [147] and the Gamma distribution [98].
. The HMMtrajectory model 4 aims to generate realistic feature trajectories by ﬁnding the sequence of ddimensional static observations which maximises the likelihood of the complete feature vector (i. statics + deltas) with respect to the parameters of a standard HMM model.nitech. θ. A typical experience in the latter is that as the accuracy of the phoneoutput distributions is increased. it is relevant to consider how well they can actually generate speech. consider the case of using a feature vector comprising static parameters and “simple 4 An implementation of this is available as an extension to HTK at hts.sp. The trajectory of these static parameters will no longer be piecewise stationary since the associated delta parameters also contribute to the likelihood and must therefore be consistent with the HMM parameters.6 HMMs for Speech Generation 33 In practice. the improvements gained from duration modelling become negligible. but it can also provide further insight into their use in recognition [169]. then given the conditional independence assumptions from Figure 3. For this reason and its computational complexity.3.ac. 3.6 HMMs for Speech Generation HMMs are generative models and although HMMbased acoustic models were developed primarily for speech recognition.e. simply adopts the meanvalue of that state. the trajectory will be piecewise stationary where the time segment corresponding to each state. This would clearly be a poor ﬁt to real speech where the spectral parameters vary much more smoothly. durational modelling is rarely used in current systems. To see how these trajectories can be computed. The key issue of interest here is the extent to which the dynamic delta and delta–delta terms can be used to enforce accurate trajectories for the static parameters.1 and a particular state sequence. explicit duration modelling of this form oﬀers some small improvements to performance in small vocabulary applications.jp. If an HMM is used to directly model the speech features. and minimal improvements in large vocabulary applications. This is not only of direct interest for synthesis applications where the ﬂexibility and compact representation of HMMs oﬀer considerable beneﬁts [168].
s y y t+1 t+1 · · · 0 0 0 I 0 · · · s . y . . The observation at time t is a function of the static features at times t − 1. .19) To illustrate the form of the 2T d × T d matrix A consider the observation vectors at times. · · · −I 0 I 0 0 · · · t−2 y s y t−1 · · · 0 0 I 0 0 · · · t−1 s y = (3. In this case. ··· 0 0 −I 0 I · · · t+2 . The process of ﬁnding the distribution of the static features is simpliﬁed by hypothesising an appropriate state/componentsequence θ for the observations to be generated. y s is t the static element of the feature vector at time t and ∆y s are the delta t features. (3. . not the complete feature vector. t and t + 1 in the complete sequence. y . These may be expressed as . . . The features modelled by the standard HMM are thus a linear transform of the static features.5 Now if a complete sequence of features is considered. t and t + 1 s y ys 0 I 0 t−1 t yt = = (3. It should therefore be possible to derive the probability distribution of the static feature distribution if the parameters of the standard HMM are known.20) t · · · 0 −I 0 I 0 · · · y t . . the following relationship can be obtained between the 2T d features used for the standard HMM and the T d static features Y 1:T = AY s . In practical synthesis systems state duration models are often estimated as in the previous section. . t − 1. d is the size of the static parameters.18) ys . In this 5 Here a nonstandard version of delta parameters is considered to simplify notation.34 HMM Structure Reﬁnements diﬀerence” delta parameters. t ∆y s −I 0 I t s y t+1 where 0 is a d × d zero matrix and I is a d × d identity matrix. ··· 0 I 0 0 0 ··· s .
. . .23) (3. the covariance matrix of the static features. and variances.6 HMMs for Speech Generation 35 case the simplest approach is to set the duration of each state equal to its average (note only single component stateoutput distributions can be used).24) where the segment mean. . µθ = . To ﬁnd µs the relationships θ θ in (3. Σθ ) = N (Y s . . µθ . using the delta parameters to obtain the distribution of the static parameters does not imply the same conditional independence assumptions as the standard HMM assumes in modelling the static features. Σθ = . µs .21).22) to (3. µ(θT ) 0 · · · Σ(θT ) Though the covariance matrix of the HMM features. θ (3. µs . ˆs The maximum likelihood static feature trajectory. (along with the corresponding versions for the static parameters only.3. has the blockdiagonal structure expected from the HMM. . µθ . The following relationships exist between the standard model parameters and the static parameters Σs−1 = AT Σ−1 A θ θ Σs−1 µs = AT Σ−1 µθ θ θ θ µsT Σs−1 µs = µT Σ−1 µθ . given the highly .21) θ θ Z Z where Z is a normalisation term. The likelihood of a static sequence can then be expressed in terms of the standard HMM features as 1 1 p(AY s θ. Thus. Y . the distribution of the associated static sequences Y s will be Gaussian distributed. (3.25) . does not have a block diagonal strucθ ture since A is not block diagonal. . (3.25). Given θ.22) (3. Σθ .. θ θ θ θ θ (3.26) These are all known as µθ and Σθ are based on the standard HMM parameters and the matrix A is given in (3. as the distribution is Gaussian. . . Σθ . can now be found using (3. λ) = N (Y 1:T . Σs in (3. Note.24) can be rearranged to give ˆs Y = µs = AT Σ−1 A θ θ −1 AT Σ−1 µθ .22). This will simply be the static parameter segment mean. µs and Σs ) are θ θ (θ ) (θ1 ) µ 1 Σ ··· 0 . Σs ).
it is also possible to search for the most likely state/componentsequence. λ. Though an interesting research direction.27) In addition. Σθ )P (θw. λ)P (θ. Just as regular HMMs can be trained discriminatively as described later. HMMsynthesis models can also be trained discriminatively [186]. s Z Y . though the standard HMM model parameters.θ (3.36 HMM Structure Reﬁnements structurednature of A eﬃcient implementations of the matrix inversion may be found. improved performance can be obtained by training them so that when used to generate sequences of static features they are a “good” model of the training data. θ = arg max p(AY s θ.21). λ) . can be trained in the usual fashion for example using ML. However. In practice N best list rescoring is often implemented instead. Here 1 ˆs ˆ Y . MLbased estimates for this can be derived [168]. If the overall length of the utterance to synthesise is known. This again makes use of the equalities in (3. µθ .28) As with synthesis. HMM trajectory models can also be used for speech recognition [169]. this form of speech recognition system is not in widespread use. The recognition output for feature vector sequence Y 1:T is now based on ˆ w = arg max arg max w θ 1 N (Y 1:T . one of the major issues with this form of decoding is that the Viterbi algorithm cannot be used. A framedelayed version of the Viterbi algorithm can be used [196] to ﬁnd the statesequence. . λ)P (w) Z . this is still more expensive than the standard Viterbi algorithm. The basic HMM synthesis model described has been reﬁned in a couple of ways. (3. Rather than simply using the average state durations for the synthesis process. then an ML estimate of θ can be found in a manner similar to Viterbi. One of the major issues with both these improvements is that the Viterbi algorithm cannot be directly used with the model to obtain the statesequence.
4 Parameter Estimation In Architecture of an HMMBased Recogniser.1) where Y (r) is the rth training utterance with transcription wref . This optimisation is normally performed using EM [33]. . . . In these approaches. λ. λ)). training data suﬃciency and modelcorrectness [20]. the goal is not to estimate the parameters that are most likely to generate the training 37 . given training data Y (1) . for ML to be the “best” training criterion. Since in general these requirements are not satisﬁed when modelling speech data. . alternative criteria have been developed. r=1 (r) (r) (4. was brieﬂy described based on maximising the likelihood that the models generate the training sequences. the estimation of the HMM model parameters. This section describes the use of discriminative criteria for training HMM model parameters. Y (R) the maximum likelihood (ML) training criterion may be expressed as 1 Fml (λ) = R R log(p(Y (r) wref . in particular. the data and models would need to satisfy a number of requirements. However. Thus.
4. i. is not normally trained in conjunction with the acoustic model. (4. (though there has been some work in this area [145]) since typically the amount of text training . One of the major costs incurred in building speech recognition systems lies in obtaining suﬃcient acoustic data with accurate transcriptions. One approach would be to explicitly minimise the classiﬁcation error on the training data.. a strict 1/0 loss function is needed as the measure to optimise.38 Parameter Estimation data. or even no transcriptions at all are therefore of growing interest and work in this area is brieﬂy summarised at the end of this section.1 Discriminative Training For speech recognition. λ)P (w) . However. Hence various approximations described in the next section have been explored. the implementation of these training criteria is more complex and they have a tendency to overtrain on the training data. Here. all of which can be expressed in terms of the posterior of the correct sentence. Compared to ML training. it is possible to estimate “generative” model parameters with discriminative criteria. they do not generalise well to unseen data. By expressing the posterior in terms of the HMM generative model parameters. whilst generalising to unseen test data.2) where the summation is over all possible wordsequences. P (w). λ) = (r) p(Y (r) wref . Using Bayes’ rule this posterior may be expressed as P (wref Y (r) . Instead.e. approximate transcriptions. three main forms of discriminative training have been examined. the diﬀerential of such a loss function would be discontinuous preventing the use of gradient descent based schemes. λ)P (wref ) w p(Y (r) (r) (r) w. Techniques which allow systems to be trained with rough. the objective is to modify the model parameters so that hypotheses generated by the recogniser on the training data more closely “match” the correct wordsequences. These issues are also discussed. The language model (or class prior).
λ) ≈ R R log r=1 P (wref . Y (r) . Y . Y . interest is growing in using discriminative models for speech recogni(r) tion. are ﬁxed this should really be called conditional entropy training. 4. there has been work on maximummargin based estimation schemes [79. 197] and model parameter transforms [159] that are dependent on the observations making them time varying. When this form of training criterion is used with discriminative models it is also known as conditional maximum likelihood (CML) training.3) As only the acoustic model parameters are trained. . this is equivalent to ﬁnding the model parameters that maximise the average logposterior probability of the correct word sequence. where P (wref Y (r) .1 Discriminative Training 39 data for the language model is far greater (orders of magnitude) than the available acoustic training data.1 Maximum Mutual Information One of the ﬁrst discriminative training criteria to be explored was the maximum mutual information (MMI) criterion [9. As the joint distribution of the wordsequences and observations is unknown. Also. rather than using Bayes’ rule as above to obtain the posterior of the correct word sequence.1. P (wref ) is ﬁxed. This can be expressed as [20] 1 I(w. the language model is included in the denominator of the objective function in the discriminative training case. Here.1 Thus 1 Given that the class priors. that unlike the ML case. Y . λ) (r) (r) . 93. 100. these discriminative criteria can be used to train feature transforms [138. λ) is directly modelled [54. 95]. and the information extracted by a recogniser with parameters λ from the associated observation sequence. the aim is to maximise the mutual information between the wordsequence. Note. (r) (4. In addition. it is approximated by the empirical distributions over the training data. 156]. 125]. w. the language model probabilities. In addition to the criteria discussed here. I(w. however. 67. λ).4. λ) P (wref )p(Y (r) .
the objective function is maximised by making the correct model sequence likely and all other model sequences unlikely. the posteriors (or loglikelihoods) are smoothed with a sigmoid function. r=1 (r) (4. The ﬁrst is that the denominator term does not include the correct word sequence.1. 29].6) There are some important diﬀerences between MCE and MMI.7) This is a speciﬁc example of the minimum Bayes risk criterion discussed next. The MCE criterion may be expressed in terms of the posteriors as 1 R R r=1 Fmce (λ) = 1 + (r) P (wref Y (r) .5) Intuitively. which introduces an additional smoothing term .2 Minimum Classiﬁcation Error Minimum classiﬁcation error (MCE) is a smooth measure of the error [82. λ) (r) . 4. (4.4) (r) log r=1 p(Y (r) wref . When = 1 then Fmce (λ) = 1 − 1 R R P (wref Y (r) . the numerator is the likelihood of the data given the cor(r) rect word sequence wref . λ)P (w) . whilst the denominator is the total likelihood of the data given all possible word sequences w. λ) (r) P (wY w=wref −1 . and a sigmoid is often used for this purpose. (4. This is normally based on a smooth function of the difference between the loglikelihood of the correct word sequence and all other competing sequences. Thus. λ)P (wref ) w p(Y (r) (r) w. . Secondly.40 Parameter Estimation the following criterion is maximised 1 Fmmi (λ) = R = 1 R R log(P (wref Y (r) . λ). λ)) r=1 R (r) (4.
• Phone: For large vocabulary speech recognition not all word sequences will be observed. To assist generalisation. .1. • Word: The loss function directly related to minimising the expected word error rate (WER). To address this minimum phone frame error (MPFE) may be used where the phone loss is weighted by the number of frames associated with each frame [198].8) w where L(w. This can cause generalisation issues. There are a number of loss functions that have been examined. the number of possible errors to be corrected is reduced compared to the number of frames. 84]. 0. This is the same as the Hamming distance described in [166]. rather than word sequences. as in the MMI criterion. r=1 (r) (4.3 Minimum Bayes’ Risk In minimum Bayes’ risk (MBR) training.4. w = wref (r) . the expected loss estimated on the training data is used. To approximate the expected loss during recognition. wref ) = (r) 1. Here Fmbr (λ) = (r) 1 R R P (wY (r) . the expected loss during recognition is minimised [22. the loss function is often computed between phone sequences. rather than trying to model the correct distribution. λ)L(w. w = wref (r) When = 1 MCE and MBR training with a sentence cost function are the same. 139]. • Phone frame error: When using the phone lossfunction. It is normally computed by minimising the Levenshtein edit distance. In the literature this is known as minimum phone error (MPE) training [136. • 1/0 loss: For continuous speech recognition this is equivalent to a sentencelevel loss function L(w. wref ) is the loss function of word sequence w against the (r) reference for sequence r. wref ). wref .1 Discriminative Training 41 4.
since these discriminative criteria can overtrain. does not inﬂuence the MLestimate).9) The ﬁrst term is referred to as the numerator term and the second.1 Parameter Estimation EM cannot be used to estimate parameters when discriminative criteria are used. the denominator term. This section brieﬂy discusses some of the implementation issues that need to be addressed when using discriminative training criteria. A comparison of the MMI. MCE and MPE criteria on the Wall Street Journal (WSJ) task and a general framework for discriminative training is given in [111]. Though all the criteria signiﬁcantly outperformed ML training. the use of the EM algorithm to train the parameters of an HMM using the ML criterion was described. (4. λ)P (w) .42 Parameter Estimation It is also possible to base the loss function on the speciﬁc task for which the classiﬁer is being built [22].2 Implementation Issues In Architecture of an HMMBased Recogniser. The numerator term is identical to the standard (r) ML criterion (the language model term. 4. MPFE is shown to give small but consistent gains over MPE. Firstly. P (wref ). Two aspects are considered. To see why this is the case. λ)P (wref ) r=1 (r) (r) − log w p(Y (r) w.2. The denominator may also be rewritten in a similar fashion to the numerator by producing a composite HMM with parameters λden . although auxiliary functions may be computed for . the form of algorithm used to estimate the model parameters is described. 4. In [198]. Secondly. Hence. MCE and MPE were found to outperform MMI on this task. various techniques for improving generalisation are detailed. note that the MMI criterion can be expressed as the diﬀerence of two loglikelihood expressions Fmmi (λ) = 1 R R log p(Y (r) wref .
11) and λden is the composite HMM representing all competing paths. λden ) (rjm) (4.2 Implementation Issues 43 each of these. +D (rm) γnumt − γdent (rjm) (4. standard EMlike auxiliary functions are deﬁned for the numerator and denominator but stability during reestimation is achieved by adding scaled current model parameters to the numerator statistics. and recently large vocabulary training with MCE has also been implemented [119]. simply has the same gradient as the criterion at the current model parameters. 129]. To handle this problem. To improve eﬃciency. .10) where µ(jm) are the current model parameters. The weighting of these parameters is speciﬁed by a constant D. it is important that the training procedure be as eﬃcient as possible. One issue with EBW is the speed of convergence. In this case. A similar expression can be derived from a weaksense auxiliary function perspective [136]. Here an auxiliary function is deﬁned which rather than being a strict lowerbound. the update of the mean parameters then becomes ˆ µ (jm) = R r=1 T (r) t=1 R r=1 (γnumt − γdent )y t T (r) t=1 (rjm) (rjm) (rjm) (r) + Dµ(jm) . This may be made faster by making D Gaussian component speciﬁc [136]. the diﬀerence of two lowerbounds is not itself a lowerbound and so standard EM cannot be used. γnumt is the same as the posterior used in ML estimation and γdent = P (θt = sjm Y (r) . where thousands of hours of training data may be used. it is common to use lattices as a compact representation of the most likely competing word sequences and only accumulate statistics at each iteration for paths in the lattice [174]. which may be very slow. For discriminative techniques one of the major costs is the accumulation of the statistics for the denominator since this computation is equivalent to recognising the training data. Similar expressions can be derived for the MPE criterion.4. For large vocabulary speech recognition systems. the extended BaumWelch (EBW) criterion was proposed [64. A large enough value of D can be shown to guarantee that the criterion does not decrease. For MMI.
These priors may either be based on the ML parameter estimates [139] or. For some . To mitigate this. However. 4. The form of the language model used in training should in theory match the form used for recognition.44 Parameter Estimation 4. By weakening the language model. the acoustic model likelihoods are often raised to a fractional power. For MPE training using these priors has been found to be essential for achieving performance gains [139]. λ) = (r) p(Y (r) wref . the number of possible confusions is increased allowing more complex models to be trained given a ﬁxed quantity of training data. a number of techniques have been developed that improve the robustness of the estimates. To improve generalisation. referred to as acoustic deweighting [182]. the posterior probabilities computed using the HMM likelihoods tend to have a very large dynamic range and typically one of the hypotheses dominates.2 Generalisation Compared to ML training. for example when using MPE training. To address this problem. As a consequence of the conditional independence assumptions. λ)α P (wref )β w p(Y (r) (r) (r) w. the MMI estimates [150]. λ)α P (w)β .12) In practice β is often set to one and α is set to the inverse of the language model scalefactor described in footnote 2 in Architecture of an HMMBased Recogniser. unigrams or heavily pruned bigrams.3 Lightly Supervised and Unsupervised Training One of the major costs in building an acoustic training corpus is the production of accurate orthographic transcriptions. it has been found that using simpler models. it has been observed that discriminative criteria tend to generalise lesswell. the posteriors are based on P (wref Y (r) . Thus when accumulating the statistics.2. (4. “robust” parameter priors may be used when estimating the models. for training despite using trigrams or fourgrams in decoding improves performance [152].
but has been successfully applied to a range of tasks [56. However. 94. (2) Recognise the audio data using an existing acoustic model and the biased LM trained in (1). This CC LM is then interpolated with a general BN LM using interpolation weights heavily weighted towards the CC LM.9 and 0. For example. the gains from unsupervised approaches can decrease dramatically when discriminative training such as MPE is used.2 This yields a biased language model. These transcriptions are known to be errorfull and thus not suitable for direct use when training detailed acoustic models. or only use segments with high conﬁdence in the recognition output. 110]. unsupervised training techniques must be used [86. However.3 Lightly Supervised and Unsupervised Training 45 situations. rather than the biased language model for lightly supervised training. a number of lightly supervised training techniques have been developed to overcome this [23. 128]. such as broadcast news (BN) which has associated closedcaptions. This reduces the accuracy of the transcriptions [23]. This procedure consists of the following stages: (1) Construct a language model (LM) using only the CC data. The general approach to lightly supervised training is exempliﬁed by the procedure commonly used for training on BN data using closedcaptions (CC). (4) Use the selected segments for acoustic model training with the hypothesised transcriptions from (2). For acoustic data which has no transcriptions at all. approximate transcriptions may be derived. Unsupervised training is similar to lightly supervised training except that a general recognition language model is used. . 110]. If the mismatch between the supervised training data and unsupervised data is 2 The interpolation weights were found to be relatively insensitive to the type of source data when tuned on heldout data. only use segments from the training data where the recognition output from (2) is consistent to some degree with the CCs. (3) Optionally postprocess the data.4.1. For example in [56] the interpolation weights were 0.
. This problem can be partly overcome by using the recognition output to guide the selection of data to manually transcribe [195]. Thus currently there appears to be a limit to how poor the transcriptions can be whilst still obtaining worthwhile gains from discriminative training.46 Parameter Estimation large. the accuracy of the numerator transcriptions become very poor. The gains obtained from using the unannotated data then become small [179].
and it can be used in recognition to reduce mismatch and the consequent recognition errors.5 Adaptation and Normalisation A fundamental idea in statistical pattern classiﬁcation is that the training data should adequately represent the test data. adaptation can be incremental or batchmode. There are varying styles of adaptation which aﬀect both the possible applications and the method of implementation. 47 . when a new caller comes on the line. for example. Adaptation allows a small amount of data from a target speaker to be used to transform an acoustic model set to make it more closely match that speaker. The solution to these problems is adaptation. or it can be unsupervised in which case the required transcriptions must be hypothesised. In batchmode. there will always be new speakers who are poorly represented by the training data. as is the case for a spoken dialogue system. adaptation can be supervised in which case accurate transcriptions are available for all of the adaptation data. otherwise a mismatch will occur and recognition accuracy will be degraded. Firstly. Secondly. In the former case. adaptation data becomes available in stages. It can be used both in training to make more speciﬁc or more compact recognition sets. In the case of speech recognition. and new hitherto unseen environments.
. The ﬁnal approach. 5. and Gaussianisation. linear transformationbased approaches are discussed. First featurebased approaches. For example. Then maximum a posteriori adaptation and cluster based approaches are described. in BN transcription this will be a speaker segment and in telephone transcription it will be a whole side of a conversation. Many of these approaches can be used within an adaptive training framework which is the ﬁnal topic. A range of adaptation and normalisation approaches have been proposed. which only depend on the acoustic features. This is only strictly true of the ﬁrst two approaches: mean and variance normalisation.1.48 Adaptation and Normalisation all of the adaptation data is available from the start as is the case in oﬀline transcription. many of which do not fully ﬁt within this deﬁnition. the necessary mean and variance statistics should be computed over the longest possible segment of speech for which the speaker and environment conditions are constant. Finally. Note that for real time systems which operate in a single continuous pass over the data. For transcription applications where multiple passes over the data are possible.1 Mean and Variance Normalisation Cepstral mean normalisation (CMN) removes the average feature value of the featurevector from each observation. Since cepstra in common with most frontend feature sets are derived from log spectra. the mean and variance statistics must be computed as running averages. 5. provided that the channel eﬀects do not signiﬁcantly vary with time or the amplitude of the signal.1 FeatureBased Schemes Featurebased schemes only depend on the acoustic features. this has the eﬀect of reducing sensitivity to channel variation. vocal tract length normalisation can be implemented in a range of ways. are described. This section gives an overview of some of the more common schemes. Cepstral variance normalisation (CVN) scales each individual feature coeﬃcient to have a unit variance and empirically this has been found to reduce sensitivity to additive noise [71].
1. F (˜i ).2 Gaussianisation Given that normalising the ﬁrst and secondorder statistics yields improved performance. In general performing Gaussianisation on the complete feature vector is highly complex. that yields y y ∼ N (0. y ˜ y t=1 (5. h(·) is the step function and rank(˜i ) is y ˜ ˜ the rank of element i of y . an obvious extension is to normalise the higher order statistics for the data from. for example. T (5.1 FeatureBased Schemes 49 5. I). however. . the complete feature vector will not necessarily satisfy (5. An alternative y approach is to estimate an M component GMM on the data and then . can be noisy. Thus for dimension i of base observation y F (˜i ) ≈ y 1 T T h(˜i − yti ) = rank(˜i )/T. This can be esti˜ y mated using all the data points from. for example. a speciﬁc speaker so that the overall distribution of all the features from that speaker is Gaussian. a speaker in a his˜ togram approach. This socalled Gaussianisation is performed by ﬁnding a transform y = φ(˜ ). that although each dimension will be approximately Gaussian distributed. . when the data are sorted. However. then there a number of schemes that may be used to deﬁne φ(·). yi . if each element of the feature vector is treated independently.5. . The transformation required such that the transformed dimension i of observation y is approximately Gaussian is yi = Φ−1 rank(˜i ) y . One diﬃculty with this approach is that when the normalisation data set is small the CDF estimate. F (˜i ).3) where Φ(·) is the CDF of a Gaussian [148]. Note. y T are the data.1) since the elements may be correlated with one another.2) ˜ ˜ where y 1 . .1) where 0 is the ddimensional zero vector and I is a ddimensional identity matrix. One standard approach is to estimate a cumulative distribution function (CDF) for each feature element yi . (5.
1(b) [71].50 Adaptation and Normalisation use this to approximate the CDF [26].1(a) this results in a problem at high frequencies where female voices have no information in the upper frequency band and male voices have the upper frequency band truncated. that is yi = Φ (m) −1 yi ˜ M −∞ m=1 (m)2 cm N (z. This is repeated until the VTLN parameters have stabilised.1.4) where µi and σi are the mean and variance of the mth GMM component of dimension i trained on all the data. Once the optimal values for all training speakers have been computed. Both these forms of Gaussianisation have assumed that the dimensions are independent of one another. Parameter estimation is performed using a grid search plotting log likelihoods against parameter values. This can be mitigated by using a piecewise linear function of the form shown in Figure 5. For example. Thus. a bilinear transform can be used [120]. Note here that . To reduce the impact of this assumption Gaussianisation is often applied after some form of decorrelating transform. one simple form of normalisation is to linearly scale the ﬁlter bank centre frequencies within the frontend feature extractor to approximate a canonical formant frequency scaling [96]. Early attempts at VTLN used a simple linear mapping. This is called vocal tract length normalisation (VTLN). Alternatively. µi (m) . To implement VTLN two issues need to be addressed: deﬁnition of the scaling function and estimation of the appropriate scaling function parameters for each speaker. the training data is normalised and the acoustic models reestimated. 5. σi (m)2 )dz .3 Vocal Tract Length Normalisation Variations in vocal tract length cause formant frequencies to shift in frequency in an approximately linear fashion. a global semitied transform can be applied prior to the Gaussianisation process. but as shown in Figure 5. (5. This results in a smoother and more compact representation of the Gaussianisation transformation [55].
it is usually ignored. the overhead incurred from iteratively computing the optimal VTLN parameters can be considerable. For very large systems. This is however very complex to estimate and since the application of mean and variance normalisation will reduce the aﬀect of this approximation. 5. An alternative is to approximate the eﬀect of VTLN by a linear transform.5.2 Linear TransformBased Schemes For cases where adaptation data is limited. 5. These diﬀer from featurebased approaches in that they use the acoustic model parameters and they require a transcription of the adaptation data. The advantage of this approach is that the optimal transformation parameters can be determined from the auxiliary function in a single pass over the data [88]. VTLN is particularly eﬀective for telephone speech where speakers can be clearly identiﬁed. then the Jacobian of the transform should be included. when comparing log likelihoods resulting from diﬀering VTLN transformations. .1 Vocal tract length normalisation. It is less eﬀective for other applications such as broadcast news transcription where speaker changes must be inferred from the data.2 Linear TransformBased Schemes 51 Fig. linear transformation based schemes are currently the most eﬀective form of adaptation.
7) In this case. MLLR is generally very robust and well suited to unsupervised incremental adaptation. the likelihood can be calculated at the same cost as when using the original diagonal covariance matrices. 97]. For MLLR there are no constraints between the adaptation applied to the means and the covariances. (5. ˜ ˜ Σ (5.1 Maximum Likelihood Linear Regression In maximum likelihood linear regression (MLLR). unless H is constrained to be diagonal. In unconstrained MLLR. Although (5. separate transforms are trained for the means and variances ˆ µ(sm) = A(s) µ(m) + b(s) . Σ(m) ). where diﬀerent transforms are used depending on the Gaussian component to be adapted. then a linear transform related to the featurespace transforms described earlier may be obtained. The multiple transform case. µ(sm) . There are two main variants of MLLR: unconstrained and constrained [50. Σ (sm) ) = A(s) N (A(s) y + b(s) . a set of linear transforms are used to map an existing model set into a new adapted model set such that the likelihood of the adaptation data is maximised.6) H(s)  If the original covariances are diagonal (as is common). ˆ (sm) = A(s) Σ(m) A(s)T . ˆ (sm) = H(s) Σ(m) H(s)T Σ (5. is discussed later. (5. µ(sm) .5) suggests that the likelihood calculation is expensive to compute.2. Σ = (sm) 1 N (H(s)−1 y. µ(m) . If the two matrix transforms are constrained to be the same. This is constrained MLLR (CMLLR) ˜ ˜ ˆ µ(sm) = A(s) µ(m) + b(s) .5) where s indicates the speaker. This section presents MLLR in the form of a single global linear transform for all Gaussian components. then by appropriately caching the transformed observations and means.52 Adaptation and Normalisation 5. the likelihood can be expressed as ˆ ˆ N (y.8) . Σ(m) ). H(s)−1 A(s) µ(m) + H(s)−1 b(s) . it can in fact be made eﬃcient using the following equality ˆ ˆ N y.
the constrained and unconstrained variance cases are similar to the semitied covariance transform discussed in HMM Structure Reﬁnements and they require an iterative solution. 97]. Whereas there are closedform solutions for unconstrained mean MLLR. First.5. the matrix transformation may be full. Using this approach. then the hypothesised transcription is used to estimate MLLR transforms. the transcription must be derived from the recogniser output and in this case. (5. the transcription is known and may be directly used without further consideration. Both forms of linear transforms require transcriptions of the adaptation data in order to estimate the model parameters. MLLR is normally applied iteratively [183] to ensure that the best hypothesis for estimating the transform parameters is used. blockdiagonal. For a given amount of adaptation.2. discussed later. When used in unsupervised mode.2 Parameter Estimation The maximum likelihood estimation formulae for the various forms of linear transform are given in [50. full transforms normally outperform larger numbers of diagonal transforms [126]. more diagonal transforms may be reliably estimated than full ones. This . However. The unknown speech is then rerecognised using the adapted models. For supervised adaptation. or diagonal. in practice. the unknown speech is recognised. Hierarchies of transforms of diﬀerent complexities may also be used [36]. the actual model parameters are not transformed making this form of transform eﬃcient if the speaker (or acoustic environment) changes rapidly. A reﬁnement is to use recognition lattices in place of the 1best hypothesis to accumulate the adaptation statistics. This is repeated until convergence is achieved.2 Linear TransformBased Schemes 53 where ˜ A(s) = A(s)−1 .9) Thus with this constraint. 5. ˜ ˜ b(s) = −A(s)−1 b(s) . CMLLR is the form of linear transform most often used for adaptive training. all words within the hypothesis are treated as equally probable. For both forms of linear transform.
5. 178. Linear transforms can also be estimated using discriminative criteria [171. then there is an additional concern. for example A(r) for regression class r. which may then be used for conﬁdencebased MLLR [172]. 180]. if unsupervised adaptation is used. When the amount of adaptation data is limited. The number of transforms to use for any speciﬁc set of adaptation data can be determined automatically using a regression class tree . As the amount of data increases. this is shown to be a lowerbound approximation where the transform parameters are considered as latent variables. As the amount of data increases further. This is the same issue as was observed for unsupervised discriminative training and.2. However. As discriminative training schemes attempt to modify the parameters so that the posterior of the transcription (or a function thereof) is improved. in practice. the number of classes and therefore transforms can be increased correspondingly to give better and better adaptation [97]. discriminative unsupervised adaptation is not commonly used. In [194].54 Adaptation and Normalisation approach is more robust to recognition errors and avoids the need to rerecognise the data since the lattice can be simply rescored [133].3 Regression Class Trees A powerful feature of linear transformbased adaptation is that it allows all the acoustic models to be adapted using a variable number of transforms. it is more sensitive to errors in the transcription hypotheses than ML estimation. The initial development of transformbased adaptation methods used the ML criterion. An alternative use of lattices is to obtain conﬁdence scores. this was then extended to include maximum a posteriori estimation [28]. the HMM state components can be grouped into regression classes with each class having its own transform. a global transform can be shared across all Gaussians in the system. 194] whereby a separate transform was estimated for each hypothesis and used to rescore only that hypothesis. any of the standard approaches may be used. For supervised adaptation. A diﬀerent approach using N best lists was proposed in [118. for example in BN transcription systems.
information about the segment. or more commonly by automatically training the tree by assuming that Gaussian components that are “close” to one another are transformed using the same linear transform [97]. for example the gender of the speaker.. 5.e. a set of Gaussian components which will share a single transform. The total occupation count associated with any node in the tree can easily be computed since the counts are known at the leaf nodes. 5. If this gender information is then used in training. Regression class trees may either be speciﬁed using expert knowledge. as illustrated in Figure 5. the tree is descended and the most speciﬁc set of nodes is selected for which there is suﬃcient data (for example. . or infer. for a given set of adaptation data.2. i. the result is referred to as a genderdependent (GD) system.2). it can normally be rapidly and robustly estimated.2 A regression class tree. the shaded nodes in Figure 5.3 Gender/Cluster Dependent Models The models described above have treated all the training data as a singleblock. In practice it is possible to record. Each node represents a regression class. with no information about the individual segments of training data.3 Gender/Cluster Dependent Models 55 Fig. One issue with training these GD models is that during recognition the gender of the testspeaker must be determined. Then. As this is a single binary latent variable.5.
priors and transition probabilities are tied over all the clusters K ˆ µ (sm) = k=1 νk µ(km) . k=1 (s) νk ≥ 0. and the remaining variances. and parameter combination. Thus K p(Y 1:T ν (s) .56 Adaptation and Normalisation It is also possible to extend the number of clusters beyond simple gender dependent models. (s) (s) (5. The simplest approach is to use a mixture of eigenvoices. λ(k) ). The training of these forms of model can be implemented using EM. the model parameters themselves can be interpolated. λ) = k=1 νk p(Y 1:T . This form of interpolation can also be combined with linear transforms [35]. Rather than interpolating likelihoods. For example speakers can be automatically clustered together to form clusters [60]. k=1 νk λ(k) . λ) = p Y 1:T . the form of representation of the cluster must be decided. and K νk = 1. (s) (5. There are two approaches that have been adopted to combining the clusters together: likelihood combination. Rather than considering a single cluster to represent the speaker (and possibly acoustic environment). One problem with this approach is that all K cluster likelihoods must be computed for each observation sequence. Here K p(Y 1:T ν (s) . s indicates the speaker. multiple clusters or socalled eigenvoices may be combined together.11) Thus gender dependent models are a speciﬁc case where νk = 1 for the “correct” gender and zero otherwise.12) Current approaches limit interpolation to the mean parameters. (s) (5.13) . (s) (5.10) where λ(k) are the model parameters associated with the kth cluster. In addition to combining the clusters.
In contrast. . EigenVoices [91]. is estimated. it is possible to use standard statistical approaches to obtain robust parameter estimates. linear transforms were used to represent each cluster for CAT. CAT uses ML to estimate the cluster parameters. also referred to as EigenMLLR [24] when used in an Eigenvoicestyle approach. A number of extensions to this basic parameter interpolation framework have been proposed. All use the same interpolation of the means. is used to estimate the model parameters. a prior over the model parameters. for the test speaker the interpolation weights. . are estimated. and cluster adaptive training (CAT) [52]. In [52]. RSW is the simplest approach using each speaker as an eigenvoice. Y (R) and a model with parameters λ. Kernelised versions of both EigenVoices [113] and EigenMLLR [112] have also been studied. . MAPbased parameter . the use of discriminative criteria to obtain the cluster parameters has been derived [193]. the MLbased smoothing approach in [49] is used. In the original implementation of EigenVoices. the cluster means were determined using PCA on the extended mean supervector.5. . Note that CAT parameters are estimated using an adaptive training style which is discussed in more detail below. 5. For both EigenVoices and CAT. In contrast RSW uses ML with the added constraint that the weights satisfy the constraints in (5.4 Maximum a Posteriori (MAP) Adaptation 57 Various forms of this model have been investigated: reference speaker weighting (RSW) [73].4 Maximum a Posteriori (MAP) Adaptation Rather than hypothesising a form of transformation to represent the diﬀerences between speakers. Having estimated the cluster parameters in training.11). Given some adaptation data Y (1) . One common approach is maximum a posteriori (MAP) adaptation where in addition to the adaptation data. Finally. where all the means of the models for each speaker were concatenated together. ν (s) . ν (s) . p(λ). but diﬀer in how the eigenvoices are estimated and how the interpolation weight vector. This complicates the optimisation criterion and hence it is not often used.
then a ﬁnite mixture density of the form pD (cj ) m pW (µ(jm) . formulae can be derived for the variances and mixture weights [61]. This is a desirable property and it makes MAP especially useful for porting a welltrained model set to a new domain for which there is only a limited amount of data. A major drawback of MAP adaptation is that every Gaussian component is updated individually. [4. R (5. 157]) but MAP nevertheless remains illsuited for rapid incremental adaptation.g. The choice of distribution for this prior is problematic since there is no conjugate prior density for a continuous Gaussian mixture HMM.13). T (r) (rjm) t=1 γt (5.1). Various attempts have been made to overcome this (e.58 Adaptation and Normalisation estimation seeks to maximise the following objective function Fmap (λ) = Fml (λ) + = 1 R R 1 log(p(λ)) R (r) log p(Y (r) wref .15) with (2. it can be seen that the likelihood is weighted by the prior. the parameters tend asymptotically to the adaptation domain. Similar. As the amount of adaptation data increases. Comparing (5. Note that MAP can also be used with other training criteria such as MPE [137]. if the mixture weights and Gaussian component parameters are assumed to be independent. However. Σ(jm) ) can be used where pD (·) is a Dirichlet distribution over the vector of mixture weights cj and pW (·) is a normalWishart density. λ) r=1 + 1 log(p(λ)).15) where µ(jm) is the prior mean and τ is a parameter of pW (·) which is normally determined empirically. then many of the model parameters will not be updated. though rather more complex.14) Comparing this with the ML objective function given in (4. it can be seen that MAP adaptation eﬀectively interpolates the original prior parameter values with those that would be obtained from the adaptation data alone. . It can then be shown that this leads to parameter estimation formulae of the form [61] ˆ µ (jm) = τ µ(jm) + τ+ R r=1 R r=1 T (r) (rjm) (r) yt t=1 γt . If the adaptation data is sparse.
CMN/CVN and Gaussianisation belong to this group. . These may be split into three groups. and then the canonical model is estimated given all of these speaker transforms. acoustic models trained directly on this set will have to “waste” a large number of parameters encoding the variability between speakers rather than the variability between spoken words which is the true aim. Hence. However. This is referred to as speaker adaptive training (SAT) [5]. the training data necessarily includes a large number of speakers. In these cases the features are transformed and the models trained as usual. The concept behind SAT is illustrated in Figure 5. The complexity of performing adaptive training depends on the nature of the adaptation/normalisation transform. a transform is estimated. VTLN may be independent of the model for example if estimated using attributes of the signal [96]. One approach to handling this problem is to use adaptation transforms during training.5 Adaptive Training 59 5. an acoustic model set should encode just those dimensions which allow the diﬀerent classes to be discriminated.5 Adaptive Training Ideally. 5. For each training speaker.3 Adaptive training. in the case of speaker independent (SI) speech recognition.5. Fig.3. • Model independent: These schemes do not make explicit use of any model information.
(2) estimate a transform for each speaker using only the data for that speaker. (5.16) where S denotes the number of speakers and γt is the posterior occupation probability of component m of state sj generating the observation at time t based on data for speaker s. means and possibly variances. For schemes where the transforms are dependent on the model parameters. using the current estimate of the model set. normally using ML estimation. Common versions of these feature transforms are VTLN using MLstyle estimation. . although additional parameters must be stored. Common schemes are SAT using MLLR [5] and CAT [52]. When MLLR is used. for example using Gaussianisation with CMLLRbased SAT [55]. • Model transformation: The model parameters. are transformed. since it is the simplest to implement. the estimation is performed by an iterative process: (1) initialise the canonical model using a speakerindependent system and set the transforms for each speaker to be an identity transform (A(s) = I. or the linear approximation [88] and constrained MLLR [50]. Using CMLLR.60 Adaptation and Normalisation • Feature transformation: These transforms also act on the features but are derived. (3) estimate the canonical model given each of the speaker transforms. In practice. multiple forms of adaptive training can be combined together. This is not the case for CAT. the estimation of the canonical model is more expensive [5] than standard SI training. the canonical mean is estimated by ˆ µ (jm) = S s=1 T (s) (sjm) t=1 γt S s=1 A(s) y t + b(s) (s) T (s) (sjm) t=1 γt (sjm) . The most common version of adaptive training uses CMLLR. (4) Goto (2) unless converged or some maximum number of iterations reached. b(s) = 0) for each speaker.
an SI model set is typically retained to generate the initial hypothesised transcription or lattice needed to compute the ﬁrst set of transforms. note that SAT trained systems incur the problem that they can only be used once transforms have been estimated for the test data. Recently a scheme for dealing with this problem using transform priors has been proposed [194].5 Adaptive Training 61 Finally.5. Thus. .
.
1) leads to the following relationship between 63 . in the melcepstral domain.6 Noise Robustness In Adaptation and Normalisation. for example. (6. That is yt = xt ⊗ ht + nt . In practice. Thus. a variety of schemes were described for adapting a speech recognition system to changes in speaker and environment. One very common and often extreme form of environment change is caused by ambient noise and although the general adaptation techniques described so far can reduce the eﬀects of noise. speciﬁc noise compensation algorithms can be more eﬀective. speech recognisers operate on feature vectors derived from a transformed version of the log spectrum. Noise compensation algorithms typically assume that a clean speech signal xt in the time domain is corrupted by additive noise nt and a stationary convolutive channel impulse response ht to give the observed noisy signal yt [1]. especially with very limited adaptation data.1) where ⊗ denotes convolution. (6. as explained in Architecture of an HMMBased Recogniser.
the noisy features are compensated or enhanced to remove the eﬀects of the noise. the noise dominates and there is little or no information left in the signal. the noise mode starts to dominate. For a given set of noise conditions. Secondly. the observed (static) speech vector y s is a highly nonlinear function of t the underlying clean (static) speech signal xs . ns . noise and noise corrupted speech observations1 y s = xs + h + C log 1 + exp(C−1 (ns − xs − h) t t t t = xs + f (xs . . Firstly. Robustness to noise can also be improved by training on speech recorded in a variety of noise conditions. there is a limit to the eﬀectiveness of these approaches and in practice SNRs below 15 dB or so require active noise compensation to maintain performance. training takes place in the clean environment. the noisy speech means move towards the noise and the variances shrink. In broad terms. in model compensation. For example. t t t (6. h). They will thus yield a vector. As the signal to noise ratio falls. PLP features do exhibit some measure of robustness [74] and further robustness can be obtained by bandpass ﬁltering the features over time in a process called relative spectral ﬁltering to give socalled RASTAPLP coeﬃcients [75]. Of course. However. the clean acoustic models are compensated to match the noisy environment. The net eﬀect is that the probability distributions trained on clean data become very poor estimates of the data distributions observed in the noisy environment and recognition error rates increase rapidly.64 Noise Robustness the static clean speech. the ideal solution to this problem would be to design a feature representation which is immune to noise. a small t amount of additive Gaussian noise adds a further mode to the speech distribution. There are two main approaches to achieving noise robust recognition and these are illustrated in Figure 6. Eventually. in feature compensation. Recognition and training then takes place in the clean environment. this is called multistyle training.1 which shows the separate clean training and noisy recognition environments. but the clean 1 In this section log(·) and exp(·) when applied to a vector perform an elementwise logarithm or exponential function to all elements of the vector. In this latter case.2) where C is the DCT matrix.
In general. both feature and model compensation can be augmented by assigning a conﬁdence or uncertainty to each feature. it nevertheless has several problems. 6. One of the simplest approaches to this problem is spectral subtraction (SS) [18]. Firstly. These uncertaintybased methods include missing feature decoding and uncertainty decoding. but model compensation has the potential for greater robustness since it can make direct use of the very detailed knowledge of the underlying clean speech signal encoded in the acoustic models. typically . feature compensation is simpler and more eﬃcient to implement. Given an estimate of the noise spectrum in the linear frequency domain nf .1 Feature Enhancement 65 Fig.6. The following sections discuss each of the above in more detail. can then be derived in the normal way from the compensated spectrum. In addition to the above. a clean speech spectrum is computed simply by subtracting nf from the observed spectra.1 Approaches to noise robust recognition. models are then mapped so that recognition can take place directly in the noisy environment. Any desired feature vector. such as melcepstral coeﬃcients. speech/nonspeech detection is needed to isolate segments of the spectrum from which a noise estimate can be made. Although spectral subtraction is widely used. 6. The decoding process is then modiﬁed to take account of this additional information.1 Feature Enhancement The basic goal of feature enhancement is to remove the eﬀects of noise from measured features.
(6. µy . It is also possible to just predict the static clean means xs and then derive the delta parameters from these estimates. βni otherwise.4) The distribution of the noisy speech is usually represented as an N component Gaussian mixture N p(y t ) = n=1 cn N (y t . smoothing may lead to an underestimate of the noise and occasionally errors lead to an overestimate.6) (n) (n) (n) = A(n) y t + b(n) and hence the required MMSE estimate is just a weighted sum of linear predictions N ˆ xt = n=1 P (ny t )(A(n) y t + b(n) ). 2 The (n) (n) (6. Secondly. A more robust approach to feature compensation is to compute a minimum mean square error (MMSE) estimate2 ˆ xt = E{xt y t }. a practical SS scheme takes the form: xf = ˆi f yi − αnf .7) where the posterior component probability is given by P (ny t ) = cn N (y t .66 Noise Robustness the ﬁrst second or so of each utterance. n} = µx (n) + Σxy (Σy )−1 (y t − µy ) (6. (n) (n) (6. . if xf > βnf ˆi i i f. (6. µy . Hence.3) where the f superscript indicates the frequency domain. (6. This procedure is prone to errors. Σy )/p(y t ). α is an empirically determined oversubtraction factor and β sets a ﬂoor to avoid the clean spectrum going negative [107]. Σy ).8) derivation given here considers the complete feature vector y t to predict the clean ˆ ˆt speech vector xt .5) If xt and y t are then assumed to be jointly Gaussian within each mixture component n then E{xt y t .
Given a clean speech GMM and an initial estimate of the noise parameters. the noisy speech GMM parameters µy and (n) Σy are trained directly on the noisy data using EM. s dn ∂h where the partial derivatives are evaluated at xs . noise and speech+noise parameters y s = xs + f (xs .2).6. then the conditional statistics required to estimate A(n) and b(n) can be derived using (6. 39]. then the parameters can be estimated directly. ns . h0 ) + (xs − xs ) 0 0 0 +(ns − ns ) 0 ∂f ∂xs (6. h is assumed to be a constant convolutional channel eﬀect. Then a single iteration of EM is used to update the noise parameters. If stereo data is not available. This ability to estimate the noise parameters without having to explicity identify noiseonly segments of the signal is a major advantage of the VTS algorithm. h0 .2 ModelBased Compensation 67 A number of diﬀerent algorithms have been proposed for estimating (n) (n) the parameters A(n) .. (6. 123].10) ∂f ∂f + (h − h0 ) . For simplicity. simultaneous recordings of noisy and clean data is available.9) A further simpliﬁcation in the SPLICE algorithm is to ﬁnd n∗ = ∗ ˆ arg maxn {P (ny t )} and then xt = y t + b(n ) . b(n) . µy and Σy [1.e. For example.2 ModelBased Compensation Feature compensation methods are limited by the need to use a very simple model of the speech signal such as a GMM. the above function is expanded around the mean vector of each Gaussian and the parameters of the corresponding noisy speech Gaussian are estimated. i. in the (n) SPLICE algorithm [34. If socalled stereo data. the matrix transform A(n) is assumed to be identity and b(n) = T t=1 P (ny t )(xt − T t=1 P (ny t ) yt) . One approach to dealing with the nonlinearity is to use a ﬁrstorder vector Taylor series (VTS) expansion [124] to relate the static speech. Given that the acoustic models within the recogniser provide a much more detailed . ns . 6. The convolu0 0 tional term.
for some situations with highly varying noise conditions. a multistate. This impacts the computational load in both compensation and decoding. y (6. 176]. Modelbased compensation is usually based on combining the cleanspeech models with a model of the noise and for most situations a single Gaussian noise model is suﬃcient. The aim of model compensation can be viewed as obtaining the parameters of the speech+noise distribution from the clean speech model and the noise model. noise model may be useful. The same state expansion occurs when combining multiplestate noise models. or multiGaussian component. One approach to modelbased compensation is parallel model combination (PMC) [57]. however.68 Noise Robustness representation of the speech. such as MFCCs. There is no simple closedform solution to the expectation in these equations so various approximations have been proposed. the decoding framework must also be extended to independently track the state alignments in both the speech and noise models [48.11) (6. although in this case. and that the noise is primarily additive (see [58] for the convolutive noise case). then the combined models will have M N components. For most purposes. N (µy . . Most model compensation methods assume that if the speech and noise models are Gaussian then the combined noisy model will also be Gaussian. better performance can be obtained by transforming these models to match the current noise conditions. If the speech models are M component Gaussians and the noise models are N component Gaussians. The parameters of the corrupted speech distribution. single Gaussian noise models are suﬃcient and in this case the complexity of the resulting models and decoder is unchanged. Standard PMC assumes that the feature vectors are linear transforms of the log spectrum. However.12) where y s are the corrupted speech “observations” obtained from a particular speech component combined with noise “observations” from the noise model. Initially. Σy ) can be found from µy = E {y s } Σy = E y s y sT − µy µT . only the static parameter means and variances will be considered.
16) = l µf µf (exp{σij } i j − 1).17) (6. (6.2 ModelBased Compensation 69 The basic idea of the algorithm is to map the Gaussian means and variances in the cepstral domain. Combining the speech and noise parameters in this linear domain is now straightforward as the speech and noise are independent µf = µf + µf y x n Σf y = Σf x + Σf . Standard PMC ignores this and assumes that the combined distribution is also lognormal. the above mappings are simply inverted to map µf and y Σy back into the cepstral domain.20) . faster. Σ) from the cepstral to logspectral domain (indicated by the l superscript) is µl = C−1 µ Σ =C l −1 (6.2) and then estimate the new distribution [48]. In this case. and then map back to the cepstral domain. The mapping from the logspectral to the linear domain (indicated by the f superscript) is l2 µf = exp{µli + σi /2} i f σij (6. combine the samples using (6. This is referred to as the lognormal approximation.18) The distributions in the linear domain will be lognormal and unfortunately the sum of two lognormal distributions is not lognormal. back into the linear domain where the noise is additive. especially at low SNRs.19) (6. A simpler. The lognormal approximation is rather poor. This is the logadd approximation.13) −1 T Σ(C ) . The mapping of a Gaussian N (µ. However.14) where C is the DCT. µn . n (6.15) (6. compute the means and variances of the new combined speech+noise distribution. this is computationally infeasible for large systems.6. µh ) Σy = Σx . A more accurate mapping can be achieved using Monte Carlo techniques to sample the speech and noise distributions. (6. here µy = µx + f (µx . PMC approximation is to ignore the variances of the speech and noise distributions.
µn .2). This is much faster than standard PMC since the mapping of the covariance matrix toandfrom the cepstral domain is not required. + (ns − µn ) s + (h − µh ) ∂n ∂h (6.24) (6. an interesting issue is how to compensate the delta and delta–delta parameters [59. (6. µn . are approximated by the instantaneous derivative with respect to time ∆y s t = n i=1 wi ys − ys ∂y s t+i t−i t . A common approximation is the continuous time approximation [66]. Note the compensation of the mean is exactly the same as the PMC logadd approximation. These have the form ∂y s /∂xs = ∂y s /∂h = A ∂y /∂n = I − A.25) It can be shown empirically that this VTS approximation is in general more accurate than the PMC lognormal approximation.21) The partial derivatives above can be expressed in terms of partial derivatives of the y s with respect to xs . Here the dynamic parameters. ns and h evaluated at µx . s s (6. For modelbased compensation schemes. It follows that the means and variances of the noisy speech are given by µy = µx + f (µx .10) [2]. noise and convolutional noise. ≈ n 2 ∂t 2 i=1 wi (6. The expansion points for the vector Taylor series are the means of the speech.23) where A = CFC−1 and F is a diagonal matrix whose inverse has leading diagonal elements given by (1 + exp(C−1 (µn − µx − µh ))). Thus for the mean of the corrupted speech distribution µy = E µx + f (µx . 66]. µn .70 Noise Robustness where f (·) is deﬁned in (6. µh ) Σy = AΣx AT + AΣh AT + (I − A)Σn (I − A)T . µh ) + (xs − µx ) ∂f ∂xs ∂f ∂f .26) . µh .22) (6. An alternative approach to model compensation is to use the VTS approximation described earlier in (6. which are a discrete time estimate of the gradient.
6.3 UncertaintyBased Approaches
71
It is then possible to show that, for example, the mean of the delta parameters µ∆y can be expressed as µ∆y = Aµ∆x . (6.27)
The variances and delta–delta parameters can also be compensated in this fashion. The compensation schemes described above have assumed that the noise model parameters, µn , Σn and µh , are known. In a similar fashion to estimating the parameters for featureenhancement MLestimates of the noise parameters may be found using VTS [101, 123] if this is not the case.
6.3
UncertaintyBased Approaches
As noted above, whilst feature compensation schemes are computationally eﬃcient, they are limited by the very simple speech models that can be used. On the other hand, model compensation schemes can take full advantage of the very detailed speech model within the recogniser but they are computationally very expensive. The uncertaintybased techniques discussed in this section oﬀer a compromise position. The eﬀects of additive noise can be represented by a DBN as shown in Figure 6.2. In this case, the state likelihood of a noisy speech feature vector can be written as ˘ p(y t θ t ; λ, λ) = ˘ p(y t xt ; λ)p(xt θt ; λ)dxt (6.28)
xt
Fig. 6.2 DBN of combined speech and noise model.
72
Noise Robustness
where ˘ p(y t xt ; λ) = nt ˘ p(y t xt , nt )p(nt ; λ)dnt (6.29)
˘ and λ denotes the clean speech model and λ denotes the noise model. The conditional probability in (6.29) can be regarded as a measure of the uncertainty in y t as a noisy estimate of xt .3 At the extremes, this could be regarded as a switch such that if the probability is one then xt is known with certainty otherwise it is completely unknown or missing. This leads to a set of techniques called missing feature techniques. ˘ Alternatively, if p(y t xt ; λ) can be expressed parametrically, it can be passed directly to the recogniser as a measure of uncertainty in the observed feature vector. This leads to an approach called uncertainty decoding. 6.3.1 Missing feature techniques
Missing feature techniques grew out of work on auditory scene analysis motivated by the intuition that humans can recognise speech in noise by identifying strong speech events in the timefrequency plane and ignoring the rest of the signal [30, 31]. In the spectral domain, a sequence of feature vectors Y1:T can be regarded as a spectrogram indexed by t = 1, . . . , T in the time dimension and by k = 1, . . . , K in the frequency dimension. Each feature element ytk has associated with it a bit which determines whether or not that “pixel” is reliable. In aggregate, these bits form a noise mask which determine regions of the spectrogram that are hidden by noise and therefore “missing.” Speech recognition within this framework therefore involves two problems: ﬁnding the mask, and then computing the likelihoods given that mask [105, 143]. The problem of ﬁnding the mask is the most diﬃcult aspect of missing feature techniques. A simple approach is to estimate the SNR and then apply a ﬂoor. This can be improved by combining it with perceptual criteria such as harmonic peak information, energy ratios, etc. [12]. More robust performance can then be achieved by training a
3 In
˘ ˘ [6] this uncertainty is made explicit and it is assumed that p(y t xt ; λ) ≈ p(xt y t ; λ). This is sometimes called feature uncertainty.
6.3 UncertaintyBased Approaches
73
Bayesian classiﬁer on clean speech which has been artiﬁcially corrupted with noise [155]. Once the noise mask has been determined, likelihoods are calculated either by featurevector imputation or by marginalising out the unreliable features. In the former case, the missing feature values can be determined by a MAP estimate assuming that the vector was drawn from a Gaussian mixture model of the clean speech (see [143] for details and other methods). Empirical results show that marginalisation gives better performance than imputation. However, when imputation is used, the reconstructed feature vectors can be transformed into the cepstral domain and this outweighs the performance advantage of marginalisation. An alternative approach is to map the spectographic mask directly into cepstral vectors with an associated variance [164]. This then becomes a form of uncertainty decoding discussed next. 6.3.2 Uncertainty Decoding
In uncertainty decoding, the goal is to express the conditional probability in (6.29) in a parametric form which can be passed to the decoder and then used to compute (6.28) eﬃciently. An important decision is the form that the conditional probability takes. The original forms of uncertainty decoding made the form of the conditional distribution dependent on the region of acoustic space. An alternative approach is to make the form of conditional distribution dependent on the Gaussian component, in a similar fashion to the regression classes in linear transformbased adaptation described in Adaptation and Normalisation. A natural choice for approximating the conditional is to use an N component GMM to model the acoustic space. Depending on the form of uncertainty being used, this GMM can be used in a number ˘ of ways to model the conditional distribution, p(y t xt ; λ) [38, 102]. For example in [102]
N
˘ p(y t xt ; λ) ≈
n=1
˘ ˘ P (nxt ; λ)p(y t xt , n; λ).
(6.30)
31) it is possible to construct a GMM based on the corrupted speech observation. To overcome this problem.74 Noise Robustness If the component posterior is approximated by ˘ ˘ P (nxt . the conditional distribution can alternatively be linked with the recogniser model components. Σ(m) + Σb ∗ ∗ (n∗ ) ). λ) (6. the compensation cost depends on the number of components in the recogniser. This is similar to using regression classes in adaptation . The same form of compensation scheme is obtained using SPLICE with uncertainty [38]. (n) A(n) . although the way that the compensation parameters. An interesting advantage of these schemes is that the cost of calculating the compensation parameters is a function of the number of components used to model the featurespace. the input feature vector is linearly transformed in the frontend and at runtime. If the conditional distribution is further approximated by choosing just the most likely component n∗ . and since the acoustic model mixture components are also Gaussian. µ(m) . it can cause all compensated models to become identical. y t . the only additional cost is that a global bias must be passed to the recogniser and added to the model variances. λ. λ) ≈ p(y t xt . the form of (6. Although uncertainty decoding based on a frontend GMM can yield good performance. thus removing all discriminatory power from the recogniser [103].28) becomes the convolution of two Gaussians which is also Gaussian. b(n) and Σb are computed diﬀers. the state/component likelihood in (6. λ) ∝ m∈θt cm N (A(n ) y t + b(n ) . It can be shown that in this case. λ) ≈ P (ny t . in low SNR conditions.33) Hence. rather than the unseen clean speech observation.28) reduces to ˘ p(y t θ t . ˘ ˘ (6. λ). In contrast for modelcompensation techniques such as PMC and VTS. This decoupling yields a useful level of ﬂexibility when designing practical systems. then ˘ p(y t xt .32) has been reduced to a single Gaussian. (6. cn∗ .32) The conditional likelihood in (6. xt .
A(rm )  can vary from component to component in contrast to the ﬁxed value used in (6. µ(m) . (6. The calculation of the compensation parameters follows a similar fashion to the frontend schemes.33). λ. Σ(m) + Σb m ). λ) = m∈θt A(rm ) cm N (A(rm ) y t + b(rm ) . the Jacobian of the transform.6. In this case ˘ p(y t θt . . so the performance tends to that of the modelbased compensation schemes. Note that for these schemes.3 UncertaintyBased Approaches 75 schemes such as MLLR.34) (r ) where rm is the regression class that component m belongs to. Uncertainty decoding of this form can also be incorporated into an adaptive training framework [104]. An interesting aspect of this approach is that as the number of regression classes tends to the number of components in the recogniser.
.
and Arabic. then it is straightforward to combine the confusion networks and select the word sequence with the overall maximum posterior likelihood. 7.1 Example Architecture A typical architecture is shown in Figure 7. any particular combination of model set and adaptation technique will have diﬀerent characteristics and make diﬀerent errors. This is followed by discussion of how multiple systems may be combined together. Furthermore. A ﬁrst pass is made over the data using a relatively simple SI model set.1. Mandarin. The 1best output 77 . In general.7 MultiPass Recognition Architectures The previous sections have reviewed some of the basic techniques available for both training and adapting an HMMbased recognition system. along with the schemes described earlier. modern transcription systems typically utilise multiple model sets and make multiple passes over the data. This section starts by describing a typical multipass combination framework. if the outputs of these systems are converted to confusion networks. Finally a description of how these approaches. as well as how complementary systems may be generated. Thus. are used for a BN transcription task in three languages: English.
. The adapted models are then used to generate lattices using a basic bigram or trigram wordbased language model. from this pass is used to perform a ﬁrst round of adaptation. When using these combination frameworks. 4gram language models interpolated with classngrams and many other variants.78 MultiPass Recognition Architectures Fig. For examples of recent largescale transcription systems see [55. a range of more complex models and adaptation techniques can be applied in parallel to provide K candidate output confusion networks from which the best word sequence is extracted. 163]. Both these topics will be discussed in the following sections. SI and SAT trained models. CMLLR. Once the lattices have been generated.1 Multipass/system combination architectures. latticebased MLLR. 7. triphone and quinphone models. 160. These 3rd pass models may include ML and MPE trained systems. the form of combination scheme must be determined and the type of individual model set designed.
to form a single word transition network (WTN). from the systems together. whereas CNC uses confusion networks. w) . Loglinear combination of the scores may also be used [15].1 produces a number of outputs that can be used for combination. . λ(K) )L(w. For ROVER. similar in structure to a confusion network. (7. conﬁdence scores may also be incorporated into .7. but this is not common practice in current multipass systems. The diﬀerence between the two schemes is that ROVER uses the 1best system output.2. There are two forms of system combination commonly used: recogniser output voting error reduction (ROVER) [43]. Figure 7. w) is the loss. .2 System Combination 79 7. as well as the possibility of using the hypotheses and lattices for adaptation and/or rescoring. . 7. usually measured in words.1 Consensus Decoding One approach is to combine either the 1best hypotheses or confusion network outputs.1) ˜ where L(w. w(k) for the kth system. Here ˆ w = arg min ˜ w w ˜ P (wY 1:T . . between the two word sequences. in Figure 7. λ(K) . the wordposterior is a function of multiple systems. . or branches. . Both schemes use a twostage process: (1) Align hypotheses/confusion networks: The output from each of the systems are aligned against one another minimising for example the Levenshtein edit distance (and for CNC the expected edit distance). λ(1) . λ(k) ). Example outputs that can be used are: the hypothesis from each system. not a single one (the parameters of this ensemble will be referred to as λ in this section). the score from the systems. In contrast to the standard decoding strategy. This may be viewed as an extension of the MBR decoding strategy mentioned earlier.2 shows the process for combining four system outputs.2 System Combination Each of the systems. . . λ(1) . p(Y1:T . and confusion network combination (CNC) [42].
Alternatively.2 Example of output combination.80 MultiPass Recognition Architectures Fig. the combination scheme (this is essential if only two systems are to be combined). lattice or N best rescoring and crossadaptation. For ROVER the selected word is based on the maximum posterior where the ith word is P (wi Y 1:T .2 shown to yield small gains over ROVER Implicit Combination The schemes described above explicitly combine the scores and hypotheses from the various systems. it is possible to select words for each nodepairing independently. . CNC has been combination [42]. As a linear network has been obtained. 7. λ(k) ) + αDi (wi ) k=1 (k) (7.2. (2) Vote/maximum posterior: Having obtained the WTN (or composite confusion network) decisions need to be made for the ﬁnal output.2) Di (wi ) = 1 when the ith word in the 1best output is wi for the kth system and α controls the balance between the 1best decision and the conﬁdence score. λ) ≈ (k) K 1 K (1 − α)P (wi Y 1:T . it is possible to implicitly combine the information from one system with another. There are two approaches that are commonly used. 7.
the performance of the rescoring system may be improved as it cannot make errors that are not in the N best list or lattices.. are standard approaches to speeding up decoding by reducing the searchspace. it is assumed that systems that make diﬀerent errors are available to combine. as discussed in Architecture of an HMMBased Recogniser. This is particularly true if the lattices or N best lists are heavily pruned. for example AdaBoost [44]. Initially this was applied to binary classiﬁcation tasks. 7. for example in MLLR the number of linear transforms. PLP. Initially a number of candidates are constructed. For example: triphone and quinphone models. Many sites adopt an adhoc approach to determining the systems to be combined.3 Complementary System Generation For the combination schemes above. Though normally described as approaches to improve eﬃciency. An alternative approach is to train systems that are designed to be complementary to one another. Here one speech recognition system is used to generate the transcription hypotheses for a second speech recognition system to use for adaptation.3 Complementary System Generation 81 Lattice [7. over the training data. so that correctly classiﬁed data is deweighted and incorrectly classiﬁed data is more heavily weighted. the performance of the combinations is evaluated and the “best” (possibly taking into account decoding costs) is selected and used. they have a sideeﬀect of implicit system combination. On a heldout testset. The level of information propagated between the two systems is determined by the complexity of the adaptation transforms being used.7. and later to multiclass tasks. . 144] and N best [153] rescoring. e. By restricting the set of possible errors that can be made. MFCC. [45]. Another common implicit combination scheme is crossadaptation. MLPbased posteriors [199] and Gaussianised frontends. The general approach in boosting is to deﬁne a distribution. multiple and single pronunciation dictionaries. A classiﬁer is then trained using this weighted data and a new distribution speciﬁed combining the new classiﬁer with all the other classiﬁers.g. random decision trees [162]. Boosting is a standard machine learning approach that allows complementary systems to be generated. or weight. SAT and GD models.
7. a simple weighted voting scheme is not directly applicable given the potentially vast number of classes in speech recognition. Context dependence was provided by using stateclustered crossword triphone . the frontend processing from the CUHTK 2003 BN system [87] was used. Though the three languages use the same acoustic model structure there are language speciﬁc aspects that are exploited to improve performance. second and thirdorder derivatives appended and then projected down to 39 dimensions using HLDA [92] optimised using the eﬃcient iterative approach described in [51]. there have been a number of applications of boosting to speech recognition [37. Also. All phones have three emitting state with a lefttoright noskip topology. Cepstral mean normalisation is applied at the segment level where each segment is a contiguous block of data that has been assigned to the same speaker/acoustic condition. 154.82 MultiPass Recognition Architectures There are a number of issues which make the application of boosting. Although initial attempts to explicitly generate complementary systems are promising. 121.4 Example Application — Broadcast News Transcription This section describes a number of systems built at Cambridge University for transcribing BN in three languages. An alternative approach. In addition to BN transcription. These illustrate many of the design issues that must be considered when building largevocabulary multipass combination systems. In ASR. algorithms to speech recognition diﬃcult. or yield models suitable for combination. All models were built using the HTK toolkit [189]. 201]. Despite these issues. For all the systems described. the related task of broadcast conversation (BC) transcription is also discussed. is to modify the form of the decision tree so that leaves are concentrated in regions where the classiﬁer performs poorly [19]. Each frame of speech is represented by 13 PLP coeﬃcients based on a linear prediction order of 12 with ﬁrst. the forms of classiﬁer used are highly complex and the data is dynamic. or boostinglike. again based on weighting the data. the majority of multipass systems currently deployed are built using the adhoc approach described earlier. Mandarin and Arabic. English.
The following sections describe how the above general procedure is applied to BN transcription in three diﬀerent languages. are transmitted over bandwidthlimited channels.4 Example Application — Broadcast News Transcription 83 models [190]. phoneset and any frontend speciﬁc processing are described. and BC transcription in Arabic and Mandarin. The dictionary for the English system was based on 45 phones and a 59K wordlist . and sp represents short silences across which phonetic context is maintained. In addition to the phone models. This will be closely matched to the wordsequences of the test data but is typically limited in size. Gaussian mixture models with an average of 36 Gaussian components per state were used for all systems. English. In addition. In addition. This latter data was therefore used in lightly supervised fashion.1 English BN Transcription The training data for the English BN transcription system comprised two blocks of data. 182]. text corpora from newswire and newspapers are used to increase the quantity of data to around a billion words. 7. for example telephone interviews. wideband and narrowband spectral analysis variants of each model set were also trained. The ﬁrst are the transcriptions from the audio data. Mandarin and Arabic.7. Genderdependent versions of these models were then built using discriminative MPE–MAP adaptation [137]. Given the nature of the training data. normally less than 10 million words. The language models used for the transcription systems are built from a range of data sources. As some BN data.4. two silence models were used: sil represents longer silences that break phonetic context. Models were built initially using ML training and then discriminatively trained using MPE training [139. The second block of about 1200 hours of data had either closedcaption or similar rough transcriptions provided. examples are given of how a complementary system may be generated. For each of these tasks the training data. the resulting LMs are expected to be more closely matched to BNstyle rather than the more conversational BCstyle data. The ﬁrst block of about 144 hours had detailed manual transcriptions. The distribution of components over the states was determined based on the state occupancy count [189].
By themselves the SPron and MPron systems perform about Table 7. This was trained using 1. which included the acoustic training data references and closedcaptions. having about 1. The ﬁrst was a multiple pronunciation dictionary (MPron).4 15.0 13. Results are given here on three separate test sets.3 10.6 10.8 10. The second was a single pronunciation dictionary (SPron) built using the approach described in [70].2 15. eval04 was a 6 hour test set evenly split between data similar to dev04 and dev04f. dev04f was expected to be harder to recognise as it contains more challenging data with high levels of background noise/music and nonnative speakers.8 15.1 shows the performance of the MPron and the SPron systems using the architecture shown in Figure 7. This is an example of using a modiﬁed dictionary to build complementary systems. Here a segmentation and clustering supplied by the Computer Sciences Laboratory for Mechanics and Engineering Sciences (LIMSI). Table 7. In order to assess the performance of the English BN transcription system. First. There are a number of points illustrated by these results.8 12.84 MultiPass Recognition Architectures was used. a range of development and evaluation data sets were used.4 13. System P2 P3b P3c P2+P3b P2+P3c P2+P3b+P3c dev04 11.1 %WER of using MPron and SPron acoustic models and the LIMSI segmentation/clustering for English BN transcription.1 pronunciations per word on average.9 MPron MPron SPron CNC CNC CNC .3 12. dev04 and dev04f were each 3 hours in size.6 13.9 15. a French CNRS laboratory.5 %WER dev04f 15. HMMs were built using each of these dictionaries with about 9000 distinct states.1. which gave outofvocabulary (OOV) rates of less than 1%.4 10. was used in the decoding experiments.1 10.4 billion words of data.1 eval04 13.3 14. A fourgram language model was used for all the results quoted here. the P3c branch is actually an example of crossadaptation since the MPron hypotheses from the P2stage are used in the initial transform estimation for the SPron system. Two dictionaries were constructed.
CNC could be used to combine the output from the two segmentations but here ROVER was used. P3b and P3c branches together gave no gains. The ﬁrst stage in processing the audio is to segment the audio data into homogeneous blocks. This shows the advantage of combining systems with diﬀerent acoustic models. This architecture is shown in Figure 7. . Thus. Finally.7. but the P3c branch is consistently better than the P3b branch.1. This process. combining the P2.5% (dev04) to 15.1 These blocks are then clustered together so that all the data from the speaker/environment are in the same cluster. may be scored as a task in itself [170]. This illustrates one of the issues with combining systems together. The results using this dual architecture are shown in Table 7. the error rates vary by about 50% between the test sets. thus combining systems may degrade performance. indeed slight degradations compared to the P2+P3c combination can be seen. This is especially true when the systems have very diﬀerent performance.4%. In addition to using multiple forms of acoustic model. To perform segmentation combination. and the segmentation generated at Cambridge University Engineering Department (CUED) 1 Ideally. even though only test data classiﬁed as BN is being used.2%–0. It is interesting to note that the average performance over the various test sets varies from 10.1% (dev04f). it is also possible to use multiple forms of segmentation [56]. or two of the systems being combined are similar to one another. contiguous blocks of data that have the same speaker and the same acoustic/ channel conditions.3. This is due to crossadaptation. Combining the P3c branch with the P2 branch using CNC (P2+P3c) shows further gains of 0. the interest here is how this segmentation and clustering aﬀects the system combination performance. sometimes referred to as diarisation. However. The two segmentations used were: the LIMSI segmentation and clustering used to generate the results in Table 7.2. This reﬂects the more challenging nature of the dev04f data. The conﬁdence scores used are not accurate representations of whether the word is correct or not. separate branches from the initial P1–P2 stage are required.4 Example Application — Broadcast News Transcription 85 the same [56].
4 using diﬀerent algorithms [161]. Segmentation/ Clustering LIMSI CUED LIMSI ⊕ CUED dev04 10. the systems diﬀer in both the segmentation and clustering fed into the recognition system [56].4 9.86 MultiPass Recognition Architectures Fig. For both segmentation and clusterings.7 eval04 12. Thus.2 %WER of the dual segmentation system for English BN transcription).2 14.8 12. 7.9 12. the equivalent of the P2+P3c branch in Table 7.1 were used.8 %WER dev04f 14.9 15.3 10. Table 7.3 Multipass/system cross segmentation combination architecture. Although the segmentations .
4. 7.5%. obtained from the 60 phone LDC set by applying mappings of the form “[aeiu]n→[aeiu] n”. they have diﬀerent numbers of clusters and diarisation scores.2% and 0. the LDC character to word segmenter . In contrast to the English BN system. For this task. the inﬂuence of tone on the acoustic models can be incorporated into the acoustic model units. The phoneset used for these systems was derived from the 60 toneless phones used in the 44K dictionary supplied by the Linguistic Data Consortium (LDC). fundamental frequency (F0) was added to the feature vector after the application of the HLDA transform.6 and 14. the Mandarin system was required to recognise both BN and BC data.2 Mandarin BN and BC Transcription About 1390 hours of acoustic training data was used to build the Mandarin BN transcription system. Combining the two segmentations together gave gains of between 0. The training data was approximately evenly split between these two types. respectively. As Mandarin is a tonal language. “[aeiou]ng→[aeiou] ng” and “u:e→ue”. A string of characters may be partitioned into “words” in a range of ways and there are multiple valid partitions that may be used. or by including tone questions about both the centre and surround phones into the decision trees used for phonetic context clustering. manual transcriptions were available for all the training data.2 seconds for the LIMSI and CUED.4 Example Application — Broadcast News Transcription 87 have similar average lengths 13. This tone information may be used either by treating each tonal vowel as a separate acoustic unit. Most dictionaries for Mandarin give tonemarkers for the vowels. The slight diﬀerence in performance for the LIMSI segmentations was because slightly tighter beams for P2 and P3c branches were used to keep the overall runtime approximately the same. It consists of 46 toneless phones.7. This set is the same as the one used in CUHTK 2004 Mandarin CTS system [160]. The latter approach is adopted in these experiments. Delta and delta–delta F0 features were also added to give a 42 dimensional feature vector. In addition to the F0 features. One issue in language modelling for the Chinese language is that there are no natural word boundaries in normal texts. As in the CUHTK Mandarin CTS system [160].
all English words and single character Mandarin words not in the LDC list were added to the wordlist to yield a total of about 58K words. This implements a simple deepest ﬁrst search method to determine the word boundaries. For this task the performance of the HLDASAT system was disappointing. The Mandarin wordlist required for this part of the process was based on the 44K LDC Mandarin dictionary. using the GMM approach with 4 components for each dimension. to give a GAUSS system for branch P3c. as expected. Gaussianisation was applied after the HLDA transform. Note that the Mandarin OOV rate is therefore eﬀectively zero since pronunciations were available for all single characters. eval06 is a mixture of the two styles. As some English words were present in the acoustic training data transcriptions. the performance on the BC data is. A slightly modiﬁed version of the architecture shown in Figure 7. The P2 output was not combined with the P3 branches to produce the ﬁnal output. A SAT system using CMLLR transforms to represent each speaker (HLDASAT) was used for branch P3a. whereas bcmdev05 comprises about 3 hours of BCstyle data. The best performing single branch was the GAUSS branch. performance is assessed in terms of the character error rate (CER). bnmdev06 consists of 3 hours of BNstyle data.3 billion “words” of data. For this task the average number of characters per word is about 1. worse than the performance on the BN data.6. Any Chinese characters that are not present in the word list are treated as individual words. instead two P3 branches in addition to the HLDA system (P3b) were generated. For all systems.1 was used for these experiments. A fourgram language model was constructed at the word level using 1. As word boundaries are not usually given in Mandarin. Table 7.3 shows the performance of the various stages. giving only small gains if .88 MultiPass Recognition Architectures was used. The performance of the Mandarin system was evaluated on three test sets. The same adaptation stages were used for all P3 branches irrespective of the form of model. The BC data is more conversational in style and thus inherently harder to recognise as well as not being as well matched to the language model training data.
7 17. This scheme yields 28 consonants. In a graphemic system. 7. In graphemic systems the acoustic models are required to model the implied pronunciation variations implicitly. however. An alternative approach is to use a phonetic system where the pronunciations for each word .3 HLDA HLDASAT HLDA GAUSS CNC any.4 Example Application — Broadcast News Transcription 89 Table 7.0 7. nunation can result in a wordﬁnal nun (/n/) being added to nouns and adjectives in order to indicate that they are unmarked for deﬁniteness.9 7.4. a dictionary is generated using onetoone lettertosound rules for each word. the short vowels (fatha /a/.2 17.7. tamarbuta and hamza.4 17. System P2 P3a P3b P3c P3a+P3c bnmdev06 8.1 18. On a range of tasks. Thus.9 7. In Arabic. This simple graphemic representation allows a pronunciation to be automatically generated for any word in the dictionary. Thus. Note that normally the Arabic text is romanised and the word and letter order swapped to be lefttoright.6 eval06 18. Additionally. four alif variants (madda and hamza above and below).1 17. has not been found to yield large gains in current conﬁgurations. though theoretically interesting. the total number of graphemes is 36 (excluding silence). ya and wa variants (hamza above).7 17. In common with the English system. sukun) are commonly not marked in texts.3 8.3 Arabic BN and BC Transcription Two forms of Arabic system are commonly constructed: graphemic and phonetic systems.7 %CER bcmdev05 18. small consistent gains were obtained by combining multiple branches together.9 18. SAT training.8 18.3 %CER using a multipass combination scheme for Mandarin BN and BC transcription. the dictionary entry for the word “book” in Arabic is ktAb /k/ /t/ /A/ /b/. kasra /i/ and damma /u/) and diacritics (shadda.
0%. compared to English with about 1. and the second using the 260 K phonetic wordlist.html. Buckwalter yields pronunciations for about 75% of the words in the 350 K dictionary used in the experiments. Using a 350 K vocabulary. In order to reduce the eﬀect of inconsistent diacritic markings (and to make the phonetic system diﬀer from the graphemic system) the variants of alef.90 MultiPass Recognition Architectures explicitly include the short vowels and nun. the ﬁrst using the 350 K graphemic wordlist. the lexicon entry for the word book becomes ktAb /k/ /i/ /t/ /A/ /b/. derived using word frequency counts only. This means that pronunciations must be added by hand. LM1. This means that the use of pronunciation probabilities is important. Two wordlists were used: a 350 K wordlist for the graphemic system.qamus. the OOV rate on bcat06 was 2.0)2 [3]. This gives a total of 32 “phones” excluding silence. for example. This results in a potentially very large vocabulary. or the word removed from the wordlist. In this work. only a small number of words were added by hand. . and a 260 K subset of the graphemic wordlist comprising words for which phonetic pronunciations could be found. A common approach to obtaining these phonetic pronunciations is to use the Buckwalter Morphological Analyser (version 2. ya and wa were mapped to their simple forms. In this case. or rules used to derive them.org/index. Thus two language models were built. which may result in a large OOV rate compared to an equivalent English system for example.1 pronunciations. using the same features as for the English system.3 pronunciations per word. higher than the OOV rate in English with a 59 K vocabulary. About 1 billion words of language model training data were used and the acoustic models were trained on about 1000 hours of acoustic data. An additional complication is that Arabic is a highly inﬂected agglutinative language. 2 Available at http://www. With this phonetic system there are an average of about 4. as words are often formed by attaching aﬃxes to triconsonantal roots.
3 eval06 23.1 21.8 27. the phonetic system has an element of crossadaptation between the graphemic and the phonetic systems. The eval06 is approximately 3 hours of a mixture of the two styles.8 23.4 21. rather than using two diﬀerent segmentations.7 WER (%) bcat06 27. the most obvious eﬀect is that performance gain obtained from combining the phonetic and graphemic systems using CNC is large.4 Example Application — Broadcast News Transcription 91 Table 7. Arabic and Mandarin.0 18.3 22.9 23. the graphemic system is still slightly better than the phonetic system on the BC data. The second. in this case English.3 20. around 6% relative reduction over the best individual branch. 7.6 27.3. bnat06. modiﬁcations beyond specifying the phoneunits and dictionary are required.2 27. However.4 %WER using a multipass combination scheme for Arabic BN and BC transcription.7. Given the large number of pronunciations and the wordlist size.4.4 Summary The above experiments have illustrated a number of points about large vocabulary continuous speech recognition. However. the lattices needed for the P3 rescoring were generated using the graphemic acoustic models for both branches. The ﬁrst.3 25. For example in Mandarin. However. Table 7.4 shows the performance of the Arabic system using the multipass architecture in Figure 7. Thus. to enable the technology to work well for diﬀerent languages. . consists of about 3 hours of BNstyle data. In terms of the underlying speech recognition technology. System P2aLM1 P2bLM2 P3aLM1 P3bLM2 P3a+P3b bnat06 21.4 20. the same broad set of techniques can be used for multiple languages.2 Graphemic Graphemic Graphemic Phonetic CNC Three test sets deﬁned by BBN Technologies (referred to as BBN) were used for evaluating the systems. in this case graphemic and phonetic branches were used. Even with the crossadaptation gains. bcat06 consists of about 3 hours of BCstyle data.
gains of 6% relative over the best individual branch were obtained. rather diﬀerent tasks would be needed to properly illustrate the various noise robustness techniques described and space prevented inclusion of these. but beyond the scope of this review. all the acoustic models described have been trained using discriminative. as well as the inclusion of tonal features. these techniques are essential. Some modelbased schemes have been applied to transcription tasks [104]. Furthermore. all the systems described make use of a combination of feature normalisation and linear adaptation schemes. the gains have been small since the levels of background noise encountered in this type of transcription are typically low. These discriminative approaches are also becoming standard since typical gains of around 10% relative compared to ML can be obtained [56]. For example in Arabic. As the results have shown.92 MultiPass Recognition Architectures charactertoword segmentation must be considered. are standard approaches used by many of the research groups developing large vocabulary transcription systems. even greater gains can be obtained by using crosssite system combination [56]. Though BN and BC transcription is a good task to illustrate many of the schemes discussed in this review. MPE. For practical systems this is an important consideration. For large vocabulary speech recognition tasks where the test data may comprise a range of speaking styles and accents. along with schemes such as global STC. This. The systems described above have all been based on HLDA feature projection/decorrelation. however. training. The experiments described have also avoided issues of computational eﬃciency. . All the systems described here were built at CUED. Though no explicit contrasts have been given. Interestingly. by combining graphemic and phonetic systems together. signiﬁcant further gains can be obtained by combining multiple systems together.
discriminative parameter estimation.Conclusions This review has reviewed the core architecture of an HMMbased CSR system and outlined the major areas of reﬁnement incorporated into modernday systems including those designed to meet the demanding task of large vocabulary transcription. all of these assumptions must be relaxed to some extent and it is the extent to which the resulting approximations can be minimised which determines the eventual performance. and algorithms for adaptation and noise compensation. that the feature distributions can be modelled exactly. 93 . that there is suﬃcient training data. HMMbased speech recognition systems depend on a number of assumptions: that speech signals can be adequately represented by a sequence of spectrally derived feature vectors. Starting from a base of simple singleGaussian diagonal covariance continuous density HMMs. that this sequence can be modelled using an HMM. All of these aim to reduce the inherent approximation in the basic HMM framework and hence improve performance. a range of reﬁnements have been described including distribution and covariance modelling. and that the training and test conditions are matched. In practice.
there is still much to do. results are impressive but still fall short of human capability. system complexity and target application. as indicated in this review. there are many ideas which have yet to be fully explored and many more still to be conceived. Progress is continuous and error rates continue to fall. the various options involve a tradeoﬀ between factors such as the availability of training or adaptation data. the most challenging problems require complex solutions. many argue that the HMM architecture is fundamentally ﬂawed and that performance must asymptote at some point. runtime computation. no good alternative to the HMM has been found yet. however. For example. even the best systems are vulnerable to spontaneous speaking styles. it is also a challenging system integration problem. In the meantime. the authors of this review believe that the performance asymptote for HMMbased speech recognition is still some way away. HMMbased speech recognition is a maturing technology and as evidenced by the rapidly increasing commercial deployment. . with the beneﬁt of more data. This is undeniably true. a number of alternative approaches have been described amongst which there was often no clear winner. Furthermore. Fortunately. it should be acknowledged that despite their dominance and the continued rate of improvement. error rates are now well below 20%. As illustrated by the ﬁnal presentation of multipass architectures and actual example conﬁgurations. Today. and the reﬁnements described in this review. The consequence of this is that building a successful large vocabulary transcription system is not just about ﬁnding elegant solutions to statistical modelling and classiﬁcation problems. error rates on the transcription of conversational telephone speech were around 50% in 1995.94 Conclusions In many cases. Usually. performance has already reached a level which can support viable applications. nonnative or highly accented speech and high ambient noise. Nevertheless. Finally. As shown by the example performance ﬁgures given for transcription of broadcast news and conversations.
who have contributed to the development of the HTK toolkit and HTKderived speech recognition systems and algorithms. this review would not have been possible. In particular. past and present. Without all of these people.Acknowledgments The authors would like to thank all research students and staﬀ. 95 . the contribution of Phil Woodland to all aspects of this research and development is gratefully acknowledged.
.
. . . y T (r) the prior of Gaussian component m the mean of Gaussian component m element i of mean vector µ(m) the covariance matrix of Gaussian component m ith leading diagonal element of covariance matrix Σ(m) Gaussian distribution with mean µ and covariance matrix Σ 97 . . . Σ) speech observation at time t observation sequence y 1 . .Notations and Acronyms The following notation has been used throughout this review. y T “clean” speech observation at time t “noise” observation at time t the “static” elements of feature vector y t the “delta” elements of feature vector y t the “delta–delta” elements of feature vector y t (r) (r) rth observation sequence y 1 . . . yt Y 1:T xt nt ys t ∆y s t ∆2 y s t (r) Y cm µ(m) (m) µi Σ(m) (m)2 σi N (µ.
λ) forward likelihood. . .98 Notations and Acronyms N (y t . . w1 . . . . . . wL Acronyms BN BC CAT CDF CMN CML CMLLR CNC broadcast news broadcast conversations cluster adaptive training cumulative density function cepstral mean normalisation conditional maximum likelihood constrained maximum likelihood linear regression confusion network combination . . p(y t+1 . . Σ) A AT A−1 A[p] A aij b λ p(Y 1:T . p(y 1 . . λ) θt sj sjm (j) γt (jm) γt (j) αt (j) βt dj (t) q q (w) P (w1:L ) w1:L likelihood of observation y t being generated by N (µ. λ) P (θt = sjm Y 1:T . λ) backward likelihood. . . µ. . λ) probability of occupying state sj for t time steps a phone such as /a/. θt = sj . /t/. q1 . y t . . . j of A bias column vector set of HMM parameters likelihood of observation sequence Y 1:T given model parameters λ state/component occupied at time t state j of the HMM component m of state j of the HMM P (θt = sj Y 1:T . etc a pronunciation for word w. y T θt = sj . Σ) transformation matrix transpose of A inverse of A top p rows of A determinant of A element i. qKw probability of word sequence w1:L word sequence.
uk) linear discriminant analysis language model maximum a posteriori minimum Bayes risk minimum classiﬁcation error melfrequency cepstral coeﬃcients maximum likelihood maximum likelihood linear regression maximum likelihood linear transform multilayer perceptron maximum mutual information minimum mean square error minimum phone error perceptual linear prediction parallel model combination reference speaker weighting probability mass function recogniser output voting error reduction speaker adaptive training speaker independent .Notations and Acronyms 99 CTS CVN DBN EBW EM EMLLT FFT GAUSS GD GMM HDA HLDA HMM HTK LDA LM MAP MBR MCE MFCC ML MLLR MLLT MLP MMI MMSE MPE PLP PMC RSW PMF ROVER SAT SI conversation transcription system cepstral variance normalisation dynamic Bayesian network extended BaumWelch expectation maximisation extended maximum likelihood linear transform fast Fourier transform Gaussianisation gender dependent Gaussian mixture model heteroscedastic discriminant analysis heteroscedastic linear discriminant analysis hidden Markov model hidden Markov model toolkit (see http://htk.cam.eng.ac.
100 Notations and Acronyms SNR SPAM VTLN VTS WER WTN signal to noise ratio subspace constrained precision and means vocal tract length normalisation vector Taylor series word error rate word transition network .
and P.” in Proceedings of ICSLP. Ahadi and P. [6] J. “Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models. 1996. 1993. Aubert and H. PhD thesis. C. no. D. 1995. Brown. 3. [10] L. Tokyo. [8] S. Xiang. Picheny. Deng. and M. [3] M. A. Gopalakrishnan. P. Kristjansson. S. Ney. Denver. vol. R.” in Proceedings of ICSLP. P. S. Acoustical and Environmental Robustness in Automatic Speech Recognition. 187–206. L. “Recent progress in Arabic broadcast news transcription at BBN. “Robust methods for using contextdependent features and models 101 . 1997. [4] S. R. Makhoul. Schwartz. Acero. China. Nguyen. P. Using Observation Uncertainty for Robust Speech Recognition. [2] A. McDonough. “Large vocabulary continuous speech recognition using word graphs. A. Zhang. Abdou. V. “A compact model for speaker adaptive training. R. Beijing. Makhoul. vol. J. V. 1986. de Souza.References [1] A. and J. Arrowood. and J.” in Proceedings of Interspeech. Aﬁfy.” in Proceedings of ICSLP. P. 11. Anastasakos. 1. Detroit. pp. Philadelphia. [5] T. 49–52. de Souza. Gopinath.” in Proceedings of ICASSP. B. “Maximum mutual information estimation of hidden Markov model parameters for speech recognition.” in Proceedings of ICASSP. Olsen. pp. [9] L. Kluwer Academic Publishers. Georgia Institute of Technology. T. [7] X. L. and R. 2003. “Modelling with a subspace constraint on inverse covariance matrices. pp. Lisbon. Bahl. F. 49–52. CO. 2002. 2000. R. September 2005. Mercer. Nahamoo. Portugal.” Computer Speech and Language. Woodland. Axelrod. “HMM adaptation using vector Taylor series for noisy speech recognition. M. Acero. L. and J. Bahl.
Denmark. 1975. S. Breslin and M. and R. J.” Computer Speech and Language. L. Adelaide. “An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. “Graphical models and automatic speech recognition. F. A. China. 2006. Eagon.” in Proceedings of ICASSP. V. Bilmes.” IEEE Transactions on Acoustics Speech and Signal Processing. PhD thesis. Green. Toulouse. Lai. Barker. W. M. Baum and J. 27. 24–29. 1999. vol.” Annals of Mathematical Statistics. pp. 467–479. 164–171. J. M. vol. W. Aalberg. “A maximisation technique occurring in the statistical analysis of probabilistic functions of Markov chains. T. 113–120. A.” in Proceedings of ASRU. Soules. J.” Computer Speech and Language. March 2004. W. SpringerVerlag.” IEEE Transactions on Acoustics Speech and Signal Processing. Brown. and P. 360–363.” in Proceedings of ICSLP. P. P. no. 1987. D. vol. Cooke.” in Proceedings of ICSLP. “Buried Markov models: A graphicalmodelling approach to automatic speech recognition. C. “Improving broadcast news transcription by lightly supervised discriminative training. no. pp. S. H. J. Weiss. L. Lee. pp. 13. Goodman. “An empirical study of smoothing techniques for language modelling. 1997. “The Dragon system — An overview. and N. Baker.” in Proceedings of Eurospeech. P. Wang.102 References in a continuous speech recognizer. 533–536. Bilmes. Beijing. Chen. Canada. Woodland. 41. 73. 1992. 2000. L. 1967.” IEICE Transactions on Information and Systems: Special Issue on Statistical Modelling for Speech Recognition. de Souza. “Fast speaker adaptation using eigenspacebased maximum likelihood linear regression. Santa Barbara. 359–394. F. Petrie. vol. 2003.” Computational Linguistics. F. 4. 2003. 1. 1979. J. Baum. Beyerlein. Byrne. “Minimum Bayes risk estimation and decoding in large vocabulary continuous speech recognition. “Classbased N gram models of natural language. Y. Montreal. [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] . pp. E. pp. 2–3.” in Mathematical Foundations of Speech and Language: Processing Institute of Mathematical Analysis Volumes in Mathematics Series. 17. Della Pietra. 23. 2001. S. pp. H. T. pp. F. E. Boll. The AcousticModelling Problem in Automatic Speech Recognition. Chen and J. C.” Bulletin of American Mathematical Society. and L. J. 1994. “Complementary system generation using directed decision trees. J. C. 213–216. “Suppression of acoustic noise in speech using spectral subtraction. 2006. pp. Chan and P.” in Proceedings of ICASSP. Carnegie Mellon. 18. A. 1970. P. K. Gales. Brown. 900–907. vol. vol. vol. pp. G. vol. V. Liao. “Discriminative model combination. no. “Robust ASR based on clean speech models: an evaluation of missing data techniques for connected digit recognition in noise. K. Mercer. E89D(3).
“Development of the 2003 CUHTK conversational telephone speech transcription system. Turkey. “Large vocabulary decoding and conﬁdence estimation using word posterior probabilities. Mrva. Green. Minneapolis. and L. “Maximum likelihood stochastic transformation adaptation of hidden Markov models. L. no. pp. Diakoloukas and V. Phoenix. Collier. Germany. 2000. Beijing. 1087–1090. Montreal. USA. Green.” in Proceedings of ICASSP. Huang. Acero. S. Acero. Denmark. L. pp. D. D. [33] A. CO. Dimitrakakis and S. Evermann. Digalakis. [31] M. Denver. [32] S. Orlando. Corduneanu. and D. 2000. B. Gales.” IEEE Transactions on Speech and Audio Processing.References 103 [26] S. 39. Y. vol. H.” in Proceedings of ICASSP. [41] G. pp. H. Aalberg. Juang.” in Proceedings of ICASSP. pp. Deng. B. 806–809. Chen and R. Woodland. Berkowitz. H. 2002. Evermann and P. and X. C. pp. [37] C. 1999. Greece.” in NIPS 2000. “Evaluation of the SPLICE algorithm on the aurora 2 database. and P. “Handling missing data in speech recognition. W. Canada. Laird. Byrne. Phoenix. no. Davis and P. Deng. “Uncertainty decoding with SPLICE for noise robust speech recognition. 2000. 863–866. 1980. “Maximum likelihood from incomplete data via the EM algorithm. P. N. H. 652–655. Droppo. FL. Dempster. pp. “Minimum error rate training based on N best string models.” in Proceedings of ICASSP. [38] J. J. 1999. X. 177–187. [28] W. M. Chan. Liu. pp. 1–4. “Rapid speech recognizer adaptation to new speakers. D.” Journal of the Royal Statistical Society Series B. T. Chou. A. Munich. Chen and R. [34] L. and A. D. 1–38. Kannan. 7. S. 1997. S.” in Proceedings of Eurospeech. Morris.” in Proceedings of Eurospeech. and A. C. Montreal. A.” IEEE Transactions on Acoustics Speech and Signal Processing.” in Proceedings of ICASSP. 1999. Canada. Gopinath. “Largevocabulary speech recognition under adverse acoustic environments. Bengio. [27] S. Yokohama. Cooke. Lee. 2004.” in Proceedings of ICASSP. pp. E. 1997. pp. and P. “Maximum aposterior linear regression with elliptical symmetric matrix variants. Acero. [30] M. 217–220. [36] V. M. and B. Boulis. M. Rhodes. “Model selection in acoustic modelling. 357–366. [39] J. Droppo. A. 1977. Mermelstein. pp. Sankar.” in Proceedings of ICASSP. Digalakis. vol. 1993. A. 2004. Istanbul. 1994.” in Proceedings of ICASSP. “Gaussianization. Deng. 1555–1558. F. 28. Bocchieri. Hain. [40] G. P. Chou. Wang. S. “Missing data techniques for robust speech recognition. “Boosting HMMs with an application to speech recognition. Rubin. pp.” in Proceedings of ICSLP. “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. 4. and M. 2001. China. Cooke.” in Proceedings of ICSLP. 1655–1658. C. A. vol. . D. 765–768. A. Khudanpur. Woodland. USA. pp. Plumpe. [29] W. Crawford. 2. Japan. Gopinath. [35] V.
pp. [49] M. vol. Fiscus. Woodland. 1997. 119–139. 7. Gales.104 References [42] G. 4. Freund and R.” Journal of Computer and System Sciences. Gales and S. [43] J. F. “A postprocessing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER). “Cluster adaptive training of hidden Markov models. [46] K. 75–98. J. pp. C. J. PA.” IEEE Transactions on Speech and Audio Processing. [52] M. no. J. F. Gales. Gales. and K.” in Proceedings of IEEE ASRU Workshop. F. vol. “Development of the CUHTK 2004 RT04 Mandarin conversational telephone speech transcription system. “Maximum likelihood linear transformations for HMMbased speech recognition. 1995. Jia.” IEEE Transactions on Speech and Audio Processing. no. 3. 1986. Mrva. Academic Press. Sinha. 1997. 1997. vol. J. “Maximum likelihood multiple subspace projections for hidden Markov models. vol. University San Diego. 1996. Freund and R. Tranter. Schapire. Gales. “Discriminative models for speech recognition.” in Proceedings of ICASSP. pp. PhD thesis. “Transformation smoothing for speaker and environmental adaptation. pp. J. 12. Gales. “Speaker independent isolated word recognition using dynamic features of speech spectrum. no.” in ITA Workshop. Santa Barbara. J. Rhodes. [54] M. Gales.” Proceedings of the Thirteenth International Conference on Machine Learning. Young. [58] M. 2000. J. [48] M. Evermann and P.” in Proceedings of Eurospeech. “Semitied covariance matrices for hidden Markov models. 12. J. P. 14. E. 1999. “Cepstral parameter compensation for HMM recognition in noise.” Speech Communication. [50] M. Gales and S. Woodland. 2000. J. pp. Yu. C. Sim. 2067–2070. F. 1998. P. Gales. D. 231– 239. pp. F. “A decision theoretic generalization of online learning and an application to boosting. 9. [47] S. September 2006. pp. 8. Baltimore. 2. Y. 289–308. 1993. Philadelphia. 1995. “Posterior probability decoding. Woodland. pp. . X. Fukunaga. Greece.” IEEE Transactions on Speech and Audio Processing. 52–59. 1972.” Computer Speech and Language. Kim. “Progress in the CUHTK broadcast news transcription system. F. [44] Y. “Experiments with a new boosting algorithm. and S. Schapire.” IEEE Transactions on Speech and Audio Processing. [51] M.” Computer Speech and Language. F. vol. [55] M. 417–428. Young. vol. [45] Y. C. F. Cambridge University. F. “Robust speech recognition in additive and convolutional noise using parallel model combination. 2002. 5. pp. Gales. February 2007. [53] M. no.” in Proceedings of Speech Transcription Workshop. vol. no. [57] M. F.” IEEE Transactions ASSP. conﬁdence estimation and system combination. pp. J. Gales. 347–352. 2005. J. 272–281. vol. F. 3. Furui. D. 10. B. Liu. J. USA. J. R. [56] M. Modelbased Techniques for Noise Robust Speech Recognition. K. Introduction to Statistical Pattern Recognition. 37–47. 34.
[75] H. Gopalakrishnan. 1996. “SWITCHBOARD. D. Picheny.” in Proceedings of ICASSP. Greece. 2091–2094. Lisbon. 631–634. vol. D. Nadas. Gopinath. pp. pp. Lee. C. “The 1998 HTK System for transcription of conversational telephone speech. S. G. “Linear discriminant analysis for improved large vocabulary continuous speech recognition. [67] A.performance of the IBM continuous speech recognizer on the ARPA noise spoke task. 4. pp. vol. Platt. [69] R. “Robust continuous speech recognition using parallel model combination. 2047–2050. A. 1992. Hermansky. 4. [63] V. 4. [73] T. 2. [65] R. Godfrey. Niesler.” Computer Speech and Language. Mahajan. Phoenix. Hazen and J.H.” in Proceedings of ICASSP. Padmanabhan. 4. pp. HaebUmbach and H. Gopinath. [60] Y. “Maximum a posteriori estimation of multivariate Gaussian mixture observations of Markov chains. F. J. no. [74] H.” in Proceedings of ICASSP. [66] R. [71] T. Acero. “Robust speech recognition in noise . 1738–1752. vol. Austin. 353–356. T. vol. A. F.L. 1990.” IEEE Transactions on Speech and Audio Processing. September 2005. “Maximum likelihood modelling with Gaussian distributions for classiﬁcation. 1989. [64] P. no.” IEEE Transactions on Speech and Audio Processing. P. and J. 5. “Implicit pronunciation modelling in ASR. [72] D.” in Proceedings of ARPA Workshop on Spoken Language System Technology. no. Goel. 1. and G. Gales and S. pp. and J. J. . Gunawardana. Woodland. M. vol. Nahamoo. 352–359.” in Proceedings of Interspeech. vol. 1999. 1994. P. Ney. Whittaker. vol. 13– 16. 20.” in Proceedings of Eurospeech. S. F. Ney. “Perceptual linear predictive (PLP) analysis of speech. “Beyond ASR 1best: Using word confusion networks in spoken language understanding . and M. Gopalakrishnan. no. Beijing. vol. 12. Hermansky and N. J. and M. Tokyo. and B. [70] T. 1994. R. TX. Glasgow. BalakrishnanAiyer. pp. J. pp. 57–60. “RASTA processing of speech. 1997. McDaniel. 2000. “A comparison of novel techniques for instantaneous speaker adaptation. Seattle. Bechet. San Francisco. W. 1992. 1998. 2002. and D. 1994. Hain. [61] J. [68] R. 291–298. Young. M.References 105 [59] M. October 2006. 1997. C.” in Proceedings of ICSLP. 2. HaebUmbach and H. II–661–II–664. and E. Riccardi. Kumar. 517–520. [62] J. C.” in Proceedings of ICASSP. Portugal. HakkaniTur. pp. Gales.” in Proceedings of EuroSpeech. pp. J. no. E. S. “Improvements in timesynchronous beam search for 10000word continuous speech recognition. 2. pp.” in Proceedings of ICASSP. Gauvain and C. “A generalisation of the Baum algorithm to rational objective functions.” Journal of Acoustical Society of America. S. Morgan. A. “Hidden conditional random ﬁelds for phone classiﬁcation.” in ISCA ITRW PMLA. 1999. “Segmental minimum Bayesrisk ASR voting strategies. Holliman.” IEEE Transactions on Speech and Audio Processing. “Speaker adaptation based on preclustering training speakers.” IEEE Transactions on Speech and Audio Processing. 87. pp. Gao. Tur. Byrne. Glass. M. Rhodes. Kanevsky. Hain. Picheny. China. 2. A.
Jelinek. Woodland. vol. 73. and Z. 1986. [89] N. no. 1976. vol. China. September 1999. and X. “Discriminative learning for minimum error classiﬁcation.” IEEE Transactions on Signal Processing. “A discrete utterance recogniser. Kneser and H. “Maximum likelihood estimation for multivariate mixture observations of Markov chains. [77] F. Jelinek. 64. Waibel. 532–556. no.” Proceedings of IEEE. no. vol. “Maximumlikelihood estimation for mixture multivariate stochastic observations of Markov chains. (St. S. 1969. M. 1616–1624.” IBM Journal on Research and Development. 973–976. Nguyen. 1993. “Continuous speech recognition by statistical methods. 283–297. pp. Y. 6. 64. 2. Finke. 7. Berlin. Y. “A novel loss function for the overall risk criterion based discriminative training of HMM models. 1584–1595. F. 56–58. Horvat. Gales. [83] B. 11. [84] J. L. no. Kumar and A. 1998. “Eigenvoices for speaker adaptation. November 2003. 307–309. 35. 4. 26. Woodland. [86] T. Liu. Jelinek. Kaiser. pp. Juang. 5. J. pp. Li. S. M. Ney.” Electronics Letters. 14. vol. E. no. “Improved clustering techniques for classbased statistical language modelling. 13. B. 1235–1249. 1992. pp.” Speech Communication. 3. 2000.106 References [76] F. 2. [88] D. [85] S. vol. and P. no. Mrva. Virgin Islands). [79] H. vol. Kacic. K. U. September 2006. “Digital ﬁltering using logarithmic arithmetic. [81] B. Goldwasser.H. Sondhi. T. no. 1998. vol. pp. pp. D. and M. Korea.H. [91] R. Wang. 1213–1243. Kuhn. Katz. “A fast sequential decoding algorithm using a stack. Field. no. L. Juang. 1971. “Estimation of probabilities from sparse data for the language model component of a speech recogniser.H. “Recent advances in broadcast news transcription. J. Andreou. Thomas. Katagiri.” in Proceedings of ICSLP. Hain. 400–401.” IEEE Transactions on ASSP. N.” in Proceedings of EuroSpeech. C. 14.” in Proceedings of ICSLP. and M.” in Proceedings of IEEE ASRU Workshop. vol. M. Kemp and A.” IEEE Transactions on Audio. “Unsupervised training of a speech recognizer: Recent experiments. Jeju. Kingsbury and P. “Large margin hidden Markov models for speech recognition.C. 2725–2728. Levinson. [87] D. Speech and Language Processing. 1985. vol. 105–110. vol. vol. Rayner. 1987. Evermann.” in Proceedings of Eurospeech. L. Beijing.” Proceedings of IEEE. [92] N. and P. pp. Junqua. W. 32. E. [80] B. G. G. Niedzielski. Tranter.S. G. Kim. Juang and S. 7.” AT and T Technical Journal. 4. [82] B. X. Juang. 1984. J. Contolini. 1985. “Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. pp. 2004. pp. T. Kim.” in Proceedings of ICSLP. S. pp. no. Umesh. 63. [78] F. Hain.H. “Using VTLN for broadcast news transcription. .” IEEE Transactions on Information Theory. S. Jiang.” AT and T Technical Journal. “On the hidden Markov model and dynamic time warping for speech recognition — A uniﬁed view. [90] R. pp. Sydney. pp.
15. 215–228. 62. I. [105] R.” Computer Speech and Language. 1. Phoenix. 4. 1995. [103] H. “Experiments with a nonlinear spectral subtractor (NSS). pp.” in Proceedings of ICASSP. Lee and R. Cambridge University. J. 2005. PhD thesis.References 107 [93] H. The Harpy Speech Recognition System. . 1983. [102] H. Lee. Greece. no. pp. 29– 45. 1035–1074. M. “Augmented statistical models for speech recognition. Gao. 2044–2047. Liu and M.” IEEE Transactions on Audio Speech and Language Processing. ﬁltering and noise. Layton and M. Gales. pp. 11. F.” in Proceedings of ICASSP.” Bell Systems Technical Journal. 115–129. “Lightly supervised and unsupervised acoustic model training.L. [107] P. F. vol. [99] S. and C. 2002. Liao and M. “Probabilistic classiﬁcation of HMM states for large vocabulary. 1986. pp. pp. “Speaker normalisation using eﬃcient frequency warping procedures. Rabiner. Rhodes. Honolulu. vol. [95] M. [108] B. [97] C. no.H.” Computer Speech and Language. “Automatic model complexity control using marginalized discriminative growth functions. F. J. 1996.” in Proceedings of Eurospeech. 171–185. 2. C. Gauvain. for robust speech recognition in cars. PA. “Continuously variable duration hidden Markov models for automatic speech recognition. vol. Rose. Lowerre. Gales. J. 3129–3132. Siniscalchi. A. [106] X. pp. 2006. 2006. May 2007. 1992. Portugal. [96] L. Lippmann and B. [94] L. Atlanta. Gales. 14. vol. 4. [104] H. C. USA. “Approximate test risk minimization through soft margin training. “Maximum entropy direct models for speech recognition. Jelinek. Gales.” IEEE Transactions on Audio. PhD thesis. “Adaptive training with joint uncertainty decoding for robust recognition of noise data. USA. Honolulu. M. 1999. Liao. F. E. Carnegie Mellon. Leggetter and P. Liao and M.” Speech Communication. T. Carlson. 2007. “An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition. Speech and Language Processing. pp. USA. Gales. [98] S. 2007. Boudy. J. J. Sondhi. “Joint uncertainty decoding for noise robust speech recognition. J.K. vol. vol. L. E. “Using missing feature theory to actively select features for robust speech recognition with interruptions. Liao and M.” in Proceedings of ICASSP. no. pp. Uncertainty Decoding For Noise Robust Speech Recognition. “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. [101] H. 3. 2. KN37–KN40. 873–881. F. Toulouse. no. pp. Lockwood and J. R. Levinson. 2006.” in Proceedings of ICASSP.” in Proceedings of Interspeech. Woodland. 1997. Li. vol. 2007. [100] J.” in Proceedings of ICSLP. 1414–1424.” in Proceedings of ICASSP. 1976. pp. P. Levinson. 9. hidden Markov models and the projection. Luo and F. Pittsburgh.” Computer Speech and Language. no. [109] X. 16. and M. “Issues with uncertainty decoding for noise robust speech recognition. Kuo and Y. Lisbon. Lamel and J.
[116] A. “Speaker normalisation with all pass transforms. Raj. September 2006. Nadas. T. 203–223. “A decision theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional . E. Richardson. J. Martin. Solomonoﬀ. 1998. and H. Haferkamp. and R.” IEEE Transactions Speech and Audio Processing. pp. Ney. Kao. Ho. 1. R.” in Proceedings of ICASSP. [124] P. “Discriminative training for largevocabulary speech recognition using minimum classiﬁcation error. C. Nakamura. J. J. Gauvain. 15. Lamel. 14. and X. 12. Lisbon. Byrne. Meyer. Matsoukas. Colthurst. S. and H.” in Proceedings of Interspeech. Ney. no. O. 733–736. and B. Luo. [115] S.” IEEE Transactions on Audio Speech and Language Processing. “The BBN 2001 English conversational speech system. Schwenk. Hazen. pp. Matsui and S. 41–50. L. 2002.108 References [110] J. Moreno. pp. pp.” Computer Speech and Language. “Kernel eigenspacebased MLLR adaptation. Makhoul. vol. [123] P. 1998. Schwartz. 1541–1556. [112] B. T. Stolcke. Riley. [117] S. Toulouse. T. and M. 2001. F. Mohri. J. Atlanta. [122] M. pp. 1.” IEEE Transactions on Audio. September 2005. pp. R. W. PhD thesis. Moreno. pp. 4. “A vector Taylor series approach for environmentindependent speech recognition. “N best unsupervised speaker adaptation for speech recognition. J. 373–400. Stern. FL. Sydney. 2007. Lefevre. and R. Katagiri. and S. “Weighted ﬁnite state transducers in speech recognition. O. 1996. [121] C. O. no. May 2006. B. Z. [125] A. Colthurst. Matsoukas. vol. and H. Speech recognition in noisy environments. Liermann. R.” Computer Speech and Language. J. Speech and Language Processing. no. J. 1253–1256. “Kernel eigenvoices speaker adaptation.L. Nguyen. Quillen. vol. no. 2. I. “Algorithms for bigram and trigram word clustering. J. “Investigations on error u minimizing training criteria for discriminative training in automatic speech recognition. 13. [118] T. F.” in Proceedings of ICASSP. F. Macherey. [120] J. H. Furui. Matsoukas.” IEEE Transactions Speech and Audio Processing. McDermott. “Unsupervised training on large amount of broadcast news data.” Computer Speech and Language. Kimball. Ma. Xiang. 2002. Schwartz. vol. Schl¨ter. vol. C. L. [111] W. 1056– 1059. 5. Kimball. [119] E. and K. L. McDonough. J. 16. Prasad. pp. “Utterancelevel boosting of HMM speech recognisers. 984–992.” in Proceedings of Eurospeech. “Finding consensus among words: Latticebased word error minimisation.” in Proceedings of ICSLP. 5. 69– 88. Carnegie Mellon University. Ma. Brill. 1995. Kimball. Mangu. 2000. A. Mak. A.” in Presentation at 2001 NIST Large Vocabulary Conversational Speech Workshop. Hsiao. and A. no.” in Proceedings of ICASSP. Gish. vol. Mak and R. Adda. 14. September 2005. [114] L. vol. 1996. T. Pereira. Le Roux. A. March 2007. pp. [113] B. Kwok. “Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system. Orlando. Madrid. Portugal.
H. and G. Plainsboro NJ. continuous speech recognition. Mangu. and M. V. G. Y.” in Proceedings of Human Language Technology Workshop. J. Orlando. pp. no. vol. P. 1. D. Nguyen and B. Padmanabhan. “Minimum phone error and Ismoothing for improved discriminative training. 2002. “An improved MMIE training algorithm for speaker independent. 1998. “A comparative study of speaker adaptation techniques. 1–38. 43–72. Montreal. and P. J.. Ortmanns. 1994. Woodland. Aubert. Xiang.” in Proceedings of ICASSP. 1991. Switzerland. M. pp. 1988.” Computer Speech and Language. Cambridge University. G. Canada. pp. Gales. “fMPE: Discriminatively trained features for speech recognition.” Computer Speech and Language. Discriminative Training for Large Vocabulary Speech Recognition. Madrid. M. and V. 1983. D. S. Essen. Odell. 31. National Institute of Standards and Technology (NIST). 2000. Garofolo. 1995. March 2004. Woodland. CO. Povey and P.” in Proceedings of Eurospeech. Saon. L. Kneser. Toronto. September 2003. H. F.” Proceedings of ICASSP. Price. W. Y. Paul.References 109 [126] [127] [128] [129] [130] [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] maximum likelihood. B. pp. Toronto.” in Proceedings of ICASSP. vol. Przybocki.” in Proceedings of ICASSP. Kingsbury. P. Normandin. FL. and S. 1.” IEEE Transactions on Acoustics Speech and Signal Processing. Povey. no.” in Proceedings of ICSLP. H. S. Ney. L. D. Soltau. Woodland.” in Proceedings of ICASSP. PhD thesis. New York. and X. C.” in Proceedings of ITRW ASR2000: ASR Challenges for the New Millenium. Sankar. C. Digalakis. D. pp. 128–132. Zweig. “The DARPA 1000word Resource Management database for continuous speech recognition. “Latticebased unsupervised MLLR for speaker adaptation. . Fisher. Young. Geneva.” in Proceedings of EuroSpeech. Rep.” in Proceedings of ICASSP. “Modelling inverse covariance matrices by basis expansion. Kim. J. pp. J. D. B. A. “Light supervision in acoustic model training. Martin. pp. Paris. small vocabulary. and R. “A onepass decoder design for large vocabulary recognition. Saon. Gopinath. and G. 2004. 1127–1130. pp. vol. U. P.. D. L. 651–654. J. no. Valtchev. A. 2002. 1991. 1994. 814–817. Povey. V. 693–996. J. Morgan Kaufman Publishers Inc. “MMIMAP and MPEMAP for acoustic model adaptation. 1997. G. J. Olsen and R. vol. Pallet.” Tech. Pallet. Zweig. 2005. 11. Bernstein. “Algorithms for an optimal A* search and linearizing the search in the stack decoder. J. Ney. Povey. and D. 8. Fiscus. “On structuring probabilistic dependences in stochastic language modelling. Philadelphia. 405–410. Neumeyer. “1998 broadcast news benchmark test results: English and nonEnglish word error rate performance measures. S. “A word graph algorithm for large vocabulary continuous speech recognition. 4. R. Denver. D. 1.
” in Proceedings of ICASSP.” Speech Communication. Zweig.L. Ostendorf. Moore. Muller. [146] A. pp. Povey. 2004. “Maximum likelihood and minimum classiﬁcation error factor analysis for automatic speech recognition. 1999. [144] F. vol. 2. S. 43. Saon. and S. pp. K. Montreal. 181–200. pp. Keystone. 8. no.” in Proceedings of IEEE ASRU Workshop. M. and J. and H. Thomas. Dharanipragada. J. 64. R. CO. [156] F.” Computer Speech and Language. Rabiner. 1. [145] B.” Computer Speech and Language. Juang. E. J. 2004. Toronto. Wessel. Povey.” in Advances in Neural Information Processing Systems. [142] L. no. Ney. Saraclar. Sondhi. Shinoda and C.S. “Interdependence of language models and discriminative training.” in Proceedings of ICASSP. F. 257–286. 1991. H. Stern. Istanbul. and G. Chow. vol. Detroit.H. C. Santa Barbara.” in Proceedings of ICASSP. [148] G. 22. [157] K. 2007. 1985. Saon. “CTS decoding improvements at IBM. and R. Gopinath. Canada. R. 1995. [147] M. Padmanabhan. Rohlicek. Seltzer. B. St. Roark. no. FL. 101–116. R. Rahim. “Recognition of isolated digits using HMMs with continuous mixture densities. 1999. Gales.” in Proceedings of ICASSP.” in Proceedings of ICASSP. 119–122. [155] M. Canada. Sim and M. “Discriminative N gram language modelling. and M. pp. Collins. and D. M. Schluter. Richardson. Raj and R. Stern. vol. K. 1997. and M. 77. Rabiner. 2004. “A Bayesian framework for spectrographic mask estimation for missing feature speech recognition. 1211–1233. 1249–1256. 2. U. [152] R. V. pp. Tampa. vol. Raj. 6. [150] G. [158] K. pp.” AT and T Technical Journal. “Using boosting to improve a hybrid HMM/NeuralNetwork speech recogniser. Saul and M. 2. “Latticebased search strategies for large vocabulary recognition. 2005. Saul. 5. Lee. Chen. A. 1985. Montreal. F. pp. 18. “Basis superposition precision matrix models for large vocabulary continuous speech recognition. B. “Structural MAP speaker adaptation using hierarchical priors. Saon. [149] G. “Missing feature approaches in speech recognition. R. D. 2000.” in Proceedings of ICASSP. “Maximum likelihood discriminant feature spaces. 379–393. 5–8.” IEEE Signal Processing Magazine. Russell and R. “Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition. 4. no. 2000. Virgin Islands. December 2003.110 References [141] L. pp. no. Sha and L.” IEEE Transactions on Speech and Audio Processing. B. vol. [153] R. G. K. . pp. 576–579.” in Proceedings of ICASSP. M. 2007.” in EARS STT workshop. pp. 115–125. M. “A tutorial on hidden Markov models and selected applications in speech recognition. Rosti and M. vol. Levinson. “Feature space Gaussianization. Phoenix. 701–704. vol. [154] H. “Factor analysed hidden Markov models for speech recognition. vol. Schwenk. 2004. [151] L.” Proceedings of IEEE. 21. “Large margin Gaussian mixture modelling for automatic speech recognition.” in Proceedings of ASRU’97. “A comparison of several approximate algorithms for ﬁnding multiple (N best) sentence hypotheses. 1989. no. pp. [143] B. Schwartz and Y. Gales.
PA. 49–52. Woodland. September 2006. C. Prentice Hall. “The CUHTK Mandarin broadcast news transcription system. Soltau. Rhodes. 14. H. 2004. [173] V. [170] S. [166] B. and T.” in Proceedings of ICASSP. V. F.” in Proceedings of ICSLP. no. “The Cambridge University March 2005 speaker diarisation system. Gales. C. and B.” Speech Communication. W. PhD thesis. [174] V. vol. Brill. and S. Lisbon. P. C. Saon. C. Portugal. “Bestﬁrst enumeration of paths through a lattice — An active chart parsing solution. E. J. J. M. Kim. [171] S. “Improvements in linear transform based speaker adaptation. 22. J. Woodland. “Explicit word error minimization in N Best list rescoring. Odell. Thompson. D. Sinha. E. Gales. pp. Weintraub. Toulouse. Gales. Zweig. Tsakalidis. vol. Tranter and D. F. [164] S. A. “Constructing ensembles of ASR systems using randomized decision trees. [169] K. Yokohama. and G. [161] R. 2006. Woodland. Y. Doumpiotis. Philadelphia. [165] A. 1990. no. Uebel and P. pp. C. Woodland. Kingsbury. “A novel decoder design for large vocabulary recognition. 5. K. J. Ramabhadran. 263– 274. 2005. Japan. “Discriminative linear transforms for feature normalisation and speaker adaptation in HMM estimation. and P. Povey. Valtchev. Reynolds. P. B. Young. Woodland. Text to speech synthesis: New paradigms and advances. 15.” in Proceedings of ICASSP. 3. “Discriminative semiparametric trajectory models for speech recognition.” IEEE Transactions of Audio.” IEEE Transactions on Speech and Audio Processing. and M.” Computer Speech and Language. Sinha. 3. Tranter. M. Stolcke. 2005. pp. Kitamura. [172] L. Speech and Language Processing. F. “Reformulating the HMM as a trajectory model. J. [162] O. 669–687. Tokuda. L. Odell. E.” in Proceedings Beyond HMM Workshop. September 2005. 2007. F. pp. Zen. B. J. Sim. and W. vol. vol. Philadelphia.” in Proceedings of ICASSP. Learning Structured Prediction Models: A Large Margin Approach. and P. 2001. D. Chapter HMMBased Approach to Multilingual Speech Synthesis. Tokuda. “Transforming binary uncertainties for robust speech recognition. no. and S. 1994. Greece. 2130–2140. Young.” in Proceedings of ICASSP. H. Liu. J. Srinivasan and D. G. [167] H. 1997.” Computer Speech and Language. J. 1997. 7. Stanford University. vol. C. . J. [160] R. Tasker. December 2004. Seattle. vol. C. Valtchev. Byrne. pp.” IEEE Transactions Speech and Audio Processing. 303–314. Black. 21. pp. X. Kingsbury. Wang. Sim and M. Mangu.” in Proceedings of InterSpeech. 2004. “An overview of automatic speaker diarisation systems. Zen. 367– 376. S. 2007. Tokyo.” in Proceedings of EuroSpeech. 4. 13. no. “MMIE training of large vocabulary recognition systems. Siohan. 2005.References 111 [159] K. [163] H. “The IBM 2004 conversational telephony system for rich transcription. and A. [168] K.
Woodland and D. J. Virgin Islands. C. P. D. pp. The HTK Book (for HTK Version 3. 1982. V. 2007. Gales. Vanhoucke and A. 4. J. vol. 2000.ac.cam. D. [187] S. 2002. J. Woodland. J. J. “Mixtures of inverse covariances. G.” in Proceedings of ICASSP. and P. D.” in Proceedings of ASRU. Istanbul. pp. . L. Honolulu.” Computer Speech and Language. “Framediscriminative and conﬁdencedriven adaptation for LVCSR. [183] P. U. Plainsboro NJ. Young and L.” in Proceedings of ICASSP.” in Proceedings of ICASSP. Young. Wellekens. J. University of Cambridge. [190] S. F. Povey. Pye. “Unsupervised training for Mandarin broadcast news and conversation transcription. 25–47. Povey. J. and P. “Discriminative adaptive training using the MPE criterion. 1996. X. [177] A. Albuqerque. 1998. Woodland. Odell. 1992. pp. 16. [178] F. Rigoll.” in Proceedings of Human Language Technology Workshop. and P. “Speech recognition evaluation: A review of the US CSR and LVCSR programmes. Woodland. [188] S. and M. Kershaw. Dallas.eng. USA. “Generating multiple solutions from connected word DP recognition algorithms. 1133–1136. and G. Moore. J. 2006.” in Proceedings of ICASSP. [184] P. 6. “Error bounds for convolutional codes and asymptotically optimum decoding algorithm. Woodland. Wang. 263–279. Toulouse. [176] A. 845–848. D. J. 351– 354. TX. 2003. [189] S. pp.” in Proceedings of IOA Autumn Conference. 1994. F. Wallhof. J. [179] L.S. M.uk. 1984.” in Proceedings of Human Language Technology Workshop. Chase. “Treebased state tying for high accuracy acoustic modelling. Young. F. Wang. “Minimum generation error training for HMMbased speech synthesis. J.” in Proceedings of ICASSP. 13. [186] Y. no. vol.” IEEE Transactions on Information Theory. pp. 1835–1838. http://htk. Hain. 1997. G. “Hidden Markov models using vector linear predictors and discriminative output distributions.” Computer Speech and Language. Valtchev. Gales. pp. San Francisco. D. “Explicit time correlation in hidden Markov models for speech recognition. Woodland. pp. [182] P. M. C. Evermann. Willett. Moore. Gales. 1990. St Thomas.” in Proceedings of ICASSP. C. “Large scale discriminative training of hidden Markov models for speech recognition. vol. Gales. T. 260–269. J.H. 307–312. D. “Hidden Markov model decomposition of speech and noise. [180] L. 384–386. Ollason. C. “The development of the 1996 HTK broadcast news transcription system.” in Proceedings of ICASSP. J. Young. Woodland. [181] C.” in Proceedings of ICSLP’96. Wang and P. M. Pye. Odell. and S. Morgan Kaufman Publishers Inc. Young. 1987. K. Liu. Sankar. Viterbi. Montreal. Philadelphia.J. Wu and R. [185] P. pp. December 2006. Canada. J.4). vol. F. “Iterative unsupervised adaptation using maximum likelihood linear regression. 12. 2003. C.112 References [175] V. pp. Varga and R. Woodland.
Istanbul.References 113 [191] S. Berkeley. 2006.” IEEE Transactions on Speech and Audio Processing. Padmanabhan. Russell. 1989. J. S. Russell. vol. Lisbon. no. 6. August 2007. “Discriminatively trained region dependent feature transforms for speech recognition. vol. “Unsupervised training with directed manual transcription for recognising Mandarin broadcast audio. [193] K. vol. Montreal. J. Young. Yu and M. [195] K. 1998. . Portugal. F. 2004. Gales.” Tech. Chen. [196] H. [194] K. “Discriminative cluster adaptive training.” in Proceedings of ICSLP. Zweig and M. H. Jeju.” IEEE Transactions Speech and Audio Processing. 2000. Zhang. Tokuda. 2007. J. Y. Stolcke. “Bayesian adaptive inference and adaptive training. 14. CUED/FINFENG/TR38. [192] S. pp. [199] Q. H. Schwartz. [200] G. N. and T. Canada. [197] B. and J.” in Proceedings of InterSpeech. H. PhD thesis. “The use of syntax and multiple alternatives in the VODIS voice operated database inquiry system. “Improved discriminative training using phone lattices. Yu and M. 1991. Woodland. 5. C. F. Gales. [201] G.” in Proceedings of ICASSP. no. [198] J. Zweig. Kitamura. September 2005. University of California. Antwerp. and P. J. Young. and J. “A Viterbi algorithm for a trajectory model derived from HMM with explicit relationship between static and dynamic features. and N. 1694–1703.” in Proceedings of InterSpeech. 15. 1. “Token passing: A simple conceptual model for connected speech recognition systems. J. Thornton. H. “On using MLP features in LVCSR. Zheng and A. Rep. S. Matsoukas. Toulouse. F. and R. S. Korea. Zhu. pp. Speech Recognition with Dynamic Bayesian Networks. no. Gales. Cambridge University Engineering Department.” in Proceedings of ICASSP. Thornton.” in Proceedings of ICASSP. M. N. 5. Morgan. 1932–1943. K. Zen. 2006. 2004. pp. Yu. 65–80. “Boosting Gaussian Mixtures in an LVCSR System.” Computer Speech and Language.