Signals and Communication Technology

For further volumes:
http://www.springer.com/series/4748
Tobias Herbig, Franz Gerl, and
Wolfgang Minker
Self-Learning Speaker
Identification
A System for Enhanced Speech Recognition
ABC
Author
Tobias Herbig
Institute of Information Technology
University of Ulm
Albert-Einstein-Allee 43
89081 Ulm
Germany
E-mail: tobias.herbig@uni-ulm.de
Dr. Franz Gerl
Harman Becker Automotive
Systems GmbH
Dept. Speech Recognition
Söflinger Str. 100
89077 Ulm
Germany
E-mail: franz.gerl@harman.com
Wolfgang Minker
Institute of Information Technology
University of Ulm
Albert-Einstein-Allee 43
89081 Ulm
Germany
E-mail: wolfgang.minker@uni-ulm.de
ISBN 978-3-642-19898-4 e-ISBN 978-3-642-19899-1
DOI 10.1007/978-3-642-19899-1
Library of Congress Control Number: 2011927891
c 2011 Springer-Verlag Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part of the mate-
rial is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Dupli-
cation of this publication or parts thereof is permitted only under the provisions of the German
Copyright Law of September 9, 1965, in its current version, and permission for use must always
be obtained from Springer. Violations are liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
Typeset: Scientific Publishing Services Pvt. Ltd., Chennai, India.
Printed on acid-free paper
9 8 7 6 5 4 3 2 1
springer.com
Self-Learning Speaker Identification for
Enhanced Speech Recognition
Summary. A self-learning speech controlled system has been developed for
unsupervised speaker identification and speech recognition. The benefits of a
speech controlled device which identifies its main users by their voice char-
acteristics are obvious: The human-computer interface may be personalized.
New ways for interacting with a speech controlled system may be developed
to simplify the handling of the device. Furthermore, speech recognition accu-
racy may be significantly improved if knowledge about the current user can
be employed. The speech modeling of a speech recognizer can be improved
for particular speakers by adapting the statistical models on speaker specific
data. The adaptation scheme presented captures speaker characteristics after
a very few utterances and transits smoothly to a highly specific speaker mod-
eling. Speech recognition accuracy may be continuously improved. Therefore
it is quite natural to employ speaker adaptation for a fixed number of users.
Optimal performance may be achieved when each user attends a supervised
enrollment and identifies himself whenever the speaker changes or the system
is reset. A more convenient human-computer communication may be achieved
if users can be identified by their voices. Whenever a new speaker profile is
initialized a fast yet robust information retrieval is required to identify the
speaker on successive utterances. Such a scenario presents a unique challenge
for speaker identification. It has to perform well on short utterances, e.g.
commands, even in adverse environments. Since the speech recognizer has a
very detailed knowledge about speech, it seems to be reasonable to employ
its speech modeling to improve speaker identification. A unified approach
has been developed for simultaneous speech recognition and speaker identi-
fication. This combination allows the system to keep long-term adaptation
profiles in parallel for a limited group of users. New users shall not be forced to
attend an inconvenient and time-consumptive enrollment. Instead new users
should be detected in an unsupervised manner while operating the device.
New speaker profiles have to be initialized based on the first occurrences of a
speaker. Experiments on the evolution of such a system were carried out on a
subset of the SPEECON database. The results show that in the long run the
system produces adaptation profiles which give continuous improvements in
speech recognition and speaker identification rate. A variety of applications
may benefit from a system that adapts individually to several users.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Fundamentals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Speech Production. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Front-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Speaker Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Bayesian Information Criterion. . . . . . . . . . . . . . . . . . . . 18
2.4 Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 Motivation and Overview. . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.3 Detection of Unknown Speakers . . . . . . . . . . . . . . . . . . . 29
2.5 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.3 Implementation of an Automated Speech
Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Speaker Adaptation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6.2 Applications for Speaker Adaptation in Speaker
Identification and Speech Recognition . . . . . . . . . . . . . . 39
2.6.3 Maximum A Posteriori . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6.4 Maximum Likelihood Linear Regression . . . . . . . . . . . . 44
2.6.5 Eigenvoices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.6.6 Extended Maximum A Posteriori . . . . . . . . . . . . . . . . . . 51
2.7 Feature Vector Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
VIII Contents
2.7.2 Stereo-based Piecewise Linear Compensation for
Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.7.3 Eigen-Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Combining Self-Learning Speaker Identification and
Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1 Audio Signal Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Multi-Stage Speaker Identification and Speech
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3 Phoneme Based Speaker Identification . . . . . . . . . . . . . . . . . . . 65
3.4 First Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4 Combined Speaker Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.2 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 Unsupervised Speech Controlled System with
Long-Term Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Joint Speaker Identification and Speech Recognition . . . . . . . 85
5.2.1 Speaker Specific Speech Recognition . . . . . . . . . . . . . . . 86
5.2.2 Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Reference Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.1 Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4.1 Evaluation of Joint Speaker Identification and
Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.2 Evaluation of the Reference Implementation . . . . . . . . 105
5.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6 Evolution of an Adaptive Unsupervised Speech
Controlled System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Posterior Probability Depending on the Training Level . . . . . 116
6.2.1 Statistical Modeling of the Likelihood Evolution . . . . . 117
6.2.2 Posterior Probability Computation at Run-Time . . . . 122
6.3 Closed-Set Speaker Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4 Open-Set Speaker Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Contents IX
6.5 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.6 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.6.1 Closed-Set Speaker Identification . . . . . . . . . . . . . . . . . . 134
6.6.2 Open-Set Speaker Identification . . . . . . . . . . . . . . . . . . . 139
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7 Summary and Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.1 Expectation Maximization Algorithm . . . . . . . . . . . . . . . . . . . . 153
A.2 Bayesian Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
A.3 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Nomenclature
ADC Analog to Digital Converter
ASR Automated Speech Recognition
BIC Bayesian Information Criterion
CDHMM Continuous Density HMM
CMS Cepstral Mean Subtraction
cos cosine
DCT Discrete Cosine Transform
DFT Discrete Fourier Transform
EM Expectation Maximization
EMAP Extended Maximum A Posterior
EV Eigenvoice
FFT Fast Fourier Transform
GMM Gaussian Mixture Model
HMM Hidden Markov Model
Hz Hertz
ID Identity
iid independent and identically distributed
kHz kilo Hertz
LDA Linear Discriminant Analysis
LLR Log Likelihood Ratio
log logarithm
LR Lip Radiation
MAP Maximum A Posteriori
MFCC Mel Frequency Cepstral Coefficients
min minute
ML Maximum Likelihood
MLLR Maximum Likelihood Linear Regression
MLP Multilayer Perceptron
MMSE Minimum Mean Squared Error
NN Neural Networks
PCA Principal Component Analysis
XII Nomenclature
RBF Radial Basis Functions
ROC Receiver Operator Characteristics
SCHMM Semi-Continuous HMM
sec second
SPEECON Speech-Driven Interfaces for Consumer Devices
SPLICE Stereo-based Piecewise Linear Compensation for
STFT Short Time Fourier Transform
UBM Universal Background Model
VAD Voice Activity Detection
VQ Vector Quantization
VT Vocal Tract
VTLN Vocal Tract Length Normalization
WA Word Accuracy
WER Word Error Rate
Environments
1
Introduction
Automatic speech recognition has attracted various research activities since
the 1950s. To achieve a high degree of user-friendliness, a natural and easy
to use human-computer interface is targeted for technical applications. Since
speech is the most important means of interpersonal communication, ma-
chines and computers can be operated more conveniently with the help of
automated speech recognition and understanding [Furui, 2009].
The history of speech recognition and speaker identification is character-
ized by steady progress. Whereas in the 1950s the first realizations were
mainly built on heuristic approaches, more sophisticated statistical tech-
niques have been established and continuously developed since the 1980s
[Furui, 2009]. Today a high level of recognition accuracy has been achieved
which allows applications of increasing complexity to be controlled by speech.
Speech recognition has attracted attention for a variety of applications
such as office systems, manufacturing, telecommunication, medical reports
or infotainment systems in automobiles [Rabiner and Juang, 1993]. Speech
recognition can increase labor efficiency in call-centers or for dictation tasks
of special occupational groups with extensive documentation duty. For in-
car applications both the usability and security can be increased for a wide
variety of users. The driver can be supported to safely participate in road
traffic and to operate technical devices, e.g. navigation systems or hands-free
sets.
Despite significant advances during the last few decades, there still exist
some deficiencies that limit the wide-spread application of speech recognition
in technical applications [Furui, 2009]. For example, recognition accuracy
can be negatively affected by changing environments, speaker variability and
natural language input [Junqua, 2000].
The goal of this book is to contribute to a natural human-computer
communication. An automatic personalization through a self-learning speech
controlled system is targeted. An integrated implementation of speech recog-
nition and speaker identification has been developed which adapts indi-
vidually to several users. Significant progresses for speech recognition are
exemplified for in-car applications.
T. Herbig, F. Gerl, and W. Minker: Self-Learning Speaker Identification, SCT, pp. 1–4.
springerlink.com c Springer-Verlag Berlin Heidelberg 2011
2 1 Introduction
1.1 Motivation
Infotainment systems with speech recognition, in general navigation, tele-
phone or music control, typically are not personalized to a single user. The
speech signal may be degraded by varying engine, wind and tire noises, or
transient events such as passing cars or babble noise. For embedded systems
computational efficiency and memory consumption are important design pa-
rameters. Nevertheless, a very large vocabulary, e.g. city or street names,
needs to be accurately recognized.
Speaker independent speech recognizers are trained on a large set of speak-
ers. Obviously, the trained speech pattern does not fit the voice of each
speaker perfectly [Zavaliagkos et al., 1995]. To achieve high recognition rates
in large vocabulary dictation systems, the statistical models can be perfectly
trained to a particular user [Thelen, 1996]. However, the user is known in
this case and the environment in general does not change.
In a system without speaker tracking all information acquired about a
particular speaker is lost with each speaker turn. However, often one can ex-
pect that a device is only used by a small number of users, e.g. 5 recurring
speakers. Therefore, it seems to be natural to employ speaker adaptation
separately for several speakers and to integrate speaker identification. In ad-
dition to an enhanced speech recognition accuracy by incremental speaker
adaptation, the usability can be improved by tracking personal preferences
and habits.
A simple implementation would be to impose the user to identify himself
whenever the system is initialized or the speaker changes. A more natural
human-computer communication will be achieved by identifying the current
user in an unsupervised way. No additional intervention of the user should
be required.
Automatic personalization through a self-learning speech controlled sys-
tem is targeted in this book. An integrated implementation comprising speech
recognition, speaker identification, detection of new users and speaker adap-
tation has been developed. Speech recognition is enhanced by individually
adapting a speech recognizer to 5-10 recurring users in an unsupervised man-
ner. Several tasks have to be accomplished by such a system:
The various speaker adaptation schemes which have been developed, e.g.
by Gauvain and Lee [1994]; Kuhn et al. [2000]; Stern and Lasry [1987], mainly
differ in the number of parameters which have to be estimated. In general,
a higher number of parameters allows more individual representation of the
speaker’s characteristics leading to improved speech recognition results. How-
ever, such approaches are only successful when a sufficient amount of utter-
ances proportional to the number of adaptation parameters can be reliably
attributed to the correct speaker. Especially during the initialization of a
new speaker profile, fast speaker adaptation that can converge within a few
utterances is essential to provide robust statistical models for efficient speech
recognition and reliable speaker identification. When prior knowledge about
1.1 Motivation 3
speaker variability can be used only a few parameters are required to effi-
ciently adapt the statistical models of a speech recognizer. However, the ca-
pability to track individual characteristics can be strongly limited. In the long
run speaker characteristics have to be optimally captured. Thus, a balanced
strategy of fast and individual adaptation has to be developed by combining
the respective advantages of both approaches.
Speech recognition has to be extended so that several individually trained
speaker profiles can be applied in parallel. Optimal recognition accuracy
may be achieved for each user by an efficient profile selection during speech
recognition. Recognition accuracy and computational complexity have to be
addressed.
Speaker identification is essential to track recurring users across speaker
turns to enable optimal long-term adaptation. Speakers have to be identified
reliably despite limited adaptation data to guarantee long-term stability and
to support the speech recognizer to build up well trained speaker profiles.
Unknown speakers have to be detected quickly so that new profiles can be
initialized in an unsupervised way without any intervention of the user. An
enrollment to train new profiles should be avoided for the sake of a more
natural operation of speech controlled devices.
Since applications for embedded systems are examined, the system shall
be designed to efficiently retrieve and represent speech and speaker character-
istics, especially for real-time computation. Multiple recognitions of speaker
identity and spoken text have to be avoided. A unified statistical modeling of
speech and speaker related characteristics will be presented as an extension
of common speech recognizer technologies.
Beyond the scope of this book, the knowledge about the speaker is not
only useful for speech recognition. The human-computer interface may be
personalized in different ways:
Feedback of the speaker identity to the speech dialog engine allows special
habits and knowledge of the user about the speech controlled device to be
taken into consideration. For instance, two operator modes for beginners and
more skilled users are possible. Beginners can be offered a comprehensive
introduction or assistance concerning the handling of the device. Advanced
users can be pointed to more advanced features.
Knowledge about the speaker identity enables the system to prefer or pre-
select speaker specific parameter sets such as an address book for hands-free
telephony or a list of frequently selected destinations for navigation. Confu-
sions of the speech recognizer can be reduced and in doubt the user can be
offered a list of reasonable alternatives leading to an increased usability.
Barge-in detection is another application where techniques for speaker
identification and speech recognition can be of help as shown by Ittycheriah
and Mammone [1999]; Ljolje and Goffin [2007]. Barge-in allows the user to
interrupt speech prompts, e.g. originating from a Text-To-Speech (TTS) sys-
tem, without pressing a push-to-talk button. A more natural and efficient
control of automated human-computer interfaces may be achieved.
4 1 Introduction
1.2 Overview
First, the fundamentals relevant for the scope of this book are discussed.
Speech production is explained by a simple model. It provides a first overview
of speech and speaker variability. The core components of a speech controlled
system, namely signal processing and feature extraction, speaker change de-
tection, speaker identification, speech recognition and speaker adaptation,
are introduced and discussed in detail. Step by step more sophisticated
strategies and statistical models are explained to handle speech and speaker
characteristics.
Then a brief survey of more complex approaches and systems concerning
the intersection of speaker change detection, speaker identification, speech
recognition and speaker adaptation is given. Speech and speaker character-
istics can be involved in several ways to identify the current user of a speech
controlled system and to understand the content of the spoken phrase. The
target system of this book is sketched and the main aspects are emphasized.
The first component of the target system is a speaker adaptation scheme
which combines short-term adaptation as well as individual adjustments of
the speaker profiles in the long run. This speaker adaptation method is able to
capture speaker characteristics after a few utterances and transits smoothly
to a highly specific speaker modeling. It is used subsequently to initialize and
continuously adapt the speaker profiles of the target system and forms the
identification basis of an integrated approach for speaker identification and
speech recognition. Remarkable improvements for speech recognition accu-
racy may be achieved under favorable conditions.
Another important component of the target system is the unified realiza-
tion of speaker identification and speech recognition which allows a compact
statistical representation for both tasks. Speaker identification enables the
system to continuously adapt the speaker profiles of its main users. Speech
recognition accuracy is continuously improved. Experiments have been con-
ducted to show the benefit for speech recognition and speaker identification.
The basic system can be extended by long-term speaker tracking. Speaker
tracking across several utterances significantly increases speaker identification
accuracy and ensures long-term stability of the system. The evolution of the
adaptive system can be taken into consideration so that a strictly unsuper-
vised system may be achieved. New users may be detected in an unsupervised
way.
Finally, a summary and conclusion are given which repeat the main aspects
and results. Several future applications and extensions are discussed in an
outlook.
2
Fundamentals
The fundamentals of speech production, automatic speech processing and the
components of a complete system for self-learning speaker identification and
speech recognition are introduced in this chapter.
In the first section discussion starts with speech production to gain insight
into speech and the problem of speaker variability. A simple yet effective
model is provided to simplify the understanding of the algorithms handling
speaker variability and speech characteristics.
Then feature extraction is considered as the inverse process. A technique
is introduced which extracts the relevant speech features from the recorded
speech signal that can be used for further automated processing.
Based on these features the modules speaker change detection, speaker
identification and speech recognition are described. They are essential for a
speech controlled system operated in an unsupervised manner. Starting from
a basic statistical model, each module introduces an additional extension
for a better representation of speech signals motivated by speech production
theory.
Finally, two competing approaches are presented to handle unseen
situations such as speaker variability or acoustic environments. Speaker adap-
tation allows modifying the statistical models for speech and speaker charac-
teristics whereas feature vector normalization or enhancement compensates
mismatches on a feature level. A continuous improvement of the overall per-
formance is targeted.
Based on the fundamentals, several realizations of complete systems known
from literature are presented in the next chapter.
2.1 Speech Production
Fant’s model [Fant, 1960] defines speech production as a source-filter model
1
.
The air stream originating from the lungs flows through the vocal cords and
1
The subsequent description of speech production follows O’Shaughnessy [2000]
if not indicated otherwise.
T. Herbig, F. Gerl, and W. Minker: Self-Learning Speaker Identification, SCT, pp. 5–57.
springerlink.com c Springer-Verlag Berlin Heidelberg 2011
6 2 Fundamentals
generates the excitation [Campbell, 1997]. The opening of the glottis deter-
mines the type of excitation and whether voiced or unvoiced speech is pro-
duced. Unvoiced speech is caused by turbulences in the vocal tract whereas
voiced speech is due to a quasi-periodic excitation [Schukat-Talamazzini,
1995]. The periodicity of the excitation is called fundamental frequency F0
or pitch. It depends on speaker and gender specific properties of the vocal
cords, e.g. length, tension and mass [Campbell, 1997]. The fundamental fre-
quencies of male speakers fall in the range between 80 Hz and 160 Hz and
are on average about 132 Hz. Female speakers have an average fundamental
frequency of 223 Hz. Children use even higher frequencies.
The Vocal Tract (VT) generally comprises all articulators above the vocal
cords that are involved in the speech production process [Campbell, 1997]. It
can be roughly approximated by a series of acoustic tubes producing charac-
teristic resonances known as formant frequencies [Campbell, 1997; Rabiner
and Juang, 1993]. As speakers differ anatomically in shape and length of their
vocal tract, formant frequencies are speaker dependent [Campbell, 1997]. The
length of the vocal tract is given by the distance between the glottis and the
lips. Male speakers have an average length of 17 cm whereas the vocal tract
of female speakers is 13 cm long on average.
The modeling of speech production is based on the three main compo-
nents comprising excitation source, vocal tract and lip radiation [Schukat-
Talamazzini, 1995]. The characteristics of glottis, vocal tract and lip
radiation are modeled by filters with the impulse responses h
G
, h
VT
and h
LR
.
In this context u(τ) denotes the excitation signal at time instant τ. The speech
signal s(τ) is given by the convolution
s(τ) = u(τ) ∗ h
G
(τ) ∗ h
VT
(τ) ∗ h
LR
(τ) (2.1)
u(τ) ∗ h(τ) =
_

˜ τ=−∞
u(˜ τ) · h(τ − ˜ τ) d˜ τ. (2.2)
Fig. 2.1 depicts the associated block diagram of the source-filter model. A
simplified model can be given by
s(τ) = u(τ) ∗ h
VT
(τ) (2.3)
if the glottis and the lip radiation are neglected [Wendemuth, 2004].
Additive noise and channel characteristics, e.g. the speech transmission
from the speaker’s mouth to the microphone, can be integrated by n(τ) and
a further room impulse response h
Ch
. The speech signal s
Mic
recorded by the
microphone is given by
s
Mic
(τ) = u(τ) ∗ h
VT
(τ) ∗ h
Ch
(τ)
. ¸¸ .
h(τ)
+n(τ). (2.4)
Speaker variability can be viewed as a result of gender and speaker de-
pendent excitation, anatomical differences in the vocal tract and acquired
2.1 Speech Production 7
speaking habits [Campbell, 1997]. Low-level acoustic information is related to
the vocal apparatus whereas higher-level information is attributed to learned
habits and style, e.g. prosody, word usage and conversational style [Reynolds
et al., 2003].
Noisy environments can have an additional influence on the manner of ar-
ticulation which is known as Lombard effect [Rabiner and Juang, 1993]. Since
speakers try to improve communication intelligibility, significant changes in
their voice patterns can occur degrading automated speech recognition and
speaker identification [Goldenberg et al., 2006]. The effects are highly speaker-
dependent and cause an increase in volume, fundamental frequency and vowel
duration as well as a shift of the first two formants and energy distribu-
tion [Junqua, 1996].
Speaker change detection and speaker identification have to provide ap-
propriate statistical models and techniques which optimally capture the in-
dividual speaker characteristics.
Glottis
Vocal
Tract
Lip
Radiation
voiced / unvoiced
noise
quasi-
periodic
scaling
scaling
Fig. 2.1 Source-filter model for voiced and unvoiced speech production as found
by Rabiner and Juang [1993]; Schukat-Talamazzini [1995]. The speech production
comprises either a quasi-periodic or noise-like excitation and three filters which
represent the characteristics of the glottis, vocal tract and the lip radiation.
The focus of speech recognition is to understand the content of the spo-
ken phrase. Therefore further aspects of speech are important. Speech can be
decomposed into phonemes
2
which can be discriminated by their place of ar-
ticulation as exemplified for vowels, fricatives, nasal consonants and plosives:
• Vowels are produced by a quasi-periodic excitation at the vocal cords and
are voiced. They are characterized by line spectra located at multiples of
the fundamental frequency and their intensity exceeds other phonemes.
2
Phonemes denote the smallest linguistic units of a language. Many languages
can be described by 20 − 40 phonemes. The physical sound generated by the
articulation of a phoneme is called phone [O’Shaughnessy, 2000].
8 2 Fundamentals
• Nasal consonants are characterized by a total constriction of the vocal
tract and a glottal excitation [Rabiner and Juang, 1993]. The nasal cavity
is excited and attenuates the sound signals so that nasals are less intensive
than vowels.
• Fricatives are produced by constrictions in the vocal tract that cause a
turbulent noisy airflow. The constriction produces a loss of energy so that
fricatives are less intensive. They are characterized by energy located at
higher frequencies.
• Plosives (stops) are produced by an explosive release of an occlusion of the
vocal tract followed by a turbulent noisy air-stream.
The phoneme based speaker identification discussed in Sect. 3.3 exploits the
knowledge about the speaker variability of phonemes to enhance the identifi-
cation accuracy. For example, no information about the vocal tract is present
if the place of articulation is given by the teeth or lips as in /f/ or /p/. In
consequence, little speaker discrimination is expected.
In the next section an automated method for feature extraction is pre-
sented. The goal is to extract the relevant spectral properties of speech which
can be used for speaker change detection, speaker identification and speech
recognition.
2.2 Front-End
The front-end is the first step in the speech recognition or speaker identifi-
cation processing chain. In this book it consists of a signal processing part,
feature extraction and post-processing as displayed in Fig. 2.2.
The signal processing applies a sampling and preprocessing to the recorded
microphone signals. The feature extraction transforms discrete time signals
into a vector representation which is more feasible for pattern recognition
algorithms. The following post-processing comprises feature vector normal-
ization to compensate for channel characteristics. In the case of speech recog-
nition a discriminative mapping is used in addition. The three parts of the
front-end are described in more detail.
Signal
Processing
Feature
Extraction
Post
Processing
Fig. 2.2 Block diagram of a front-end for speech recognition and speaker identifica-
tion. The recorded microphone signal is preprocessed to reduce background noises.
A vector representation is extracted from the speech signal. The feature vectors are
normalized in the post processing to compensate for channel characteristics.
2.2 Front-End 9
Signal Processing
The signal processing receives an input signal from the microphone and deliv-
ers a digitized and enhanced speech signal to feature extraction. The principle
block diagram is displayed in Fig. 2.3.
ADC
Noise
Reduction
s
Mic
(τ) s(l)
VAD
Fig. 2.3 Block diagram of the signal processing in the front-end. The recorded
speech signal is sampled at discrete time instances and noise reduction is performed.
Speech pauses are excluded for subsequent speech processing.
The Analog to Digital Converter (ADC) samples the incoming time signals
at discrete time instances
s
Mic
l
= s
Mic
(
l
f
s
), l = 0, 1, 2, 3, . . . (2.5)
and performs a quantization. In this book a sampling rate of f
s
= 11.025 kHz
and a 16 bit quantization are used. In this context l denotes the discrete time
index.
Especially in automotive applications, speech signals are superimposed
with background noises affecting the discrimination of different speakers or
speech decoding. Noise reduction targets to minimize environmental influ-
ences which is essential for a reliable recognition accuracy.
Noise reduction can be achieved by a spectral decomposition and a spectral
weighting as described by Vary et al. [1998]. The time signal is split into its
spectral components by an analysis filter bank. A noise estimation algorithm
calculates the noise spectrum as found by Cohen [2003]; Cohen and Berdugo
[2002]. In combination with an estimate of the disturbed speech spectrum,
the power of the undisturbed or clean speech signal is estimated by a spec-
tral weighting. The Wiener filter, spectral subtraction and Ephraim-Malah
are well-known weighting techniques which are described by Capp´e [1994];
Ephraim and Malah [1984] in more detail. The phase of the speech signals
remains unchanged since phase distortions seem to be less critical for human
hearing [Vary et al., 1998]. After spectral weighting the temporal speech sig-
nals are synthesized by a synthesis filter bank. The enhanced speech signal
is employed in the subsequent feature extraction. Fig. 2.4 displays the noise
reduction setup as described above.
As noise reduction algorithms do not play an important role in this book,
only the widely-used Wiener filter is applied. Interested readers are referred
to Benesty et al. [2005]; H¨ ansler and Schmidt [2004, 2008] for further detailed
information.
10 2 Fundamentals
Power
Weight
Synthesis
filter bank
Analysis
filter bank
factor
Fig. 2.4 Block diagram of the noise reduction. The signal is split into its spectral
components by an analysis filter bank. Noise and speech power are estimated to
calculate a spectral weighting. The power of the clean speech signal is estimated by
a spectral weighting of the disturbed speech spectrum. The enhanced speech signal
is synthesized by the synthesis filter bank.
Voice Activity Detection (VAD) or speech segmentation excludes speech
pauses so that the subsequent processing is done on portions of the micro-
phone signal that are assumed to comprise only speech utterances. The com-
putational load and the probability of false classifications can be reduced for
speaker identification and speech recognition. Conventional speech segmenta-
tion algorithms rely on speech properties such as the zero-crossing rate, rising
or falling energy
3
E{(s
Mic
l
)
2
} [Kwon and Narayanan, 2005]. Furthermore, the
harmonic structure of voiced speech segments may be used, especially for vow-
els. Adverse environments complicate the end-pointing because background
noises mask regions of low speech activity. Finally, speech segmentation can
shift the detected boundaries of the beginning and end of an utterance to
guarantee that the entire phrase is enclosed for speech recognition. For fur-
ther reading, the interested reader is referred to literature such as Espi et al.
[2010]; Ram´ırez et al. [2007].
Feature Extraction
Feature extraction transforms the enhanced speech signals into a vector
representation which reflects discriminatory properties of speech. The Mel
Frequency Cepstral Coefficients (MFCC) are frequently used in speech recog-
nition and speaker identification [O’Shaughnessy, 2000; Quatieri, 2002].
MFCCs are physiologically motivated by human hearing
4
.
The reader may wonder why the MFCC features are also used for speaker
identification. Even though MFCCs are expected to cause a loss of
speaker information, they seem to capture enough spectral information
for speaker identification. The vocal tract structure may be considered as the
3
E{} denotes the expectation value.
4
For the comparison of human and machine in speaker identification and speech
recognition it is referred to Hautam¨aki et al. [2010]; Liu et al. [1997]; O’Shaugh-
nessy [2000] and Meyer et al. [2006].
2.2 Front-End 11
dominant physiological feature to distinguish speakers. This feature is reflected
by the speech spectrum [Reynolds and Rose, 1995]. Kinnunen [2003] provides
a thorough investigation of several spectral features for speaker identification
and concludes that the best candidates range at the same level. However,
features directly computed by a filter bank such as the MFCCs are obviously
better suited for noisy applications [Reynolds and Rose, 1995]. MFCCs
enable a compact representation and are suitable for statistical modeling by
Gaussian mixtures [Campbell, 1997] as explained later in Sect. 2.4. Reynolds
[1995a,b] obtains excellent speaker identification rates even for large speaker
populations. According to Quatieri [2002] the mel-cepstrum can be regarded
as one of the most effective features for speech-related pattern recognition.
Hence, the description in this book is restricted to the extraction of the
MFCC features. High-level features [Quatieri, 2002; Reynolds et al., 2003]
are not considered since a command and control scenario is investigated. An
exemplary MFCC block diagram is shown in Fig. 2.6.
First, the Short Time Fourier Transform (STFT) divides the sequence of
speech samples in frames of predetermined length, applies a window function
and splits each frame into its spectral components. Common window func-
tions are the Hamming and Hann window [Vary et al., 1998]. In this book
the window function
˜
h is realized by the Hann window
˜
h
l
=
1
2
_
1 + cos(
2πl
N
Frame
)
_
, l = −
1
2
N
Frame
, . . . ,
1
2
N
Frame
, (2.6)
where the length of one frame is given by N
Frame
. Subsequently, t denotes the
discrete frame index. The frame shift N
shift
describes the time which elapsed
between two frames. The frame length is 20 ms and frame shift is given as
half of the frame length. The windowing
˜ s
w
t,l
= s
Mic
t·N
shift
+l
·
˜
h
l
, t > 1, (2.7)
combined with the Discrete Fourier Transform (DFT) or Fast Fourier Trans-
form (FFT)
5
realizes a filter bank [Vary et al., 1998]. The filter characteris-
tics are determined by the window’s transfer function shifted in the frequency
domain [H¨ansler and Schmidt, 2004; Vary et al., 1998]. The incoming micro-
phone signals are split into N
FFT
narrow band signals S(f
b
, t) [H¨ ansler and
Schmidt, 2004; Rabiner and Juang, 1993]:
S(f
b
, t) = FFT
_
˜ s
w
t,l
_
=
1
2
N
Frame

l=−
1
2
N
Frame
˜ s
w
t,l
· exp(−i

N
FFT
f
b
l), N
FFT
≥ N
Frame
.
(2.8)
5
The FFT represents an efficient implementation of the DFT [Kammeyer and
Kroschel, 1998]. Details on DFT and FFT are given by Kammeyer and Kroschel
[1998]; Oppenheim and Schafer [1975], for example.
12 2 Fundamentals
In this context the index f
b
denotes the frequency bin and i is the imaginary
unit. The FFT order is selected as N
FFT
= 256 and zero-padding [Oppenheim
and Schafer, 1975] is applied.
Applying the STFT to speech signals presumes that speech can be regarded
as stationary for short time periods. Stationarity claims that the statistical
properties of the random variable are independent from the investigated time
instances [H¨ ansler, 2001]. Irrespective of voiced or unvoiced speech, the spec-
tral amplitudes of speech can be regarded as quasi-stationary for tens of mil-
liseconds [O’Shaughnessy, 2000]. For the feature extraction this assumption
is approximately true if the frame length is shorter than 30 ms [Schukat-
Talamazzini, 1995].
In the next step of the feature extraction the magnitude |S(f
b
, t)| and the
local energy |S(f
b
, t)|
2
are calculated. The phase information is discarded.
The human auditory system cannot resolve frequencies linearly [Logan,
2000; Wendemuth, 2004]. The critical-band rate or Bark scale reflects the
relationship between the actual and frequency resolution which is approx-
imated by the mel filter bank [O’Shaughnessy, 2000]. The mel filter bank
comprises a non-linear frequency warping and a dimension reduction. The
relation between linear frequencies f and warped mel-frequencies f
mel
is de-
rived from psychoacoustic experiments and can be mathematically described
by
f
mel
= 2595 · log
10
(1 +
f
700 Hz
) (2.9)
as found by O’Shaughnessy [2000]. Up to 1 kHz the linear and warped frequen-
cies are approximately equal whereas higher frequencies are logarithmically
compressed.
In the warped frequency domain an average energy is computed by an
equally spaced filter bank [Mammone et al., 1996]. In the linear frequency
domain this corresponds to filters with increasing bandwidth for higher fre-
quencies. Triangular filters can be employed as found by Davis and Mermel-
stein [1980]. The spectral resolution decreases with increasing frequency. In
Fig. 2.5 an example is shown. Further details can be found by Rabiner and
Juang [1993].
The number of frequency bins is reduced here by this filter bank from
1
2
N
FFT
+1 to 19. The result can be grouped into the vectors x
MEL
t
represent-
ing all energy values of frame index t. For convenience, all vectors are viewed
as column vectors.
The logarithm is applied element by element to x
MEL
t
and returns x
LOG
t
.
The dynamics of the mel energies are compressed and products of spectra are
deconvolved [O’Shaughnessy, 2000]. For example, speech characteristics and
scaling factors are separated into additive components in the case of clean
speech.
The Discrete Cosine Transform (DCT) is a well-known algorithm from
image processing because of its low computational complexity and decorrela-
tion property. The DCT approximates the Karhuen-Loeve Transform (KLT)
2.2 Front-End 13
0
0
1000
1000
2000
2000 3000 4000 5000
500
1500
Frequency [Hz]
M
e
l
s
c
a
l
e
Fig. 2.5 Example for a mel filter bank. An equally spaced filter bank is used in the
warped frequency domain to compute an averaged energy. In the linear frequency
domain the bandwidth of the corresponding filters increases for higher frequencies.
Triangular filters can be applied, for example.
in the case of Markov random signals of first order [Britanak et al., 2006]
which are described later in Sect. 2.5.2. The KLT also known as Principal
Component Analysis (PCA) projects input vectors onto directions efficient
for representation [Duda et al., 2001]. The DCT can be combined with a
dimension reduction by omitting the higher dimensions.
The DCT in (2.10) is applied to the logarithmic mel-energies. The coeffi-
cients x
DCT
l,t
of the resulting vector x
DCT
t
are given by the formula
x
DCT
l,t
=
d
LOG

l
1
=1
x
LOG
l,t
· cos
_
l(l
1

1
2
)
π
d
LOG
_
, l = 1, . . . , d
DCT
, (2.10)
as found by O’Shaughnessy [2000]. d contains the dimensionality of the spec-
ified vector. The first coefficient (l = 0) is typically not used because it
represents an average energy of x
LOG
t
and therefore depends on the scaling
of the input signal [O’Shaughnessy, 2000]. It is replaced in this book by a
logarithmic energy value which is derived from x
MEL
t
. The second coefficient
of the DCT can be interpreted as the ratio of the energy located at low
and high frequencies [O’Shaughnessy, 2000]. The resulting vector is denoted
by x
MFCC
t
. A dimension reduction is used for all experiments in this book
and the remaining 11 MFCC features are employed.
Post Processing
In post processing a normalization of the developed feature vectors x
MFCC
t
is performed and a Linear Discriminant Analysis (LDA) is applied. Fig. 2.7
14 2 Fundamentals
STFT MEL DCT
S(k, t) |S(k, t)|
2
˜ s
w
t,l
LOG
Spectral
Envelope
x
MEL
t
x
LOG
t
x
DCT
t
Fig. 2.6 Block diagram of the MFCC feature extraction. The STFT is applied to
the windowed speech signal and the spectral envelope is further processed by the
mel filter bank. The discrete cosine transform is applied to the logarithm of the mel
energies. The MFCC feature vectors x
MFCC
t
are usually computed by discarding
the first coefficient of x
DCT
t
. A dimension reduction can be applied.
Mean
Subtraction
LDA
x
MFCC
t
x
LDA
t
x
CMS
t
Fig. 2.7 Block diagram of the post processing in the front-end. The cepstral mean
vector iteratively estimated is subtracted from the MFCC vectors. In the case of a
speech recognizer an LDA is employed to increase the discrimination of clusters in
feature space, e.g. phonemes.
displays the corresponding block diagram comprising the Cepstral Mean Sub-
traction (CMS) and LDA.
Mean subtraction is an elementary approach to compensate for channel
characteristics [Mammone et al., 1996]. It improves the robustness of speaker
identification and speech recognition [Reynolds et al., 2000; Young et al.,
2006]. For common speech recognizers it has been established as a standard
normalization method [Buera et al., 2007].
Noise reduction as a pre-processing diminishes the influence of background
noises on the speech signal whereas the room impulse response h
Ch
in (2.4)
remains relatively unaffected. The logarithm of feature extraction splits the
speech signal and the room impulse response into additive components if the
channel transfer function is spectrally flat within the mel filters and if clean
speech is considered. When the room impulse response varies only slowly
in time relative to the speech signal, the long-term average of feature vec-
tor μ
MFCC
is approximately based on the unwanted channel characteristics.
The subtraction of this mean vector from the feature vectors compensates
for unwanted channel characteristics such as the microphone transfer function
and leaves the envelope of the speech signal relatively unaffected. In addition,
the inter-speaker variability is reduced by mean normalization [H¨ ab-Umbach,
1999].
To avoid latencies caused by the re-processing of entire utterances the
mean can be continuously adapted with each recorded utterance to track the
long-term average. A simple iterative mean computation is given by
2.2 Front-End 15
μ
MFCC
t
=
n
MFCC
−1
n
MFCC
μ
MFCC
t−1
+
1
n
MFCC
x
MFCC
t
, t > 1 (2.11)
μ
MFCC
1
= x
MFCC
1
(2.12)
where the total number of feature vectors is denoted by n
MFCC
.
Class et al. [1993, 1994] describe an advanced method to iteratively adapt
the mean subtraction. The mean of x
MFCC
t
is realized by a recursion of first
order. A decay factor is chosen so that a faster adaptation during the enroll-
ment of new speakers is achieved. A slower constant learning rate is guaran-
teed afterwards.
In this book, the mean vector μ
MFCC
t
is estimated according to (2.11).
μ
MFCC
1
and n
MFCC
are initialized in an off-line training. Furthermore, n
MFCC
is limited to guarantee a constant learning rate in the long run. Mean nor-
malization is not applied to the first coefficient (l = 0) which is replaced by
a logarithmic energy estimate as explained above. A peak tracker detecting
the maxima of the energy estimate is used for normalization.
Further normalization techniques can be found by Barras and Gauvain
[2003]; Segura et al. [2004].
MFCCs capture only static speech features since the variations of one
frame are considered [Young et al., 2006]. The dynamic behavior of speech
can be partially captured by the first and second order time derivatives of
the feature vectors. These features are also referred as delta features x
Δ
t
and delta-delta features x
Δ
2
t
and are known to significantly enhance speech
recognition accuracy [Young et al., 2006]. Delta features are computed by the
weighted sum of feature vector differences within a time window of 2 · N
Δ
frames given by
x
Δ
t
=

N
Δ
l=1
l ·
_
x
MFCC
t+l
−x
MFCC
t−l
_
2 ·

N
Δ
j=l
l
2
(2.13)
as found by Young et al. [2006]. The same algorithm can be applied to the
delta features to compute delta-delta features. The feature vectors used for
speech recognition may comprise static and dynamic features stacked in a
supervector.
Delta or delta-delta features are not computed in the speech recognizer
used for the experiments of this book. Instead, the preceding and subse-
quent 4 vectors are stacked together with the actual feature vector in a long
supervector. The dynamics of delta and delta-delta coefficients have been
incorporated into the LDA using a bootstrap training [Class et al., 1993].
An LDA is applied in order to obtain a better discrimination of clusters
in feature space, e.g. phonemes. It is a linear transform realized by a matrix
multiplication. Input vectors are projected onto those directions which are
efficient for discrimination [Duda et al., 2001]. The result is a compact rep-
resentation of each cluster with an improved spatial discrimination with re-
spect to other clusters. Furthermore, a dimension reduction can be achieved.
16 2 Fundamentals
Further details can be found by Duda et al. [2001]. The speech decoding
works on the resulting vectors x
LDA
t
as explained later in Sect. 2.5.
An unsupervised system for speaker identification usually encounters dif-
ficulties to define reasonable clusters. The most obvious strategy is to cluster
the feature vectors of each speaker separately. Since the pool of speakers is
generally unknown a priori, speaker identification usually does not use an
LDA.
2.3 Speaker Change Detection
The feature extraction described in the preceding section provides a compact
representation of speech that allows further automated processing.
One application is speaker change detection which performs a segmentation
of an incoming speech signal into homogeneous parts in an unsupervised way.
The goal is to obtain precise boundaries of the recorded utterances so that
always only one speaker is enclosed. Then speaker identification and speech
recognition can be applied to utterances containing only one speaker.
In Sect. 3.1 several strategies for audio signal segmentation are outlined.
Many algorithms work on the result of a speaker change detection as the first
part of an unsupervised speaker tracking.
First, the motivation of speaker change detection is introduced and the cor-
responding problem is mathematically described. Then a detailed description
of one representative which is well-known from literature is presented.
2.3.1 Motivation
The task of speaker change detection is to investigate a stream of audio
data for possible speaker turns. For example, a buffered sequence of a data
stream is divided into segments of fixed length and each boundary has to be
examined for speaker turns. Thus, a representation of speech data suitable for
speaker change detection has to be found and a strategy has to be developed
to confirm or reject hypothetical speaker changes.
The front-end introduced in Sect. 2.2 extracts the relevant features from
the time signal and provides a continuous stream of MFCC feature vectors
for speech recognition and speaker identification. Ajmera et al. [2004] demon-
strate that MFCCs can also be used to detect speaker changes. Thus, only
MFCCs are considered to limit the computational complexity of the tar-
get system including self-learning speaker identification and speech recog-
nition. However, other features could be used as well in the subsequent
considerations.
In the following, a data stream is iteratively investigated frame by frame or
in data blocks for speaker turns. In Fig. 2.8 an example is shown. A sequence
of T frames is given in a buffer. At time instance t
Ch
a speaker change is
hypothesized and the algorithm has to decide whether part I and part II
2.3 Speaker Change Detection 17
t = 1 t = t
Ch
t = T
H
0
H
1
I II
Fig. 2.8 Problem of speaker change detection. A continuous data stream has to be
examined for all candidates where a speaker turn can occur. The basic procedure
checks a given time interval by iteratively applying metric-based, model-based or
decoder-guided techniques.
originate from one or two speakers. This procedure can be repeated for all
possible time instances 1 < t
Ch
< T. In Sect. 3.1 several strategies are listed
to select hypothetical speaker changes.
Here it is assumed that a hypothetical speaker turn at an arbitrary time
instance t
Ch
has to be examined and that a binary hypothesis test has to be
done. Subsequently, the shorthand notation x
1:t
Ch
for the sequence of feature
vectors {x
1
, . . . , x
t
Ch
} is used.
• Hypothesis H
0
assumes no speaker change so that both sequences of
feature vectors x
1:t
Ch
and x
t
Ch
+1:T
originate from one speaker.
• Hypothesis H
1
presumes a speaker turn so that two speakers are respon-
sible for x
1:t
Ch
and x
t
Ch
+1:T
.
The optimal decision of binary hypothesis tests can be realized by the
Bayesian framework. In the case of two competing hypotheses H
0
and H
1
,
H
0
is accepted by the Bayesian criterion if the following inequality is valid
[H¨ ansler, 2001]:
p(x
1:T
|H
0
)
p(x
1:T
|H
1
)
H
0

(c
10
−c
11
) · p(H
1
)
(c
01
−c
00
) · p(H
0
)
. (2.14)
Here the parameters c
01
, c
00
, c
10
and c
11
denote the costs for falsely ac-
cepted H
1
, truly detected H
0
, falsely accepted H
0
and correctly detected H
1
,
respectively. The probabilities p(H
0
) and p(H
1
) stand for the prior probabil-
ities of the specified hypothesis.
The left side of (2.14) is called the likelihood ratio. The right side contains
the prior knowledge and acts as a threshold. If the costs for correct decisions
equal zero and those for false decisions are equal, the threshold only depends
on the prior probabilities. For equal prior probabilities the criterion chooses
the hypothesis with the highest likelihood.
18 2 Fundamentals
A statistical model is required to compute the likelihoods p(x
1:T
|H
0
)
and p(x
1:T
|H
1
). The existing algorithms for finding speaker turns can be
classified into metric-based, model-based and decoder-guided approaches as
found by Chen and Gopalakrishnan [1998].
In this context only one prominent representative of the model-based algo-
rithms is introduced and discussed. For further details, especially to metric-
based or decoder-guided algorithms, the interested reader is referred to the
literature listed in Sect. 3.1.
2.3.2 Bayesian Information Criterion
The so-called Bayesian Information Criterion (BIC) belongs to the most
commonly applied techniques for speaker change detection [Ajmera et al.,
2004]. There exist several BIC variants and therefore only the key issues are
discussed in this section.
Fig. 2.8 denotes the first part before the assumed speaker turn as I and the
second part as II. The binary hypothesis test requires the statistical models
for part I and II as well as for the conjunction of both parts I + II. The
statistical models for part I and II cover the case of two origins, e.g. two
speakers, and thus reflect the assumption of hypothesis H
1
. The model built
on the conjunction presumes only one origin as it is the case in H
0
.
Parameter Estimation
Each part is modeled by one multivariate Gaussian distribution which can be
robustly estimated even on few data. Subsequently, μ, Σ, |Σ| and d denote
the mean vector, covariance matrix, the determinant of the covariance ma-
trix and the dimension of the feature vectors, respectively. The superscript
indices T and −1 are the transpose and inverse of a matrix. Θ is the param-
eter set of a multivariate Gaussian distribution comprising the mean vector
and covariance. The probability density function is given by
p(x
t
|Θ) = N {x
t
|μ, Σ}
=
1
(2π)
0.5d
|Σ|
0.5
· e

1
2
(x
t
−μ)
T
·Σ
−1
·(x
t
−μ)
.
(2.15)
In addition, all features are usually assumed to be independent and iden-
tically distributed (iid) random variables. Even though successive frames of
the speech signal are not statistically independent as shown in Sect. 2.5, this
assumption simplifies parameter estimation and likelihood computation but
is still efficient enough to capture speaker changes. Since the statistical de-
pendencies of the speech trajectory are neglected, the basic statistical models
can be easily estimated for each part of the data stream.
The mean vector and covariance matrix are unknown a priori and can be
determined by the Maximum Likelihood (ML) estimates [Bishop, 2007; Lehn
2.3 Speaker Change Detection 19
and Wegmann, 2000]. An ergodic random process is assumed so that the
expectation value can be replaced by the time average [H¨ ansler, 2001]. The
parameters for the statistical model of the hypothesis H
0
are given by
μ
I+II
= E
x
{x
t
} ≈
1
T
T

t=1
x
t
(2.16)
Σ
I+II
= E
x
{(x
t
−μ
I+II
) · (x
t
−μ
I+II
)
T
} (2.17)

1
T
T

t=1
(x
t
−μ
I+II
) · (x
t
−μ
I+II
)
T
. (2.18)
In the case of a speaker turn, the parameters are estimated separately on
part I and II as given by
μ
I
= E
x
{x
t
} ≈
1
t
Ch
t
Ch

t=1
x
t
(2.19)
Σ
I
= E
x
{(x
t
−μ
I
) · (x
t
−μ
I
)
T
} (2.20)

1
t
Ch
t
Ch

t=1
(x
t
−μ
I
) · (x
t
−μ
I
)
T
. (2.21)
The parameters μ
II
and Σ
II
are estimated on part II in analogy to μ
I
and Σ
I
.
The likelihood computation of feature vector sequences becomes feasible
due to the iid assumption. The log-likelihood can be realized by the sum of
log-likelihoods of each time step
log (p(x
1:T
|Θ)) =
T

t=1
log (p(x
t
|Θ)) . (2.22)
Now the likelihood values p(x
1:T
|H
0
) and p(x
1:T
|H
1
) can be given for both
hypotheses. For convenience, only log-likelihoods are examined
log (p(x
1:T
|H
0
)) =
T

t=1
log(N {x
t

I+II
, Σ
I+II
}) (2.23)
log (p(x
1:T
|H
1
)) =
t
Ch

t=1
log(N {x
t

I
, Σ
I
})
+
T

t=t
Ch
+1
log(N {x
t

II
, Σ
II
}),
(2.24)
whereas the parameter set Θ is omitted.
20 2 Fundamentals
Hypothesis Test
The hypothesis test can now be realized by using these likelihoods and a
threshold
θ
th
<
> log (p(x
1:T
|H
1
)) −log (p(x
1:T
|H
0
)) . (2.25)
θ
th
replaces the costs and prior probabilities in (2.14) and has to be de-
termined experimentally. This approach is known as the Log Likelihood Ra-
tio (LLR) algorithm [Ajmera et al., 2004].
The BIC criterion test extends this technique by introducing a penalty
term and does not need a threshold θ
th
. The hypothesis H
1
assumes a speaker
turn and therefore applies two Gaussians in contrast to H
0
. The models
of H
1
are expected to lead to a better representation of the observed speech
data since the number of model parameters is doubled. The penalty term P
compensates this inequality and guarantees an unbiased test [Ajmera et al.,
2004]. A speaker turn is assumed if the following inequality is valid
log (p(x
1:T
|H
1
)) −log (p(x
1:T
|H
0
)) −P ≥ 0, (2.26)
where the penalty
P =
1
2
·
_
d +
1
2
d(d + 1)
_
· log T (2.27)
comprises the number of parameters which are used to estimate the mean
and covariance. P may be multiplied by a tuning parameter λ which should
theoretically be λ = 1 as found by Ajmera et al. [2004].
The training and test data of the Gaussian distributions are identical and
permit a more efficient likelihood computation. The BIC criterion only re-
quires to compute the covariances for both parts and their conjunction as
found by Zhu et al. [2005]. Thus, the inequality
T · |Σ
I+II
| −t
Ch
· |Σ
I
| −(T −t
Ch
) · |Σ
II
| −λ · P ≤ 0 (2.28)
is also known as the variance BIC [Nishida and Kawahara, 2005]. If the
factor λ is equal to zero, equation (2.28) describes the LLR algorithm [Ajmera
et al., 2004].
The requirement to estimate robust parameters on limited data does not
allow using more complex statistical models. Even though Gaussian distri-
butions only consist of one mean vector and one covariance matrix, short
utterances, e.g. ≤ 2 sec, seem not to provide enough information for a reli-
able estimation as the results of Zhou and Hansen [2000] suggest.
The complexity of the statistical models can be further reduced by using
diagonal covariances. The elements of diagonal covariances are equal to zero
except the main diagonal. Σ
diag
(l
r
, l
c
) denotes the matrix element of row l
r
and column l
c
:
2.4 Speaker Identification 21
Σ
diag
l
r
,l
c
= δ
K
(l
r
, l
c
) · σ
l
r
, δ
K
(l
r
, l
c
) =
_
0 if l
r
= l
c
1 if l
r
= l
c
. (2.29)
σ and δ
K
are the scalar variance and Kronecker delta. This assumption is a
trade-off between the reduced number of parameters and statistical modeling
accuracy.
Zhou and Hansen [2000] suggest to use the Hotelling measure instead of
Gaussian distribution for very short utterances. Their algorithm estimates
only the mean vectors μ
I
, μ
II
and μ
I+II
separately. The covariance Σ
I+II
replaces Σ
I
, Σ
II
and simplifies the covariance estimation.
In this context it should be emphasized that the use case of this book is a
command and control application. The average duration can be less than 2 sec
which limits the use of complex algorithms for speaker change detection. Since
short utterances do not cover all phonemes sufficiently, text-dependency can
be a problem. The variation of the utterance duration is expected to be
high so that the comparison of such utterances seems to be difficult [Nishida
and Kawahara, 2005]. One Gaussian distribution might be not sufficient to
capture the variations sufficiently [Nishida and Kawahara, 2005] which is
undesirable with respect to the limited training data. Alternatively, more
sophisticated statistical models incorporating prior knowledge about speech
and speaker characteristics can be used as found by Malegaonkar et al. [2007].
Adverse environments such as in-car applications contain time-variant
background noises. Due to the low complexity of the statistical models previ-
ously introduced, it seems to be difficult to distinguish properly between the
changes of the acoustic environment and speaker changes in an unsupervised
manner.
2.4 Speaker Identification
The discussion of speaker changes is now extended to speaker identifica-
tion and the question who is speaking is addressed in this section. Speaker
identification is an essential part of the target system since speakers should
be tracked across several speaker changes. Speech recognition can then be
enhanced for a particular speaker by individual long-term adaptation as ex-
plained in the following sections.
Furthermore, a technique is necessary to detect whether a speaker is known
to the system. Only in combination with a robust detection of new users can
a completely unsupervised system be achieved which is able to enroll speakers
without any additional training.
In this section a short introduction into speaker identification is given by
a selection of the relevant literature. The problem of speaker identification
is formulated and a brief survey of the main applications and strategies is
provided in Sect. 2.4.1. The main representative of the statistical models
used for speaker identification is explained in Sect. 2.4.2. The training of the
22 2 Fundamentals
statistical model and speaker identification at run-time are described in more
detail. Finally, some techniques are discussed to detect unknown speakers.
2.4.1 Motivation and Overview
Speaker identification has found its way into a manifold of speech controlled
applications such as automatic transcription of meetings and conferences,
information retrieval systems, security applications, access control and law
enforcement [Gish and Schmidt, 1994]. The main task is to identify speakers
by their voice characteristics.
Probabilistic approaches determine statistical models for each speaker so
that this problem can be solved in a statistical framework. Then speaker
identification can be realized by the Maximum A Posteriori (MAP) crite-
rion [Gish and Schmidt, 1994; Reynolds and Rose, 1995]. Speaker identifica-
tion has to find the speaker who maximizes the posterior probability of being
the origin of the recorded utterance:
i
MAP
= arg max
i
{p(i|x
1:T
)} . (2.30)
Variable i and i
MAP
denote the speaker index and the corresponding MAP
estimate.
The current utterance is characterized by a sequence of feature vec-
tors x
1:T
. For example, MFCC features or mean normalized MFCCs com-
puted by the front-end may be employed for speaker identification [Reynolds,
1995a; Reynolds et al., 2000]. Delta features may be used in addi-
tion [Reynolds et al., 2000].
The MAP criterion can be expressed depending on the likelihood p(x
1:T
|i)
and the prior probability p(i) instead of the posterior probability p(i|x
1:T
)
by applying Bayes’ theorem.
Bayes’ theorem [Bronstein et al., 2000] relates the joint probability p(a, b),
conditional probabilities p(a|b) and p(b|a) as well as the prior probabili-
ties p(a) and p(b) for two arbitrary random variables a and b. It can be
expressed by a multiplication of prior and conditional probabilities:
p(a, b) = p(a|b) · p(b) = p(b|a) · p(a). (2.31)
The combination of (2.30) and (2.31) likewise describes the identification
problem depending on the likelihood. The density function of the spoken
phrase p(x
1:T
) is speaker independent and can be discarded. The resulting
criterion is based on a multiplication of the likelihood, which describes the
match between the speaker model and the observed data, and the prior prob-
ability for a particular speaker:
i
MAP
= arg max
i
{p(x
1:T
|i) · p(i)} . (2.32)
2.4 Speaker Identification 23
Depending on the application different approaches concerning the statis-
tical models and training algorithms exist:
Speaker identification can be separated from speaker verification [Camp-
bell, 1997]. Speaker identification has to recognize the voice of a current user
without any prior knowledge about his identity whereas speaker verification
accepts or rejects a claimed identity. Thus, speaker verification is typically
applied to authentication problems [Campbell, 1997]. Speech plays an impor-
tant role in biometric systems such as financial transactions or access control
in security relevant areas [Campbell, 1997]. Speech can serve as biometric key
to secure the access to computer networks and to ensure information secu-
rity [Reynolds and Carlson, 1995]. The user enters an identity claim and is
asked to speak a prompted text or a password [Campbell, 1997]. For example,
speaker verification compares the observed voice characteristics with stored
voice patterns of the claimed speaker identity and speech recognition checks
the spoken password.
Speaker identification enables the personalization and optimization of
speech controlled systems by accounting for user specific habits. Usability
can be increased by speaker-adapted user-machine interfaces. Reynolds et al.
[2000] provide an introduction into speaker verification which widely applies
to speaker identification as well.
Speaker identification can be further subdivided into closed-set and open-
set systems [Gish and Schmidt, 1994]. In closed-set scenarios the current
user has to be assigned to one speaker out of a pool of enrolled speakers. The
criterion for speaker selection can be given by the MAP criterion in (2.30).
In open-set scenarios identification has not only to determine the speaker’s
identity but has also to decide whether the current user is an enrolled speaker
or an unknown speaker.
A further discrimination is given by the general type of statistical model
such as non-parametric and parametric models [Gish and Schmidt, 1994].
Non-parametric techniques usually admit any kind of distribution function to
represent the measured distribution. Clustering algorithms such as k-means
or Linde Buzo Gray (LBG) [Linde et al., 1980] are applied to capture speaker
specific centers of gravity in the feature space. At run-time speaker identifi-
cation algorithms, e.g. the Vector Quantization (VQ), rely on distance mea-
sures to find the speaker model with the highest match as found by Gish and
Schmidt [1994].
Parametric approaches presume a special kind of statistical model a priori
and only have to estimate the corresponding parameters. Gish and Schmidt
[1994] provide a general description of different speaker identification tech-
niques and refer to parametric and non-parametric identification.
Several strategies exist for the training of statistical models. The decision
in (2.30) can be directly realized by discriminative speaker models [Bishop,
2007]. The goal is to learn how to separate enrolled speakers without the need
to train individual statistical models for each speaker. However, training has
24 2 Fundamentals
to be re-computed when a new speaker is enrolled. The Multilayer Percep-
tron (MLP) as a special case of the Neural Networks (NN) framework
6
is
a well-known discriminative model. For example, MLPs have been used for
speaker identification as found by Genoud et al. [1999]; Morris et al. [2005].
Reynolds and Rose [1995] employ Radial Basis Functions (RBF) as a ref-
erence implementation for their proposed speaker identification framework
based on generative models.
In contrast to discriminative models, generative speaker models individu-
ally represent the characteristics of the target speaker. They are independent
from non-target speakers. The pool of enrolled speakers can be extended it-
eratively without re-training of all speaker models. The speaker whose model
yields the best match with the observed feature vectors is considered as the
target speaker. Reynolds et al. [2000]; Reynolds and Rose [1995] investigate
the statistical modeling of speaker characteristics with Gaussian mixtures
models which is one of the standard approaches for speaker identification
based on generative speaker modeling.
The final design of a speaker identification implementation depends on
the application. For verification the current user might be asked to read a
prompted text so that the content of the utterance is known. Another exam-
ple may be limited vocabulary, e.g. digits, in special applications. This con-
straint is known as text-dependent speaker identification or verification [Gish
and Schmidt, 1994]. Kimball et al. [1997]; Reynolds and Carlson [1995] de-
scribe two approaches for text-dependent realizations to verify a speaker’s
identity. Text-prompted systems [Che et al., 1996; Furui, 2009], e.g. for access
control, can ask the user to speak an utterance that is prompted on a dis-
play. This minimizes the risk that an impostor plays back an utterance via
loudspeaker. A larger vocabulary has to be recognized but the transcription
is still known. In contrast, text-independent identification has to handle ar-
bitrary speech input [Gish and Schmidt, 1994; Kwon and Narayanan, 2005].
This identification task requires different statistical approaches.
The scope of this book requires a complete system characterized by a
high flexibility to integrate new speakers and to handle individually trained
speaker models. Users are not restricted to special applications or a given
vocabulary so that speaker identification has to be operated in a text-
independent mode. Furthermore, applications in an embedded system impose
constraints concerning memory consumption and computational load. These
requirements suggest a speaker identification approach based on generative
speaker models. Thus, only the most dominant generative statistical model is
introduced in Sect. 2.4.2. For a more detailed survey on standard techniques
for speaker identification or verification, it is referred to Campbell [1997];
Gish and Schmidt [1994].
6
Bishop [1996, 2007] provides an extensive investigation of the NN framework and
MLPs, in particular.
2.4 Speaker Identification 25
2.4.2 Gaussian Mixture Models
Gaussian Mixture Models (GMMs) have emerged as the dominating
generative statistical model in the state-of-the-art of speaker identifica-
tion [Reynolds et al., 2000]. GMMs are attractive statistical models because
they can represent various probability density functions if a sufficient number
of parameters can be estimated [Reynolds et al., 2000].
GMMs comprise a set of N multivariate Gaussian density functions subse-
quently denoted by the index k. The resulting probability density function of
a particular speaker model i is a convex combination of all density functions:
GMMs are constructed on standard multivariate Gaussian densities as
given in (2.15) but introduce the component index k as a latent variable with
the discrete probability p(k|i). The weights
w
i
k
= p(k|i) (2.33)
characterize the prior contribution of the corresponding component to the
density function of a GMM and fulfill the condition
N

k=1
w
i
k
= 1. (2.34)
Each Gaussian density represents a conditional density function p(x
t
|k, i).
According to Bayes’ theorem, the joint probability density function p(x
t
, k|i)
is given by the multiplication of both. The sum over all densities results in
the multi-modal probability density of GMMs
p(x
t

i
) =
N

k=1
p(k|Θ
i
) · p(x
t
|k, Θ
i
) (2.35)
=
N

k=1
w
i
k
· N
_
x
t

i
k
, Σ
i
k
_
. (2.36)
Each component density is completely determined by its mean vector μ
k
and covariance matrix Σ
k
. The parameter set
Θ
i
=
_
w
i
1
, . . . , w
i
N
, μ
i
1
, . . . , μ
i
N
, Σ
i
1
, . . . , Σ
i
N
_
(2.37)
contains the weighting factors, mean vectors and covariance matrices for a
particular speaker model i. x
t
denotes the feature vector. For example, MFCC
features or mean normalized MFCCs may be employed for speaker identifi-
cation [Reynolds, 1995a; Reynolds et al., 2000]. Delta features may be used
in addition [Reynolds et al., 2000].
Diagonal covariance matrices in contrast to full matrices are advanta-
geous because of their computational efficiency [Reynolds et al., 2000]. Multi-
variate Gaussian distributions with diagonal covariance matrices comprise
26 2 Fundamentals
uncorrelated univariate distributions and even imply statistical indepen-
dence [H¨ansler, 2001]. The loss of information can be compensated by a larger
set of multivariate Gaussian distributions so that correlated features can be
accurately modeled by GMMs [Reynolds et al., 2000]. Because of the DCT
a low correlation of the MFCC features can be assumed [Quatieri, 2002]. Fi-
nally, GMMs based on diagonal covariances have been found to outperform
realizations with full covariance matrices [Quatieri, 2002; Reynolds et al.,
2000].
Fig. 2.9 exemplifies the likelihood function for a GMM comprising 4 Gaus-
sian distributions with diagonal covariance matrices.
-5 -5
0
0
0
5
5
0.1
0.2
0.3
0.4
x
x
1
2
L
i
k
e
l
i
h
o
o
d
Fig. 2.9 Likelihood function of a GMM comprising 4 Gaussian densities. For con-
venience, two dimensional mean and feature vectors are chosen. x
1
and x
2
denote
the elements of the feature vector.
The training of GMMs and speaker identification at run-time are described
in the following.
Training
Subsequently, it is assumed that speaker specific data can be employed to
train a GMM for each speaker. The goal of the training is to determine
a parameter set Θ that optimally represents the voice characteristics of a
particular speaker. For convenience, the speaker index is omitted.
If no prior knowledge about the parameter set is available, the optimization
problem is to find a parameter set Θ
ML
which maximizes the likelihood
d

p(x
1:T
|Θ)
¸
¸
¸
¸
Θ=Θ
ML
= 0 (2.38)
2.4 Speaker Identification 27
for a given training data set x
1:T
. Due to the latent variable k, the derivative
of p(x
1:T
|Θ) cannot be solved analytically for its parameters as described in
Sect. A.1.
The standard approximation is known as the Expectation Maximiza-
tion (EM) algorithm [Dempster et al., 1977] which is characterized by an
iterative procedure. Each iteration provides a new parameter set Θ starting
from the parameter set of the preceding iteration
¯
Θ or an initial set. The
likelihood increases with each iteration until a maximum is reached. The
likelihood acts as stopping condition. Each iteration consists of the following
two steps:
• In the E-step all feature vectors are assigned to each Gaussian distribution
by computing the posterior probability p(k|x
t
,
¯
Θ).
• In the M-step new ML estimates of the parameters are calculated based
on the assignment of the E-step. An improved statistical modeling is there-
fore achieved by each iteration.
The EM algorithm is equivalently described by maximizing the auxiliary
function
Q
ML
(Θ,
¯
Θ) = E
k
1:T
{log (p(x
1:T
, k
1:T
|Θ)) |x
1:T
,
¯
Θ}
k
1:T
= {k
1
, . . . , k
T
}
(2.39)
with respect to Θ as found by Bishop [2007]; Gauvain and Lee [1994]. Under
the iid assumption, maximizing
Q
ML
(Θ,
¯
Θ) =
T−1

t=0
N−1

k=0
p(k|x
t
,
¯
Θ) · log (p(x
t
, k|Θ)) (2.40)
leads to a locally optimal solution. However, it does not need to be the global
optimum since the EM algorithm cannot distinguish between local and global
maxima [Bishop, 2007]. Therefore different initialization strategies are de-
scribed in the literature. Reynolds and Rose [1995] show that randomized
initializations yield results comparable to more sophisticated strategies.
In the E-step each feature vector is assigned to all Gaussian densities by
the posterior probability
p(k|x
t
,
¯
Θ) =
¯ w
k
· N
_
x
t
| ¯ μ
k
,
¯
Σ
k
_

N
l=1
¯ w
l
· N
_
x
t
| ¯ μ
l
,
¯
Σ
l
_ (2.41)
according to the importance for the training of a particular Gaussian density
as found by Reynolds and Rose [1995]. The number of the softly assigned
feature vectors is denoted by
n
k
=
T

t=1
p(k|x
t
,
¯
Θ). (2.42)
28 2 Fundamentals
In the M-step the mean vectors, covariance matrices and weights are re-
calculated
μ
ML
k
=
1
n
k
T

t=1
p(k|x
t
,
¯
Θ) · x
t
(2.43)
Σ
ML
k
=
1
n
k
T

t=1
p(k|x
t
,
¯
Θ) · x
t
· x
T
t
−μ
ML
k
· (μ
ML
k
)
T
(2.44)
w
ML
k
=
n
k
T
(2.45)
based on the result of the E-step as found by Reynolds and Rose [1995]. If the
covariance matrices are realized by diagonal matrices, only the main diagonal
in (2.44) is retained. A lower bound should be assigned to avoid singularities
in the likelihood function leading to infinite likelihood values [Bishop, 2007].
A more detailed description of GMMs and training algorithms can be found
by Bishop [2007].
Evaluation
Speaker identification has to decide which user is speaking. Again, each ut-
terance is characterized by a sequence of feature vectors x
1:T
and statistical
dependencies between successive time instances are neglected by the iid as-
sumption. For each speaker i the log-likelihood
log (p(x
1:T
|i)) =
T

t=1
log
_
N

k=1
w
i
k
· N
_
x
t

i
k
, Σ
i
k
_
_
(2.46)
is calculated and the MAP criterion given by (2.32) is applied. The parameter
set Θ is omitted. The speaker with the highest posterior probability p(i|x
1:T
)
is identified. If the prior probability for the enrolled speakers is unknown or
uniformly distributed, the ML criterion
i
ML
= arg max
i
{p(x
1:T
|i)} (2.47)
selects the speaker model with the highest match [Reynolds and Rose, 1995].
Under optimal conditions excellent speaker identification rates can be ob-
tained as demonstrated by Reynolds [1995a].
Knowing the speaker identity enables speaker specific speech recognition
as described in Sect. 3.3 or offers the possibility to further adapt the corre-
sponding GMMs as explained in Sect. 2.6.
2.4 Speaker Identification 29
2.4.3 Detection of Unknown Speakers
The detection of unknown speakers is a critical issue for open-set speaker
identification. A generative statistical model for a closed set of enrolled speak-
ers was explained in Sect. 2.4.2. Each speaker participates in a training pro-
cedure in which speaker specific utterances are used to train speaker models.
Detecting unknown or out-of-set speakers is difficult since unknown speakers
cannot be modeled explicitly. Thus, another strategy is necessary to extend
the GMM based speaker identification for out-of-set speakers.
A simple way is to introduce a threshold θ
th
for the absolute log-likelihood
values given by (2.46) as found by Fortuna et al. [2005]. The likelihood de-
scribes the match between a given statistical model and the observed data.
If the speaker’s identity does not correspond to a particular speaker model, a
low likelihood value is expected. If all log-likelihoods of the enrolled speakers
fall below a predetermined threshold
log (p(x
1:T

i
)) ≤ θ
th
, ∀i, (2.48)
an unknown speaker has to be assumed.
However, in-car applications suffer from adverse environmental conditions
such as engine noises. High fluctuations of the absolute likelihood are probable
and may affect the threshold decision.
Advanced techniques may use normalization techniques comprising a Uni-
versal Background Model (UBM) as found by Angkititrakul and Hansen
[2007]; Fortuna et al. [2005] or cohort models [Markov and Nakagawa, 1996].
The UBM is a speaker independent model which is trained on a large group of
speakers. Instead of a threshold for log-likelihoods, the log-likelihood ratios of
the speaker models and UBM can be examined for out-of-set detection [For-
tuna et al., 2005]. If the following inequality
log (p(x
1:T

i
)) −log (p(x
1:T

UBM
)) ≤ θ
th
, ∀i (2.49)
is valid for all speaker models, an unknown speaker is likely. Alternatively,
the ML criterion can be applied [Zhang et al., 2000].
The advanced approach has the advantage of lowering the influence of
events that affect all statistical models in a similar way. For example, phrases
spoken in an adverse environment may cause a mismatch between the speaker
models and the audio signal due to background noises. Furthermore, text-
dependent fluctuations in a spoken phrase, e.g. caused by unseen data or
the training conditions, can be reduced [Fortuna et al., 2005; Markov and
Nakagawa, 1996]. In those cases the log-likelihood ratio appears to be more
robust than absolute likelihoods.
Speaker identification for known and unknown speakers can be implemented
as a two-stage approach. In the first stage the most probable enrolled speaker
30 2 Fundamentals
is determined. Then the speaker identity is accepted or rejected by speaker ver-
ification based on the techniques described above [Angkititrakul and Hansen,
2007].
2.5 Speech Recognition
In the preceding sections several strategies were discussed to track different
speakers. The algorithms presented try to recognize either a speaker turn or
the speaker identity. Now the question has to be answered how to decode or
recognize the understanding of a spoken utterance since the target system
shall be operated by speech commands. In this context the influence of the
speaker on speech recognition is temporarily neglected. Speaker variability
will be explicitly addressed in the next section.
First, speech recognition is introduced from a general point of view and the
problem of speech decoding is described mathematically. Then a statistical
model is presented that has been established as a widely-used approach in
Automated Speech Recognition (ASR) [Bishop, 2007]. Finally, the architecture
of the speech recognizer used for the experiments of this book is explained. It
is used in the following chapters. The basic components and the architecture
of the speech recognizer are considered in more detail.
2.5.1 Motivation
For a speech controlled system it is essential to communicate with the user.
On the one hand information has to be presented visually or acoustically to
the user. On the other hand the user should be able to control the device
via spoken commands. Hands-free telephony or the operation of a navigation
system are in-car applications where speech recognition enables the user to
speak a sequence of digits, select an entry from an address book or to enter
city names, for example.
Speech recognition can be formulated as a MAP estimation problem. A
word sequence W
1:N
W
= {W
1
, W
2
, . . . , W
N
W
} has to be assigned to an ut-
terance or equivalently a transcription of the spoken phrase has to be gen-
erated. N
W
denotes the variable number of words within a single utterance.
The most probable word sequence
W
MAP
1:N
W
= arg max
W
1:N
W
{p(x
1:T
|W
1:N
W
) · p(W
1:N
W
)} (2.50)
has to be determined out of all possible sequences [Schukat-Talamazzini,
1995; Setiawan et al., 2009]. The speech characteristics are represented by
a sequence of feature vectors x
1:T
which is extracted in the front end of the
speech recognizer. For convenience, the LDA features x
LDA
t
introduced in
Sect. 2.2 are used for speech recognition. The index LDA is omitted. The
likelihood p(x
1:T
|W
1:N
W
) determines the match between the acoustic speech
2.5 Speech Recognition 31
signal and the statistical model of a particular word sequence. The proba-
bility p(W
1:N
W
) represents the speech recognizer’s prior knowledge about
given word sequences. This prior probability reflects how often these word
sequences occur in a given language and whether this sequence is reasonable
with respect to the grammar of a particular language model.
2.5.2 Hidden Markov Models
As discussed so far, speech characteristics and speaker variability were rep-
resented by particular time instances. Only the static features are tracked by
GMMs. Statistical dependencies of successive time instances are usually not
modeled because of the iid assumption. The dynamic of speech is therefore
neglected.
Speech signals can be decomposed into phonemes as the smallest possible
entity [Schukat-Talamazzini, 1995]. During speech production the vocal tract
and the articulatory organs cannot change arbitrarily fast [O’Shaughnessy,
2000]. Phonemes can be defined by a characterizing sequence of states and
transitions [O’Shaughnessy, 2000]. This becomes evident from Sect. 2.1 since
vowels, nasal consonants, fricatives and plosives are produced by a specific
configuration of the articulatory organs and excitation.
For example, 3 states can be used to represent the context to the prior
phoneme, the steady-state of the actual phoneme and the context to the
successor [O’Shaughnessy, 2000]. This motivates to integrate the dynamic
behavior of speech into the statistical model.
GMMs can be extended by an underlying hidden statistical model to rep-
resent states and transitions. In a first step this underlying model is described
and in a second step both models are combined. The result is the so-called
Hidden Markov Model (HMM).
Markov Model
The Markov model is the statistical description of a system which can assume
different states. It is characterized by a temporal limitation of the statistical
dependencies [H¨ansler, 2001]. Markov models are also called Markov models
of first order if each transition only depends on the preceding and current
state [H¨ansler, 2001]. Statistical dependencies across multiple time instances
are ignored.
A Markov model is defined by a set of states s, transition probabilities and
an initialization. The joint probability p(s
t
, s
t−1
) is split by Bayes’ theorem
into a transition probability p(s
t
|s
t−1
) and the probability of the preceding
state p(s
t−1
). Finally, p(s
t
, s
t−1
) is reduced to the probability p(s
t
) by the
sum over all previous states s
t−1
:
32 2 Fundamentals
p(s
t
) =
M

s
t−1
=1
p(s
t
|s
t−1
) · p(s
t−1
), t > 1
a
j
1
j
2
= p(s
t
= j
2
|s
t−1
= j
1
)
(2.51)
where M denotes the number of states. The transition probability is a
j
1
j
2
and the initial probability of a particular state is given by p(s
1
).
Fig. 2.10 displays a Markov chain as a feed-forward realization. The states
are arranged in a chain and transitions can only occur either back to the
original state or to one of the next states in a forward direction.
1 2 3
a
13
a
11
a
12
a
22
a
33
a
23
Fig. 2.10 Example for a Markov chain comprising 3 states organized as a feed-
forward chain. The transition probabilities a
j
1
j
2
denote the transitions from state j
1
to state j
2
.
Speech production of single phonemes can be modeled, e.g. by a feed-
forward chain as given in Fig. 2.10. State 2 denotes the steady state of a
phoneme whereas the states 1 and 3 represent the preceding and subse-
quent phoneme, respectively. However, variations of the pronunciation are not
captured.
Fusion of Markov Model and GMM
An HMM fuses a Markov model and one or more GMMs. The Markov model
represents the hidden unobserved random process and its states introduce
the latent variable s
t
. The emission probability represents the distribution of
the observations belonging to the state s
t
. It is modeled by a GMM which
is subsequently called codebook. Bayes’ theorem enables the combination of
both random processes in analogy to Sect. 2.4.2. The final probability density
function is composed by the superposition of all states of the Markov model
and the Gaussian density functions of the GMM:
p(x
t
|Θ) =
M

s
t
=1
N

k=1
p(x
t
|k, s
t
, Θ) · p(k, s
t
|Θ) (2.52)
=
M

s
t
=1
N

k=1
p(x
t
|k, s
t
, Θ) · p(k|s
t
, Θ) · p(s
t
|Θ). (2.53)
2.5 Speech Recognition 33
N and k denote the number of Gaussian densities and the corresponding in-
dex. Θ contains all parameters of the HMM including the initial state prob-
abilities, state transitions and the GMM parameters as defined in (2.37). For
convenience, the following notation omits the parameter set.
The conditional discrete probability p(k|s
t
) corresponds to the weights of
the GMM and the likelihood function p(x
t
|k, s
t
) is realized by a Gaussian
density. The state probability p(s
t
) originates from the Markov model given
by (2.51).
One realization is the Semi-Continuous HMM (SCHMM) [Huang and
Jack, 1989; Rabiner and Juang, 1993; Schukat-Talamazzini, 1995]. It has only
one GMM parameter set at its disposal. All states share the mean vectors
and covariances of one GMM and only differ in their weighting factors:
p
SCHMM
(x
t
) =
M

s
t
=1
N

k=1
w
s
t
k
· N {x
t

k
, Σ
k
} · p(s
t
). (2.54)
SCHMMs take benefit from the ability of GMMs to approximate arbitrary
probability density functions. SCHMMs achieve high recognition accuracy.
However, they are efficient in terms of memory and computational com-
plexity [Huang and Jack, 1989; Schukat-Talamazzini, 1995]. The latter is
essential especially for embedded systems. Furthermore, speaker adaptation
is straightforward due to the possibility to define codebooks which can be
easily modified [Rieck et al., 1992; Schukat-Talamazzini, 1995]. Speaker adap-
tation is usually realized by a shift of mean vectors whereas covariances are
left unchanged as described in the next section. When using full covariance
matrices, SCHMMs split the parameter set into mean vectors of low com-
plexity and more complex covariances. This separation enables high speech
recognition accuracies and makes adaptation efficient.
A further realization is given by the Continuous Density HMM (CDHMM)
which supplies a complete GMM parameter set for each state:
p
CDHMM
(x
t
) =
M

s
t
=1
N

k=1
w
s
t
k
· N {x
t

s
t
k
, Σ
s
t
k
} · p(s
t
). (2.55)
The superscript index s
t
indicates the state dependence of the weight factor,
mean vector and covariance matrix. CDHMMs enable an accurate statistical
model for each state at the expense of a higher number of Gaussian distribu-
tions. The covariance matrices may be realized by diagonal matrices to limit
the number of parameters.
Training and Evaluation
Speech can be modeled by HMMs. For instance, a phoneme can be repre-
sented by an HMM comprising 3 states. They reflect the steady state and
the transitions from the preceding and to the subsequent phoneme within
34 2 Fundamentals
a spoken phrase. The corresponding GMMs represent the variability of the
pronunciation.
The concatenation of models enables speech modeling on a word level. The
corpus of all possible words is given by a lexicon. It determines the utterances
to be recognized. The prior probabilities of word sequences are given by the
language model [Schukat-Talamazzini, 1995].
HMMs can be trained by the Baum-Welch algorithm. In contrast to GMM
training, state and transition probabilities have to be included. A detailed
description can be found by Rabiner and Juang [1993]; Rabiner [1989].
The evaluation of HMMs has to determine the optimal sequence of models.
Subsequently, the decoding problem is exemplified for only one HMM. The
temporal sequence of states and transitions has to be determined so as to
provide the highest match between model and the observed data.
The current state can be estimated by the forward algorithm [O’Shaugh-
nessy, 2000; Rabiner and Juang, 1993; Schukat-Talamazzini, 1995] based on
the feature vectors x
1:t
observed so far and the initial state probability p(s
1
).
The following notation is used: The density function p(x
1:t
, s
t
= j|Θ) is
denoted by α
t
(j) for all states j = 1, . . . , M. The conditional probability
density function p(x
t
|s
t
= j) is given by b
j
(x
t
) and transitions of the Markov
model are represented by a
lj
= p(s
t
= j|s
t−1
= l). The initialization is given
by π
j
= p(s
1
= j). The forward algorithm can be iteratively calculated. The
recursion is given by
α
t
(j) = b
j
(x
t
) ·
M

l=1
α
t−1
(l) · a
lj
, t > 1 (2.56)
and the initialization
α
1
(j) = π
j
· b
j
(x
1
). (2.57)
According to Bayes’ theorem, the posterior of state s
t
given the history of ob-
servations x
1:t
can be obtained by a normalization of the forward algorithm:
p(s
t
|x
1:t
, Θ) =
p(x
1:t
, s
t
|Θ)
p(x
1:t
|Θ)
∝ p(x
1:t
, s
t
|Θ). (2.58)
In addition, the backward algorithm [O’Shaughnessy, 2000; Rabiner and
Juang, 1993; Schukat-Talamazzini, 1995] can be applied when the com-
plete utterance of length T is buffered. An algorithm similar to the for-
ward algorithm is applied in reversed temporal order to calculate the likeli-
hood β
t
(j) = p(x
t+1:T
|s
t
= j, Θ) for the successive observations x
t+1:T
given
the current state s
t
= j. The recursion is given by
β
t
(j) =
M

l=1
a
jl
· b
l
(x
t+1
) · β
t+1
(l), 0 < t < T (2.59)
2.5 Speech Recognition 35
and the initialization
β
T
(j) = 1. (2.60)
The posterior probability p(s
t
= j|x
1:T
) can be calculated by combining the
forward and backward algorithm
p(s
t
= j|x
1:T
) =
α
t
(j) · β
t
(j)

M
l=1
α
t
(l) · β
t
(l)
, 0 < t ≤ T (2.61)
as described by O’Shaughnessy [2000]; Rabiner and Juang [1993]; Schukat-
Talamazzini [1995]. The most probable state
s
MAP
t
= arg max
s
t
{p(s
t
|x
1:T
)} (2.62)
can then be determined according to the MAP criterion.
To determine the best path in an HMM which means the most likely se-
quence of states, the Viterbi algorithm is usually employed [Forney, 1973;
O’Shaughnessy, 2000; Rabiner and Juang, 1993; Schukat-Talamazzini, 1995].
In contrast to the forward algorithm, the density function p(x
1:t
, s
1:t
|Θ)
with s
t
= j has to be optimized:
ϑ
t
(j) = max
s
t−1
{p(x
1:t
, s
1:t
|Θ)|s
t
= j} . (2.63)
The Viterbi algorithm is recursively calculated by
ϑ
t
(j) = max
l

t−1
(l) · a
lj
· b
j
(x
t
)} , t > 1 (2.64)
ϑ
1
(j) = π
j
· b
j
(x
1
). (2.65)
Backtracking is required to determine the best sequence of states as de-
scribed by Schukat-Talamazzini [1995]. The Viterbi algorithm is faster than
the forward-backward algorithm since only the best sequence of states is con-
sidered instead of the sum over all paths [O’Shaughnessy, 2000].
For speech recognition as described by (2.50) this algorithm has to be
extended so that transitions between HMMs can be tracked. Path search al-
gorithms such as the Viterbi algorithm can be used to determine the recogni-
tion result given by the optimal word sequence W
opt
1:N
W
= {W
opt
1
, . . . , W
opt
N
W
}.
Further details about the evaluation of HMMs can be found by Rabiner and
Juang [1993].
2.5.3 Implementation of an Automated Speech
Recognizer
In this book, a speech recognizer based on SCHMMs is used. Fig. 2.11 shows
the speech recognizer setup applied in the following chapters. It can be sub-
divided into three parts:
36 2 Fundamentals
Front-end x
t
Codebook q
t
Decoder
Fig. 2.11 Block diagram of the SCHMM-based speech recognizer used for the
experiments of this book. A feature vector representation of the speech signal is
extracted in the front-end of the speech recognizer. Each feature vector is compared
with all Gaussian densities of the codebook. The result is used for speech decoding
to generate a transcription of the spoken phrase. Figure is taken from [Herbig et al.,
2010c].
• Front-end. In the front-end a noise reduction is performed, the MFCC
features are extracted from the audio signal, a cepstral mean subtraction
is calculated and an LDA is applied. For each time instance the vector
representation x
t
of the relevant speech characteristics is computed.
• Codebook. The feature vectors are evaluated by a speaker indepen-
dent codebook subsequently called standard codebook. It consists of about
1000 multivariate Gaussian densities. The likelihood computation
p(x
t
|k) = N {x|μ
k
, Σ
k
} (2.66)
of each Gaussian density with index k does not require a state alignment
since only the weighting factors are state dependent. The soft quantization
q
t
= (q
1
(x
t
), . . . , q
k
(x
t
), . . . q
N
(x
t
))
T
(2.67)
q
k
(x
t
) =
N {x
t

k
, Σ
k
}

N
l=1
N {x
t

l
, Σ
l
}
(2.68)
contains the matching result represented by the normalized likelihoods of
all Gaussian densities [Fink, 2003]. The weights are evaluated in the speech
decoder.
• Speech Decoder. The recognition task in (2.50) is solved by speech de-
coding based on q
1:T
. The acoustic model is realized by Markov chains.
In addition, special models are used to capture garbage words and short
speech pauses not removed by the speech segmentation. Garbages might
contain speech which does not contribute to the speech recognition result.
For example, a speaker hesitates within an utterance, mispronounces sin-
gle words, coughs or clears his throat. In combination with the lexicon and
language model, the transcription of the spoken utterance is obtained as
depicted in Fig. 2.12. Utterances can be rejected when the confidence of the
recognition result, e.g. given by the posterior probability p(W
1:N
W
|x
1:T
),
does not exceed a pre-determined threshold.
2.6 Speaker Adaptation 37
Decoder
Acoustic
model
Lexicon
Language
model
Fig. 2.12 Components of the speech decoding comprising the acoustic models,
lexicon and language models.
2.6 Speaker Adaptation
In Sect. 2.2 speech production was discussed as a source-filter model. Speaker
variability has its origins in a combination of gender and speaker specific
excitation source, physical anatomy of the vocal tract and acquired speaking
habits, e.g. speaking rate [Campbell, 1997]. Speech characteristics were also
briefly described which are important for speech decoding.
In the preceding sections several techniques were introduced to deal with
speaker and speech variability. In practical applications the problem of unseen
conditions or time-variant speaker and speech characteristics occur. Since a
re-training is usually not feasible at run-time, speaker adaptation provides
an alternative method to adjust or initialize statistical models. Thus, speaker
adaptation is an essential part of the speech controlled target system of this
book.
In this section several adaptation strategies are discussed. They can be
equally applied for speaker identification and speech recognition:
First, the application of speaker adaptation is motivated and possible
sources of speech variations are given. The latter can affect speaker iden-
tification and speech recognition. Several applications for speaker adaptation
in speaker identification and speech recognition are described. The differences
between the adaptation of GMMs and HMMs are emphasized. A simplified
notation is introduced to describe both adaptation scenarios.
Then two of the main representatives for long-term speaker adaptation
are introduced. They enable the estimation of a high number of adaptation
parameters. For each speaker an individually adapted statistical model can
be given.
In contrast to this representative, an adaptation algorithm is described
that excels in fast convergence. This algorithm is suitable for adaptation on
limited training data since it only requires a few parameters. At the same
38 2 Fundamentals
time the minimal number of adaptation parameters restricts the algorithm’s
accuracy for long-term adaptation [Botterweck, 2001].
Finally, an adaptation strategy is introduced which integrates short-term
and long-term adaptation capabilities. The optimal transition between short-
term and long-term adaptation is achieved at the expense of a higher com-
putational load [Jon et al., 2001; Stern and Lasry, 1987].
2.6.1 Motivation
In the preceding sections several statistical models and training algorithms
were presented to handle speech and speaker related information. Speaker
adaptation provides an alternative way to create adjusted statistical models.
The practical realization of the statistical models described in the preced-
ing sections often suffers from mismatches between training and test con-
ditions when the training has to be done on incomplete data that do not
represent the actual test situation. Speaker adaptation offers the oppor-
tunity to moderate those mismatches and to obtain improved statistical
representations.
In general, pattern recognition on speech encounters several sources which
are responsible for a mismatch. Equation (2.4) reveals that the speaker’s vocal
tract and environmental influences may cause variations. Among those, there
are the channel impulse response as well as background noises.
Even though the noise reduction of the front-end reduces background
noises and smooths the power spectrum, the residual noise and channel dis-
tortions can affect the speaker identification and speech recognition accuracy.
Enhancement or normalization techniques, e.g. cepstral mean normal-
ization, can be applied to reduce unwanted variations such as channel
characteristics.
In the case of speaker specific variations speaker adaptation integrates
them into the statistical modeling to initialize or enhance a speaker specific
model. It starts from a preceding or initial parameter set
¯
Θ and provides
an optimized parameter set Θ for each speaker based on training data. The
speaker index is omitted. Optimization can be achieved by the ML or MAP
criterion which increase either the likelihood or posterior probability for the
recorded utterance x
1:T
. The problem can be formulated by
Θ
ML
= arg max
Θ
{p(x
1:T
|Θ)} (2.69)
Θ
MAP
= arg max
Θ
{p(x
1:T
|Θ) · p(Θ)} (2.70)
as found by Gauvain and Lee [1994]. The same problem has to be solved as
for the training algorithm in Sect. 2.4.2. Incomplete data have to be handled
because of latent variables. This can be implemented by the EM algorithm
as described in Sect. A.1.
2.6 Speaker Adaptation 39
In contrast to the realization of the EM algorithm in Sect. 2.4.2, just
one iteration or only a few ones are performed. The first iterations usually
contribute most to the final solution so that a reduction of the computational
complexity is achieved by this restriction. Furthermore, the risk of over-fitting
can be reduced. Equations (A.5) and (A.3) form the base of the following
algorithms.
The capability of adaptation algorithms depends on the number of avail-
able parameters. With a higher number of parameters a more individual and
accurate representation of the voice pattern of a particular speaker can be
achieved. A high number of parameters, however, requires a large amount of
adaptation data. Otherwise over-fitting can occur since the statistical model
starts to represent insignificant properties and loses the ability to general-
ize [Duda et al., 2001].
Thus, either different approaches are needed for special applications de-
pending on the amount of data or an integrated strategy has to be found.
2.6.2 Applications for Speaker Adaptation in Speaker
Identification and Speech Recognition
GMMs were introduced in Sect. 2.4.2 as common statistical models to capture
speaker characteristics. The former ones are trained by the EM algorithm
to obtain an optimal statistical representation. However, in the literature
there exist alternative methods to obtain speaker specific GMMs. According
to Nishida and Kawahara [2005] adapting a Universal Background Model
(UBM) to a particular speaker can be viewed as one of the most frequently
applied techniques to build up speaker specific GMMs.
In the literature a further important application is updating the codebook
of a speech recognizer [Zavaliagkos et al., 1995]. In Sect. 2.5 speech recognition
was introduced from a speaker independent point of view. Many speech rec-
ognizers are extensively trained on a large group of speakers and thus do not
represent the true speech characteristics of particular speakers [Zavaliagkos
et al., 1995]. In realistic applications speaker adaptation can enhance the ro-
bustness of the ASR. The recognition rate can be significantly increased as
demonstrated by Gauvain and Lee [1994]; Kuhn et al. [2000], for example.
The preceding sections showed that GMMs are a special case of HMMs.
Therefore similar techniques can be applied to adapt GMMs and HMMs.
Subsequently, only speech recognizers based on SCHMMs are investigated in
this book.
Speaker adaptation for speaker identification purely aims at a better rep-
resentation of the speaker’s static voice characteristics with respect to the
likelihood or posterior probability.
Codebook adaptation is intended to optimize the recognition of spoken
phrases and therefore includes the temporal speech dynamics as a further
aspect. Codebooks can be optimized by taking advantage from the speech
40 2 Fundamentals
recognizer’s knowledge about the spoken utterance. Since speaker adapta-
tion follows the speech decoding in the realization of this book, a two-stage
approach becomes feasible. Hence, the optimization problem can be solved
in analogy to the segmental MAP algorithm
Θ
seg MAP
= arg max
Θ
_
max
s
1:T
{p(x
1:T
, s
1:T
|Θ) · p(Θ)}
_
(2.71)
or equivalently by the following recursion
ˆ s
1:T
= max
s
1:T
_
p(x
1:T
, s
1:T
|
¯
Θ)
_
(2.72)
ˆ
Θ = arg max
Θ
{p(x
1:T
, ˆ s
1:T
|Θ) · p(Θ)} (2.73)
as found by Gauvain and Lee [1994]. The state sequence of an observed ut-
terance is given by s
1:T
= {s
1
, s
2
. . . , s
T
}.
¯
Θ and
ˆ
Θ denote the parameter set
of the preceding iteration and the adapted parameter set. Several iterations
may be performed to calculate Θ
seg MAP
.
One problem of such a two-stage approach is the fact that transcription
errors can negatively affect the speaker adaptation accuracy, especially when
a large number of free parameters have to be estimated. Thus confidence
measures may be employed on a frame or word level to increase the robustness
against speech recognition errors [Gollan and Bacchiani, 2008].
The following approximation is used in this book. During speech decoding
an optimal state sequence ˆ s
1:T
is determined by the Viterbi algorithm. The
speech recognition result is integrated in codebook optimization by the state
dependent weights of the shared GMM since only SCHMMs are considered.
The auxiliary functions of the EM algorithm in (A.3) and (A.5) have to be
re-written
Q
seg MAP
(Θ,
¯
Θ) = Q
seg ML
(Θ,
¯
Θ) + log (p(Θ)) (2.74)
Q
seg ML
(Θ,
¯
Θ) =
T

t=1
N

k=1
p(k|ˆ s
t
, x
t
,
¯
Θ) · log (p(x
t
, k|ˆ s
t
, Θ)) (2.75)
=
T

t=1
N

k=1
p(k|ˆ s
t
, x
t
,
¯
Θ) · log (p(x
t
|k, ˆ s
t
, Θ))
+
T

t=1
N

k=1
p(k|ˆ s
t
, x
t
,
¯
Θ) · log (p(k|ˆ s
t
, Θ))
(2.76)
to integrate the knowledge of the speech transcription in the codebook
adaptation.
2.6 Speaker Adaptation 41
When speaker adaptation is restricted to adapt only the mean vec-
tors μ
l
, l = 1, . . . , N, the optimization of the auxiliary function can be
simplified for SCHMMs:
d

l
Q
seg ML
(Θ,
¯
Θ) =
T

t=1
N

k=1
p(k|ˆ s
t
, x
t
,
¯
Θ) ·
d

l
log (p(x
t
|k, Θ)) . (2.77)
The likelihood p(x
t
|k, s
t
, Θ) of a particular Gaussian distribution does not
depend on the current state s
t
and the prior probability p(k|s
t
, Θ) = w
s
t
k
is
independent from the mean vectors μ
l
. The reader is referred to Gauvain and
Lee [1994]; Matrouf et al. [2001] for adaptation scenarios modifying also the
covariance matrices of an HMM.
The difference between the adaptation of GMMs and codebooks of SC-
HMMs becomes now obvious. The posterior probability p(k|x
t
,
¯
Θ) determines
the assignment of the observed feature vector x
t
to the Gaussian distributions
of a GMM. The posterior probability p(k|ˆ s
t
, x
t
,
¯
Θ) integrating the transcrip-
tion of the spoken phrase is used for codebook adaptation. It provides more
information about the speech signal compared to p(k|x
t
,
¯
Θ). Thus, the E-
step does not only check whether a feature vector is assigned to a particular
Gaussian distribution. In addition, the contribution to the recognition result
gains importance for speaker adaptation.
Speaker adaptation normally modifies only the codebooks of a speech rec-
ognizer. Adapting the transitions of the Markov chain is usually discarded
because the emission densities are considered to have a higher effect on
the speech recognition rate compared to the transition probabilities [Dobler
and R¨ uhl, 1995]. The transition probabilities reflect the durational informa-
tion [O’Shaughnessy, 2000]. For example, they contain the speech rate and
average duration between the start and end state of a feed-forward chain as
shown in Fig. 2.10. It is referred to Dobler and R¨ uhl [1995]; Gauvain and Lee
[1994] for adapting the Markov transitions.
Subsequently, the discussion is focused on GMM and codebook adapta-
tion. Because of the two-staged approach a simplified notation can be used
for both applications in speaker identification and speech recognition. The
state variable is omitted in the case of a speech recognizer’s codebook to ob-
tain a compact representation. Thus, codebooks and GMMs can be equally
considered when adaptation is described in this book.
2.6.3 Maximum A Posteriori
Maximum A Posteriori and Maximum Likelihood Linear Regression can be
viewed as standard adaptation algorithms [Kuhn et al., 1999, 2001]. They
both belong to the most frequently used techniques to establish speaker
specific GMMs by adapting a UBM on speaker specific data [Nishida and
Kawahara, 2005]. Codebook adaptation of a speech recognizer is a further
important application as already discussed in Sect. 2.6.2.
42 2 Fundamentals
MAP adaptation also known as Bayesian learning [Gauvain and Lee, 1994]
is a standard approach. It is directly derived from the extended auxiliary
function
Q
MAP
(Θ,
¯
Θ) = Q
ML
(Θ,
¯
Θ) + log (p(Θ)) , (2.78)
where
¯
Θ denotes an initial or speaker specific parameter set. It integrates
prior knowledge p(Θ) and the ML estimates Θ
ML
based on the observed
data [Stern and Lasry, 1987]. This algorithm is characterized by individual
adaptation of each Gaussian density [Gauvain and Lee, 1994; Jon et al., 2001;
Reynolds et al., 2000].
Gauvain and Lee [1994] provide a detailed derivation and the mathematical
background of the Bayesian learning algorithm so that only the results are
given here. More details can be found in Sect. A.2 where codebook adaptation
is exemplified.
MAP adaptation starts from a speaker independent UBM as the initial pa-
rameter set Θ
UBM
or alternatively from a trained speaker specific GMM. The
parameters μ
UBM
k
, Σ
UBM
k
and w
UBM
k
subsequently contain the prior knowl-
edge about the mean vectors, covariance matrices and the weights, respec-
tively.
Equation (2.78) has to be optimized with respect to the modified param-
eter set Θ. The feature vectors are viewed in the context of the iid assump-
tion and statistical dependencies among the Gaussian densities are neglected.
Thus, p(Θ) can be factorized [Zavaliagkos et al., 1995]. This allows individual
adaptation equations for each Gaussian distribution. The following two steps
are performed:
First, a new parameter set of ML estimates Θ
ML
is calculated by applying
one E-step and one M-step of the EM algorithm. The ML estimates are de-
noted by the superscript ML. N and n
k
denote the total number of Gaussian
densities and the number of feature vectors softly assigned to a particular
Gaussian density. For convenience, the equations of the ML estimation in
Sect. 2.4 are modified
p(k|x
t
, Θ
UBM
) =
w
UBM
k
· N
_
x
t

UBM
k
, Σ
UBM
k
_

N
l=1
w
UBM
l
· N
_
x
t

UBM
l
, Σ
UBM
l
_ (2.79)
n
k
=
T

t=1
p(k|x
t
, Θ
UBM
) (2.80)
μ
ML
k
=
1
n
k
T

t=1
p(k|x
t
, Θ
UBM
) · x
t
(2.81)
Σ
ML
k
=
1
n
k
T

t=1
p(k|x
t
, Θ
UBM
) · (x
t
−μ
ML
k
) · (x
t
−μ
ML
k
)
T
(2.82)
w
ML
k
=
n
k
T
(2.83)
2.6 Speaker Adaptation 43
so that the UBM determines the feature vector assignment and acts as the ini-
tial model. When an iterative procedure is applied, Θ
UBM
has to be replaced
by
¯
Θ which represents the parameter set of the preceding iteration.
After the ML estimation two sets of parameters exist. The ML esti-
mates Θ
ML
represent the observed data whereas the initial parameters Θ
UBM
comprise the prior distribution. For adaptation a smooth transition be-
tween Θ
UBM
and Θ
ML
has to be found depending on these observations.
This problem is solved in the second step by a convex combination of the
initial parameters and the ML estimates resulting in the optimized parameter
set Θ
MAP
. The number of softly assigned feature vectors n
k
controls for each
Gaussian density the weighting of this convex combination. The influence of
the new estimates Θ
ML
increases with higher n
k
. Small values of n
k
cause the
MAP estimate to resemble the prior parameters. The associated equations
α
k
=
n
k
n
k

(2.84)
μ
MAP
k
= (1 −α
k
) · μ
UBM
k
+ α
k
· μ
ML
k
(2.85)
Σ
MAP
k
= (1 −α
k
) · Σ
UBM
k

k
· Σ
ML
k
(2.86)
w
MAP
k
∝ (1 −α
k
) · w
UBM
k

k
· w
ML
k
,
N

l=1
w
MAP
l
= 1 (2.87)
can be derived from (2.78) and can be found by Reynolds et al. [2000].
The constant η contains prior knowledge about the densities of the
GMM [Gauvain and Lee, 1994]. In practical applications η can be set ir-
respectively of this. Reynolds et al. [2000] show that the choice of this con-
stant η is not sensitive for the performance of speaker verification and can be
selected in a relatively wide range. The weight computation w
MAP
k
introduces
a constant factor to assure that the resulting weights are valid probabilities
and sum up to unity.
Adapting the mean vectors seems to have the major effect on the speaker
verification results [Reynolds et al., 2000]. Thus, only mean vectors and if
applicable weights can be adapted as a compromise between enhanced iden-
tification rate and computational load.
Fig. 2.13(a) and Fig. 2.13(b) illustrate the procedure of the MAP adap-
tation in analogy to Reynolds et al. [2000]: Fig. 2.13(a) displays the original
GMM as ellipses. The centers represent mean vectors and the main axes of
the ellipses correspond to variances. The crosses indicate the observed feature
vectors. Fig. 2.13(b) shows the adapted Gaussian densities as filled ellipses
and the original density functions as unfilled ellipses. The Gaussian densities
at the top are individually adapted with respect to the assigned feature vec-
tors. The resulting ellipses cover the feature vectors and parts of the original
ellipses indicating the convex combination in (2.85). One density function
remains unchanged since no feature vectors are assigned.
44 2 Fundamentals
(a) Filled ellipses display the original
Gaussian densities of a GMM. The ob-
served feature vectors are depicted as
crosses.
(b) Adapted Gaussian densities are de-
picted as filled ellipses whereas the initial
Gaussian densities of the GMM are de-
noted by unfilled ellipses (dotted line).
Fig. 2.13 Example for the MAP adaptation of a GMM comprising 3 Gaussian
densities.
Finally, the advantages and drawbacks of the MAP adaptation should
be discussed: Individual adaptation is achieved on extensive adaptation data
which enables high adaptation accuracy. If the constant η is selected appropri-
ately, MAP adaptation is robust against over-fitting caused by limited speaker
specific data. However, inefficient adaptation of untrained Gaussian densities
has to be expected resulting in slow convergence [Zavaliagkos et al., 1995].
For n
k
→ 0 no adaptation occurs or only a few parameters are marginally
adapted. Thus, MAP adaptation is important for long-term adaptation and
suffers from inefficient adaptation on limited data [Kuhn et al., 2000].
2.6.4 Maximum Likelihood Linear Regression
Approaches for speaker adaptation such as MAP adaptation, where each
Gaussian distribution is adapted individually, require a sufficient amount of
training data to be efficient. In general the number of adaptation parameters
has to be balanced with the amount of speaker specific data.
One way is to estimate a set of model transformations for classes of
Gaussian distributions to capture speaker characteristics. Several publica-
tions such as Gales and Woodland [1996]; Leggetter and Woodland [1995a]
describe Maximum Likelihood Linear Regression (MLLR) which is widely
used in speaker adaptation. The key idea of MLLR is to represent model
2.6 Speaker Adaptation 45
adaptation by a linear transformation. For example, mean vector adaptation
may be realized for each regression class r by a multiplication of the original
mean vector μ
k
r
with matrix A
r
and an offset vector b
r
. The optimized mean
vector μ
MLLR
k
r
is given by
μ
MLLR
k
r
= A
opt
r
· μ
k
r
+b
opt
r
(2.88)
which can be rewritten as
μ
MLLR
k
r
= W
opt
r
· ζ
k
r
(2.89)
W
opt
r
=
_
b
opt
r
A
opt
r
¸
(2.90)
ζ
T
k
r
=
_
1
T
μ
T
k
r
¸
. (2.91)
The covariance matrices may be adapted in a similar way as described
by Gales and Woodland [1996].
The goal of MLLR is to determine an optimal transformation matrix W
opt
k
r
in such a way that the new parameter set maximizes the likelihood of the
training data. This problem may be solved by maximizing the auxiliary func-
tion Q
ML
(Θ,
¯
Θ) of the EM algorithm. In this context
¯
Θ denotes the ini-
tial parameter set, e.g. the standard codebook Θ
0
, or the parameter set of
the previous iteration when an iterative procedure is applied. The optimal
transformation matrix W
opt
k
r
is given by the solution of the following set of
equations:
T

t=1
R

r=1
p(k
r
|x
t
,
¯
Θ) · Σ
−1
k
r
· x
t
· ζ
T
k
r
=
T

t=1
R

r=1
p(k
r
|x
t
,
¯
Θ) · Σ
−1
k
r
· W
opt
m
· ζ
k
r
· ζ
T
k
r
(2.92)
as found by Gales and Woodland [1996]. p(k
r
|x
t
,
¯
Θ) denotes the posterior
probability where each feature vector x
t
is assigned to a particular Gaussian
density k
r
within a given regression class r.
In this book, MAP adaptation is viewed as an appropriate candidate for
long-term adaptation. A highly specific speaker modeling can be achieved
on extensive data. Furthermore, the combination of short-term and long-
term adaptation in Sect. 4.2 can be intuitively explained. Thus, MLLR is not
considered.
2.6.5 Eigenvoices
The MAP adaptation has its strength in long-term speaker adapta-
tion whereas the Eigenvoice (EV) approach is advantageous in the case of
46 2 Fundamentals
limited training data as shown by Kuhn et al. [2000]. This technique benefits
from prior knowledge about the statistical dependencies between all Gaussian
densities. Thus, this algorithm significantly reduces the number of adaptation
parameters. Even very few training data, e.g. short command and control ut-
terances, allows estimating some scalar parameters. When prior knowledge
about speaker variability can be employed, the codebook of a speech recog-
nizer can be efficiently adapted by some 10 adaptation parameters [Kuhn
et al., 2000].
There are about 25, 000 parameters to be adapted when the MAP adap-
tation is applied to the codebook of the speech recognizer introduced in
Sect. 2.5.3. The EV technique in the realization of this book only needs 10 pa-
rameters to efficiently adapt codebooks. GMMs for speaker identification can
be realized by significantly fewer parameters as will be shown in Sect. 5.3.
In an off-line training step codebook adaptation is performed for a large
pool of speakers. The main directions of variations in the feature space are
extracted. They are presented by eigenvectors which are subsequently called
eigenvoices. At run-time codebook adaptation can be efficiently implemented
by a linear combination of all eigenvoices. Only the weighting factors have to
be determined to modify the model parameters in an optimal manner. For
convenience, this presentation only considers the mean vector adaptation of
codebooks but may be extended to weights or covariance matrices. A more
detailed investigation of the EV approach can be found by Kuhn et al. [2000];
Thyes et al. [2000].
Training
In an off-line training step N
Sp
speaker specific codebooks are trained. The
origin of all speaker specific models is given here by the standard codebook
with parameter set Θ
0
. Each speaker with index i provides a large amount
of training utterances so that an efficient long-term speaker adaptation is
possible. For convenience, the MAP algorithm introduced in Sect. 2.6.3 is
considered in the following. The result is a speaker specific set of mean vectors
μ
0
k
MAP
−→
_
μ
1
k
, . . . , μ
i
k
, . . . , μ
N
Sp
k
_
, k = 1, . . . , N (2.93)
which is grouped by the component index k. In total, N Gaussian densities
are employed.
Subsequently, all mean vectors are arranged in supervectors
ˇ μ
i
=








μ
i
1
.
.
.
μ
i
k
.
.
.
μ
i
N








(2.94)
2.6 Speaker Adaptation 47
Speaker 1
Speaker 2
Supervectors
Fig. 2.14 Example for the supervector notation. All mean vectors of a codebook
are stacked into a long vector.
indicated by the superscript in ˇ μ and the missing index k. To obtain a com-
pact notation, equation (2.93) is represented by supervectors
ˇ μ
0
MAP
−→
_
ˇ μ
1
, ˇ μ
2
, . . . , ˇ μ
N
Sp
_
. (2.95)
An example of the supervector notation is displayed in Fig 2.14. In the follow-
ing, a pool of diverse speakers is considered. For convenience, the mean over
all speaker specific supervectors is assumed to be identical to the supervector
of the speaker independent mean vectors given by the standard codebook
E
i
{ˇ μ
i
} = ˇ μ
0
. (2.96)
An additional offset vector is omitted here.
Now the training has to obtain the most important adaptation directions
which can be observed in this speaker pool. The covariance matrix
ˇ
Σ
EV
= E
i
{
_
ˇ μ
i
− ˇ μ
0
_
·
_
ˇ μ
i
− ˇ μ
0
_
T
} (2.97)
48 2 Fundamentals
is calculated and the eigenvectors of
ˇ
Σ
EV
are extracted using PCA
7
as de-
scribed by Jolliffe [2002]. The eigenvalues are denoted by λ
EV
. The corre-
sponding eigenvectors ˇe
EV
fulfill the condition
ˇ
Σ
EV
· ˇe
EV
l
= λ
EV
· ˇe
EV
l
, 1 ≤ l ≤ N
Sp
. (2.98)
and can be chosen as orthonormal
(ˇe
EV
l
1
)
T
· ˇe
EV
l
2
= δ
K
(l
1
, l
2
), 1 ≤ l
1
, l
2
≤ N
Sp
(2.99)
since the covariance matrix is symmetric
ˇ
Σ
EV
= (
ˇ
Σ
EV
)
T
by its defini-
tion [Bronstein et al., 2000; Meyberg and Vachenauer, 2003]. The number
of eigenvoices is limited by the number of speakers N
Sp
.
The eigenvalues characterize the variance of the observed data projected
onto these eigenvectors if the eigenvectors are normalized [Jolliffe, 2002]. High
values indicate the important directions in the sense of an efficient speaker
adaptation. The eigenvectors ˇe
EV
are sorted along decreasing eigenvalues and
the following considerations focus only on the first L eigenvectors.
The first eigenvector usually represents gender information and enables
gender detection as the results of Kuhn et al. [2000] suggest. This book makes
use of only L = 10 eigenvoices similar to Kuhn et al. [2000] because the
corresponding eigenvalues diminish rapidly.
Up to this point the principle directions of speaker adaptation are repre-
sented as supervectors. At run-time the EV speaker adaptation and especially
the combination of short-term and long-term adaptation in Sect. 4.2 become
more feasible when each Gaussian density is individually represented. Each
supervector ˇe
EV
l
separates into its elements
ˇe
EV
l
=








e
EV
1,l
.
.
.
e
EV
k,l
.
.
.
e
EV
N,l








, l = 1, . . . , L (2.100)
where the index k indicates the assignment to a particular Gaussian density.
Adaptation
At run-time the optimal combination of the eigenvoices has to be determined
to obtain a set of adapted parameters. The adapted mean vector ˇ μ
EV
re-
sults from a linear combination of the original speaker independent mean
vector ˇ μ
0
and the eigenvoices ˇe
EV
. For convenience, the representation of
the eigenvoices for each Gaussian density is used
7
Realizations of the PCA which are based on the correlation matrix also exist [Jol-
liffe, 2002; Kuhn et al., 2000].
2.6 Speaker Adaptation 49
μ
EV
k
= μ
0
k
+
L

l=1
α
l
· e
EV
k,l
(2.101)
as noted in (2.100). In order to obtain a comprehensive notation the matrix
M
k
= (e
EV
k,1
, . . . , e
EV
k,L
), k = 1, . . . N (2.102)
containing all eigenvoices is introduced. The weight vector α = (α
1
, . . . , α
L
)
T
is employed to represent (2.101) by a matrix product and an offset vector:
μ
EV
k
= μ
0
k
+M
k
· α. (2.103)
For adaptation the optimal scalar weighting factors α
l
have to be deter-
mined. Standard approaches solve this problem by optimizing the auxiliary
function of the EM algorithm
Q
ML
(Θ, Θ
0
) =
T

t=1
N

k=1
p(k|x
t
, Θ
0
) · [log (p(x
t
|k, Θ)) + log (p(k|Θ))] (2.104)
as demonstrated by Kuhn et al. [1998], for example. In this context, Θ con-
tains the weighting factors w
0
k
, the mean vectors μ
EV
k
depending on α and
the covariance matrices Σ
0
k
. w
0
k
and Σ
0
k
are given by the standard codebook.
When speaker adaptation is iteratively performed, Θ
0
has to be replaced
by the parameter set
¯
Θ of the preceding iteration. Prior knowledge can be
included by optimizing
Q
MAP
(Θ, Θ
0
) = Q
ML
(Θ, Θ
0
) + log (p(Θ)) . (2.105)
p(x
t
|k, Θ) and p(k|Θ) were already introduced in Sect. 2.4.2. They denote the
likelihood function of a single Gaussian density and the weight of a particular
density function as given by the equations (2.35) and (2.36).
The derivations known from literature differ in the choice of the prior
knowledge p(Θ) concerning the parameter set as found by Huang et al. [2004];
Kuhn et al. [1998], for example. In this book a uniform distribution is chosen
so that the MAP criterion reduces to the ML estimation problem.
The optimal combination of the eigenvoices is obtained by inserting the
linear combination (2.103) into the auxiliary function (2.104) and optimiz-
ing Q
ML
(Θ, Θ
0
) with respect to the weighting factors α. The computation is
simplified since the weights are independent of the mean vectors:
d

log (p(k|Θ)) = 0 (2.106)
d

log (p(x
t
|k, Θ)) =
d

log
_
N
_
x
t

EV
k
(α), Σ
0
k
__
. (2.107)
50 2 Fundamentals
Now the problem of the parameter computation can be given in a compact
notation
d

Q
ML
(Θ, Θ
0
) = −
1
2
·
T−1

t=0
N

k=1
p(k|x
t
, Θ
0
) ·
d

d
Mahal
k
(α). (2.108)
The squared Mahalanobis distance
d
Mahal
k
(α) =
_
x
t
−μ
EV
k
(α)
_
T
· (Σ
0
k
)
−1
·
_
x
t
−μ
EV
k
(α)
_
(2.109)
is given by the exponent of the Gaussian density excluding the scaling fac-
tor −
1
2
as found by Campbell [1997]. d
Mahal
k
is a quadratic distance measure
and therefore the derivative leads to linear expressions in terms of α. Since Σ
0
k
is symmetric, the derivation rule
d

d
Mahal
k
(α)
=
d

_
x
t
−μ
0
k
−M
k
· α
_
T
· (Σ
0
k
)
−1
·
_
x
t
−μ
0
k
−M
k
· α
_
(2.110)
=
d

α
T
· M
T
k
· (Σ
0
k
)
−1
· M
k
· α −
d

α
T
· M
T
k
· (Σ
0
k
)
−1
·
_
x
t
−μ
0
k
_

d

_
x
t
−μ
0
k
_
T
· (Σ
0
k
)
−1
· M
k
· α
(2.111)
=2 · M
T
k
· (Σ
0
k
)
−1
· M
k
· α−2 · M
T
k
· (Σ
0
k
)
−1
·
_
x
t
−μ
0
k
_
(2.112)
=−2 · (M
k
)
T
· (Σ
0
k
)
−1
·
_
x
t
−μ
0
k
−M
k
· α
_
(2.113)
is applicable as found by Felippa [2004]; Petersen and Pedersen [2008].
The optimal parameter set α
ML
has to fulfill the condition
d

Q
ML
¸
¸
¸
¸
α=α
ML
= 0 (2.114)
which leads to the following set of equations
0 =
_
T

t=1
N

k=1
p(k|x
t
, Θ
0
) · (M
k
)
T
· (Σ
0
k
)
−1
· M
k
_
· α
ML

T

t=1
N

k=1
p(k|x
t
, Θ
0
) · (M
k
)
T
· (Σ
0
k
)
−1
·
_
x
t
−μ
0
k
_
.
(2.115)
The matrix M
k
and the covariance matrix Σ
0
k
are independent from the ob-
servations x
t
. According to equation (2.43), the sum over all observed feature
vectors weighted by the posterior probability is equal to the ML estimate μ
ML
k
multiplied by the soft number of assigned feature vectors n
k
. This allows the
following modification:
2.6 Speaker Adaptation 51
0 =
_
N

k=1
n
k
· (M
k
)
T
· (Σ
0
k
)
−1
· M
k
_
· α
ML

N

k=1
n
k
· (M
k
)
T
· (Σ
0
k
)
−1
·
_
μ
ML
k
−μ
0
k
_
.
(2.116)
The solution of this set of equations delivers the optimal weigths α
ML
. The
adapted mean vectors μ
EV
k
result from the weighted sum of all eigenvoices
in (2.101) given the optimal weights α
ML
.
In contrast to the MAP adaptation in Sect. 2.6.3 this approach is charac-
terized by fast convergence even on limited training data due to prior knowl-
edge about speaker variability. Only a few utterances are required. However,
this also limits the adaptation accuracy for long-term adaptation [Botter-
weck, 2001]. Thus, EV adaptation is considered as a promising candidate for
short-term adaptation, especially for the use case of this book.
2.6.6 Extended Maximum A Posteriori
The preceding sections presented some techniques for long-term and short-
term adaptation. Now the question emerges how to handle the transition
between limited and extensive training data. One way is to use an experimen-
tally determined threshold. Hard decisions implicate the need for a reasonable
choice of the associated thresholds and bear the risk to be not optimal.
The Extended Maximum A Posterior (EMAP) speaker adaptation pre-
sented by Stern and Lasry [1987] follows the strategy to integrate both
aspects and offers an optimal transition. EMAP differs from the MAP adap-
tation in Sect. 2.6.3 since the parameters are regarded as random variables
with statistical dependencies across all Gaussian densities [Stern and Lasry,
1987]. EMAP jointly adapts all Gaussian densities whereas MAP adaptation
individually modifies each distribution.
The prior knowledge how the Gaussian densities of a codebook depend on
the adaptation of the remaining density functions is employed to efficiently
adjust even those which are not or not sufficiently trained [Jon et al., 2001] as
shown in (2.125) further below. This behavior is similar to the EV technique
and is important when adaptation relies only on limited data such as short
utterances.
As soon as more utterances are accumulated, e.g. with the help of a robust
speaker identification, each Gaussian density can be modified individually.
Thus, in the extreme case of large data sets EMAP behaves like the MAP
algorithm as shown later in (2.124).
First, some vectors and matrices are introduced so that EMAP adaptation
can be described by the supervector notation introduced in the preceding
section. All mean vectors μ
k
are stacked into a supervector
52 2 Fundamentals
ˇ μ =








μ
1
.
.
.
μ
k
.
.
.
μ
N








(2.117)
whereas the covariance matrix
ˇ
Σ
0
= diag
_
Σ
0
1
. . . , Σ
0
N
_
(2.118)
represents a block matrix whose main diagonal consists of the covariances
matrices Σ
0
k
of the standard codebook. The covariance matrix
ˇ
Σ
EMAP
= E
i
{(ˇ μ
i
− ˇ μ
0
) · (ˇ μ
i
− ˇ μ
0
)
T
} (2.119)
denotes a full covariance matrix representing prior knowledge about the de-
pendencies between all Gaussian densities in analogy to (2.97). The estima-
tion of this matrix is done in an off-line training.
At run-time all ML estimates given by (2.81) are stacked into a long con-
catenated vector ˇ μ
ML
. The number of softly assigned feature vectors
n
k
=
T

t=1
p(k|x
t
, Θ
0
) (2.120)
build the main diagonal of the matrix
ˇ
C = diag (n
1
, . . . , n
N
) . (2.121)
Stern and Lasry [1987] describe the EMAP approach in detail. Therefore
the derivation is omitted in this book and only the result is given. The adapted
mean supervector ˇ μ
EMAP
can be calculated by
ˇ μ
EMAP
= ˇ μ
0
+
ˇ
Σ
EMAP
· (
ˇ
Σ
0
+
ˇ
C ·
ˇ
Σ
EMAP
)
−1
·
ˇ
C · (ˇ μ
ML
− ˇ μ
0
) (2.122)
as found by Jon et al. [2001], for example. Finally, the two extreme cases for
n
k
→ 0, ∀k and n
k
→ ∞, ∀k should be discussed:
• When the number of the softly assigned feature vectors increases n
k
→ ∞,
the term (
ˇ
Σ
0
+
ˇ

ˇ
Σ
EMAP
)
−1
is dominated by
ˇ

ˇ
Σ
EMAP
so that the following
approximation may be employed:
ˇ μ
EMAP
≈ ˇ μ
0
+
ˇ
Σ
EMAP
· (
ˇ
C ·
ˇ
Σ
EMAP
)
−1
·
ˇ
C · (ˇ μ
ML
− ˇ μ
0
) (2.123)
μ
EMAP
k
≈ μ
ML
k
, ∀k. (2.124)
Therefore, the MAP and EMAP estimates converge if a large data set can
be used for adaptation.
2.7 Feature Vector Normalization 53
• In the case of very limited data n
k
→ 0, ∀k the influence of
ˇ

ˇ
Σ
EMAP
de-
creases which allows the approximation
ˇ μ
EMAP
≈ ˇ μ
0
+
ˇ
Σ
EMAP
· (
ˇ
Σ
0
)
−1
·
ˇ
C · (ˇ μ
ML
− ˇ μ
0
). (2.125)
The matrix
ˇ
Σ
EMAP
· (
ˇ
Σ
0
)
−1
is independent from the adaptation data and
represents prior knowledge. Even limited adaptation data can be handled
appropriately. The intention is therefore similar to the EV approach dis-
cussed in Sect. 2.6.5.
One disadvantage of the EMAP technique is the high computational com-
plexity. Especially, the matrix inversion in (2.122) is extremely demanding
because the dimension of
ˇ
Σ
0
+
ˇ
C ·
ˇ
Σ
EMAP
is the product of the number of
Gaussian densities N and the dimension of the feature vectors. Unfortunately,
the inversion contains the number of the softly assigned feature vectors n
k
so
that this inversion has to be done at run-time [Rozzi and Stern, 1991]. Jon
et al. [2001] approximate the covariance matrix
ˇ
Σ
EMAP
by some 25 eigenvec-
tors. This approximation helps to significantly reduce the complexity of the
matrix inversion.
2.7 Feature Vector Normalization and Enhancement
In this section the normalization and enhancement of feature vectors is intro-
duced. Alternative approaches are described to deal with speaker variabilities.
In contrast to speaker adaptation unwanted variations, e.g. caused by envi-
ronmental effects or speakers [Song and Kim, 2005], are removed from the
speech signal and are not integrated in the statistical modeling.
First, a motivation for feature normalization and enhancement as well
as a brief classification of the existing techniques are given to demonstrate
the change of the procedural method to handle speaker variabilities. Then a
selection of approaches known from the literature is presented.
2.7.1 Motivation
The main aspect of speaker adaptation is to alleviate the mismatch between
training and test situations. The source-filter theory in Sect. 2.1 gives reasons
for a possible mismatch by speaker specific characteristics, channel impulse
responses and background noises, for example.
Speaker adaptation looks for a better statistical representation of speech
signals by integrating these discrepancies into the statistical model. Even
though speaker adaptation is considered to produce normally the best re-
sults, it is more computationally demanding than speech normalization in
the feature space [Buera et al., 2007].
54 2 Fundamentals
Normalization and enhancement follow the idea to remove a mismatch
between the test and training conditions, e.g. due to environmental distur-
bances or speaker dependencies, from the feature vectors. These fluctuations
should not be visible for speech recognition.
Normalization and enhancement of feature vectors can be advantageous
when users operate speech controlled devices under various environmental
conditions and when speaker adaptation can adjust only one statistical model
for each speaker. Therefore, techniques for feature vector enhancement are
required which are able to handle different mismatch conditions [Song and
Kim, 2005].
Noise reduction and cepstral mean normalization join this procedure be-
cause they limit the effects of background noises or room impulse responses
on the feature vector space.
Vocal Tract Length Normalization (VTLN) may be applied in addition.
Inter-speaker variability due to physiological differences of the vocal tract
can be moderated by a speaker-specific frequency warping prior to cepstral
feature extraction [Garau et al., 2005; Kim et al., 2004], e.g. by modifying the
center frequencies of the mel filter bank [Garau et al., 2005; H¨ab-Umbach,
1999].
A variety of additional methods have been developed which perform a cor-
rection directly in the feature space. Feature normalization and enhancement
can be divided into three categories [Buera et al., 2007]:
• High-pass filtering comprises cepstral mean normalization and relative
spectral amplitude processing [Buera et al., 2007], for example.
• Model-based techniques presume a structural model to describe the
environmental degradation and apply the inverse operation for the com-
pensation at run-time [Buera et al., 2007].
• Empirical compensation apply statistical models which represent the
relationship between the clean and disturbed speech feature vectors. They
enable the correction or mapping of the observed disturbed feature vectors
onto their corresponding clean or undisturbed representatives.
Two algorithms are presented in Sect. 2.7.2 and Sect. 2.7.3 which apply simi-
lar statistical models compared to the methods for speaker identification and
speaker adaptation described in the preceding sections.
2.7.2 Stereo-based Piecewise Linear Compensation
for Environments
The Stereo-based Piecewise Linear Compensation for Environ-
ments (SPLICE) technique is a well-known approach for feature vector
based compensation [Buera et al., 2007; Droppo et al., 2001; Moreno, 1996].
It represents a non-linear feature correction as a front-end of a common
speech recognizer [Droppo and Acero, 2005].
2.7 Feature Vector Normalization 55
A statistical model is trained to learn the mapping of undisturbed fea-
ture vectors to disturbed observations. At run-time the inverse mapping is
performed. The disturbed feature vectors y
t
have to be replaced by optimal
estimates, e.g. given by the Minimum Mean Squared Error (MMSE) criterion.
MMSE estimates minimize the Euclidean distance between the correct
clean feature vectors and the resulting estimates on average. The optimal
estimate x
MMSE
t
is given by the expectation of the corresponding conditional
probability density function p(x
t
|y
t
) [Droppo and Acero, 2005; H¨ ansler, 2001;
Moreno, 1996] as shown below:
x
MMSE
t
= E
x|y
{x
t
} (2.126)
x
MMSE
t
=
_
x
t
x
t
· p(x
t
|y
t
) dx
t
. (2.127)
A latent variable may be introduced to represent multi-modal density func-
tions. A more precise modeling can be achieved. The sum over all density
functions
x
MMSE
t
=
_
x
t
x
t
·
N

k=1
p(x
t
, k|y
t
) dx
t
(2.128)
reduces the joint density function p(x
t
, k|y
t
) to the conditional density func-
tion p(x
t
|y
t
). A further simplification is achieved by using Bayes’ theorem
x
MMSE
t
=
_
x
t
x
t
·
N

k=1
p(x
t
|k, y
t
) · p(k|y
t
) dx
t
. (2.129)
Integral and sum can be interchanged
x
MMSE
t
=
N

k=1
p(k|y
t
) ·
_
x
t
x
t
· p(x
t
|k, y
t
) dx
t
(2.130)
=
N

k=1
p(k|y
t
) · E
x|y,k
{x
t
} (2.131)
due to the linearity of the expectation value.
Under the assumption that the undisturbed feature vector x
t
can be rep-
resented by the observation y
t
and an offset vector [Moreno, 1996], equa-
tion (2.131) can be realized by
x
MMSE
t
= y
t
+
N

k=1
p(k|y
t
) · r
k
. (2.132)
The offset vectors r
k
have to be trained on a stereo data set that comprises
both the clean feature vectors x and the corresponding disturbed feature
vectors y of the identical utterance.
56 2 Fundamentals
The posterior probability p(k|y
t
) determines the influence of each offset
vector. An additional GMM is introduced
p(y
t
) =
N

k=1
w
k
· N {y
t

k
, Σ
k
} (2.133)
which represents the distribution of the observed feature vectors y
t
as found
by Buera et al. [2007], for example. The posterior probability p(k|y
t
) can be
calculated by combining (2.133) and (2.41).
This SPLICE technique only marks a basic system since a manifold of vari-
ations and extensions exist. Buera et al. [2005] should be mentioned because
they investigate similar techniques in the context of speaker verification and
speaker identification.
2.7.3 Eigen-Environment
The SPLICE algorithm is built on a mixture of conditional distributions
p(x
t
|y
t
, k) each contributing to the overall correction. The goal is to deliver
an optimal estimate x
MMSE
t
of the clean feature vector.
Song and Kim [2005] describe an eigen-environment approach that is a
combination of the SPLICE method and an eigenvector technique similar to
the EV approach. The main issue is the computation or approximation of this
offset vector r
k
by a linear combination of eigenvectors. The basic directions
of the offset vectors r
k
and the average offset vector e
Env
k,0
have to be extracted
in a training similar to the EV adaptation. Only the first L eigenvectors e
Env
k,l
associated with the highest eigenvalues are employed.
At run-time a set of scalar weighting factors α
l
has to be determined to
obtain an optimal approximation of the offset vector
r
k
= e
Env
k,0
+
L

l=1
α
l
· e
Env
k,l
. (2.134)
Song and Kim [2005] estimate the parameters α
l
with the help of the EM
algorithm by maximizing the auxiliary function Q
ML
in a similar way as the
EV adaptation in Sect. 2.6.5. The estimate of the undisturbed feature vector
can then be calculated by (2.132).
As discussed before, an auxiliary GMM is required to represent the distri-
bution of the disturbed feature vectors. It enables the application of (2.41)
to compute the posterior probability p(k|y
t
) which controls the influence of
each offset vector.
2.8 Summary 57
2.8 Summary
In this chapter the basic components of a complete system comprising speaker
change detection, speaker identification, speech recognition and speaker adap-
tation have been described.
The discussion started with the human speech production and introduced
the source-filter model. The microphone signal is explained by excitation,
vocal tract, channel characteristics and additional noise. Speaker specific ex-
citation and vocal tract as well as acquired speaker habits can contribute to
speaker variability [Campbell, 1997; O’Shaughnessy, 2000].
The front-end compensates for noise degradations and channel distortions
and provides a compact representation of the spectral properties suitable for
speech recognition and speaker identification.
Several statistical models were introduced to handle speech and speaker
variability for speaker change detection, speaker identification and speech
recognition:
BIC as a well-known method for speaker change detection uses multivari-
ate Gaussian distributions for the statistical modeling of hypothetical speaker
turns. Speaker identification extends this model by applying a mixture of
Gaussian distributions to appropriately capture speaker characteristics such
as the vocal tract. Speech recognition is usually not intended to model par-
ticular speaker characteristics but has to provide a transcription of speech
utterances. Dynamic speech properties have to be taken into consideration.
For this purpose HMMs combine GMMs with an underlying Markov process
to improve the modeling of speech.
Finally, speaker adaptation provides several strategies to modify GMMs
and HMMs and targets to reduce mismatch situations between training and
test. The main difference of the techniques described in this chapter can be
seen in the degrees of freedom which have to be balanced with the amount
of available training data. Short-term and long-term adaptation have been
explained in detail.
Normalization or enhancement provides a different approach to deal with
speaker variability. Instead of learning mismatch conditions, they are removed
from the input signal. Even though speaker adaptation generally yields better
results [Buera et al., 2007], applications are possible which combine those
strategies. In fact, the combination of cepstral mean subtraction and speaker
adaptation already realizes a simple implementation.
More complex systems, e.g. speaker tracking or speaker specific speech
recognition, can be constructed based on these components as described in
the following chapters.
3
Combining Self-Learning Speaker
Identification and Speech Recognition
Chapter 2 introduced the fundamentals about speaker change detection,
speaker identification, speech recognition and speaker adaptation. The ba-
sic strategies relevant for the problem of this book were explained.
In this chapter a survey of the literature concerning the intersection of
speaker change detection, speaker identification, speaker adaptation and
speech recognition is given. The approaches are classified into three groups.
They mainly differ in the complexity of the speaker specific models and the
methods of involving speech and speaker characteristics.
A first conclusion with respect to the cited literature is drawn in the last
section. The advantages and drawbacks of the approaches are discussed and
the target system of this book is sketched.
3.1 Audio Signal Segmentation
A frequently investigated scenario can be seen in the broad class of audio
segmentation. A continuous stream of audio data has to be divided into ho-
mogeneous parts [Chen and Gopalakrishnan, 1998; Hain et al., 1998; Lu and
Zhang, 2002; Meinedo and Neto, 2003]. The criteria for homogeneity may
be the acoustic environment, channel properties, speaker identity or speaker
gender [Chen and Gopalakrishnan, 1998; Meinedo and Neto, 2003]. This pro-
cedure is known as segmentation, end-pointing or divisive segmentation.
The problem of speaker change detection is to find the beginning and end
of an utterance for each speaker. The distinction of speech and non-speech
parts can be an additional task [Meinedo and Neto, 2003]. In other words the
question to be solved is which person has spoken when [Johnson, 1999].
Speaker change detection has found its way into a broad range of appli-
cations. It can be used for broadcast news segmentation to track the anchor
speakers [Johnson, 1999], for example. Further applications are automatic
transcription systems for conferences or conversations [Lu and Zhang, 2002].
Automatic transcription systems aim at journalizing conversations and at
T. Herbig, F. Gerl, and W. Minker: Self-Learning Speaker Identification, SCT, pp. 59–70.
springerlink.com c Springer-Verlag Berlin Heidelberg 2011
60 3 Combining Self-Learning Speaker Identification and Speech Recognition
assigning the recognized utterances to the associated persons. Video content
analysis and audio/video retrieval can be enhanced by unsupervised speaker
change detection [Lu and Zhang, 2002]. The ability to track the speaker en-
ables more advanced speaker adaptation techniques and reduces the error
rate of unsupervised transcription systems [Kwon and Narayanan, 2002].
Speaker change detection has to handle data streams typically without
prior information about the beginning and end of a speaker’s utterance,
speaker identity or the number of speakers [Lu and Zhang, 2002]. A high
number of speaker changes has to be expected a priori for applications such
as meeting indexing. Real-time conditions can be a further challenge [Lu and
Zhang, 2002].
Divisive segmentation splits the data stream into segments that contain
only one speaker, for example. The given data segment is tested for all theo-
retical speaker turns as illustrated in Fig. 3.1(a). This can be realized by fixed-
length segmentation or by a variable window length [Kwon and Narayanan,
2002].
Fixed-length segmentation partitions the data stream into small segments
of predefined length by applying overlapping or non-overlapping window func-
tions. For example, Cheng and Wang [2003] use a first segmentation with a
window length of 12 sec and a refined window of 2 − 3 sec. Each bound-
ary is checked for a speaker change. If no change is found both blocks are
merged and the next boundary is tested. Small segments, however, may com-
plicate robust end-pointing, e.g. in the case of small speech pauses [Kwon
and Narayanan, 2002].
Alternatively, the window length can be dynamically enlarged depending
on the result of the speaker change detection [Ajmera et al., 2004]. This
enables the system to estimate more parameters. However, both the com-
putational load and time delays limit the application of complex statistical
models and extensive training algorithms for real-time applications [Lu and
Zhang, 2002].
This problem can be extended to speaker tracking. In this case an au-
tomatic speaker indexing has to be performed to retrieve all occurrences of
one speaker in a data stream [Meinedo and Neto, 2003]. Hence, agglomera-
tive clustering can be regarded as the continuation of divisive clustering. The
clustering can be solved by using a two step procedure. First, the data stream
is divided into homogeneous blocks. Then, those blocks which are assumed to
correspond to the same speaker are combined. For example, the same or sim-
ilar algorithms can be applied to the result of the speaker change detection to
test the hypothesis that non-adjacent segments originate from one speaker.
This strategy allows building larger clusters of utterances comprising only
one speaker. Thus, more robust and complex speaker specific models can be
trained [Mori and Nakagawa, 2001]. An example is given in Fig. 3.1(b).
Chen and Gopalakrishnan [1998]; Hain et al. [1998]; Johnson [1999];
Meinedo and Neto [2003]; Tritschler and Gopinath [1999]; Yella et al. [2010];
3.1 Audio Signal Segmentation 61
Time
(a) Divisive segmentation.
Time
(b) Agglomerative clustering.
Fig. 3.1 Comparison of divisive segmentation (a) and agglomerative clustering (b).
Divisive segmentation splits the continuous data stream into data blocks of prede-
fined length which are iteratively tested for speaker turns. Based on this result
the same or other algorithms are applied for agglomerative clustering to merge
non-adjacent segments if they are assumed to originate from the same speaker.
Zhou and Hansen [2000] provide further detailed descriptions and several
implementations.
Segmentation and clustering can be structured into three groups [Chen
and Gopalakrishnan, 1998; Cheng and Wang, 2003; Kwon and Narayanan,
2002]:
• Metric-based techniques compute the distances of adjacent intervals of
the considered audio signal [Kwon and Narayanan, 2002]. Maxima of the
distance measure indicate possible speaker changes and a threshold de-
tection leads to the acceptance or rejection of the assumed speaker turns.
Metric-based techniques are characterized by low computational cost but
require experimentally determined thresholds [Cheng and Wang, 2003].
Furthermore, they usually span only short time intervals and thus may
not be suitable for robust distance measures [Cheng and Wang, 2003]. For
example, the Euclidean distance or an information theoretic distance mea-
sure can be applied as found by Kwon and Narayanan [2002]; Siegler et al.
[1997]; Zochov´a and Radov´ a [2005].
• Model-based algorithms such as BIC estimate statistical models for each
segment and perform a hypothesis test. These models are locally trained
on the segment under investigation. A low model complexity is required
because of limited data. A detailed description of the BIC was given in
Sect. 2.3 or can be found by Ajmera et al. [2004]; Chen and Gopalakrishnan
[1998]; Tritschler and Gopinath [1999]; Zhou and Hansen [2000].
62 3 Combining Self-Learning Speaker Identification and Speech Recognition
• Decoder-guided techniques extend this model-based procedure by using
pre-trained statistical models. When speaker models are trained off-line,
more sophisticated models of higher complexity and more extensive train-
ing algorithms can be applied. This approach circumvents the restriction
of locally optimized statistical modeling at the cost of an enrollment. At
run-time the Viterbi algorithm can be applied to determine the most likely
sequence of speaker occurrences in a data stream as found by Hain et al.
[1998]; Wilcox et al. [1994]. For example, the audio stream can be pro-
cessed by a speech recognizer. The transcription delivers the candidates
for speaker turns, e.g. speech pauses, which can be employed in the sub-
sequent speaker segmentation [Zochov´a and Radov´ a, 2005].
Nishida and Kawahara [2005] present a multi-stage algorithm to cluster ut-
terances based on speaker change detection or statistical speaker model se-
lection. In the first implementation the variance BIC is applied for divisive
and agglomerative clustering. The second proposed framework operates ei-
ther on discrete or continuous speaker models depending on the amount of
training data. Robustness is gained by a simple discrete model during the ini-
tialization phase. The system switches in an unsupervised manner to GMM
modeling when a robust parameter estimation becomes feasible. The transi-
tion is realized by the Gaussian mixture size selection that is set up on the
BIC framework. Identification and verification methods enable the system to
merge the speaker clusters.
The result of a robust audio signal segmentation can be employed to realize
speaker specific speech recognition. As long as the current speaker does not
change the speech recognizer is able to continuously adapt the speech model
to the speaker’s characteristics.
Zhang et al. [2000] investigate a two-stage procedure of speaker change
detection and speech recognition. The first stage checks an audio signal for
speaker turns and the second stage decodes the speech signal with a refined
speaker adapted HMM. During the first stage one speaker independent HMM
and up to two speaker specific HMMs are evaluated in parallel. The ML
criterion is used to decide whether a new speaker has to be added to the
recognition system and whether a speaker change has to be assumed. If a
new speaker is indicated, a new HMM is initialized by adapting the speaker
independent HMM to the current utterance. The updated speaker specific
HMM is then used for re-recognition of the utterance and delivers the final
transcription result. In a second experiment the HMM recognition system
is extended by speaker specific GMMs and a further UBM to decrease the
computational effort. The UBM is realized by a GMM comprising 64 Gaussian
distributions. Speaker change detection is implemented comparably to the
detection of unknown speakers in Sect. 2.4.3. The most likely GMM is selected
according to the ML criterion and the associated HMM is used for speech
decoding.
3.2 Multi-Stage Speaker Identification 63
3.2 Multi-Stage Speaker Identification and Speech
Recognition
Many algorithms in audio signal segmentation such as BIC use simple sta-
tistical models to detect speaker changes even on few speech data. Some of
these algorithms were described in the preceding section.
Speaker identification usually relies on more complex models such as
GMMs which require a sufficient amount of training and test data. Some
publications employ the combination of both strategies as will be shown in
this section.
Figure 3.2(a) displays a possible realization of such an approach. The fea-
ture extraction transforms the audio signal into a vector representation com-
prising discriminatory features of the speech signal. VAD separates speech
and speech pauses so that only the speech signals have to be processed.
Speaker change detection clusters the feature vectors of an utterance and
assures that only one speaker is present. Speaker identification determines
the speaker identity and selects the corresponding speaker model. The sub-
sequent speaker adaptation applies common adaptation techniques to better
capture speaker characteristics. If speaker identification is combined with
speech recognition, the knowledge about the speaker identity also enables
speaker specific speech recognition.
Geiger et al. [2010] present a GMM based system for open-set on-line
speaker diarization. In the first stage audio segmentation is performed frame
by frame. A threshold criterion and a rule-based framework are applied to
determine the start and end points of speech and garbage words based on
the energy of the audio signal. MFCCs are extracted to be used for several
classification steps, e.g. speech or non-speech as well as male or female. Three
GMMs are trained off-line - male, female and garbage. They are employed
to classify each segment and to identify known speakers. When one of the
gender models yields the highest likelihood score, a new speaker is assumed
and a new model is initialized. MAP adaptation is used to initialize additional
GMMs and to continuously adapt the speaker models.
Wu et al. [2003] investigate speaker segmentation for real-time applica-
tions such as broadcast news processing. A two-stage approach is presented
comprising pre-segmentation and refinement. Pre-segmentation is realized
by a UBM which categorizes speech into three categories - reliable speaker-
related, doubtful speaker-related and unreliable speaker-related. Based on
reliable frames the dissimilarity between two segments is computed and po-
tential speaker changes are identified. In the case of an assumed speaker turn
this decision is verified by adapted GMMs. If no speaker change is found,
incremental speaker adaptation is applied to obtain a more precise speaker
model. Otherwise, the existing model is substituted by a new model initial-
ized from UBM.
Kwon and Narayanan [2005] address the issue of unsupervised speaker in-
dexing by developing a complete system combining speaker change detection,
64 3 Combining Self-Learning Speaker Identification and Speech Recognition
speaker identification and speaker adaptation. Their algorithm presumes no
prior knowledge about speaker changes or the number of speakers. VAD fil-
ters the speech signal and rejects all non-speech parts such as background
noise. Only the speech signal is further processed by the speaker change de-
tection which applies model-based segmentation algorithms as introduced in
Sect. 3.1 to detect possible speaker turns. As soon as a different speaker
is assumed, the data between the last two speaker turns is clustered and a
pool of speaker models is incrementally built up and refined. The problem of
model initialization is solved by a set of reference speakers. The acoustic sim-
ilarity between particular speakers is exploited by a Monte Carlo algorithm
to construct an initial speaker model that can be further adapted to the
new speaker. This approach is called sample speaker model and is compared
to the UBM and Universal Gender Model (UGM) initialization method for
speaker adaptation. The final speaker adaptation is implemented by MAP
adaptation.
Further multi-stage procedures combining segmentation and clustering are
described by Hain et al. [1998], for example.
As soon as a confident guess concerning the speaker’s identity is available
speaker specific speech recognition becomes feasible as found by Mori and
Nakagawa [2001]. Nishida and Kawahara [2004] present an approach that
combines automatic speech recognition with a preceding speaker indexing to
obtain a higher recognition accuracy. Speaker identification controls speech
recognition so that the associated speaker specific speech models can be used
to enhance speech recognition.
These speaker indexing algorithms consist of a step-by-step processing.
VAD makes discrete decisions in speech and non-speech segments and the
speaker change detection controls the clustering and adaptation algorithm.
Therefore no confidence measure about previous decisions is available in the
subsequent processing. This loss of information can be circumvented when a
system is based on soft decisions and when the final discrete decisions, e.g.
speaker identity, are delayed until speaker adaptation.
Schmalenstroeer and H¨ ab-Umbach [2007] describe an approach for speaker
tracking. Speech segmentation, speaker change detection, speaker identifica-
tion and spatial localization by a beamformer
1
are considered simultaneously.
All cues for speaker changes, speaker identity and localization can be kept
as soft quantities. This prevents a loss of information caused by threshold
detections. Each speaker represents a state of an HMM and the associated
GMM contains the speaker characteristics. The transitions of the Markov
chain can be controlled by a speaker change or tracking algorithm. The most
likely sequence of speaker identities is determined by the Viterbi algorithm
to achieve a more robust speaker tracking. This approach provides a method
of a complete speaker tracking system.
1
Details on beamforming can be found by Krim and Viberg [1996]; Veen and
Buckley [1988], for example.
3.3 Phoneme Based Speaker Identification 65
3.3 Phoneme Based Speaker Identification
The goal of the algorithms in the preceding sections was to control speech
recognition by estimating the speaker identity. The opposite problem is also
known from literature. Speech recognition can enhance speaker identification
since some groups of phonemes differ in their discriminative information.
Figure 3.2(b) displays a simplified block diagram of phoneme based speaker
identification. It is pointed out that speaker identification and speech recog-
nition are interchanged compared to Fig. 3.2(a).
The configuration of the articulatory organs widely varies during speech
production as briefly described in Sect. 2.1. Hence, speaker specific voice char-
acteristics are not accurately modeled by averaged features. Acoustic discrim-
ination can improve the performance of speaker identification [Rodr´ıguez-
Li˜ nares and Garc´ıa-Mateo, 1997].
Eatock and Mason [1994]; Harrag et al. [2005]; Pelecanos et al. [1999];
Thompson and Mason [1993] agree that several phoneme groups exhibit a
higher degree of speaker discriminative information. Thus, a phoneme recog-
nition before speaker identification can enhance the speaker identification
rate. These publications conclude that vowels and nasals contain the highest
speaker specific variability whereas fricatives and plosives have the lowest
impact on speaker identification.
This observation motivated Kinnunen [2002] to construct a speaker de-
pendent filter bank. The key issue is to emphasize specific frequency bands
which are characteristic for a considered group of phonemes to account for
speaker characteristics. The goal is an improved representation of the speaker
discriminative phonemes to increase the speaker identification rate.
Gutman and Bistritz [2002] apply phoneme-based GMMs to increase the
identification accuracy of GMMs. A stronger correlation of phonemes and
Gaussian mixtures is targeted. For each speaker one GMM is extensively
trained by standard training algorithms without using a phoneme classifica-
tion. Phoneme segmentation is then used to cluster the speaker’s training
data. The phonetic transcription of the TIMIT database and the Viterbi
algorithm are used for the classification. This scenario is similar to a text-
prompted speaker verification. Finally, a set of phoneme dependent GMMs
is trained for all speakers by adapting the phoneme independent GMM to
each phoneme cluster. A speaker independent background model is equally
constructed. For testing, the likelihoods of all speaker specific GMMs are
computed given the phoneme segmentation. They are normalized by the cor-
responding background model. The ratio of both likelihoods is evaluated to
accept or reject the identity claim of the current speaker. In their tests, a lower
error rate was achieved for speaker verification based on phoneme-adapted
GMMs compared to a standard GMM approach.
66 3 Combining Self-Learning Speaker Identification and Speech Recognition
Feature
Speech /
Non-Speech
Speaker Change
Detection
Speaker
Identification
Speaker
Adaptation
Speech
Recognition
Extraction
(a) Exemplary block diagram of a multi-stage
speaker identification including speech recognition
and speaker adaptation.
Feature
Speech /
Non-Speech
Speaker Change
Detection
Speech
Recognition
Extraction
Speaker
Identification
Speaker
Adaptation
(b) Exemplary block diagram of
a phoneme based speaker identi-
fication with integrated speaker
adaptation.
Fig. 3.2 Comparison of two examples for multi-stage speaker identification and
phoneme based speaker identification. In both cases the result of the speaker change
detection triggers the recognition. Either the most probable speaker is identified
to enable an enhanced speech recognition or the spoken phrases is divided into
phoneme groups by a speech recognizer to enhance speaker identification. In ad-
dition, speaker adaptation can be continuously applied to the speech and speaker
models.
Gauvain et al. [1995] present a statistical modeling approach for text-
dependent and text-independent speaker verification. Phone-based HMMs are
applied to obtain a better statistical representation of speaker characteristics.
Each phone is modeled by a speaker-specific left-to-right HMM comprising
three states. 75 utterances are used for training which is realized by MAP
adaptation of an initial model comprising 35 speaker-independent context-
free phone models and 32 Gaussian distributions. For testing all speaker
models are run in parallel. For each speaker the phone-based likelihood is
3.3 Phoneme Based Speaker Identification 67
computed by the Viterbi algorithm and the ML criterion is used to hypoth-
esize the speaker identity. For the text-independent verification task local
phonotactic constraints are imposed to limit the search space. For text-
dependent speaker verification the limitation is given by the concatenated
phone sequence. Both high-quality and telephone speech are examined.
Kimball et al. [1997] describe a two-stage approach for text-dependent
speaker verification. Voice signatures are investigated for electronic docu-
ments. A common speech recognizer categorizes the speech input into several
phonetic classes to train more discriminative statistical models for speaker
verification. This representation is also known as the broad phonetic category
model. The second approach applies supervised speaker adaptation to speaker
specific HMMs. The evaluation of the HMMs associated with the given text
is combined with cohort speaker normalization to solve the verification task.
Olsen [1997] also uses a two-stage approach for speaker identification. The
initial stage consists of an HMM for phone alignment whereas an RBF is
applied in the second stage for speaker identification given the phonetic
information.
Genoud et al. [1999] present a combination of speech recognition and
speaker identification with the help of a neural network. This approach offers
the opportunity of a mutual benefit because speech and speaker information
are modeled and evaluated simultaneously. The discriminative MLP receives
MFCC features at its input layer and is trained for both tasks. It comprises
two sets of output. Besides the speaker independent phoneme output, the pos-
terior probability of each phoneme is examined for several target speakers. In
their experiments 12 registered speakers are investigated in a closed-set sce-
nario. The generalization towards open-set systems including unsupervised
adaptation at the risk of error propagation is left for future research.
Park and Hazen [2002] also suggest to use a two-stage approach by combin-
ing common GMM based speaker identification and phoneme based speaker
identification. The latter relies on the phonetically structured statistical mod-
els introduced by Faltlhauser and Ruske [2001] or phonetic class GMMs.
The main issue is to use granular GMMs which resolve the phonetic content
more precisely compared to globally trained GMMs. In addition, the speaker
adaptive scoring allows the system to learn the speaker specific phoneme pro-
nunciation when sufficient speech material can be collected for a particular
speaker. The proposed identification framework hypothesizes the temporal
phoneme alignment by a speaker independent speech recognizer given a spo-
ken phrase. In parallel, GMMs are applied to pre-select a list of the best
matching speaker models which have to be processed by the second stage.
There, the phonetically refined speaker models re-score the subset of selected
speakers and lead to the final speaker identification. In summary, the speech
recognition result controls the selection of the speaker models and exploits
more information about the speech utterance than globally trained GMMs.
Rodr´ıguez-Li˜ nares and Garc´ıa-Mateo [1997] combine speech classification,
speech segmentation and speaker identification by using an HMM system.
68 3 Combining Self-Learning Speaker Identification and Speech Recognition
Speech signals are divided into voiced and unvoiced speech as well as the
associated transitions. In addition, speech pauses representing background
noises are modeled. These categories are modeled by four HMMs. At run
time the phonetic classification is implemented by the Viterbi algorithm. The
probabilities of all phonetic classes are accumulated and combined either by
equal weighting or by applying selective factors to investigate only the effect
of single classes. The speaker with the highest score is selected. In addition,
the speech segmentation is obtained as a by-product. In summary, speaker
specific information is mainly contained in voiced parts of an utterance. Thus,
the speaker identification rate on purely voiced parts is significantly higher
than for unvoiced speech.
Matsui and Furui [1993] consider speech recognition and speaker verifi-
cation in a unified frame work. The task is to verify the identity claim of
the actual user by his voice characteristics and the spoken utterance that
is prompted by the system. Matsui and Furui represent speakers by speaker
specific phoneme models which are obtained by adaptation during an enroll-
ment. At run-time the convex combination of both likelihoods of a phoneme
dependent and a phoneme independent speaker HMM enables the acceptance
or rejection of the claimed speaker identity. The result is a combined statis-
tical model for text-dependent speech recognition and speaker verification.
Reynolds and Carlson [1995] compare two approaches for text-dependent
speaker verification. The user is asked to speak 4 randomized combination
locks prompted by the system. It has to be decided either to accept or re-
ject the claimed identity. The first technique investigates a parallel com-
putation of text-independent speaker verification and speaker independent
speech recognition. The task is to verify both the speaker’s identity and the
uttered phrase. A decision logic combines the single scores of the decoupled
identifiers. The second method consists of parallel speaker specific speech
recognizers which perform both tasks. Each speaker undergoes an enrollment
phase of approximately 6 min and uses a limited vocabulary. Cohort speakers
serve as alternative models and are used for score normalization.
Nakagawa et al. [2006] propose a combined technique comprising two
statistical models. The first model is a common GMM known from the
speaker identification literature and the second one is an HMM well known
from speech recognition. Both models have shown their strengths in rep-
resenting speaker and speech information. The second model is motivated
by the fact that temporal transitions between different phonemes are not
modeled by GMMs. They can be captured by approaches coming from the
speech recognition research. Their statistical model incorporates the knowl-
edge about speech characteristics by adapting a syllable-based speaker inde-
pendent speech model to particular speakers. At run-time both systems run in
parallel and a convex combination of the independent scores integrates both
the text-independent speaker information and speech related knowledge.
3.4 First Conclusion 69
3.4 First Conclusion
All approaches touched on in the preceding sections have advantages and
drawbacks with respect to the use case of this book.
The algorithms belonging to audio signal segmentation usually offer the
advantage of statistical models of low complexity which only require a small
number of parameters to be estimated. This makes short-term speaker adap-
tation feasible but also limits the capability to represent the individual prop-
erties of a speaker which could be modeled using long-term speaker adapta-
tion. Another strong point is the opportunity to omit the training to increase
the usability of speech controlled devices. However, most of these algorithms
do not easily support to build-up a more complex system. In general, it
seems not possible to construct highly complex statistical models for speech
and speaker characteristics in an unsupervised manner without sophisticated
initial models [Kwon and Narayanan, 2005]. Furthermore, many of these al-
gorithms bear the risk to be vulnerable against time-variant adverse envi-
ronments which is the case for in-car applications. Time latency is a further
critical issue for the use case of this book.
Multi-stage speaker identification overcomes the limitation of simple sta-
tistical models. For example, prior knowledge is provided in form of an initial
model which can be adapted to particular speakers. Some techniques tackle
the problem of speaker change detection and speaker identification simulta-
neously and thus motivate a unified framework. However, a training phase for
each new speaker is usually obligatory. Furthermore, some implementations
apply a strict multi-step procedure with separate modules. Each module re-
ceives only the discrete results from the preceding stages. The final result is
often obtained by the ML criterion and does not provide confidence measures.
A further drawback is the multiple computation of speech and speaker infor-
mation even though similar statistical models and similar or even identical
input are applied.
Phoneme based speaker identification introduces a new aspect into speaker
identification. The identification of the actually spoken phonemes and
the knowledge that speaker identification is directly influenced by certain
phoneme groups allow constructing more precise speaker models at the ex-
pense of a more complex training phase. Once again the separate modeling
of speech and speakers and the application of multi-stage procedures may
be of disadvantage. This information flow is usually unidirectional. Discrete
decisions and the lack of confidence measures may be additional drawbacks.
The goal of this book is to present a solution which integrates speech
and speaker information into one statistical model. Furthermore, speech
and speaker shall be recognized simultaneously to allow real-time process-
ing on embedded devices. The intended functionality of the desired solution
is sketched in Fig. 3.3.
The critical issues are the self-organization of the complete system, detect-
ing unknown speakers and avoiding a supervised training phase. The system
70 3 Combining Self-Learning Speaker Identification and Speech Recognition
Feature vectors
Codebooks
Speech
Speaker
Dialog
Speaker adaptation
symbolic
sub-symbolic
level
level
In-set
Out-of-set
Fig. 3.3 Draft of the target system’s functionality. The speech controlled system
shall comprise speech recognition, speaker identification and speaker adaptation. In-
teractions with a speech dialog should be possible. In the sub-symbolic level feature
vectors are extracted to be used in the symbolic levels. Several profiles or codebooks
are used for speech recognition. A speaker independent codebook (solid line) acts as
template for further speaker specific codebooks (dashed line). The results of speaker
identification and speech recognition trigger speaker adaptation either to adjust the
codebooks of known speakers (in-set) or to initialize new codebooks (dotted line) in
the case of a new speaker (out-of-set). Significantly higher speech recognition rates
are expected compared to speaker independent speech recognition.
has to handle speaker models at highly different training levels and has to
provide a unified adaptation technique which permits both short-term and
long-term adaptation. The system does not start from scratch but is sup-
ported by a robust prior statistical model and adapts to particular speakers.
In the beginning, only a few adaptation parameters can be reliably estimated.
This number may be increased depending on the training level. Speaker iden-
tification has to be realized on different time scales. The entire information
is preserved by avoiding discrete decisions. Confidence measures in form of
posteriori probabilities instead of likelihoods have to be established.
4
Combined Speaker Adaptation
The first component of the target system is a flexible yet robust speaker
adaptation to provide speaker specific statistical models. In this chapter a
balanced adaptation strategy is motivated. New speaker profiles can be ini-
tialized and existing profiles can be continuously adapted in an unsupervised
way. The algorithm is suitable for short-term adaptation based on a few ut-
terances and long-term adaptation when sufficient training data is available.
A first experiment has been conducted to investigate the improvement of a
speech recognizer’s performance when speakers are optimally tracked so that
the corresponding speaker models can be continuously adapted.
In the subsequent chapters, this adaptation technique will serve as the base
for the target system comprising speaker identification, speech recognition
and speaker adaptation.
4.1 Motivation
Initializing and adapting the codebooks of the automatic speech recognizer
introduced in Sect. 2.5.3 is highly demanding since about 25, 000 parameters
are involved in adapting mean vectors.
The key issue is to limit the effective number of adaptation parameters in
the case of limited data and to allow a more individual adjustment later on.
Thus, a balanced adaptation strategy relying on short-term adaptation at
the beginning and individual adaptation in the long-run as well as a smooth
transition has to be implemented.
First, it is assumed that speaker identification and speech recognition de-
liver a reliable guess of the speaker identity and the transcription of the
processed speech utterance. Error propagation due to maladaptations is ne-
glected in this chapter and will be addressed later. The benefit of robust
speaker adaptation for speech recognition will be demonstrated by experi-
ments under the condition that the speaker identity is given.
T. Herbig, F. Gerl, and W. Minker: Self-Learning Speaker Identification, SCT, pp. 71–82.
springerlink.com c Springer-Verlag Berlin Heidelberg 2011
72 4 Combined Speaker Adaptation
4.2 Algorithm
In Sect. 2.6 an overview of several well-known adaptation techniques which
capture the voice characteristics of a particular speaker was given:
Short-term adaptation such as EV adaptation imposes a tied adaptation
scheme for all Gaussian densities of a codebook. Long-term adaptation imple-
mented by the MAP adaptation permits individual modifications in the case of
a large data set. However, MAP adaptation is expected to be inefficient during
the start or enrollment phase of a newspeaker model as explained in Sect. 2.6.3.
In Sect. 2.6.6 an optimal combination of short-term and long-term adaptation
was discussed. However, the EMAP algorithm requires a costly matrix inver-
sion and is considered to be too expensive for the purpose of this book.
In the following a simple interpolation of short-term adaptation realized by
eigenvoices and long-term adaptation based on the MAP technique is inves-
tigated. Kuhn et al. [2000] present results mostly for supervised training for
MAP using an EV estimate as prior. They do not report any tuning of the
interpolation factor for the MAP estimate. This may explain their results for
MLED=>MAP which performed worse on isolated letters compared to pure
EVs in the unsupervised case. Since the transition fromEVadaptation to MAP
is crucial for the systempresented in this work, it will be derived in closer detail.
Furthermore, the problem of efficient data acquisition for continuous adapta-
tion in an unsupervised system will be addressed. Details will be provided on
how to handle new speaker specific data and how to delay speaker adapta-
tion until a robust guess of the speaker identity is available. Robustness will
be gained by the initial modeling of a speech recognizer based on SCHMMs.
Eigenvoices have shown a very robust behavior in the experiments which
will be discussed later. Only a few parameters are required. Prior knowl-
edge about speaker adaptation effects, gender, channel impulse responses
and background noises contributes to the robustness of EV adaptation. The
key issue of the suggested adaptation scheme is to start from the prior dis-
tribution of μ
EV
k
and to derive the resulting data-driven adaptation strategy.
A two-stage procedure is applied which is comparable to the segmental
MAP algorithm in Sect. 2.6.2. Speech decoding provides an optimal state
alignment and determines the state dependent weights w
s
k
of a particular
codebook. Then the codebook is adapted based on the training data x
1:T
of
the target speaker i. In the following derivation the auxiliary function of the
EM algorithm is extended by a term comprising prior knowledge:
Q
MAP

i
, Θ
0
) = Q
ML

i
, Θ
0
) + log (p(Θ
i
)) (4.1)
=
T

t=1
N

k=1
p(k|x
t
, Θ
0
) · log (p(x
t
, k|Θ
i
)) + log (p(Θ
i
)) (4.2)
as given in (2.40) and (2.78). Only one iteration of the EM algorithm is
calculated. The state variable s is omitted. The initial parameter set
Θ
0
=
_
w
0
1
, . . . , w
0
N
, μ
0
1
, . . . , μ
0
N
, Σ
0
1
, . . . , Σ
0
N
_
(4.3)
4.2 Algorithm 73
is given by the standard codebook. Since only mean vectors are adapted, the
following notation is used for the speaker specific codebooks:
Θ
i
=

w
0
1
, . . . , w
0
N
, μ
i
1
, . . . , μ
i
N
, Σ
0
1
, . . . , Σ
0
N

. (4.4)
Subsequently, the speaker index i is omitted. For reasons of simplicity, μ
k
and
Σ
k
denote the speaker specific mean vectors to be optimized and the covari-
ance matrices of the standard codebook.
For the following equations it is assumed that each Gaussian distribution
can be treated independently from the remaining distributions. Thus, the
prior distribution of the mean vector μ
k
can be factorized or equivalently the
logarithm is given by a sum of logarithms. A prior Gaussian distribution is
assumed
log (p(μ
1
, . . . , μ
N
)) =
N

k=1
log

N

μ
k

EV
k
, Σ
EV
k

(4.5)
= −
1
2
N

k=1

μ
k
−μ
EV
k

T
· (Σ
EV
k
)
−1
·

μ
k
−μ
EV
k


1
2
N

k=1
log


EV
k
|


d
2
N

k=1
log (2π)
(4.6)
where the covariance matrix Σ
EV
k
represents the uncertainty of the EV
adaptation.
Q
MAP
is maximized by taking the derivative with respect to the mean
vector μ
k
and by calculating the corresponding roots μ
opt
k
. The derivation can
be found in Sect. A.2. In this context, μ
EV
k
is employed as prior information
about the optimized mean vector μ
opt
k
. The optimization problem is solved
by (A.26):
n
k
· Σ
−1
k
·

μ
ML
k
−μ
opt
k

= (Σ
EV
k
)
−1
·

μ
opt
k
−μ
EV
k

. (4.7)
When n
k
approaches infinity, only ML estimates are applied to adjust code-
books. In the learning phase which is characterized by limited training data
the prior knowledge incorporated into the eigenvoices still allows efficient
codebook initialization and continuous adaptation.
In the next step both sides of (4.7) are multiplied by Σ
k
. Equation (4.7)
can be rewritten by

Σ
k
· (Σ
EV
k
)
−1
+ n
k
· I

· μ
opt
k
= Σ
k
· (Σ
EV
k
)
−1
· μ
EV
k
+ n
k
· μ
ML
k
(4.8)
where I denotes the identity matrix. The matrix Σ
k
· (Σ
EV
k
)
−1
makes use of
prior knowledge about statistical dependencies between the elements of each
mean vector. However, the EV approach already contains prior knowledge as
74 4 Combined Speaker Adaptation
described in Sect. 2.6.5. It may be assumed that any matrix multiplication
can be discarded. If the approximation
Σ
k
· (Σ
EV
k
)
−1
= λ · I, λ = const (4.9)
is used, the solution of the optimization problem becomes quite easy to in-
terpret [Herbig et al., 2010c]:
μ
opt
k
= (1 −α
k
) · μ
EV
k
+ α
k
· μ
ML
k
(4.10)
α
k
=
n
k
n
k
+ λ
. (4.11)
Since this adaptation scheme interpolates EV and ML estimates within the
MAP framework, it is called EV-MAP adaptation in the following.
When sufficient speaker specific data is available, the convex combina-
tion approaches ML estimates such as the MAP adaptation
1
described in
Sect. 2.6.3.
Few data causes this adaptation scheme to resemble the EV estimate μ
EV
k
.
The latter is decomposed into a linear combination of the eigenvoices
as defined in (2.101). The corresponding weights are calculated according
to (2.116). Only the sufficient statistics of the standard codebook
n
k
=
T

t=1
p(k|x
t
, Θ
0
) (4.12)
μ
ML
k
=
1
n
k
T

t=1
p(k|x
t
, Θ
0
) · x
t
(4.13)
are required. However, nearly untrained Gaussian densities do not provide
reliable ML estimates. In fact, this computation may lead to diverging val-
ues of μ
ML
k
and may force extreme weighting factors in (2.116). This may
reduce the performance of speaker identification and speech recognition.
Thus, the ML estimates in (2.116) are replaced here by MAP estimates as
given in (2.85). For sufficiently large n
k
there is no difference since μ
MAP
k
and μ
ML
k
converge. If n
k
is small for a particular Gaussian distribution, the
MAP estimate guarantees that approximately the original mean vector is
used μ
MAP
k
≈ μ
0
k
. Those mean vectors have only a limited influence on the
result of the set of linear equations in (2.116). Furthermore, the EV adapta-
tion was only calculated in the experiments carried out when sufficient speech
data ≥ 0.5 sec was available.
The ML estimates have to be continuously updated so that μ
EV
k
and μ
opt
k
can be computed. Since the ML estimates μ
ML
k
are weighted sums of all
observed feature vectors assigned to a particular Gaussian distribution, a
1
In the following, the term MAP adaptation refers to the implementation discussed
in Sect. 2.6.3. EV-MAP adaptation denotes the adaptation scheme explained in
this chapter.
4.2 Algorithm 75
recursive computation is possible. The updated ML estimates μ
ML
k
are
weighted sums of the preceding estimates ¯ μ
ML
k
and an innovation term
n
k
= ¯ n
k
+
T

t=1
p(k|x
t
, Θ
0
) (4.14)
μ
ML
k
=
¯ n
k
n
k
· ¯ μ
ML
k
+
1
n
k
·
T

t=1
p(k|x
t
, Θ
0
) · x
t
(4.15)
where ¯ n
k
denotes the number of the assigned feature vectors up to the last
utterance. This allows efficient long-term speaker tracking which will be dis-
cussed later. Speaker specific adaptation data can be collected before a guess
of the speaker identity is available.
A concrete realization should apply an exponential weighting window
for μ
ML
k
and n
k
. A permanent storage may be undesirable, since speaker
characteristics significantly change over time
2
[Furui, 2009]. In this book no
window is applied because time-variance is not a critical issue for the inves-
tigated speech database.
Finally, the solution of (4.8) may be discussed with respect to the EMAP
solution in (2.122). Equation (4.8) is equivalent to

Σ
k
· (Σ
EV
k
)
−1
+ n
k
· I

· μ
opt
k
=

Σ
k
· (Σ
EV
k
)
−1
+ n
k
· I

· μ
EV
k
+ n
k
·

μ
ML
k
−μ
EV
k

.
(4.16)
Both sides are multiplied by

Σ
k
· (Σ
EV
k
)
−1
+ n
k
· I

−1
to obtain the following
term
μ
opt
k
= μ
EV
k
+ n
k
·

Σ
k
· (Σ
EV
k
)
−1
+ n
k
· I

−1
·

μ
ML
k
−μ
EV
k

(4.17)
which is equivalent to
μ
opt
k
= μ
EV
k
+ n
k
· Σ
EV
k
·

Σ
k
+ n
k
· Σ
EV
k

−1
·

μ
ML
k
−μ
EV
k

. (4.18)
A comparison with (2.122) identifies a similar structure of both adaptation
terms. The matrix Σ
EV
k
corresponds to
ˇ
Σ
EMAP
. Despite these structural
similarities, there is a basic difference between EMAP and the adaptation
scheme proposed here. EMAP requires the computation and especially the
matrix inversion in the supervector space. The prior knowledge is represented
by
ˇ
Σ
EMAP
. Combining EV and ML estimates takes benefit from statistical
bindings in the EV estimation process. However, this approach works in the
feature space which is characterized by a lower dimensionality.
Botterweck [2001] describes a similar combination of MAP and EV esti-
mates. The original equation is re-written by
2
A study of intra-session and inter-session variability can be found by Godin and
Hansen [2010].
76 4 Combined Speaker Adaptation
ˇ μ = ˇ μ
0
+
ˇ
P ·

ˇ μ
MAP
− ˇ μ
0

+
L

l=1
α
l
· ˇe
EV
l
(4.19)
to agree with the notation of Sect. 2.6. Using the MAP framework, the co-
efficients of the EVs α
l
have to be determined by optimizing the auxiliary
function of the EM algorithm. ˇ μ
0
represents all mean vectors of the stan-
dard codebook in supervector notation. The matrix
ˇ
P guarantees that the
standard MAP estimate only takes effect in the directions orthogonal to all
eigenvoices. Adaptation is performed in the supervector space. Due to the
matrix multiplication it is more expensive than the solution derived above.
4.3 Evaluation
Several experiments have been conducted with different speaker adaptation
methods to investigate the benefit which can be achieved by a robust speaker
identification. The results are presented and discussed in detail.
First, the database is introduced and the composition of the selected subset
is explained. This subset is subsequently used for all evaluations. The eval-
uation setup is described in more detail. It is employed to simulate realistic
tasks for self-learning speaker identification and speech recognition. Finally,
the results for closed-set experiments are presented.
4.3.1 Database
SPEECON
3
is an extensive speech database which was collected by a consor-
tium comprising several industrial partners. The database comprises about
20 languages and is intended to support the development of voice-driven inter-
faces of consumer applications. Each language is represented by 600 speakers
comprising 550 adult speakers. The recordings were done under realistic con-
ditions and environments such as home, office, public places and in-car. Both
read and spontaneous speech are covered.
In this book, a subset of the US-SPEECON database is used for evaluation.
This subset comprises 73 speakers (50 male and 23 female speakers) recorded
in an automotive environment. The Lombard effect is considered. The sound
files were down-sampled from 16 kHz to 11.025 kHz. Only AKG microphone
recordings were used for the experiments. Colloquial utterances with more
than 4 words were removed to obtain a realistic command and control appli-
cation. Utterances containing mispronunciations were also discarded. Digit
and spelling loops were kept irrespective of their length. This results in at
least 250 utterances per speaker which are ordered in the sequence of the
recording session.
3
The general description of the SPEECON database follows Iskra et al. [2002].
4.3 Evaluation 77
4.3.2 Evaluation Setup
In the following the configuration of the test set, baseline and speaker adapta-
tion is explained. Furthermore, the evaluation measures for speech recognition
and speaker identification are introduced.
Test Set
The evaluation was performed on 60 sets. Each set comprises 5 enrolled speak-
ers
4
to ensure independence of the speaker set composition which is chosen
randomly. The composition of female and male speakers in a group is not
balanced. At the beginning of each speaker set an unsupervised enrollment
takes place. It consists of 10 utterances for each speaker as shown in Fig. 4.1.
In Chapter 5 and 6 closed-set and open-set scenarios will be discussed.
For closed sets the system has to identify a speaker from a list of enrolled
speakers. Thus, the first two utterances of each new speaker are indicated
during enrollment. For the remaining 8 utterances the system has to identify
the speaker in an unsupervised manner. The corresponding speaker specific
codebook has to compete against existing speaker models. In an open-set
scenario the system has to decide when a new speaker model has to be added
to the existing speaker profiles.
After this enrollment at least 5 utterances are spoken before the speaker
changes. From one utterance to the next the probability of a speaker change
is approximately 10 %. The order of the speakers in each set are chosen ran-
domly. For each set all utterances of each speaker are used only once. Thus,
every set provides 1, 250 utterances originating from 5 speakers. This results
in 75, 000 utterances used for evaluation. The speaker specific codebooks are
continuously adapted. The configuration of the test investigates the robust-
ness of the start phase, error propagation and long-term speaker adaptation.
No additional information concerning speaker changes and speaker identities
will be given in Chapter 5 and 6.
Baseline
The speech recognizer without any speaker adaptation or speaker profiles acts
as the baseline for all subsequent experiments. The speech recognizer applies
grammars for digit and spelling loops as well as for numbers. Finally, a gram-
mar was generated which contains all remaining utterances (≈ 2, 000 com-
mand and control utterances). For testing the corresponding grammar was
selected prior to each utterance. The speech recognizer was trained as de-
scribed by Class et al. [1993].
4
When speaker sets with 10 enrolled speakers are examined in Chapter 6, another
evaluation set comprising 30 different speaker sets is used.
78 4 Combined Speaker Adaptation
.
.
.
E
n
r
o
l
l
m
e
n
t
T
e
s
t
Sp 1
Sp 2
Sp 3
Sp 4
Sp 5
Sp 3
Sp 5
Sp 2
Sp 1
(a) Closed-set.
.
.
.
E
n
r
o
l
l
m
e
n
t
T
e
s
t
Sp 1
Sp 2
Sp 3
Sp 4
Sp 5
Sp 3
Sp 5
Sp 2
Sp 1
(b) Open-set.
Fig. 4.1 Block diagram of the setup which is used for all experiments of this book.
The setup consists of two blocks. First, a start phase is used in which new speaker
specific codebooks are initialized. However, the speaker identity is only given in
the supervised parts (diagonal crossed boxes) of Fig. 4.1(a). For the remaining
utterances speaker identification is delegated to the speech recognizer or a com-
mon speaker identification system. Then speakers are randomly selected. At least
5 utterances are spoken by each speaker before a speaker turn takes place.
Speaker Adaptation
In an off-line training the main directions of speaker variability in the fea-
ture space were extracted. For this purpose codebooks were trained by MAP
adaptation for a large set of speakers of the USKCP
5
development database.
The mean vectors of each codebook were stacked into long vectors. The eigen-
voices were calculated using PCA.
To simulate continuous speaker adaptation without speaker identification,
short-term adaptation is implemented by a modified EV approach. A smooth-
ing of the weighting factors α is applied by introducing an exponential weight-
ing window in (4.14) and (4.15). It guarantees that speaker changes are
5
The USKCP is a speech database internally collected by TEMIC Speech Dialog
Systems, Ulm, Germany. The USKCP comprises command and control utter-
ances recorded in an automotive environment. The language is US-English.
4.3 Evaluation 79
captured within approximately 5 utterances if no speaker identification is
employed. However, speech recognition is still affected by frequent speaker
changes.
Evaluation Measures
Speech recognition is evaluated by the Word Accuracy (WA)
WA = 100 ·

1 −
W
S
+ W
I
+ W
D
W

%. (4.20)
W represents the number of words in the reference string and W
S
, W
I
and W
D
denote the number of substituted, inserted and deleted words in the recog-
nized string [Boros et al., 1996]. Speaker identification is evaluated by the
identification rate. The typical error is given by a confidence interval assum-
ing independent data sets. Only the intervals of the best and the worst result
are given. When a test is repeated on an identical test set, a paired differ-
ence test may be more appropriate to evaluate the statistical significance. In
Sect. A.3 the evaluation measures are described in detail.
4.3.3 Results
Before a realistic closed-set scenario is examined in detail, an upper bound
for unsupervised speaker adaptation is investigated [Herbig et al., 2010c]. It
determines the optimal WA of the speech recognizer which can be achieved by
a perfect speaker identification combined with several adaptation strategies.
In addition, feature extraction is externally controlled so that energy normal-
ization and mean subtraction are updated or reset before the next utterance
is processed. Nevertheless, unsupervised speaker adaptation has to handle er-
rors of speech recognition. Short-term adaptation and the speaker adaptation
scheme introduced in Sect. 4.2 are compared to the baseline system.
The tuning parameter λ in (4.11) is evaluated for specific values. For λ ≈ 0
the adaptation relies only on ML estimates. In contrast, λ → ∞ is the re-
verse extreme case since EV estimates are used irrespective of the amount
of speaker specific data. The latter case is expected to limit the improve-
ment of word accuracy. In addition, the performance of MAP adaptation
is tested for dedicated values of the tuning parameter η used in (2.84)
and (2.85).
In Table 4.1 and Fig. 4.2 the speech recognition rates are compared for the
complete evaluation test set under the condition that the speaker identity is
known. The results show a significant improvement of WA when the EV-
MAP speaker adaptation is compared to baseline and short-term adaptation.
Only about 6 % relative error rate reduction is achieved by short-term adapta-
tion compared to the baseline. EV adaptation combined with perfect speaker
identification yields about 20 % relative error rate reduction. 88.94 % WA
80 4 Combined Speaker Adaptation
Table 4.1 Comparison of different adaptation techniques under the condi-
tion of perfect speaker identification. The baseline is the speaker independent
speech recognizer without any speaker adaptation. Short-term adaptation without
speaker identification is realized by EV adaptation with an exponential weighting
window. This adaptation is based only on a few utterances. EV-MAP speaker
adaptation is tested for several dedicated values of the parameter λ. Speaker pools
with 5 speakers are investigated. Table is taken from [Herbig et al., 2010c].
Speaker adaptation WA [%]
Baseline 85.23
Short-term adaptation 86.13
EV-MAP adaptation
ML (λ ≈ 0) 88.44
λ = 4 88.87
λ = 8 88.86
λ = 12 88.94
λ = 16 88.86
λ = 20 88.90
EV (λ →∞) 88.19
MAP adaptation
η = 4 88.62
η = 8 88.62
η = 12 88.57
Typical errors
min ±0.22
max ±0.25
and 25 % relative error rate reduction with respect to the speaker indepen-
dent baseline are achieved using λ = 12. Obviously, λ can be selected in
a relatively wide range without negatively affecting the performance of the
speech recognizer. The results for pure EV or MAP adaptation are worse as
expected. MAP adaptation yields better results than the ML or EV approach
[Herbig et al., 2010c].
In Table 4.2 the evaluation is split into several evaluation bands repre-
senting different adaptation stages. The effect of speaker adaptation is eval-
uated for weakly, moderately and extensively trained codebooks. Since the
evaluation is done on different speech utterances the recognition rates and
improvements are only comparable within each evaluation band.
The implementations which interpolate EV and ML estimates yield con-
sistent results for all evaluation bands. On the evaluation bands I and III
the main difference between ML estimates and EV adaptation can be seen.
On the first utterances EV adaptation is superior to ML estimates which
will become obvious in the following chapters. In the long-run MAP or ML
estimates allow more individual codebook adaptation. They outperform the
EV approach in the long run, but suffer from poor performance during the
4.3 Evaluation 81
λ
BL ML 4 8 12 16 20 EV ST
85
86
87
88
89
90
W
A
[
%
]
Fig. 4.2 Results for speaker adaptation with predefined speaker identity. Further-
more, the speaker independent baseline (BL) and short-term adaptation (ST) are
depicted.
Table 4.2 Comparison of different adaptation techniques under the condi-
tion of perfect speaker identification. The evaluation is split into several eval-
uation bands according to the adaptation stage in terms of recognized utterances.
Speech recognition results are presented for several evaluation bands - I = [1; 50[,
II = [50; 100[, III = [100; 250] utterances.
Speaker adaptation I II III
WA [%] WA [%] WA [%]
Baseline 84.51 86.84 85.02
Short-term adaptation 85.89 87.22 85.87
EV-MAP adaptation
ML (λ ≈ 0) 87.77 89.17 88.50
λ = 4 88.43 89.67 88.80
λ = 8 88.57 89.97 88.62
λ = 12 88.51 89.95 88.79
λ = 16 88.35 89.91 88.74
λ = 20 88.37 89.83 88.84
EV (λ →∞) 87.99 89.03 88.00
MAP adaptation
η = 4 87.98 89.36 88.67
η = 8 87.69 89.54 88.73
η = 12 87.56 89.42 88.75
Typical errors
min ±0.51 ±0.48 ±0.29
max ±0.58 ±0.54 ±0.33
learning phase. Therefore the EV approach performs worse than ML esti-
mates on the complete test set.
It should be emphasized that this experiment was conducted under optimal
conditions. In the following chapters it will be referred to this experiment as
an upper bound for tests under realistic conditions where the speaker has to
be identified in an unsupervised way. The ML approach will suffer from poor
82 4 Combined Speaker Adaptation
performance when the number of utterances is smaller than ≈ 40 as will be
shown in Chapter 6. In this case the recognition accuracy is expected to be
limited in the long run.
4.4 Summary
A flexible adaptation scheme was presented and discussed. Speaker specific
codebooks can be initialized and continuously adapted based on the standard
codebook. The algorithm provides a data-driven smooth transition between
globally estimated mean vectors and locally optimized mean vectors permit-
ting individual adjustment. Computational complexity is negligible compared
to the EMAP algorithm. ML estimates do not have to be re-calculated as be-
ing already available from the EV adaptation. The interpolation of EV and
ML estimates only requires to calculate the weighting factors α
k
and the sum
of the weighted estimates for each Gaussian density according to (4.10).
When the system has perfect knowledge of the identity of the current
speaker, a relative error rate reduction of 25 % can be achieved via the adapta-
tion techniques described here. Both the MAP and EV adaptation described
in Sect. 2.6.3 and Sect. 2.6.5 have performed worse than the combined ap-
proach since a fast convergence during the initialization and an individual
training on extensive adaptation data are required as well as an optimal
smooth transition.
The conditions of this experiment may be realistic for some systems: A
system may ask each speaker to identify himself when he starts a speaker turn
or when the system is reset. However, this assumption is expected to limit
the application of some adaptation techniques to use cases where the speaker
has to be identified in an unsupervised manner. Thus, several techniques for
speaker identification are discussed in the next chapters to approach this
upper bound.
5
Unsupervised Speech Controlled
System with Long-Term Adaptation
Chapters 2 and 3 provided a general overview of problems related to the
scope of this book. The fundamentals and existing solutions as known from
literature were discussed in detail.
In the preceding chapter a combination of short-term and long-term adap-
tation was examined. Speaker profiles can be initialized and continuously
adapted under the constraint that the speaker has to be known a priori.
The experiments carried out showed a significant improvement for speech
recognition if speaker identity is known or can be at least reliably estimated.
In this chapter a flexible and efficient technique for speaker specific speech
recognition is presented. This technique does not force the user to identify
himself. Speaker variabilities are handled by an unsupervised speech con-
trolled system comprising speech recognition and speaker identification.
The discussion starts with the problem to be solved and motivates the
new approach. Then a unified speaker identification and speech recognition
method is presented to determine the speaker’s identity and to decode the
speech utterance simultaneously. A unified modeling is described which han-
dles both aspects of speaker and speech variability. The basic architecture of
the target system including the modules speaker identification, speech recog-
nition and speaker adaptation is briefly discussed.
A reference implementation is then described. It comprises a standard
speaker identification technique in parallel to the speaker specific speech
recognition as well as speaker adaptation of both statistical models. It is
used as a reference implementation for the subsequent experiments.
Both implementations are evaluated for an in-car application. Finally, both
systems are discussed with respect to the problem of this book and an ex-
tension of the speaker identification method is motivated.
5.1 Motivation
In the preceding chapters the effects of speaker variability on speech recogni-
tion were described and several techniques which solve the problem of speaker
T. Herbig, F. Gerl, and W. Minker: Self-Learning Speaker Identification, SCT, pp. 83–113.
springerlink.com c Springer-Verlag Berlin Heidelberg 2011
84 5 Unsupervised Speech Controlled System with Long-Term Adaptation
tracking and unsupervised adaptation were outlined. Each of the approaches
described covers a part of the use case in this book.
The main goal is now to construct a complete system which includes
speaker identification, speech recognition and speaker adaptation. Ideally,
it shall be operated in a completely unsupervised manner so that the user
can interact with the system without any additional intervention.
Simultaneous speaker identification and speech recognition can be de-
scribed as a MAP estimation problem
_
W
MAP
1:N
W
, i
MAP
_
= arg max
W
1:N
W
,i
{p(W
1:N
W
, i|x
1:T
)} (5.1)
= arg max
W
1:N
W
,i
{p(W
1:N
W
|i, x
1:T
) · p(i|x
1:T
)} (5.2)
as found by Nakagawa et al. [2006]. The parameter set Θ
i
of the speaker
specific codebook is omitted. The MAP criterion is applied to the joint pos-
terior probability of the spoken phrase W
1:N
W
and speaker identity i given
the observed feature vectors x
1:T
.
An approach is subsequently presented which is characterized by a mutual
benefit between speaker identification and speech recognition. Computational
load and latency are reduced by an on-line estimation of the speaker iden-
tity. For each feature vector the most probable speaker is determined. The
associated speaker profile enables speech decoding. An efficient implementa-
tion with respect to computational complexity and memory consumption is
targeted.
Models used in speaker change detection, speaker identification and speech
recognition are of increasing complexity. BIC is based on single Gaussian dis-
tributions whereas the speech recognizer introduced in Sect. 2.5.3 is realized
by a mixture of Gaussian distributions with an underlying Markov model.
Front-End x
t
Speech
Speaker
Recognition
Adaptation
Transcription
Fig. 5.1 Unified speaker identification and speaker specific speech recognition. One
feature extraction and only one statistical model are employed for speaker identifi-
cation and speech recognition. The estimated speaker identity and the transcription
of the spoken phrase are used to initialize and continuously adapt speaker specific
models. Speaker identification and speech recognition of successive utterances are
enhanced.
5.2 Joint Speaker Identification and Speech Recognition 85
The goal is to show in this book that the problem of speaker variability
can be solved by a compact statistical modeling which allows unsupervised
simultaneous speaker identification and speech recognition in a self-evolving
system [Herbig and Gerl, 2009; Herbig et al., 2010d].
The subsequent sections describe a unified statistical model for both tasks
as an extension of the speaker independent speech recognizer which was in-
troduced in Sect. 2.5.3. It is shown how a robust self-learning speaker iden-
tification with enhanced speech recognition can be achieved for a closed-set
scenario. The system structure is depicted in Fig. 5.1.
5.2 Joint Speaker Identification and Speech
Recognition
In the preceding chapter an appropriate speaker adaptation technique was
introduced to initialize and adapt codebooks of a speech recognizer. However,
the derivation of the adaptation scheme was done under the condition that
speaker identity and the spoken text are known. Now speaker specific speech
recognition and speaker identification have to be realized.
The implementation of an automated speech recognizer presented in
Sect. 2.5.3 is based on a speaker independent codebook. As shown in Fig. 2.11
the feature vectors extracted by the front-end are compared with a speaker
independent codebook. For each time instance the likelihood p(x
t
|k, Θ
0
) is
computed for each Gaussian distribution with index k. The resulting vec-
tor q
t
given by (2.67) and (2.68) comprises all likelihoods and is used for the
subsequent speech decoding.
When the speaker’s identity is given, speaker adaptation can provide in-
dividually adapted codebooks which increase the speech recognition rate for
each speaker. Therefore, the speech recognizer has to be able to handle several
speaker profiles in an efficient way.
In Fig. 5.2 a possible realization of an advanced speaker specific speech
recognizer is shown. The speech recognizer is extended by N
Sp
speaker spe-
cific codebooks which are operated in parallel to the speaker independent
codebook. This approach allows accurate speech modeling for each speaker.
However, the parallel computation significantly increases the computational
load since the speech decoder has to process N
Sp
+ 1 data streams in paral-
lel. The Viterbi algorithm is more computationally complex than a codebook
Front-end x
t
q
i
t
decoder
Speaker
specific
codebooks
Speech
Fig. 5.2 Block diagram for a speaker specific ASR system. N
Sp
speaker specific
codebooks and speech decoders are operated in parallel to generate a transcription
for each speaker profile.
86 5 Unsupervised Speech Controlled System with Long-Term Adaptation
selection strategy on a frame-level, especially for a large set of speaker specific
codebooks operated in parallel.
Thus, it seems to be advantageous to continuously forward only the re-
sult q
i
t
of the codebook belonging to the current speaker i. A solution with-
out parallel speech decoding for each codebook is targeted. Such an approach,
however, requires the knowledge of the current speaker identity for an optimal
codebook selection.
The computational complexity can be significantly decreased by a two-
stage approach. The solution of (5.2) can be approximated by the following
two steps. First, the most probable speaker is determined by
i
MAP
= arg max
i
{p(i|x
1:T
)} (5.3)
which can be implemented by standard speaker identification methods. In the
second step, the complete utterance can be reprocessed. The corresponding
codebook can then be employed in the speech decoder to generate a tran-
scription. The MAP estimation problem
W
MAP
1:N
W
= arg max
W
1:N
W

p(W
1:N
W
|x
1:T
, i
MAP
)

(5.4)
is then solved given the speaker identity i
MAP
. For example, the most likely
word string can be determined by the Viterbi algorithm which was described
in Sect. 2.5.2. The disadvantages of such an approach are the need to buffer
the entire utterance and reprocessing causing latency.
Thus, speaker identification is implemented here on two time scales. A
fast but probably less confident identification selects the optimal codebook
on a frame-level and enables speaker specific speech recognition. In addition,
speaker identification on an utterance-level determines the current speaker
and provides an improved guess of the speaker identity for speaker adapta-
tion. The latter is performed after speech decoding. The goal is to identify the
current user and to decode the spoken phrase in only one iteration [Herbig
et al., 2010d].
5.2.1 Speaker Specific Speech Recognition
An appropriate speech model has to be selected on a frame-level for speech
decoding. Since high latencies of the speech recognition result and multiple
recognitions have to be avoided, the speech recognizer has to be able to switch
rapidly between speaker profiles at the risk of not always applying the correct
speech model.
On the other hand, if the speech recognizer selects an improper speaker
specific codebook because two speakers yield similar pronunciation patterns,
the effect on speech recognition should be acceptable and no serious increase
in the error rate is expected. False speaker identifications lead to mixed code-
books when the wrong identity is employed in speaker adaptation. It seems to
5.2 Joint Speaker Identification and Speech Recognition 87
Front-End x
t
Speaker 2
Standard
Codebook
Speaker 1
Speaker N
Sp
p(x
t

0
)
p(x
t

1
)
p(x
t

2
)
p(x
t

i
)
p(x
t

NSp
)
q
i=0
t
q
i=1
t
q
i=2
t
q
i
t
q
i=NSp
t
i
f
a
s
t
t
=
a
r
g
m
a
x
i
t
{
p
(
i
t
|
x
1
:
t
)
}
q
ifast
t
.
.
.
φ
0
t
φ
0
t
φ
0
t
φ
0
t
φ
0
t
Fig. 5.3 Implementation of the codebook selection for speaker specific speech
recognition. Appropriate codebooks are selected and the corresponding soft quan-
tization q
i
fast
t
is forwarded to speech decoding. Figure is taken from [Herbig et al.,
2010d].
be evident that this kind of error can have a more severe impact on the long-
term stability of the speech controlled system than an incorrect codebook
selection for a single utterance.
Subsequently, a strategy for codebook selection is preferred. It is oriented
at the match between codebooks and the observed feature vectors x
t
. Since a
codebook represents the speaker’s pronunciation characteristics, this decision
is expected to be correlated with the speaker identification result.
Class et al. [2003] describe a technique for automatic speaker change detec-
tion which is based on the evaluation of several speaker specific codebooks.
This approach is characterized by low computational complexity. It is ex-
tended in this book by speaker specific speech recognition and simultaneous
speaker tracking.
Fig. 5.3 displays the setup of parallel codebooks comprising one speaker
independent and several speaker specific codebooks. The standard codebook
with index i = 0 is used to compute the match q
0
t
between this codebook
and the feature vector x
t
. To avoid latencies caused by speech decoding, only
codebooks are investigated and the underlying Markov models are not used.
Therefore, the statistical models are reduced in this context to GMMs. Even
in the case of an SCHMM the weighting factors w
s
t
k
are state dependent. The
state alignment is obtained from the acoustic models during speech decoding
which requires the result of the soft quantization q
i
t
. The codebooks are
further reduced to GMMs with uniform weighting factors
88 5 Unsupervised Speech Controlled System with Long-Term Adaptation
w
k
=
1
N
(5.5)
for reasons of simplicity [Herbig et al., 2010d].
Even though the parallel speech decoding of N
Sp
+1 codebooks is avoided,
the computational load for the evaluation of N
Sp
+ 1 codebooks on a frame-
level can be undesirable for embedded systems. Thus, a pruning technique
is applied to the likelihood computation of speaker specific codebooks. Only
those Gaussian densities are considered which generate the N
b
highest like-
lihoods p(x
t
|k, Θ
0
) for the standard codebook. For N
b
= 10 no eminent
differences between the resulting likelihood
p(x
t

i
) =
1
N
N

k=1
N

x
t

i
k
, Σ
0
k

(5.6)
and
p(x
t

i
) ≈
1
N

k∈φ
0
t
N

x
t

i
k
, Σ
0
k

(5.7)
have been observed in the experiments which will be discussed later [Herbig
et al., 2010d]. Subsequently, the vector
φ
0
t
=

k
0
t,1
, . . . , k
0
t,N
b

T
(5.8)
represents the indices of the associated N
b
Gaussian densities relevant for
the likelihood computation of the standard codebook. Speaker specific code-
books emerge from the standard codebook and speaker adaptation always
employs the feature vector assignment of the standard codebook as described
in Sect. 4.2. Thus, the subset φ
0
t
can be expected to be a reasonable selection
for all codebooks. Each codebook generates a separate vector
q
i
t

p(x
t
|k = φ
0
t,1
, Θ
i
), . . . , p(x
t
|k = φ
0
t,N
b
, Θ
i
)

T
(5.9)
by evaluating this subset φ
0
t
. The complexity of likelihood computation is
reduced because only N +N
Sp
· N
b
instead of N +N
Sp
· N Gaussian densities
have to be considered. Finally, each codebook generates a vector q
i
t
and a
scalar likelihood p(x
t

i
) which determines the match between codebook and
observation.
A decision logic can follow several strategies to select an appropriate
speaker specific codebook. Only the corresponding soft quantization q
i
fast
t
is further processed by the speech decoder. Subsequently, three methods are
described:
One possibility may be a decision on the actual likelihood value p(x
t

i
).
The ML criterion can be applied so that in each time instance the codebook
is selected with the highest likelihood value
5.2 Joint Speaker Identification and Speech Recognition 89
i
fast
t
= arg max
i
{p(x
t

i
)} . (5.10)
The optimization problem for speech recognition can then be approximated
by
W
MAP
1:N
W
= arg max
W
1:N
W

p(W
1:N
W
|x
1:T
, i
fast
1:T
)

(5.11)
where i
fast
1:T
describes the codebook selection given by the corresponding q
1:T
.
This criterion has the drawback that the codebook selection may change
frame by frame. In the experiments carried out an unrealistic high number
of speaker turns could be observed even within short utterances. This im-
plementation seems to be inappropriate since the speech decoder is applied
to a series of q
t
originating from different codebooks. The speech recognizer
has to be able to react quickly to speaker changes, e.g. from male to female,
especially when no re-processing is allowed. However, an unrealistic number
of speaker turns on a single utterance has to be suppressed.
A more advanced strategy for a suitable decision criterion might be to
apply an exponential weighting window. Again, the iid assumption is used
for reasons of simplicity. The sum of the log-likelihood values enables stan-
dard approaches for speaker identification as described in Sect. 5.2.2. A local
speaker identification may be implemented by
l
i
t
= l
i
t−1
· α + log (p(x
t

i
)) · (1 −α), 1 < t ≤ T, 0 < α < 1 (5.12)
l
i
1
= p(x
1

i
) (5.13)
where l
i
t
denotes the local log-likelihood of speaker i. Codebook selection may
then be realized by
i
fast
t
= arg max
i

l
i
t

(5.14)
and speech recognition can be described again by (5.11).
Due to the knowledge about the temporal context, an unrealistic number
of speaker turns characterized by many switches between several codebooks
can be suppressed. However, the identification accuracy is expected to be
limited by the exponential weighting compared to (2.46).
Alternatively, a decision logic based on Bayes’ theorem may be applied
in analogy to the forward algorithm introduced in Sect. 2.5.2. The posterior
probability p(i
t
|x
1:t
) is estimated for each speaker i given the entire history of
observations x
1:t
of the current utterance. The posterior probability p(i
t
|x
1:t
)
decomposes into the likelihood p(x
t
|i
t
) and the predicted posterior probabil-
ity p(i
t
|x
1:t−1
):
p(i
t
|x
1:t
) =
p(x
t
|i
t
, x
1:t−1
) · p(i
t
|x
1:t−1
)
p(x
t
|x
1:t−1
)
, t > 1 (5.15)
∝ p(x
t
|i
t
) · p(i
t
|x
1:t−1
) (5.16)
p(i
1
|x
1
) ∝ p(x
1
|i
1
) · p(i
1
). (5.17)
90 5 Unsupervised Speech Controlled System with Long-Term Adaptation
The likelihood p(x
t
|i
t
, x
1:t−1
) represents new information about the speaker
identity given an observed feature vector x
t
and is regarded independently
from the feature vector’s time history x
1:t−1
. The initial distribution p(i
1
)
allows for a boost of a particular speaker identity. For convenience, the pa-
rameter set Θ is omitted here. The predicted posterior distribution may be
given by
p(i
t
|x
1:t−1
) =
N
Sp

i
t−1
=1
p(i
t
|i
t−1
) · p(i
t−1
|x
1:t−1
), t > 1. (5.18)
This approach yields the advantage of normalized posterior probabilities in-
stead of likelihoods. The codebook with maximal posterior probability is
selected according to the MAP criterion:
i
fast
t
= arg max
i
{p(i
t
|x
1:t
)} . (5.19)
The transcription can be determined in a second step by applying the MAP
criterion
W
MAP
1:N
W
= arg max
W
1:N
W

p(W
1:N
W
|x
1:T
, i
fast
1:T
)

(5.20)
which can be considered as an approximation of (5.2).
The standard codebook is always evaluated in parallel to all speaker spe-
cific codebooks to ensure that the speech recognizer’s performance does not
decrease if none of the speaker profiles is appropriate, e.g. during the initial-
ization of a new codebook.
Since Markov models are not considered here, a smoothing of the posterior
may be employed to prevent an instantaneous drop or rise, e.g. caused by
background noises during short speech pauses.
This Bayesian framework yields the advantage that prior knowledge about
a particular speaker, e.g. from the last utterance, can be easily integrated to
stabilize codebook selection at the beginning of an utterance. A smoothed
likelihood suffers from not normalized scores which makes it more compli-
cated to determine an initial likelihood value for a preferred or likely speaker.
Here the prior probability p(i
1
) can be directly initialized. For example, the
discrete decision i
0
from the last utterance and the prior speaker change
probability p
Ch
may be used for initialization, e.g. given by
p(i
1
) ∝ δ
K
(i
1
, i
0
) · (1 −p
Ch
) + (1 −δ
K
(i
1
, i
0
)) · p
Ch
. (5.21)
In summary, this representation is similar to a CDHMM. The states of the
Markov model used for codebook selection are given by the identities of the
enrolled speakers. The codebooks represent the pronunciation characteristics
of particular speakers. Subsequently, only this statistical model is used for
codebook selection.
5.2 Joint Speaker Identification and Speech Recognition 91
5.2.2 Speaker Identification
In the preceding section codebook selection was implemented by likelihood
computation for each codebook and applying Bayes’ theorem. However, the
techniques discussed in Sect. 5.2.1 should be viewed as a local speaker identi-
fication. Since speaker adaptation is calculated after speech decoding, a more
robust speaker identification can be used on an utterance-level. Error prop-
agation due to maladaptation seems to be more severe than an improper
codebook selection on parts of an utterance.
The experiments discussed later clearly show the capability of speaker spe-
cific codebooks to distinguish different speakers. Furthermore, codebooks are
expected to provide an appropriate statistical model of speech and speaker
characteristics since HMM training includes more detailed knowledge of
speech compared to GMMs. Thus, the key idea is to use the codebooks to
decode speech utterances and to identify speakers in a single step as shown
in Fig. 5.4.
Each speaker specific codebook can be evaluated in the same way as com-
mon GMMs for speaker identification. Only the simplification of equal weight-
ing factors w
k
is imposed for the same reasons as in Sect. 5.2.1. The likelihood
computation can be further simplified by the iid assumption. The following
notation is used.
The log-likelihood L
i
u
denotes the accumulated log-likelihood of speaker i
and the current utterance with index u. Each utterance is represented by
its feature vector sequence x
1:T
u
. The log-likelihood is normalized by the
length T
u
of the recorded utterance which results in an averaged log-likelihood
per frame
L
i
u
=
1
T
u
log (p(x
1:T
u

i
)) =
1
T
u
T
u

t=1
log

k∈φ
0
t
N

x
t

i
k
, Σ
0
k



(5.22)
in analogy to Markov and Nakagawa [1996]. The weighting factors w
k
=
1
N
are omitted, for convenience.
However, this implementation of speaker identification suffers from speech
pauses and garbage words which do not contain speaker specific information.
The speech recognition result may be employed to obtain a more precise
speech segmentation. Since speaker identification is enforced after speech
decoding, the scalar likelihood scores p(x
t

i
) can be temporarily buffered.
The accumulated log-likelihood
L
i
u
=
1

t∈Φ
log

k∈φ
0
t
N

x
t

i
k
, Σ
0
k



(5.23)
is calculated for each speaker as soon as a precise segmentation is available.
The set Φ contains all time instances which are considered as speech. #Φ
92 5 Unsupervised Speech Controlled System with Long-Term Adaptation
Front-
x
t
Speaker 2
Standard
Codebook
Speaker 1
Speaker N
Sp
p(x
t

0
)
p(x
t

1
)
p(x
t

2
)
p(x
t

i
)
p(x
t

N
Sp
)
q
i=0
t
q
i=1
t
q
i=2
t
q
i
t
q
i=N
Sp
t
.
.
.
φ
0
t
φ
0
t
φ
0
t
φ
0
t
Σ Σ Σ Σ Σ
i
slow
= arg max
i
(p(x
1:T
u

i
))
i
f
a
s
t
=
a
r
g
m
a
x
i
(
p
(
i
|
x
t
)
)
q
i
fast
t
i
slow
End
φ
0
t
Speech pause
detection
I
II
Fig. 5.4 Joint speaker identification and speech recognition. Each feature vector
is evaluated by the standard codebook which determines the subset φ
0
t
comprising
the N
b
relevant Gaussian densities. Each speaker specific codebook is evaluated on
the corresponding Gaussian densities given by φ
0
t
. p(x
t

0
) and q
i
t
are employed
for speaker identification and speech recognition, respectively. Part I and II denote
the codebook selection for speaker specific speech recognition as depicted in Fig. 5.3
and speaker identification on an utterance-level, respectively.
5.2 Joint Speaker Identification and Speech Recognition 93
denotes the corresponding number of feature vectors containing purely
speech. The modified log-likelihood ensures a relatively robust estimate
i
slow
u
= arg max
i

L
i
u

. (5.24)
Conventional speaker identification based on GMMs without additional
speech recognition cannot prevent that neither training, adaptation nor test
situations contain garbages or speech pauses.
To obtain a strictly unsupervised speech controlled system, the detection of
new users is essential. It can be easily implemented by the following threshold
decision
L
i
u
−L
0
u
< θ
th
, ∀i, (5.25)
similar to Fortuna et al. [2005]. For example, the best in-set speaker is de-
termined according to (5.24). The corresponding log-likelihood ratio is then
tested for an out-of-set speaker.
5.2.3 System Architecture
Now all modules are available to construct a first complete system which is
able to identify the speaker and to decode an utterance in one step. Speaker
specific codebooks are initialized and continuously adapted as introduced in
the preceding chapter. In Fig. 5.5 the block diagram of the system is depicted.
It is assumed that users have to press a push-to-talk button before start-
ing to enter voice commands. Thus, speech segmentation is neglected since
at least a rough end-pointing is given. Speech pauses are identified by a VAD
based on energy and fundamental frequency. States for silence or speech
pauses are included in the Markov models of the speech recognizer to en-
able a refined speech segmentation. An additional speaker change detection
is omitted.
Automated speaker identification and speech recognition are now briefly
summarized:
• Front-end. Standard speech enhancement techniques are applied to the
speech signal to limit the influence of the acoustic environment. Feature
vectors are extracted from the enhanced speech signal to be used for speech
recognition and speaker identification. The computational complexity of
two separate algorithms is avoided. The speech recognizer dominates the
choice of feature vectors because speaker identification is only intended
to support speech recognition as an optional component of the complete
system. The basic architecture of the speech recognizer is preserved. For
each speaker the long-term energy and mean normalization is continu-
ously adapted. Since speaker changes are assumed to be rare, users can
be expected to speak several utterances. Therefore, only the simple solu-
tion of one active feature extraction is considered. When a speaker turn
94 5 Unsupervised Speech Controlled System with Long-Term Adaptation
Front-End
Speech
Speaker
Adaptation
Transcription
I II
ML
Fig. 5.5 System architecture for joint speaker identification and speech recogni-
tion. One front-end is employed for speaker specific feature extraction. Speaker
specific codebooks (I) are used to decode the spoken phrase and to estimate the
speaker identity (II) in a single step. Both results are used for speaker adaptation
to enhance future speaker identification and speech recognition. Figure is taken
from [Herbig et al., 2010e].
occurs, the front-end discards the mean and energy modifications origi-
nating from the last utterance. The parameter set which belongs to the
identified speaker is used to process the next utterance.
• Speech recognition. Appropriate speaker specific codebooks are selected
by a local speaker identification to achieve an optimal decoding of the
recorded utterance. The speech recognizer can react quickly to speaker
changes. However, codebook selection is expected to be less confident com-
pared to speaker identification on an utterance-level. Speech recognition
delivers a transcription of the spoken phrases on an utterance-level. The
state alignment of the current utterance can be used in the subsequent
speaker adaptation to optimize the corresponding codebook.
• Speaker identification. The likelihood values p(x
t

i
), i = 0, . . . , N
Sp
of each time instance are buffered and the speaker identity is estimated on
an utterance-level. The speaker identity is used for codebook adaptation
and for speaker specific feature extraction.
• Speaker adaptation. The codebooks of the speech recognizer are initial-
ized and adjusted based on the speaker identity and a transcription of the
spoken phrase.
For each recorded utterance the procedure above is applied. This results in an
enhanced recognition rate for speech recognition and speaker identification
as well.
If the computational load is not a critical issue, a set of parallel speaker
specific feature extractions may be realized to provide an optimal feature
5.3 Reference Implementation 95
normalization for each codebook. Alternatively, the complete utterance may
be re-processed when a speaker change is detected. However, this intro-
duces an additional latency and overhead. The performance decrease which
is caused by only one active feature normalization will be evaluated in the
experiments.
5.3 Reference Implementation
In the preceding sections a unified approach for speaker identification and
speech recognition was introduced. Alternatively, standard techniques for
speaker identification and speaker adaptation can be employed. In Sect. 2.4.2
a standard technique for speaker identification based on GMMs purely opti-
mized to capture speaker characteristics was described. In Sect. 2.6 standard
methods for speaker adaptation were introduced to initialize and continuously
adjust speaker specific GMMs.
This speaker identification technique is used here without further modi-
fications as a reference for the unified approach discussed. Combined with
the speaker specific speech recognizer of Sect. 5.2.1 a reference implementa-
tion can be easily obtained. It is intended to deliver insight into the speech
recognition and speaker identification rates which can be achieved by such
an approach.
First, the specifications and the implementation of the speaker identifi-
cation are given. Then the system architecture of the alternative system is
described. The experiments of the preceding chapter will be discussed with
respect to this reference.
5.3.1 Speaker Identification
Speaker identification is subsequently realized by common GMMs purely
representing speaker characteristics. Fig. 5.6 shows the corresponding block
diagram and has to be viewed as a detail of Fig. 5.7 specifying speaker iden-
tification and front-end.
A speaker independent UBM is used as template for speaker specific
GMMs. Several UBMs comprising 32, 64, 128 or 256 Gaussian distributions
with diagonal covariance matrices have been examined in the experiments.
The feature vectors comprise 11 mean normalized MFCCs and delta features.
The 0 th cepstral coefficient is replaced by a normalized logarithmic energy
estimate.
The UBM was trained by the EM algorithm. About 3.5 h speech data
originating from 41 female and 36 male speakers of the USKCP
1
database was
incorporated into the UBM training. For each speaker about 100 command
1
The USKCP is an internal speech database for in-car applications. The USKCP
was collected by TEMIC Speech Dialog Systems, Ulm, Germany.
96 5 Unsupervised Speech Controlled System with Long-Term Adaptation
Front-End x
t
Speaker 1
Speaker 2
Speaker 3

a
r
g
m
a
x
i
p
(
x
1
:
T
|
Θ
i
)
i
slow
t
Fig. 5.6 Block diagram of speaker identification exemplified for 3 speakers. Each
speaker is represented by one GMM. The accumulated log-likelihood is calculated
based on the iid assumption. The speaker model with the highest likelihood is
selected on an utterance-level. The speaker identity is used for speaker adapta-
tion. Additionally, speaker specific feature normalization is controlled by speaker
identification.
and control utterances such as navigation commands, spelling and digit loops
are available. The language is US-English. The training was performed only
once and the resulting UBM has been used without any further modifications
in the experiments.
When a new speaker is enrolled, a new GMM is initialized by MAP adap-
tation according to (2.85) based on the first two utterances. Each speaker
model is continuously adapted as soon as an estimate of the speaker identity
is available.
The log-likelihood log p (x
1:T
u

i
) of each utterance x
1:T
u
is iteratively
computed for all speakers i = 1, . . . , N
Sp
in parallel to speech recognition.
Θ
i
denotes the speaker specific parameter set. The iid assumption is applied
to simplify likelihood computation by accumulating the log-likelihood values
of each time instance as described by (2.46). The speaker with the highest
likelihood score is identified as the current speaker
i
slow
u
= arg max
i
{log(p(x
1:T
u

i
))} . (5.26)
To guarantee that always an appropriate codebook is employed for speech
decoding, codebook selection is realized as discussed before. However, speaker
adaptation is based on the result of this speaker identification.
5.3.2 System Architecture
The reference system is realized by a parallel computation of speaker iden-
tification with GMMs on an utterance-level and the speaker specific speech
5.4 Evaluation 97
Front-End x
t
Speaker
Identification
Speech
Recognition
Adaptation
Adaptation
Transcription
GMM
Codebook
Fig. 5.7 System architecture for parallel speaker identification and speaker specific
speech recognition. Codebook selection is implemented as discussed before. Speaker
identification is realized by an additional GMM for each speaker. Figure is taken
from [Herbig et al., 2010b].
recognition previously described. The setup is depicted in Fig. 5.7. The work
flow of the parallel processing can be summarized as follows [Herbig et al.,
2010b]:
• Front-end. The recorded speech signal is preprocessed to reduce back-
ground noises. Feature vectors are extracted as discussed before.
• Speech recognition. Appropriate speaker specific codebooks are selected
for the decoding of the recorded utterance as presented in Sect. 5.2.
• Speaker identification. One GMM is trained for each speaker to repre-
sent speaker characteristics. A standard speaker identification technique is
applied to estimate the speaker’s identity on an utterance-level. The code-
books of the speech recognizer and the corresponding GMMs are adapted
according to this estimate.
• Speaker adaptation. GMM models and the corresponding codebooks
are continuously updated given the guess of the speaker identity. Code-
book adaptation is implemented by the speaker adaptation scheme de-
scribed in Sect. 4.2. GMMs are adjusted by MAP adaptation introduced
in Sect. 2.6.3. Adaptation accuracy is supported here by the moderate
complexity of the applied GMMs.
5.4 Evaluation
In the following both realizations for speaker specific speech recognition com-
bined with speaker identification are discussed. First, the joint approach and
98 5 Unsupervised Speech Controlled System with Long-Term Adaptation
then the reference implementation are considered for closed-set and open-set
scenarios.
5.4.1 Evaluation of Joint Speaker Identification and
Speech Recognition
In the following experiments the joint speaker identification and speech recog-
nition is investigated. First, experiments are discussed which investigate
speaker identification for closed sets. Then the detection of unknown speakers
is examined.
Closed-Set Speaker Identification
First, the speaker identification accuracy of the joint approach is examined
based on a supervised experiment characterized by optimum conditions. Then
a realistic scenario for a limited group of enrolled speakers is discussed. Fi-
nally, the influence of feature normalization on the performance of the system
is considered.
Speaker Identification Accuracy
In a first step the performance of speaker identification is determined for
the approach shown in Fig. 5.5. The correct speaker identity is used here for
speaker adaptation to investigate the identification accuracy depending on
the training level of speaker specific codebooks. Even though error propaga-
tion is neglected, one can get a first overview of the identification accuracy
of the joint approach discussed in the preceding sections.
The test setup is structured as described in Fig. 4.1(a). It comprises 4 sets
each with 50 speakers in a randomized order. For each speaker 200 utterances
are used randomly. The likelihood scores of all speakers are recorded. The
codebook of the target speaker is modified by speaker adaptation after each
utterance as described in Chapter 4.
For evaluation, several permutations are used to randomly build speaker
pools of a predetermined size. The resulting speaker identification rate is
plotted in Fig. 5.8. For λ = 4 the best performance of speaker identifica-
tion is achieved whereas accuracy decreases for higher values of λ. However,
when λ tends towards zero, the identification rate drops significantly since
unreliable ML estimates are used, especially during the initialization. The
EV approach obviously loses for larger speaker sets. Since each speaker is
represented by only 10 parameters, modeling accuracy is limited and the
probability of false classifications rises in the long run compared to the com-
bination of EV and ML estimates.
In Fig. 5.9 the result is depicted in more detail. It is represented by four
evaluation bands based on I = [1; 50[, II = [50; 100[, III = [100; 150[ and
5.4 Evaluation 99
IV = [150; 200] utterances. The evolution of speaker specific codebooks ver-
sus their adaptation level is shown. Each curve represents a predetermined
speaker pool size. According to Fig 5.9 it can be stated that the identifi-
cation accuracy rises with improved speaker adaptation as expected. The
identification rate is a function of the tuning parameter λ, the number of
enrolled speakers N
Sp
and training. Having analyzed the measured identifi-
cation rates, a small parameter λ seems to be advantageous for an accurate
speaker identification. Even in the last evaluation band a significant difference
between λ = 4, λ = 12 and λ = 20 can be observed. The threshold λ has still
an influence on adaptation accuracy even though 150 utterances have been
accumulated. However, speaker identification is generally affected by a higher
number of enrolled speakers since the probability of confusions increases.
In summary, Fig. 5.9 shows a consistent improvement of the identification
rate for all setups as soon as sufficient adaptation data is accessible. The
implementation with λ = 4 clearly outperforms the remaining realizations,
especially for larger speaker sets. However, the enrollment phase which is
characterized by the lowest identification rate is critical for long-term stability
as shown subsequently.
Unsupervised Closed-Set Identification of 5 enrolled Speakers
In the following the closed-set scenario of a self-learning speaker identification
combined with speech recognition and unsupervised speaker adaptation is
investigated [Herbig et al., 2010d]. The subsequent considerations continue
the discussion of the example in Sect. 4.3.3.
2 3 4 5 6 7 8 9 10
0.85
0.9
0.95
1
Number of enrolled speakers
I
d
e
n
t
i

c
a
t
i
o
n
r
a
t
e
Fig. 5.8 Performance of the self-learning speaker identification. Speaker
adaptation is supervised so that no maladaptation occurs. Several speaker pools
are investigated for different parameters of speaker adaptation - ML (λ ≈ 0) (+),
λ = 4 (), λ = 8 (), λ = 12 (), λ = 16 (·), λ = 20 (◦) and EV (λ →∞) (2).
100 5 Unsupervised Speech Controlled System with Long-Term Adaptation
50 100 150 200
0.88
0.9
0.92
0.94
0.96
0.98
1
Adaptation level [NA]
I
d
e
n
t
i

c
a
t
i
o
n
r
a
t
e
(a) λ = 4.
50 100 150 200
0.88
0.9
0.92
0.94
0.96
0.98
1
Adaptation level [NA]
I
d
e
n
t
i

c
a
t
i
o
n
r
a
t
e
(b) λ = 12.
50 100 150 200
0.88
0.9
0.92
0.94
0.96
0.98
1
Adaptation level [NA]
I
d
e
n
t
i

c
a
t
i
o
n
r
a
t
e
(c) λ = 20.
Fig. 5.9 Performance of the self-learning speaker identification versus adaptation
stage. N
A
denotes the number of utterances used for speaker adaptation. Several
speaker pools are considered - 2 (), 3 (2), 4 (◦), 5 (+), 6 (×), 7 (·), 8 (), 9 ()
and 10 () enrolled speakers listed in top-down order.
Closed-set speaker identification was evaluated on 60 sets. Each set com-
prises 5 enrolled speakers. In Sect. 4.3.3 the speaker identity was given for
speaker adaptation leading to optimally trained speaker profiles of the speech
recognizer.
The goal is now to determine what speech recognition and identification
results can be achieved if this knowledge about the true speaker identity (ID)
is not given. Only the first two utterances of a new speaker are indicated and
then the current speaker has to be identified in a completely unsupervised
manner. The experiment described in Sect. 4.3.3 has been repeated with unsu-
pervised speaker identification. The resulting speech recognition and speaker
identification rates are summarized in Table 5.1. In addition, the percentage
of the utterances which are rejected by the speech recognizer is given.
The results show significant improvements of the WA with respect to
the baseline and short-term adaptation. The two special cases ML (λ ≈ 0)
and EV (λ → ∞) clearly fall behind the combination of both adaptation
5.4 Evaluation 101
Table 5.1 Comparison of the different adaptation techniques for self-learning
speaker identification. Speaker pools with 5 enrolled speakers are considered.
Table is taken from [Herbig et al., 2010b].
Speaker adaptation Rejected [%] WA [%] Speaker ID [%]
Baseline - 85.23 -
Short-term adaptation - 86.13 -
EV-MAP adaptation
ML (λ ≈ 0) 2.10 86.89 81.54
λ = 4 2.09 88.10 94.64
λ = 8 2.11 88.17 93.49
λ = 12 2.10 88.16 92.42
λ = 16 2.09 88.18 92.26
λ = 20 2.06 88.20 91.68
EV (λ →∞) 2.11 87.51 84.71
MAP adaptation
η = 4 2.02 87.47 87.43
Typical errors
min ±0.23 ±0.16
max ±0.25 ±0.28
Table 5.2 Comparison of different adaptation techniques for self-learning
speaker identification. Speaker pools with 5 enrolled speakers are considered.
Speaker identification and speech recognition results are given for several evaluation
bands - I = [1; 50[, II = [50; 100[, III = [100; 250] utterances.
Speaker I II III
adaptation
WA [%] ID [%] WA [%] ID [%] WA [%] ID [%]
Baseline 84.51 - 86.84 - 85.02 -
Short-term 85.89 - 87.22 - 85.87 -
adaptation
EV-MAP
adaptation
ML (λ ≈ 0) 85.49 78.79 87.45 79.51 87.35 83.13
λ = 4 87.42 92.61 89.18 93.93 88.04 95.54
λ = 8 87.46 90.83 89.33 92.44 88.11 94.71
λ = 12 87.51 89.94 89.33 91.69 88.07 93.48
λ = 16 87.56 89.77 89.31 91.40 88.09 93.37
λ = 20 87.57 89.38 89.22 90.99 88.15 92.67
EV (λ →∞) 87.38 85.08 88.56 84.75 87.21 84.57
Typical errors
min ±0.53 ±0.43 ±0.49 ±0.38 ±0.30 ±0.19
max ±0.58 ±0.67 ±0.54 ±0.65 ±0.33 ±0.35
102 5 Unsupervised Speech Controlled System with Long-Term Adaptation
λ
BL ML 4 8 12 16 20 EV ST
85
86
87
88
89
90
W
A
[
%
]
(a) Results for speech recognition.
λ
ML 4 8 12 16 20 EV
80
84
88
92
96
100
I
D
[
%
]
(b) Results for speaker identification.
Fig. 5.10 Comparison of supervised speaker adaptation with predefined speaker
identity (black) and the joint speaker identification and speech recognition (dark
gray) described in this chapter. The speaker independent baseline (BL) and short-
term adaptation (ST) are depicted for comparison. Figure is taken from [Herbig
et al., 2010d].
techniques. Furthermore, no eminent difference in WA can be observed for 4 ≤
λ ≤ 20 [Herbig et al., 2010d].
It becomes obvious that speaker identification can be optimized indepen-
dently from the speech recognizer and seems to reach an optimum of 94.64 %
for λ = 4. This finding also agrees with the considerations concerning the
identification accuracy sketched above. For higher values the identification
rates drop significantly. Obviously, speaker characteristics can be captured
despite limited adaptation data and the risk of error propagation. A compar-
ison of the evaluation bands shows a consistent behavior for all realizations.
In Fig. 5.10 the speech recognition rates achieved by the joint speaker
identification and speech recognition are graphically compared with the su-
pervised experiment in Sect. 4.3. Additionally, the speaker identification rates
are depicted. It can be stated that this first implementation is already close
to the upper bound given by the supervised experiment and is significantly
better than the baseline. Again, no eminent differences in the WA can be
stated but a significant decrease in the speaker identification rate can be
observed for increasing λ.
Potential for Further Improvements
Finally, the influence of speaker specific feature extraction on speaker identifi-
cation and speech recognition is quantified for the scenario discussed before.
The experiment has been repeated with supervised energy and mean nor-
malization. Before an utterance is processed, the correct parameter set is
loaded and continuously adapted. Speaker identification, speech recognition
and speaker adaptation remain unsupervised.
5.4 Evaluation 103
Table 5.3 Results for joint speaker identification and speech recognition with su-
pervised speaker specific feature extraction. The feature vectors are nor-
malized based on the parameter set of the target speaker. The parameters are
continuously adapted.
Speaker adaptation Rejected [%] WA [%] Speaker ID [%]
Baseline - 85.23 -
Short-term adaptation - 86.13 -
EV-MAP adaptation
λ = 4 2.05 88.47 97.33
λ = 8 2.11 88.55 96.66
λ = 12 2.06 88.60 96.22
λ = 16 2.09 88.56 95.88
λ = 20 2.08 88.61 95.75
Typical errors
min ±0.23 ±0.12
max ±0.25 ±0.15
When Table 5.3 is compared with Table 5.1, it becomes evident that
feature normalization influences speaker identification significantly. Further-
more, WA can be increased when the feature vectors are accurately normal-
ized. Even though this is an optimal scenario concerning feature extraction,
a clear improvement for speaker identification can be expected when feature
extraction is operated in parallel.
Open-Set Speaker Identification
In the preceding section speaker identification was evaluated for a closed-set
scenario. To achieve a speech controlled system which can be operated in
a completely unsupervised manner, unknown speakers have to be automati-
cally detected. This enables the initialization of new codebooks which can be
adapted to new users.
In Fig. 5.11 an open-set scenario is considered [Herbig et al., 2010a]. Again,
speaker specific codebooks are evaluated as discussed before. Unknown speak-
ers are detected by the threshold decision in (5.25). If this threshold is not
exceeded by any speaker specific codebook, an unknown speaker is hypoth-
esized. Speaker adaptation is supervised so that maladaptation with respect
to the speaker identity is neglected.
The performance of the in-set / out-of-set classification is evaluated by the
so-called Receiver Operator Characteristics (ROC) described in Sect. A.3.
The detection rate of unknown speakers is plotted versus false alarm rate.
104 5 Unsupervised Speech Controlled System with Long-Term Adaptation
0 0.1 0.2 0.3 0.4 0.5 0.6
0.4
0.5
0.6
0.7
0.8
0.9
1
D
e
t
e
c
t
i
o
n
r
a
t
e
False alarm rate
(a) 2 enrolled speakers.
0 0.1 0.2 0.3 0.4 0.5 0.6
0.4
0.5
0.6
0.7
0.8
0.9
1
D
e
t
e
c
t
i
o
n
r
a
t
e
False alarm rate
(b) 5 enrolled speakers.
0 0.1 0.2 0.3 0.4 0.5 0.6
0.4
0.5
0.6
0.7
0.8
0.9
1
D
e
t
e
c
t
i
o
n
r
a
t
e
False alarm rate
(c) 10 enrolled speakers.
Fig. 5.11 Detection of unknown speakers based on log-likelihood ratios of the
speaker specific codebooks and standard codebook. Speaker adaptation (λ = 4) is
supervised. Maladaptation with respect to the speaker identity is neglected. Several
evaluation bands are shown - N
A
≤ 20 (◦), 20 < N
A
≤ 50 (), 50 < N
A
≤ 100 (2)
and 100 < N
A
≤ 200 (+). Confidence intervals are given by a gray shading. Figure
is taken from [Herbig et al., 2010a].
The ROC curves in Fig. 5.11 show that open-set speaker identification is
difficult when speaker models are trained on only a few utterances. For N
A

20 both error rates are unacceptably high even if only two enrolled speakers
are considered. At least 50 utterances are necessary for adaptation to achieve
a false alarm and miss rate less than 10 %. For 5 or 10 enrolled speakers the
detection rate of unknown speakers is even worse.
In summary, unknown speakers can be detected by applying a simple
threshold to the log-likelihood ratios of the speaker specific codebooks and the
standard codebook. However, the detection rates achieved in the experiments
show that it is difficult for such a system to detect new users based on only
one utterance as expected. A global threshold seems to be inappropriate since
the training of each speaker model should be taken into consideration [Herbig
et al., 2010a].
In a practical implementation this problem would be even more compli-
cated when mixed codebooks are used or when several codebooks belong to
5.4 Evaluation 105
one speaker. It is hard to imagine that a reasonable implementation of an
unsupervised speech controlled system can be realized with these detection
rates, especially when 10 speakers are enrolled. Thus, a more robust technique
is required for a truly unsupervised out-of-set detection.
5.4.2 Evaluation of the Reference Implementation
In the following the performance of the reference implementation is examined.
Several settings of the reference implementation are compared for the closed-
set and open-set identification task [Herbig et al., 2010a,b].
Closed-Set Speaker Identification
First, the speaker identification accuracy of the reference implementation
is examined based on a supervised experiment characterized by optimum
conditions. Then a realistic closed-set scenario is considered for a limited
group of enrolled speakers.
Speaker Identification Accuracy
The performance of speaker identification is determined for the approach
shown in Fig. 5.7. The correct speaker identity is used here for speaker adapta-
tion to describe the identification accuracy of speaker specific GMMs. Speaker
identification and GMM training are realized as described in Sect. 5.3. If not
indicated otherwise, mean vector adaptation is performed to enhance speaker
specific GMMs.
In Fig. 5.12 the averaged accuracy is compared for dedicated realizations
in analogy to Fig. 5.8. GMMs containing a high number of Gaussian dis-
tributions show the best performance probably because a more individual
adjustment is possible. A comparison with Fig. 5.8 reveals that the code-
books of the speech recognizer cannot be expected to model speaker charac-
teristics as accurately as GMMs purely optimized for speaker identification.
However, in terms of performance all realizations yield good results. It should
be emphasized that this consideration is based on a supervised adaptation.
Maladaptation due to speaker identification errors is neglected. The further
experiments will show the deficiencies of the reference implementation to
accurately track the current speaker.
Unsupervised Speaker Identification of 5 enrolled Speakers
A realistic closed-set application for self-learning speaker identification com-
bined with speech recognition and unsupervised speaker adaptation is investi-
gated. The subsequent considerations continue the discussion of the example
in Sect. 4.3.3 and Sect. 5.4.1. Closed-set speaker identification is evaluated
on 60 sets. Each set comprises 5 enrolled speakers.
106 5 Unsupervised Speech Controlled System with Long-Term Adaptation
2 3 4 5 6 7 8 9 10
0.94
0.95
0.96
0.97
0.98
0.99
1
Number of enrolled speakers
I
d
e
n
t
i

c
a
t
i
o
n
r
a
t
e
Fig. 5.12 Performance of the reference implementation. Speaker identification
is implemented by GMMs comprising 32 (◦), 64 (·), 128 (2) or 256 () Gaussian
distributions. MAP adaptation (η = 4) is only applied to mean vectors. For com-
parison, the codebooks (×) of the speech recognizer are adjusted using λ = 4. For
all implementations maladaptation is not considered. Confidence intervals are
given by a gray shading.
In Table 5.4 the results of this scenario are presented for several imple-
mentations with respect to the number of Gaussian distributions and values
of parameter η. Both the speaker identification and speech recognition rate
reach an optimum for η = 4 and N = 64 or 128. For higher values of η
this optimum is shifted towards a lower number of Gaussian distributions
as expected. Since the learning rate of the adaptation algorithm is reduced,
only a reduced number of distributions can be efficiently estimated at the
Table 5.4 Realization of parallel speaker identification and speech recogni-
tion. Speaker identification is implemented by several GMMs comprising 32, 64, 128
and 256 Gaussian distributions. MAP adaptation of mean vectors is used.
Codebook adaptation uses λ = 12. Table is taken from [Herbig et al., 2010b].
MAP η = 4 η = 8 η = 12 η = 20
N WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] WA [%] ID [%]
32 88.01 88.64 88.06 88.17 87.98 87.29 87.97 87.50
64 88.13 91.09 88.06 89.64 87.98 87.92 87.92 85.30
128 88.04 91.18 87.94 87.68 87.87 84.97 87.82 80.09
256 87.92 87.96 87.97 85.59 87.90 81.20 87.73 76.48
Typical errors
min ±0.23 ±0.21 ±0.23 ±0.22 ±0.23 ±0.24 ±0.23 ±0.24
max ±0.23 ±0.24 ±0.23 ±0.25 ±0.23 ±0.28 ±0.23 ±0.31
5.4 Evaluation 107
Table 5.5 Detailed investigation of the WA and identification rate on several eval-
uation bands. MAP adaptation of mean vectors is used. Codebook adaptation
uses λ = 12. Several evaluation bands are considered - I = [1; 50[, II = [50; 100[,
III = [100; 250] utterances.
MAP adaptation I II III
η = 4 WA [%] ID [%] WA [%] ID [%] WA [%] ID [%]
Baseline 84.51 - 86.84 - 85.02 -
Short-term 85.89 - 87.22 - 85.87 -
adaptation
Number of Gaussian
distributions N
32 87.41 86.14 89.17 88.65 87.89 89.46
64 87.49 88.76 89.13 91.13 88.09 91.84
128 87.38 88.73 89.03 90.80 88.02 92.12
256 87.31 85.53 88.86 87.39 87.88 88.95
Typical errors
min ±0.53 ±0.52 ±0.50 ±0.45 ±0.30 ±0.25
max ±0.58 ±0.57 ±0.54 ±0.53 ±0.33 ±0.29
Table 5.6 Realization of parallel speaker identification and speech recogni-
tion. Speaker identification is implemented by several GMMs comprising 32, 64, 128
or 256 Gaussian distributions. MAP adaptation of weights and mean vec-
tors is used. Codebook adaptation uses λ = 12. Table is taken from [Herbig et al.,
2010b].
MAP η = 4 η = 8 η = 12 η = 20
N WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] WA [%] ID [%]
32 87.92 87.24 87.97 88.24 87.97 87.61 88.02 87.04
64 88.11 90.59 88.06 89.99 88.03 88.80 87.93 86.64
128 88.11 91.32 88.03 89.42 88.03 88.10 87.91 84.26
256 88.10 91.62 87.97 88.71 88.02 86.01 87.88 82.88
Typical errors
min ±0.23 ±0.20 ±0.23 ±0.22 ±0.23 ±0.23 ±0.23 ±0.24
max ±0.23 ±0.24 ±0.23 ±0.23 ±0.23 ±0.25 ±0.23 ±0.27
beginning. The performance of the speech recognizer is marginally reduced
with higher η [Herbig et al., 2010b].
Table 5.5 contains the corresponding detailed results of speaker identifica-
tion and speech recognition obtained on different evaluation bands for η = 4.
It can be concluded that 32 Gaussian distributions are not sufficient whereas
more than 128 Gaussian distributions seems to be oversized. The highest
identification rates of 91.18 % and 91.09 % have been achieved with 64 and
128 distributions. Comparing Table 5.2 and Table 5.5 similar results for
speech recognition can be observed.
108 5 Unsupervised Speech Controlled System with Long-Term Adaptation
32 64 128 256
85
86
87
88
89
90
W
A
[
%
]
N
(a) Speech recognition rates of the ref-
erence implementation. MAP adapta-
tion (η = 4) of mean vectors and
weights (black) and only mean vec-
tors (dark gray) are depicted.
λ
BL ML 4 8 12 16 20 EV ST
85
86
87
88
89
90
W
A
[
%
]
(b) Speech recognition rates of the
unified approach. Results are shown
for speaker adaptation with predefined
speaker identity (black) as well as for
joint speaker identification and speech
recognition (dark gray). The speaker
independent baseline (BL) and short-
term adaptation (ST) are shown for
comparison.
Fig. 5.13 Comparison of the reference implementation (left) and the joint
speaker identification and speech recognition (right) with respect to speech
recognition. Figure is taken from [Herbig et al., 2010b].
32 64 128 256
80
84
88
92
96
100
I
D
[
%
]
N
(a) Speaker identification rates of
the reference implementation. MAP
adaptation (η = 4) of mean vectors
and weights (black) and only mean
vectors (dark gray) are depicted.
λ
ML 4 8 12 16 20 EV
80
84
88
92
96
100
I
D
[
%
]
(b) Speaker identification rates of
the joint speaker identification and
speech recognition.
Fig. 5.14 Comparison of the reference implementation (left) and the joint
speaker identification and speech recognition (right) with respect to speaker
identification. Figure is taken from [Herbig et al., 2010b].
For the next experiment not only mean vectors but also weights are mod-
ified. The results are summarized in Table 5.6. In the preceding experiment
the speaker identification accuracy could be improved for η = 4 by increasing
the number of Gaussian distributions to N = 128. For N = 256 the identifica-
tion rate dropped significantly. Now a steady improvement and an optimum
5.4 Evaluation 109
0 0.1 0.2 0.3 0.4 0.5 0.6
0.4
0.5
0.6
0.7
0.8
0.9
1
D
e
t
e
c
t
i
o
n
r
a
t
e
False alarm rate
(a) 2 enrolled speakers.
0 0.1 0.2 0.3 0.4 0.5 0.6
0.4
0.5
0.6
0.7
0.8
0.9
1
D
e
t
e
c
t
i
o
n
r
a
t
e
False alarm rate
(b) 5 enrolled speakers.
0 0.1 0.2 0.3 0.4 0.5 0.6
0.4
0.5
0.6
0.7
0.8
0.9
1
D
e
t
e
c
t
i
o
n
r
a
t
e
False alarm rate
(c) 10 enrolled speakers.
Fig. 5.15 Detection of unknown speakers based on log-likelihood ratios of
the speaker specific GMMs and UBM. Speaker identification is realized by GMMs
with 64 Gaussian distributions. η = 4 is used to adapt the mean vectors. Mal-
adaptation is neglected. Several evaluation bands are investigated - N
A
≤ 20 (◦),
20 < N
A
≤ 50 (), 50 < N
A
≤ 100 (2) and 100 < N
A
≤ 200 (+). Confidence
intervals are given by a gray shading.
of 91.62 % can be observed for N = 256. However, the identification rate
approaches a limit. For η = 4 doubling the number of Gaussian distributions
from 32 to 64 results in 26 % relative error rate reduction. The relative error
rate reduction which is achieved by increasing the number of Gaussian distri-
butions from 128 to 256 is about 3.5 %. The optimum for speech recognition
is again about 88.1 % WA [Herbig et al., 2010b].
Finally, the results of the unified approach characterized by an integrated
speaker identification and the reference implementation are compared in
Fig. 5.13 and Fig. 5.14. Both mean vector and weight adaptation are depicted
for η = 4 representing the best speech recognition and speaker identification
rates obtained by GMMs [Herbig et al., 2010b].
In summary, similar results for speech recognition are achieved but identi-
fication rates are significantly worse compared to the experiments discussed
before. This observation supports again the finding that the speech recog-
nition accuracy is relatively insensitive with respect to moderate error rates
110 5 Unsupervised Speech Controlled System with Long-Term Adaptation
0 0.1 0.2 0.3 0.4 0.5 0.6
0.4
0.5
0.6
0.7
0.8
0.9
1
D
e
t
e
c
t
i
o
n
r
a
t
e
False alarm rate
(a) 2 enrolled speakers.
0 0.1 0.2 0.3 0.4 0.5 0.6
0.4
0.5
0.6
0.7
0.8
0.9
1
D
e
t
e
c
t
i
o
n
r
a
t
e
False alarm rate
(b) 5 enrolled speakers.
0 0.1 0.2 0.3 0.4 0.5 0.6
0.4
0.5
0.6
0.7
0.8
0.9
1
D
e
t
e
c
t
i
o
n
r
a
t
e
False alarm rate
(c) 10 enrolled speakers.
Fig. 5.16 Detection of unknown speakers based on log-likelihood ratios of
the speaker specific GMMs and UBM. Speaker identification is realized by GMMs
with 256 Gaussian distributions. η = 4 is used for adapting the mean vectors.
Maladaptation is neglected. Several evaluation bands are investigated - N
A
≤ 20 (◦),
20 < N
A
≤ 50 (), 50 < N
A
≤ 100 (2) and 100 < N
A
≤ 200 (+). Confidence
intervals are given by a gray shading. Figure is taken from [Herbig et al., 2010a].
of speaker identification. Thus, different strategies can be applied to identify
speakers without affecting the performance of the speech recognizer as long
as a robust codebook selection is employed in speech decoding.
Open-Set Speaker Identification
In the following the open-set scenario depicted in Fig. 5.11 is examined for
GMMs purely optimized to identify speakers [Herbig et al., 2010a]. Unknown
speakers are detected by a threshold decision based on log-likelihood ratios of
speaker specific GMMs and UBM similar to (2.49). The log-likelihood scores
are normalized by the length of the utterance. The corresponding ROC curves
are depicted in Fig. 5.15.
The ROC curves show again the limitations of open-set speaker identifica-
tion to detect unknown speakers, especially when speaker models are trained
5.5 Summary and Discussion 111


0 0.1 0.2 0.3 0.4
0.6
0.7
0.8
0.9
1
D
e
t
e
c
t
i
o
n
r
a
t
e
False alarm rate
Fig. 5.17 Detection of unknown speakers. Speaker specific codebooks (solid
line) are compared to GMMs comprising 32 (◦), 64 (), 128 (2) and 256 (+) Gaus-
sian densities for 100 < N
A
≤ 200. MAP adaptation (η = 4) is employed to adjust
the mean vectors. Figure is taken from [Herbig et al., 2010a].
on only a few utterances. Compared to Fig. 5.11 the detection rates are sig-
nificantly worse. It becomes obvious that these error rates are unacceptably
high for a direct implementation in an unsupervised complete system.
In Fig. 5.16 the same experiment is repeated with GMMs comprising
256 Gaussian distributions. In contrast to the former implementation, the
performance is clearly improved. However, the joint approach of speaker iden-
tification and speech recognition still yields significantly better results.
To compare the influence of the number of Gaussian densities on the de-
tection accuracy, all reference implementations and a specific realization with
speaker specific codebooks are shown in Fig. 5.17. Only the case of extensively
trained speaker models (100 < N
A
< 200) is examined. When the number
of Gaussian distributions is increased from 32 to 64, a clear improvement
can be observed. However, the detection accuracy approaches a limit when
a higher number of Gaussian distributions is used. It becomes obvious that
the accuracy of all reference implementations examined here is inferior to the
codebook based approach.
5.5 Summary and Discussion
Two approaches have been developed to solve the problem of an unsupervised
system comprising self-learning speaker identification and speaker specific
speech recognition.
First, a unified approach for simultaneous speaker identification and speech
recognition was described. Then an alternative system comprising a standard
technique for speaker identification was introduced.
Both systems have several advantages and drawbacks which should be dis-
cussed. Both implementations do not modify the basic architecture of the
speech recognizer. Speaker identification and speech recognition use an iden-
tical front-end so that a parallel feature extraction for speech recognition and
112 5 Unsupervised Speech Controlled System with Long-Term Adaptation
speaker identification is avoided. Computation on different time scales was
introduced. A trade-off between a fast but less accurate speaker identifica-
tion for speech recognition and a delayed but more confident speaker identi-
fication for speaker adaptation has been developed: Speaker specific speech
recognition is realized by an on-line codebook selection. On an utterance level
the speaker identity is estimated in parallel to speech recognition. Multiple
recognitions are not required. Identifying the current user enables a speech
recognizer to create and continuously adapt speaker specific codebooks. This
enables a higher recognition accuracy in the long run.
94.64 % speaker identification rate and 88.20 % WA were achieved by the
unified approach for λ = 4 and λ = 20, respectively. The results for the
baseline and the corresponding upper bound were 85.23 % and 88.90 % WA
according to Table 4.1 [Herbig et al., 2010b].
For the reference system several GMMs are required for speaker identifica-
tion in addition to the HMMs of the speech recognizer. Complexity therefore
increases since both models have to be evaluated and adapted. Under perfect
conditions higher speaker identification accuracies could be achieved with
the reference implementation in the experiments carried out. Under realistic
conditions a speaker identification rate of 91.18 % was achieved for 128 Gaus-
sian distributions and η = 4 when only the mean vectors were adapted. The
best speech recognition result of 88.13 % WA was obtained for 64 Gaussian
distributions. By adapting both the mean vectors and weights, the speaker
identification rate could be increased to 91.62 % for 256 Gaussian distribu-
tions and η = 4. The WA remained at the same level [Herbig et al., 2010b].
However, the detection rates of unknown speakers were significantly worse
compared to the unified approach. Especially for an unsupervised system
this out-of-set detection is essential to guarantee long-term stability and to
enhance speech recognition. Thus, only the unified approach is considered in
the following.
The major drawback of both implementations becomes evident by the
problem of unknown speakers which is not satisfactorily solved. This seems to
be the most challenging task of a completely unsupervised speech recognizer,
especially for a command and control application in an adverse environment.
Therefore, further effort is necessary to extend the system presented in this
chapter.
Another important issue is the problem of weakly, moderately and exten-
sively trained speaker models. Especially during the initialization on the first
few utterances of a new speaker, this approach bears the risk of severe er-
ror propagation. It seems to be difficult for weakly adapted speaker models
to compete with well trained speaker models since adaptation always tracks
speaker characteristics and residual influences of the acoustic environment.
As discussed so far, both algorithms do not appropriately account for the
training status of each speaker model except for codebook and GMM adap-
tation. The evolution of the system is not appropriately handled. Likelihood
is expected to be biased.
5.5 Summary and Discussion 113
Furthermore, both realizations do not provide any confidence measure
concerning the estimated speaker identity. Utterances with uncertain origin
cannot be rejected. Long-term stability might be affected because of mal-
adaptation.
As discussed so far, the integration of additional information sources con-
cerning speaker changes or identity are not addressed. For example, a beam-
former, the elapsed time between two utterances or the completion of speech
dialog steps may contribute to a more robust speaker change detection.
In summary, the joint approach depicted in Fig. 5.5 constitutes a com-
plete system characterized by a unified statistical modeling, moderate com-
plexity and a mutual benefit between speech and speaker modeling. It has
some deficiencies due to the discrete decision of the speaker identity on an
utterance-level, lack of confidence measures and therefore makes the detection
of unknown speakers more complicated.
In the next chapter an extension is introduced which alleviates the draw-
backs described above. A solution suitable for open-set scenarios will be
presented.
6
Evolution of an Adaptive
Unsupervised Speech Controlled
System
In the preceding chapter a first solution and a reference implementation for
combined speaker identification and speaker specific speech recognition were
presented and evaluated as complete systems. However, some deficiencies
discussed in the preceding section motivate to extend this first solution.
Speaker identification as discussed so far has been treated on an utterance
level based on a likelihood criterion. The discussion in this chapter goes be-
yond that. Long-term speaker identification is investigated to track different
speakers across several utterances. The evolution of the self-adaptive system
is taken into consideration. Together this results in a more confident guess
of the current speaker’s identity. Long-term stability and the detection of
unknown speakers can be significantly improved.
The motivation and goals of the new approach are discussed. Posterior
probabilities are derived which reflect both the likelihood and adaptation
stage of each speaker model. Combined with long-term speaker tracking on
a larger time scale a flexible and robust speaker identification scheme may
be achieved. This technique can be completed by the detection of unknown
speakers to obtain a strictly unsupervised solution. Finally, an overview of
the advanced architecture of the speech controlled system is presented. The
problem of speaker variability is solved on different time scales. The evalua-
tion results are given at the end of this chapter.
6.1 Motivation
In Chapter 5 a unified approach was introduced and evaluated on speech data
recorded in an automotive environment. This technique is based on a unified
modeling of speech and speaker related information. Especially for embedded
systems in adverse environments, such an approach seems to be advantageous
since a compact representation and fast retrieval of speaker characteristics
for enhanced speech recognition are favorable. The latter became obvious due
to the evaluation results and the improvements achieved.
T. Herbig, F. Gerl, and W. Minker: Self-Learning Speaker Identification, SCT, pp. 115–143.
springerlink.com c Springer-Verlag Berlin Heidelberg 2011
116 6 Evolution of an Adaptive Unsupervised Speech Controlled System
However, unknown speakers appear to be a more challenging problem. As
discussed so far, the integrated speaker identification does not enable a robust
detection of unknown speakers. Obviously, it is very difficult to detect new
users only on a single utterance.
In addition, the problem of speaker models at highly different training lev-
els should be addressed in a self-learning speech controlled system, especially
if it is operated in an unsupervised manner. Error propagation is expected
since weakly trained speaker models are characterized by a higher error rate
with respect to the in-set / out-of-set detection. Confidence measures, e.g.
posterior probabilities, are expected to be more sophisticated than decisions
based on likelihoods.
Finally, it may be advantageous to integrate additional devices, e.g. a
beamformer, to support speaker tracking. Even though this aspect goes be-
yond the scope of this book, the following considerations can be easily ex-
tended as will be shown later.
Therefore, the solution presented in the preceding chapter shall be ex-
tended. The strengths of the unified approach shall be kept but its deficien-
cies shall be removed or at least significantly moderated. The problem of
speaker tracking is now addressed on a long-term time scale. The goal is to
buffer a limited number of utterances in an efficient manner. A path search
algorithm is applied to find the optimal assignment to the enrolled speakers
or an unknown speaker.
In addition, the training status of speaker specific codebooks is consid-
ered. As discussed before, the evaluation of each codebook in (5.23) does
not distinguish between marginally and extensively adapted speaker models.
Long-term speaker adaptation is characterized by a high number of parame-
ters which allow a highly individual adjustment. This degree of individualism
results in an improved match between statistical model and observed feature
vectors. Thus, the likelihood is expected to converge to higher values for
extensively adapted codebooks.
In a first step, a posterior probability is introduced based on a single utter-
ance by taking this likelihood evolution into consideration. Then long-term
speaker tracking is employed to extend this posterior probability to a series of
successive utterances. A robust estimate of the speaker identity can be given
by the MAP criterion. A confidence measure is automatically obtained since
the information about the speaker’s identity is kept as a probability instead
of discrete ML estimates. Unknown speakers can be detected as a by-product
of this technique.
6.2 Posterior Probability Depending on the Training
Level
In this section the effect of different training or adaptation levels on likelihood
values is investigated for a closed-set scenario. Newly initialized codebooks are
less individually adapted compared to extensively trained speaker models. On
6.2 Posterior Probability Depending on the Training Level 117
average a smaller ratio of the speaker specific log-likelihoods L
i
u
and the log-
likelihood of the standard codebook L
0
u
is expected compared to extensively
trained models since all codebooks evolve from the standard codebook [Herbig
et al., 2010c].
The goal is to employ posterior probabilities instead of likelihoods which
integrate the log-likelihood scores L
i
u
and the training status of each speaker
model [Herbig et al., 2010e]. Thus, it is proposed to consider likelihood values
as random variables with respect to the adaptation status and to learn the
corresponding probability density function in a training. The latter has to be
performed only once as described subsequently.
6.2.1 Statistical Modeling of the Likelihood Evolution
For a statistical evaluation of the likelihood distribution which is achieved in
an unsupervised speech controlled system the scores of 100, 000 utterances
for target and non-target speakers were investigated in a closed-set scenario.
In order to have a realistic profile of internal parameters, e.g. log-likelihood
and training level, concerning speaker changes the speakers were organized
in 20 sets containing 25 enrolled speakers. The composition was randomly
chosen and needed not to be gender balanced. In total, 197 native speak-
ers of the USKCP
1
development database were employed. For each speaker
approximately 300 − 400 utterances were recorded under different driving
and noise conditions. For each speaker a training set of 200 utterances was
separated. Users were asked to speak short command and control utterances
such as digits and spelling loops or operated applications for hands-free tele-
phony, radio and navigation. The order of utterances and speakers was also
randomized. The speaker change rate from one utterance to the next was set
to about 7.5 %. At least 5 utterances were spoken between two speaker turns.
The speaker specific codebooks were evaluated in parallel according to (5.23).
The codebook of the target speaker was continuously adapted by combining
EV and ML estimates as described by (4.10). Hence, speaker adaptation was
supervised in the sense that the speaker identity was given to prevent any
maladaptation.
All speakers of this subset are subsequently considered to describe the like-
lihood evolution only depending on the number of utterances used for adapta-
tion N
A
. Speaker dependencies of this evolution are not modeled for reasons
of flexibility. A statistical description is targeted which does not depend on
a specific speaker set. Subsequently, log-likelihood ratios with respect to the
standard codebook are considered for target and non-target speakers. The
standard codebook as the origin of all speaker specific codebooks is viewed
as a reasonable reference.
1
The USKCP is a speech database internally collected by TEMIC Speech Dialog
Systems, Ulm, Germany. The USKCP comprises command and control utter-
ances for in-car applications such as navigation commands, spelling and digit
loops. The language is US-English.
118 6 Evolution of an Adaptive Unsupervised Speech Controlled System
-4 -3 -2 -1 0 1 2 3 4 5
0
10
20
30
40
50
60
70
80
90
L
i
u
−L
0
u
(a) Non-target speaker.
-3 -2 -1 0 1 2 3 4 5
0
20
40
60
80
100
120
140
160
180
200
L
i
u
−L
0
u
(b) Target speaker.
Fig. 6.1 Histogram of the log-likelihood ratios L
i
u
− L
0
u
for non-target speakers
and the target speaker. λ = 4 and N
A
= 5 are employed.
The central limit theorem predicts a Gaussian distribution of a random
variable if realized by a sum of an infinite number of independent random
variables [Bronstein et al., 2000; Zeidler, 2004]. Subsequently, it is assumed
that this condition is approximately fulfilled since L
i
u
is an averaged log-
likelihood.
In Fig. 6.1 and Fig. 6.2 both the histogram and the approximation by
univariate Gaussian distributions are shown for some examples. Asymmetric
distributions, e.g. as found by Bennett [2003], are not considered for reasons
of simplicity. In consequence univariate symmetric Gaussian distributions
are used to represent the log-likelihood ratios. The following two cases are
considered:
• Target speaker. The codebook under investigation belongs to the target
speaker. This match is described by the parameter set { ˙ μ, ˙ σ} comprising
mean and standard deviation.
• Non-target speakers. The codebook belongs to a non-target speaker
profile. This mismatch situation is described by the parameter set {¨ μ, ¨ σ}.
To estimate the parameter set { ˙ μ, ˙ σ} several adaptation stages comprising
5 utterances are defined. All log-likelihoods L
i
u
measured within a certain
adaptation stage N
A
are employed to estimate the mean and standard devi-
ation of the pooled log-likelihood ratio scores. The parameters {¨ μ, ¨ σ} of the
non-target speaker models can be calculated in a similar way. The procedure
only differs in the fact that speaker model and current speaker do not match
[Herbig et al., 2010e].
In Figs. 6.3 - 6.5 the parameters { ˙ μ, ¨ μ, ˙ σ, ¨ σ} are shown for certain adapta-
tion stages N
A
. Different values of the parameter λ were employed in speaker
adaptation.
In summary, the Figs. 6.3 - 6.5 reveal a remarkable behavior of the speaker
specific codebooks which should be discussed in more detail. The threshold λ
for combining EV and ML estimates in (4.11) is increased from Fig. 6.3 to
6.2 Posterior Probability Depending on the Training Level 119


-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
L
i
u
−L
0
u
(a) Non-target speaker.


-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
L
i
u
−L
0
u
(b) Target speaker.
Fig. 6.2 Distribution of log-likelihood ratios for dedicated adaptation stages -
N
A
= 10 (2), N
A
= 20 (), N
A
= 50 (◦) and N
A
= 100 (). λ = 4 is applied for
codebook adaptation. Figure is taken from [Herbig et al., 2010e].


10
1
10
2
-3
-2
-1
0
1
2
Adaptation level [NA]
˙
μ
,
¨μ
(a) Mean values { ˙ μ, ¨ μ}.


10
1
10
2
0
0.5
1
1.5
Adaptation level [NA]
˙
σ
,
¨σ
(b) Standard deviations { ˙ σ, ¨ σ}.
Fig. 6.3 Log-likelihood ratios of the target (solid line) and non-target speaker
models (dashed line) for different training levels. The convex combination in (4.10)
and (4.11) uses only ML estimates (λ →0).
Fig. 6.5. In Fig. 6.3 this constant is almost zero. Only ML estimates are
applied whereas in Fig. 6.5 pure EV estimates are employed.
At first glance the log-likelihood values behave as expected. The correct
assignment of an utterance to the corresponding codebook yields a clear
improvement with respect to the standard codebook. Log-likelihood ratios
increase with continued speaker adaptation. Even with only a few utterances
an improvement can be achieved on average. The prior knowledge incorpo-
rated into the EV estimates enables fast speaker adaptation based on only a
few adaptation parameters.
However, one would not employ the scenario depicted in Fig. 6.3. During
the first 20 utterances, the performance of the modified codebooks is signif-
icantly worse compared to the standard codebook. The initial mean vectors
in (4.10) are substituted by unreliable ML estimates which converge slowly.
120 6 Evolution of an Adaptive Unsupervised Speech Controlled System


10
1
10
2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Adaptation level [NA]
˙
μ
,
¨μ
(a) Mean values { ˙ μ, ¨ μ}.


10
1
10
2
0
0.2
0.4
0.6
0.8
1
Adaptation level [NA]
˙
σ
,
¨σ
(b) Standard deviations { ˙ σ, ¨ σ}.
Fig. 6.4 Log-likelihood ratios of the target (solid line) and non-target speaker
models (dashed line) for different training levels. The convex combination in (4.10)
and (4.11) uses λ = 4. Figure is taken from [Herbig et al., 2010c,e].


10
1
10
2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Adaptation level [NA]
˙
μ
,
¨μ
(a) Mean values { ˙ μ, ¨ μ}.


10
1
10
2
0
0.2
0.4
0.6
0.8
1
Adaptation level [NA]
˙
σ
,
¨σ
(b) Standard deviations { ˙ σ, ¨ σ}.
Fig. 6.5 Log-likelihood ratios of the target (solid line) and non-target speaker
models (dashed line) for different training levels. The convex combination in (4.10)
and (4.11) only uses EV estimates (λ →∞).
In Fig. 6.5 the other extreme case is depicted. ML estimates are con-
sequently neglected and only EV adaptation is applied. Fast adaptation is
achieved on the first utterances but the log-likelihood scores settle after some
20 − 30 utterances. This shows the restriction of this implementation of the
EV approach since only 10 parameters can be estimated. This finding agrees
with the reasoning of Botterweck [2001].
On average codebooks of non-target speakers perform worse than the stan-
dard codebook. When their codebooks are continuously adapted, the dis-
crimination capability of speaker specific codebooks becomes obvious. The
mismatch between statistical model and actual speaker characteristics is in-
creasingly dominant. However, the extent of this drift is not comparable to
the likelihood evolution observed for the target speaker.
6.2 Posterior Probability Depending on the Training Level 121


10
1
10
2
-1
-0.5
0
0.5
1
1.5
2
Adaptation level [N
A]
˙
μ
(a) Target speaker.


10
1
10
2
-3
-2.5
-2
-1.5
-1
-0.5
0
Adaptation level [NA]
¨μ
(b) Non-target speaker.
Fig. 6.6 Evolution of log-likelihood ratios for different speaker adaptation schemes
- combined EV and ML adaptation with λ = 4 (), MAP adaptation with η = 4 (2),
EV (◦) and ML () adaptation.
Dedicated values of the means and standard deviations shown in Fig. 6.4
are represented in Fig. 6.2 by the corresponding univariate Gaussian distri-
butions. In Fig. 6.6 likelihood evolution is evaluated for several adaptation
schemes.
At a second glance these distributions reveal some anomalies and conflicts.
The conclusion that the codebook of the target speaker performs better than
the standard codebook and that codebooks of non-target speakers are worse
is only valid on average. The investigation reveals a large overlap of both
distributions. Especially a weakly adapted codebook of the target speaker
seems to perform often either comparably well or even worse than non-target
speaker models. However, even relatively well adapted models of non-target
speakers might behave comparably or better than the standard codebook in
some cases.
Obviously, it is complicated to construct a self-organizing system as long
as the start or enrollment of new speakers is critical. As soon as about 20 ut-
terances are assigned to the correct speaker model, this situation is less crit-
ical since the overlap is reduced. Especially in the learning phase, the use
of posterior probabilities should help to prevent false decisions and error
propagation. Those errors are usually not detected by speaker identification
techniques when only ML estimates are employed.
The Figs. 6.3 - 6.5 contain a further detail concerning the standard de-
viations of the log-likelihood ratios. Larger standard deviations can be ob-
served for continuously adapted codebooks. The combination of short-term
and long-term speaker adaptation discussed in Sect. 4.2 might give a reason-
able explanation. Speaker adaptation starts with a small set of adaptation
parameters and allows a higher degree of individualism as soon as a suf-
ficient amount of training data is available. This restriction might explain
the relatively tight Gaussian distributions at the beginning. When N
A
ex-
ceeds a threshold of about 10 utterances, the standard deviation increases
122 6 Evolution of an Adaptive Unsupervised Speech Controlled System
significantly. However, the shift of the Gaussian distributions is larger than
the broadening of the curve. A tendency towards more reliable decisions can
be observed.
Finally, this approach should be discussed with respect to the literature.
Yin et al. [2008] investigate the influence of speaker adaptation on speaker
specific GMMs. They conclude that likelihood scores drift with an increasing
length of enrollment duration. They assume Gaussian distributed likelihoods
and perform mean subtraction and variance normalization to receive normal-
ized scores. In contrast to this book, a decision strategy based on likelihoods
instead of posterior probabilities is used.
In the following the finding of the experiment described in this section is
used to calculate posterior probabilities which reflect the observed likelihood
scores based on a single utterance and the training level of all speaker specific
codebooks. In a second step this concept is extended to posterior probabilities
which take a series of successive utterances into consideration.
6.2.2 Posterior Probability Computation at Run-Time
Prior knowledge about likelihood evolution allows calculating meaningful pos-
terior probabilities which are more reliable than pure likelihoods as will be
shown later.
In the following, posterior probabilities are derived for each speaker.
They equally consider the observed likelihood scores given the current utter-
ance X
u
= {x
1
, . . . , x
T
u
} and the training levels of all speaker models. This
includes not only the log-likelihood L
i
u
u
of the hypothesized target speaker i
u
but also log-likelihoods of the standard codebook L
0
u
and non-target speaker
models L
i=i
u
u
. The performance of each speaker model is evaluated in the
context of the codebooks of all enrolled speakers. The procedure has to be
applicable to a varying number of enrolled speakers without any re-training
since a self-learning speech controlled system is targeted [Herbig et al., 2010e].
The proposed posterior probability is decomposed into the contributions of
each speaker model and is derived by applying Bayes’ theorem. The posterior
probability
p(i
u
|L
0
u
, L
1
u
, . . . , L
N
Sp
u
) =
p(L
0
u
, L
1
u
, . . . , L
N
Sp
u
|i
u
) · p(i
u
)
p(L
0
u
, L
1
u
, . . . , L
N
Sp
u
)
(6.1)
is represented by the conditional probability density of all log-likelihood
scores, prior probability of the assumed speaker and a normalization.
However, it seems to be advantageous to use relative measures instead of
the absolute likelihood scores:
• Absolute likelihood scores of speaker specific codebooks obviously do not
provide a robust measure. For example, adverse environments might affect
modeling accuracy. This effect is unwanted for speaker identification and
6.2 Posterior Probability Depending on the Training Level 123
can be circumvented or at least alleviated by using relative instead of abso-
lute measures. Adverse conditions are expected to affect all codebooks in a
similar way. The standard codebook as a speaker independent and exten-
sively trained speech model can be considered as a reasonable reference.
This problem is related to the problem of open-set speaker identification
as found by Fortuna et al. [2005].
• Another strong point to employ relative scores is the likelihood evolution
described in the last section. The likelihood converges to higher values
when the statistical models are continuously adapted. The range of log-
likelihood scores depends on the effective number of speaker adaptation
parameters.
• Furthermore, the log-likelihoods of the statistical model representing the
target speaker seem to be only marginally correlated with those of non-
target speaker models if relative measures, e.g. L
i
u
− L
0
u
, are considered
instead of L
i
u
. A similar result can be observed when speaker models of
two non-target speakers are compared. In both cases the correlation coef-
ficient [Bronstein et al., 2000] given by
ρ(a, b) =
E
a,b
{(a −E
a
{a}) · (b −E
b
{b})}
_
E
a
{(a −E
a
{a})
2
} ·
_
E
b
{(b −E
b
{b})
2
}
(6.2)
a = L
i
−L
0
, i = 1, . . . , N
Sp
(6.3)
b = L
j
−L
0
, j = 1, . . . , N
Sp
, j = i (6.4)
was quite small on average when all speakers of the development set were
investigated. In the case of speaker models trained on a few utterances
the magnitude of the correlation coefficient was |ρ| < 0.2, in general. A
further decrease could be observed for moderately and extensively trained
speaker models.
Thus, equation (6.1) is rewritten in terms of relative measures with respect
to the score of the standard codebook:
p(i
u
|L
0
u
, L
1
u
, . . . , L
N
Sp
u
) =
p(L
1
u
, . . . , L
N
Sp
u
|L
0
u
, i
u
) · p(L
0
u
|i
u
) · p(i
u
)
p(L
1
u
, . . . , L
N
Sp
u
|L
0
u
) · p(L
0
u
)
. (6.5)
Subsequently, it is assumed that speaker specific log-likelihoods L
i
which are
normalized by L
0
can be treated separately or equivalently can be viewed as
statistically independent:
p(i
u
|L
0
u
, L
1
u
, . . . , L
N
Sp
u
)
=
p(L
1
u
|L
0
u
, i
u
)
p(L
1
u
|L
0
u
)
· . . . ·
p(L
N
Sp
u
|L
0
u
, i
u
)
p(L
N
Sp
u
|L
0
u
)
·
p(L
0
u
|i
u
) · p(i
u
)
p(L
0
u
)
(6.6)
=
p(L
1
u
|L
0
u
, i
u
)
p(L
1
u
|L
0
u
)
· . . . ·
p(L
N
Sp
u
|L
0
u
, i
u
)
p(L
N
Sp
u
|L
0
u
)
· p(i
u
|L
0
u
). (6.7)
124 6 Evolution of an Adaptive Unsupervised Speech Controlled System
Furthermore, the log-likelihoods of the standard codebook L
0
u
are assumed
to be independent of the speaker identity i
u
:
p(i
u
|L
0
u
, L
1
u
, . . . , L
N
Sp
u
) =
p(L
1
u
|L
0
u
, i
u
)
p(L
1
u
|L
0
u
)
· . . . ·
p(L
N
Sp
u
|L
0
u
, i
u
)
p(L
N
Sp
u
|L
0
u
)
· p(i
u
). (6.8)
The denominators contain normalization factors which guarantee that the
sum over all posterior probabilities p(i
u
|L
0
u
, L
1
u
, . . . , L
N
Sp
u
) is equal to unity.
Therefore, the final result can be given by
p(i
u
|L
0
u
, L
1
u
, . . . , L
N
Sp
u
) ∝ p(L
1
u
|L
0
u
, i
u
) · . . . · p(L
N
Sp
u
|L
0
u
, i
u
) · p(i
u
). (6.9)
To compute the posteriori probability p(i
u
|L
0
u
, L
1
u
, . . . , L
N
Sp
u
), it is assumed
that the speaker under investigation is the target speaker whereas all remain-
ing speakers are non-target speakers. Thus, the density function p(L
i
u
u
|L
0
u
, i
u
)
has to be calculated in the case of the target speaker and p(L
i=i
u
u
|L
0
u
, i
u
) in
the case of a non-target speaker. Both cases are combined in (6.9). This step
is repeated for all speakers. This procedure is not limited by the number of
enrolled speakers N
Sp
because an additional speaker only causes a further
factor in (6.9). Thus, each posterior probability takes all log-likelihoods of a
particular utterance into consideration.
As discussed so far, it is not obvious how this approach reflects different
training levels. In the following, the knowledge about the adaptation sta-
tus is directly incorporated into the parameters of the conditional density
functions

p(L
1
u
|L
0
u
, i
u
), . . . , p(L
N
Sp
u
|L
0
u
, i
u
)

.
In general, conditional Gaussian distributions [Bishop, 2007] seem to offer
a statistical framework suitable to integrate the log-likelihood of a particu-
lar speaker and the reference given by the standard codebook. Alternatively,
univariate Gaussian distributions can be applied which are trained and eval-
uated on the log-likelihood ratio L
i
u
− L
0
u
as done in the preceding section.
For reasons of simplicity the latter approach is employed.
Two sets of parameters are required for each adaptation level as discussed
before. The parameters for the target speaker are characterized by the mean ˙ μ
and standard deviation ˙ σ based on log-likelihood ratios. The parameters for
non-target speakers are denoted by ¨ μ and ¨ σ. Now the conditional density
functions p(L
i
u
|L
0
u
, i
u
) can be calculated for the target speaker
p(L
i
u
u
|L
0
u
, i
u
) = N

L
i
u
u
−L
0
u
| ˙ μ
i
u
, ˙ σ
i
u

(6.10)
and non-target speakers
p(L
i=i
u
u
|L
0
u
, i
u
) = N

L
i=i
u
u
−L
0
u
| ¨ μ
i
, ¨ σ
i

. (6.11)
For testing the parameter sets { ˙ μ, ˙ σ} and {¨ μ, ¨ σ} are calculated for each
speaker individually. The parameters depend only on the number of training
6.3 Closed-Set Speaker Tracking 125
utterances N
A
. With respect to Fig. 6.4 this problem can be considered as
a regression problem. For example, a Multilayer Perceptron
2
(MLP) may be
trained to interpolate the graph for unseen N
A
. For each scenario one MLP
was trained prior to the experiments which will be discussed later. Each MLP
may be realized by one node in the input layer, 4 nodes in the hidden layer and
two output nodes. The latter represent the mean and standard deviation of
the univariate Gaussian distribution. Another strategy might be to categorize
mean values and standard deviations in groups of several adaptation stages
and to use a look-up table.
In summary, posterior probabilities are employed to reflect the match be-
tween the target speaker model and an observation as well as the expected
log-likelihood due to the adaptation stage of the statistical model. In ad-
dition, the performance of non-target speaker models is evaluated likewise.
The computation of the posterior probabilities can be simplified when all
log-likelihoods are normalized by the log-likelihood of the standard code-
book. The resulting posterior probability can be factorized. Prior knowledge
about likelihood evolution is directly incorporated into the parameters of the
corresponding univariate Gaussian distributions.
In the following sections long-term speaker tracking is described. The goal
is to calculate a posterior probability which is not only calculated on a single
utterance. Instead a series of successive utterances is investigated to deter-
mine the probability that speaker i has spoken the utterance X
u
given the
preceding and successive utterances.
6.3 Closed-Set Speaker Tracking
As discussed so far, only a posterior probability of a single utterance has
been considered. However, the goal of this chapter is to provide the posterior
probability p(i
u
|X
1:N
u
) for each speaker i which reflects an entire series of
utterances X
1:N
u
. Therefore a statistical model has to be found which allows
long-term speaker tracking. Referring to Sect. 2.4.1 both discriminative and
generative statistical models can be applied to distinguish between different
speakers:
• Discriminative models such as an MLP can be trained to separate enrolled
speakers and to detect speaker changes. However, the number of speak-
ers is often unknown and speech data for an enrollment is usually not
available for a self-learning speaker identification system. Furthermore, an
enrollment is undesirable with respect to convenience.
• Generative statistical models are preferred in this context since enrolled
speakers are individually represented. A solution is targeted which is in-
dependent from the composition of the speaker pool.
2
A detailed description of MLPs, their training and evaluation can be found by
Bishop [1996].
126 6 Evolution of an Adaptive Unsupervised Speech Controlled System
a
13
a
11
a
12
a
22
a
33
a
23
Sp 1 Sp 2 Sp 3
a
21
a
32
a
31
Fig. 6.7 Speaker tracking for three enrolled speakers realized by a Markov model of
first order. The transitions a
ij
denote the speaker change probability from speaker i
to speaker j.
Subsequently, speaker tracking is viewed as a Markov process. Each state
represents one enrolled speaker and the transitions model speaker changes
from one utterance to the next. In Fig. 6.7 an example of three enrolled
speakers is depicted. A flexible structure is used which allows the number of
states to be increased with each new speaker until an upper limit is reached.
Then one speaker model is dropped and replaced by a new speaker model to
limit the computational complexity. Transition probabilities are chosen inde-
pendently from enrolled speakers. They are either determined by an expected
prior speaker change rate p
Ch
or can be controlled by additional devices such
as a beamformer if a discrete probability for speaker changes can be given. In
contrast to Sect. 2.5.2, the Markov model is operated on asynchronous events
namely utterances instead of equally spaced time instances.
In the next step the link between speaker identification described in
Sect. 5.2 and the Markov model for speaker tracking has to be implemented.
Speaker identification makes use of speaker specific codebooks which are in-
terpreted as GMMs with uniform weighting factors.
A new HMM
3
can be constructed on an utterance level by combining
codebooks and the Markov model for speaker tracking [Herbig et al., 2010e].
During speech recognition codebooks represent speaker specific pronuncia-
tions to guarantee optimal speech decoding. Now these codebooks are used
to determine the average match with the speaker’s characteristics for speaker
identification according to (5.23). Codebooks are considered here in the con-
text of a series of utterances and speaker changes. The emission probability
density of a particular speaker is given by
p(X
u
|i
u
) = exp

T
u
· L
i
u
u

. (6.12)
Each speaker or equivalently each state is described by an individual GMM
and therefore the resulting HMM equals an CDHMM as introduced in
3
A similar model the so-called segment model is known from the speech recognition
literature. Instead of single observations x
t
, segments x
1:T
of variable length T
are assigned to each state to overcome some limitations of HMMs [Delakis et al.,
2008].
6.3 Closed-Set Speaker Tracking 127
Sect. 2.5.2. Decoding techniques for HMMs can be used to assign a series
of utterances
X
1:N
u
= {X
1
, . . . , X
u
, . . . , X
N
u
} (6.13)
to the most probable sequence of speaker identities
i
MAP
1:N
u
=

i
MAP
1
, . . . , i
MAP
u
, . . . , i
MAP
N
u

. (6.14)
This task can be solved by an iterative algorithm such as the forward al-
gorithm already explained in Sect. 2.5.2. The forward algorithm can be em-
ployed to compute the joint probability p(X
1:u
, i
u
) of each speaker i
u
given
the observed history of utterances X
1:u
. The prior probability of each speaker
is modeled here by p(i
u
) =
1
N
Sp
. N
Sp
represents the number of enrolled
speakers.
Additionally, the backward algorithm can be employed. It requires to save
the likelihoods p(X
u
|i
u
) of N
u
utterances in a buffer and to re-compute the
complete backward path after each new utterance. For all speaker models the
likelihood p(X
u+1:N
u
|i
u
) has to be calculated as described in Sect. 2.5.2.
In Fig. 6.8 an example with 5 speakers and two exemplary paths are shown.
The dashed path displays the correct path and the dotted one denotes a
non-target speaker. The graphical representation in form of a finite state
machine as used in Fig. 6.7 only discloses all possible states and transitions
as well as the corresponding probabilities. The time trajectory is concealed.
Thus, trellis representation is subsequently used because it spans all possible
paths or equivalently the trajectory of states as a surface. The dashed path
is iteratively processed by the forward algorithm. In this example the first
three utterances are expected to produce high posterior probabilities since
no speaker turn is presumed. After the speaker change there is a temporary
decrease of the resulting posterior probability because speaker turns are pe-
nalized by the prior probability p
Ch
. The backward algorithm encounters the
inverse problem since only the last three utterances are expected to result in
high posterior probabilities. This example shows the problem of incomplete
posterior probabilities.
Therefore, complete posteriors are used to capture the entire temporal
knowledge about the complete series of utterances. This is achieved by the
fusion of forward and backward algorithmwhich was introduced in Sect. 2.5.2.
Here only the final result is given:
p(i
u
|X
1:N
u
) ∝ p(X
1:u
, i
u
) · p(X
u+1:N
u
|i
u
), 0 < u < N
u
(6.15)
p(i
N
u
|X
1:N
u
) ∝ p(X
1:N
u
, i
N
u
). (6.16)
The MAP criterion can be applied to determine the most probable speaker
identity
i
MAP
u
= arg max
i
u
{p(i
u
|X
1:N
u
)} . (6.17)
128 6 Evolution of an Adaptive Unsupervised Speech Controlled System
Time
S
p
e
a
k
e
r
p(i
u
1:3
|X
u
1:3
)
p(i
u
1:4
|X
u
1:4
)
p(i
u
1:6
|X
u
1:6
)
p
Ch
u
3
Fig. 6.8 Example for speaker tracking based on posterior probabilities. 5 speakers
and two exemplary paths are shown. States are depicted versus the discrete time
axis.
In addition, a direct measure for confidence or complementary uncertainty is
achieved by posterior probabilities. If uncertainty exceeds a given threshold,
particular utterances can be rejected by speaker identification. Those utter-
ances are not used for speaker adaptation. In the experiments carried out
adaptation was not performed when p(i
MAP
u
|X
1:N
u
) < 0.5.
The Markov model for speaker tracking and the modified posterior proba-
bility discussed before can be combined to obtain a more precise speaker mod-
eling where likelihood evolution is considered. p(L
0
u
, L
1
u
, . . . , L
N
Sp
u
|i
u
) is em-
ployed in the forward-backward algorithm instead of the likelihood p(X
u
|i
u
).
The former one can be obtained by applying Bayes’ theorem to (6.9). The
probability density function p(L
0
u
, L
1
u
, . . . , L
N
Sp
u
|i
u
) is used as an indirect mea-
sure for p(X
u
|i
u
) since the likelihoods

L
0
u
, L
1
u
, . . . , L
N
Sp
u

are derived from
the observation X
u
.
6.4 Open-Set Speaker Tracking
The focus of the preceding sections has been to enhance speech recognition by
an unsupervised speaker tracking which is able to identify a limited number of
6.4 Open-Set Speaker Tracking 129
enrolled speakers. Only the closed-set case has been investigated. The feature
vectors of an utterance are only compared with existing speaker profiles and
the one with the highest match or negatively spoken with lowest discrepancy
is selected. Thus, the current user is always assigned to one of the enrolled
speaker profiles. In this scenario speaker identification needs additional input
when a new user operates the speech controlled device for the first time.
Long-term speaker tracking suffers from this deficiency in the same way.
Although each speaker model is investigated with respect to its adaptation
level, there is no measure whether the existing speaker models fit at all. The
normalization in (6.16) always guarantees that the sum of all posterior prob-
abilities equals unity. When there is a mismatch between an utterance and
the speaker’s codebook, this normalization can cause unwanted results and
might pretend a high confidence. The goal is to define a statistical approach
to detect mismatch conditions, especially unknown speakers.
In Sect. 2.4.3 some techniques were discussed to detect unknown speakers.
The principle idea was to apply either fixed or dynamic thresholds to the log-
likelihoods of each speaker model. Log-likelihood scores can be normalized
by a reference, e.g. given by a UBM or speaker independent codebook, to
obtain a higher robustness against false detections.
The Figs. 5.11, 6.3 - 6.5 and Fig. 6.9 show that a global threshold for log-
likelihood ratios is difficult. The log-likelihood distributions significantly vary
depending on the adaptation stage. Especially in the learning phase of a new
codebook, it is complicated to find an optimal threshold. An unacceptably
high error rate with respect to the discrimination of in-set and out-of-set
speakers is expected.
Unknown speakers cannot be modeled directly because of a lack for train-
ing or adaptation data. However, an alternative model for situations when
no speaker model represents an utterance appropriately can be obtained as
a by-product of the posterior probability explained in Sect. 6.2.2.
In (6.9) each log-likelihood score is validated given the adaptation stage
of the corresponding codebook. The range of log-likelihood values is known
a priori. The main principle is to consider one speaker as the target speaker
whereas remaining speakers have to be viewed as non-target speakers.
In the case of an unknown user no speaker model exists so that all speaker
models act as non-target speaker models. In consequence, the posterior prob-
ability of an out-of-set speaker is proportional to the product of the univariate
Gaussian functions of all non-target speakers [Herbig et al., 2011]:
p(i
u
= N
Sp
+1|L
0
u
, L
1
u
, . . . , L
N
Sp
u
) ∝
N
Sp

i=1
N

L
i
u
−L
0
u
| ¨ μ
i
, ¨ σ
i

· p(i
u
= N
Sp
+1).
(6.18)
The closed-set solution in (6.9) already contains the log-likelihood ratios of all
possible non-target speaker models. Therefore, only the normalization has to
be adjusted to guarantee that the sum over all in-set and out-of-set speakers
equals unity.
130 6 Evolution of an Adaptive Unsupervised Speech Controlled System
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
L
i
u
−L
0
u
(a) N
A
= 5.
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
L
i
u
−L
0
u
(b) N
A
= 10.
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
L
i
u
−L
0
u
(c) N
A
= 20.
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
L
i
u
−L
0
u
(d) N
A
= 30.
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
L
i
u
−L
0
u
(e) N
A
= 50.
-3 -2 -1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
L
i
u
−L
0
u
(f) N
A
= 100.
Fig. 6.9 Overlap of the log-likelihood distributions of the target (solid line) and
non-target speaker models (dashed line). λ = 4 is employed in speaker adaptation.
The x-coordinate denotes the log-likelihood ratio of speaker specific codebooks and
standard codebook.
The concept of long-term speaker tracking, however, exploits the knowl-
edge of a series of utterances and is more robust compared to a binary clas-
sifier based on likelihoods or posterior probabilities originating from a sin-
gle utterance. The Markov model used for speaker tracking can be easily
6.4 Open-Set Speaker Tracking 131
Time
S
p
e
a
k
e
r
In-set speakers
Out-of-set speaker
Fig. 6.10 Example for open-set speaker tracking.
extended from N
Sp
to N
Sp
+ 1 states to integrate the statistical modeling of
unknown speakers into the forward-backward algorithm.
In Fig. 6.10 open-set speaker tracking is shown for an example of three
paths. Two belong to enrolled speakers and an additional one represents an
unknown speaker.
The MAP estimate in (6.17) may be used to detect unknown speakers in
direct comparison with all enrolled speaker models. If the out-of-set speaker
obtains the highest posterior probability, a new speaker specific codebook
may be initialized. However, a relatively small posterior probability might be
sufficient to enforce a new codebook. This can be the case when all posterior
probabilities tend towards a uniform distribution.
It seems to be advantageous to consider speaker identification as a two-
stage decision process [Angkititrakul and Hansen, 2007]. In the first step, the
MAP criterion is applied as given in (6.17) to select the most probable in-set
speaker. In the second step, this speaker identity has to be verified. This
leads to a binary decision between in-set and out-of-set speakers. Hypothe-
ses H
0
and H
1
characterize the event of an in-set and an out-of-set speaker,
respectively. The optimal decision is defined in (2.14).
In general, the following two probabilities can be calculated by the forward-
backward algorithm. The posterior probability
p(H
0
|X
1:N
u
) = p(i
u
= N
Sp
+ 1|X
1:N
u
) (6.19)
132 6 Evolution of an Adaptive Unsupervised Speech Controlled System
denotes an enrolled speaker and the probability of an unknown is given by
p(H
1
|X
1:N
u
) = p(i
u
= N
Sp
+ 1|X
1:N
u
). (6.20)
Optimal Bayes decisions depend on the choice of the costs for both error
scenarios as described in Sect. 2.3. Subsequently, equal costs for not detected
out-of-set speakers and not detected in-set speakers are assumed to simplify
the following considerations. Bayes’ theorem permits the following notation
p(X
1:N
u
|H
0
) · p(H
0
) = p(H
0
|X
1:N
u
) · p(X
1:N
u
) (6.21)
p(X
1:N
u
|H
1
) · p(H
1
) = p(H
1
|X
1:N
u
) · p(X
1:N
u
), (6.22)
where the posterior probabilities p(H
0
|X
1:N
u
) and p(H
1
|X
1:N
u
) are comple-
mentary
p(H
0
|X
1:N
u
) + p(H
1
|X
1:N
u
) = 1. (6.23)
According to equation (2.14) the following criterion can be applied to decide
whether a new speaker profile has to be initialized:
p(H
0
|X
1:N
u
)
p(H
1
|X
1:N
u
)
H
1
≤ 1 (6.24)
1 −p(H
1
|X
1:N
u
)
p(H
1
|X
1:N
u
)
H
1
≤ 1 (6.25)
p(H
1
|X
1:N
u
)
H
1

1
2
. (6.26)
In particular, the posterior probabilities derived in Sect. 6.2.2 can be used
in the forward-backward algorithm. Discrete decisions on an utterance level
are avoided and a maximum of information is exploited. Furthermore, a series
of utterances is considered to reduce the risk of false decisions. This technique
provides a solution for the problem of open-set speaker tracking, especially
when speaker models are differently trained.
6.5 System Architecture
In combination with long-term speaker tracking can a completely unsuper-
vised system be implemented where the detection of unknown speakers is
integrated [Herbig et al., 2011]. Since speech recognition requires a guess of
the speaker identity on different time scales, the structure of the speaker
identification process shown in Fig. 6.11 is employed.
I Speech recognition. Codebook selection is realized by a local speaker
identification on a frame level as discussed in Sect. 5.2.1.
II Preliminary speaker identification. After each utterance the updates
of the speaker specific parameter sets for energy normalization and mean
subtraction are kept if no speaker turn is detected. Otherwise a reset is
6.5 System Architecture 133
Front-End
Speech
Speaker
Adaptation
Transcription
ML
l
p
Forward
Backward
Algorithm
I II III
Fig. 6.11 System architecture for joint speaker identification and speech recog-
nition comprising three stages. Part I and II denote the speaker specific speech
recognition and preliminary speaker identification, respectively. In Part III poste-
rior probabilities are calculated for each utterance and long-term speaker tracking
is performed. Speaker adaptation is employed to enhance speaker identification and
speech recognition. Codebooks are initialized in the case of an unknown speaker
and the statistical modeling of speaker characteristics is continuously improved.
performed and the parameter set of the identified speaker is used subse-
quently. This stage is required for practical reasons since speaker identi-
fication and speech recognition accuracy are sensitive to an appropriate
feature normalization.
IIILong-term speaker tracking. At the latest after a predetermined
number of utterances a final speaker identification is enforced. Speaker
adaptation is then calculated for the most probable alignment of speaker
identities. Speaker tracking is based on posterior probabilities originating
from the reinterpretation of the observed log-likelihood scores. Unknown
speakers can be detected. An additional speaker change detection, e.g.
with the help of a beamformer, may be included without any structural
modifications.
The speaker adaptation scheme presented in Chapter 4 allows the system
to delay the final speaker identification since only the sufficient statistics of
the standard codebook given by (4.14) and (4.15) are required. The assign-
ment of observed feature vectors to the Gaussian distributions of a codebook
is determined by the standard codebook. Therefore the speaker identity is
not needed to accumulate the adaptation data. As soon as the speaker iden-
tity is determined, the data of this speaker is accumulated and an enhanced
codebook is computed. To balance speaker identification accuracy with the
latency of speaker adaptation and memory consumption, at most 6 successive
utterances are subsequently employed for speaker tracking.
134 6 Evolution of an Adaptive Unsupervised Speech Controlled System
6.6 Evaluation
In the following experiments speaker identification has been extended by long-
term speaker tracking. A completely unsupervised speech controlled system
has been evaluated.
First, the closed-set scenario examined in Sect. 5.4.1 is discussed. Speaker
identification and speech recognition are evaluated for speaker groups of
5 or 10 enrolled speakers [Herbig et al., 2010d,e]. In addition, the robust-
ness of both implementations against speaker changes is examined.
Then speaker tracking is tested in a completely unsupervised way to sim-
ulate an open-set test case [Herbig et al., 2011]. The speech recognition rate
is compared to the results of the first solution presented in Chapter 5 and
optimal speaker adaptation investigated in Chapter 4.
6.6.1 Closed-Set Speaker Identification
In a first step, closed-set experiments are considered to investigate the speaker
identification and speech recognition rates for speaker sets with 5 or 10 speak-
ers [Herbig et al., 2010d,e]. Finally, the effect of speaker changes on the iden-
tification accuracy is examined.
Unsupervised Speaker Identification of 5 enrolled Speakers
In Sect. 5.4.1 closed-set identification was examined for groups of 5 speakers.
In total, 60 different sets were evaluated. 75, 000 test utterances were exam-
ined. The optimum of the WA was determined in Sect. 4.3 by supervised
speaker adaptation in the sense that the speaker identity was given. This ex-
periment has been repeated with long-term speaker tracking under realistic
conditions to obtain a direct comparison.
The results are summarized in Table 6.1. The identification accuracy is
noticeably improved by long-term speaker tracking compared to the results in
Table 5.1. For λ = 4 a relative error rate reduction of approximately 74 % can
be achieved by long-term speaker tracking. Unfortunately, speech recognition
does not benefit from this approach and remains at the same level as before.
Again, the combination of EV and ML estimates yields the best identifi-
cation result for λ = 4. 98.59 % of the speakers are correctly identified. For
higher values of λ the identification accuracy is degraded. However, for all
implementations except the baseline and short-term adaptation the identifi-
cation rates are higher than for the experiments discussed in Sect. 5.4.1.
Even for λ ≈ 0 an improvement can be stated since the temporary decrease
of the likelihood shown in Fig. 6.3 is taken into consideration. For λ → ∞
only a marginal improvement is achieved which can be explained by Fig. 6.5.
Since the corresponding likelihood scores settle after 10 − 20 utterances, no
improvement can be expected by the knowledge of the adaptation status in
the long run.
6.6 Evaluation 135
Table 6.1 Comparison of different adaptation techniques for self-learning speaker
identification with long-term speaker tracking. Speaker sets comprising 5 en-
rolled speakers are considered. Table is taken from [Herbig et al., 2010e].
Speaker adaptation Rejected [%] WA [%] Speaker ID [%]
Baseline - 85.23 -
Short-term adaptation - 86.13 -
EV-MAP adaptation
ML (λ ≈ 0) 2.09 86.85 86.74
λ = 4 2.12 88.09 98.59
λ = 8 2.11 88.17 97.91
λ = 12 2.11 88.21 97.37
λ = 16 2.13 88.18 97.28
λ = 20 2.12 88.20 95.04
EV (λ →∞) 2.28 87.48 86.58
Typical errors
min ±0.23 ±0.09
max ±0.25 ±0.25
Table 6.2 Comparison of different adaptation techniques for long-term speaker
tracking. Speaker sets with 5 enrolled speakers are investigated. Speaker iden-
tification and speech recognition results are presented for several evaluation bands -
I = [1; 50[, II = [50; 100[, III = [100; 250] utterances.
Speaker I II III
adaptation
WA [%] ID [%] WA [%] ID [%] WA [%] ID [%]
Baseline 84.51 - 86.84 - 85.02 -
Short-term 85.89 - 87.22 - 85.87 -
adaptation
EV-MAP
adaptation
ML (λ ≈ 0) 85.38 81.95 87.49 87.05 87.30 88.22
λ = 4 87.40 97.82 89.17 98.29 88.05 98.94
λ = 8 87.59 97.04 89.29 97.48 88.06 98.35
λ = 12 87.50 96.68 89.39 96.67 88.13 97.83
λ = 16 87.62 96.35 89.30 96.69 88.07 97.79
λ = 20 87.53 93.92 89.21 94.51 88.17 95.60
EV (λ →∞) 87.32 86.84 88.58 86.69 87.19 86.45
Typical errors
min ±0.53 ±0.24 ±0.49 ±0.21 ±0.30 ±0.10
max ±0.58 ±0.63 ±0.54 ±0.54 ±0.33 ±0.32
Comparing the evaluation bands in Table 5.2 and Table 6.2 no signifi-
cant differences with respect to speech recognition can be observed. Speaker
identification, however, benefits from long-term speaker tracking even in the
136 6 Evolution of an Adaptive Unsupervised Speech Controlled System
λ
BL ML 4 8 12 16 20 EV ST
85
86
87
88
89
90
W
A
[
%
]
(a) Results for speech recognition.
λ
ML 4 8 12 16 20 EV
75
80
85
90
95
100
I
D
[
%
]
(b) Results for speaker identification.
Fig. 6.12 Comparison of supervised speaker adaptation with predefined speaker
identity (black), joint speaker identification and speech recognition (dark gray) and
long-term speaker tracking (light gray). Furthermore, the speaker independent base-
line (BL) and short-term adaptation (ST) are shown. Figure is taken from [Herbig
et al., 2010e].
learning phase given by the evaluation band I. This indicates that error prop-
agation in the first evaluation band negatively affected the results of the
implementations examined in the preceding chapter. Since this example has
been discussed for several implementations, the results for supervised speaker
adaptation, joint speaker identification and speech recognition and the com-
bination with long-term speaker tracking are finally compared in Fig. 6.12.
Both the WA and identification rate are depicted.
Unsupervised Speaker Identification of 10 enrolled Speakers
As discussed so far, speaker sets of 5 enrolled speakers have been evaluated.
Even though 5 enrolled speakers seem to be sufficient for many use cases, e.g.
in-car applications, the robustness of the speech controlled system should
be investigated when the number of speakers is doubled. A new test set
comprising 30 speaker sets is used for evaluation.
First, the joint speaker identification and speech recognition is tested with-
out long-term speaker tracking to determine the performance decrease for
speaker identification and speech recognition. Then, long-term speaker track-
ing is activated to evaluate whether an increased error rate on single utter-
ances severely affects speaker tracking. The results are given in Table 6.3 and
Table 6.4, respectively.
When Table 5.1 and Table 6.3 are compared despite different test sets, it
can be stated that the speaker identification rates drop significantly. However,
even with 10 enrolled speaker about 90 % identification rate can be achieved.
In both cases about 20 % relative error rate reduction can be obtained for
speech recognition when λ = 4 is used for adaptation. Thus, the speech recog-
nition accuracy of the unified approach appears to be relatively insensitive
to moderate speaker identification errors as expected [Herbig et al., 2010d].
6.6 Evaluation 137
Table 6.3 Comparison of different adaptation techniques for self-learning speaker
identification without long-term speaker tracking. Speaker sets with 10 en-
rolled speakers are investigated. Table is taken from [Herbig et al., 2010d].
Speaker adaptation Rejected [%] WA [%] Speaker ID [%]
Baseline - 84.93 -
Short-term adaptation - 85.74 -
EV-MAP adaptation
ML (λ ≈ 0) 1.74 86.48 72.47
λ = 4 1.65 87.96 89.56
λ = 8 1.64 87.87 87.02
λ = 12 1.70 87.82 85.72
λ = 16 1.64 87.86 84.82
λ = 20 1.68 87.78 83.53
EV (λ →∞) 4.46 87.04 72.68
Typical errors
min ±0.23 ±0.22
max ±0.26 ±0.33
Table 6.4 Comparison of different adaptation techniques for self-learning speaker
identification with long-term speaker tracking. Speaker sets comprising 10 en-
rolled speakers are investigated. Table is taken from [Herbig et al., 2010d].
Speaker adaptation Rejected [%] WA [%] Speaker ID [%]
Baseline - 84.93 -
Short-term adaptation - 85.74 -
EV-MAP adaptation
ML (λ ≈ 0) 1.75 86.67 85.55
λ = 4 1.65 87.89 97.89
λ = 8 1.71 87.91 97.35
λ = 12 1.72 88.05 96.46
λ = 16 1.70 88.01 96.32
λ = 20 1.70 87.84 91.89
EV (λ →∞) 1.81 87.04 77.08
Typical errors
min ±0.23 ±0.10
max ±0.26 ±0.30
However, a comparable level of the speaker identification rates to former
experiments is achieved when speaker tracking is active as shown in Table 6.4.
Especially for larger speaker sets, a fast and robust information retrieval
already in the learning phase is essential. With respect to Table 6.3 and Ta-
ble 6.4 it becomes evident that an increased number of maladaptations also
includes a less confident codebook selection which reduces the WA. There-
fore long-term speaker tracking generally yields an advantage for the speaker
identification and speech recognition rate.
138 6 Evolution of an Adaptive Unsupervised Speech Controlled System
For the considered use case (N
Sp
< 10) it can be concluded that a speech
controlled system has been realized which is relatively insensitive to the num-
ber of enrolled speakers or the tuning parameter of speaker adaptation. Re-
markably high identification rates have been achieved.
Effects of Speaker Changes on the Speaker Identification Accuracy
The techniques for speech controlled systems examined here do not use an
explicit speaker change detection such as BIC. Instead, speaker changes are
tracked by speaker identification on different time scales since speaker turns
within an utterance are not considered. Nevertheless the question remains
how the speech recognizer with integrated speaker identification reacts to
speaker changes.
Codebooks are identified on a frame level for speaker specific speech recog-
nition. A rapid switch between the codebooks of the enrolled speakers is
possible if a clear mismatch between expected speaker characteristics and
observed data is detected. But feature extraction is also speaker specific and
is controlled on an utterance level. A reliable guess of the current speaker
identity is available as soon as the entire utterance is processed. Since only
moderate speaker change rates are expected, a parallel feature extraction
can be avoided. This technique bears the risk not to detect speaker changes
accurately since the parameter set of the speaker previously detected is used.
Subsequently, the vulnerability of speaker identification with respect
to speaker changes is investigated. The prior probability of speaker
changes p
Ch
= 0.1 is rather small and is comparable to the error rate of
speaker identification on an utterance level as given in Table 5.1. It is as-
sumed that most of the identification errors occur after speaker changes.
This question should be answered by certain points on the ROCcurve which
are shown in Fig. 6.13 and Fig. 6.14. The true positives that means correctly
detected speaker turns are plotted versus the rate of falsely assumed speaker
changes. Additional information about ROC curves can be found in Sect. A.3.
Fig. 6.13 and Fig. 6.14 are based on the results of Table 5.1 and Table 6.1,
respectively. The performance of the target system either with or without
long-term speaker tracking is depicted. Speaker change detection is given by
the identification result of successive utterances. According to Fig. 6.13 and
Fig. 6.14 it seems to be obvious that speaker changes are not a critical issue
for the self-learning speaker identification. In fact, the evolution of the unsu-
pervised system tends towards an even more robust system. But Fig. 6.13 and
Fig. 6.14 also reveal that long-term speaker tracking only reduces the false
alarm rate and even marginally increases the rate of missed speaker changes.
The latter case can be explained by the fact that speaker changes are penal-
ized by p
Ch
in the forward-backward algorithm. In addition, the first utterance
after a speaker change uses improper parameters for feature extraction.
In addition, several experiments for speaker change detection as described
in Sect. 2.3 were conducted. Furthermore, the Arithmetic-Harmonic Spheric-
ity (AHS) measure as found by Bimbot et al. [1995]; Bimbot and Mathan
6.6 Evaluation 139


0 0.05 0.1 0.15
0.9
0.92
0.94
0.96
0.98
1
C
o
r
r
e
c
t
l
y
d
e
t
e
c
t
e
d
s
p
e
a
k
e
r
t
u
r
n
s
Falsely detected speaker turns
(a) λ = 4.


0 0.05 0.1 0.15
0.9
0.92
0.94
0.96
0.98
1
C
o
r
r
e
c
t
l
y
d
e
t
e
c
t
e
d
s
p
e
a
k
e
r
t
u
r
n
s
Falsely detected speaker turns
(b) λ = 8.


0 0.05 0.1 0.15
0.9
0.92
0.94
0.96
0.98
1
C
o
r
r
e
c
t
l
y
d
e
t
e
c
t
e
d
s
p
e
a
k
e
r
t
u
r
n
s
Falsely detected speaker turns
(c) λ = 12.


0 0.05 0.1 0.15
0.9
0.92
0.94
0.96
0.98
1
C
o
r
r
e
c
t
l
y
d
e
t
e
c
t
e
d
s
p
e
a
k
e
r
t
u
r
n
s
Falsely detected speaker turns
(d) λ = 16.
Fig. 6.13 Error rates of speaker change detection. The detection rate of
speaker changes is plotted versus false alarm rate for speaker identification on an
utterance level. The evaluation bands I = [1; 50[, II = [50; 100[, III = [100; 250] and
the complete test set are used - I (), II (+), III (2) and on average (◦). Confidence
bands are given by a gray shading.
[1993] was examined to capture speaker turns. Unfortunately, no robust
speaker change detection could be achieved. The reason may be the interfer-
ence of speaker and channel characteristics as well as background noises. It
seems to be complicated to accurately detect speaker changes in adverse envi-
ronments using simple statistical models as already mentioned in Sect. 2.3.2.
Hence, only a prior probability for speaker changes was used for long-
term speaker tracking. Speaker change detection was delegated to speaker
identification.
6.6.2 Open-Set Speaker Identification
The closed-set scenario is now extended to an open-set scenario. As discussed
so far, the first two utterances of a new speaker were indicated to belong to
an unknown speaker as depicted in Fig.4.1(a).
140 6 Evolution of an Adaptive Unsupervised Speech Controlled System


0 0.005 0.01 0.015 0.02 0.025
0.9
0.92
0.94
0.96
0.98
1
C
o
r
r
e
c
t
l
y
d
e
t
e
c
t
e
d
s
p
e
a
k
e
r
t
u
r
n
s
Falsely detected speaker turns
(a) λ = 4.


0 0.005 0.01 0.015 0.02 0.025
0.9
0.92
0.94
0.96
0.98
1
C
o
r
r
e
c
t
l
y
d
e
t
e
c
t
e
d
s
p
e
a
k
e
r
t
u
r
n
s
Falsely detected speaker turns
(b) λ = 8.


0 0.005 0.01 0.015 0.02 0.025
0.9
0.92
0.94
0.96
0.98
1
C
o
r
r
e
c
t
l
y
d
e
t
e
c
t
e
d
s
p
e
a
k
e
r
t
u
r
n
s
Falsely detected speaker turns
(c) λ = 12.


0 0.005 0.01 0.015 0.02 0.025
0.9
0.92
0.94
0.96
0.98
1
C
o
r
r
e
c
t
l
y
d
e
t
e
c
t
e
d
s
p
e
a
k
e
r
t
u
r
n
s
Falsely detected speaker turns
(d) λ = 16.
Fig. 6.14 Error rates of speaker change detection. The detection rate of
speaker changes is plotted versus false alarm rate for long-term speaker tracking.
The evaluation bands I = [1; 50[, II = [50; 100[, III = [100; 250] and the complete
test set are used - I (), II (+), III (2) and on average (◦). Confidence bands are
given by a gray shading.
To obtain a strictly unsupervised speech controlled system, no information
of the first occurrences of a new speaker is given by external means. If a known
user is detected, the correct speaker profile has to be adjusted. Otherwise
a new speaker profile has to be initialized on the first few utterances and
continuously adapted on the subsequent utterances. The test setup is shown
in Fig. 4.1(b).
Subsequently, long-term speaker tracking is extended so that unknown
speakers can be detected as described in Sect. 6.4. The evaluation set com-
prises 5 speakers in each speaker set and is identical to the experiment re-
peatedly discussed before. To limit memory consumption and computational
load, only one out-of-set and at most six in-set speakers can be tracked.
If necessary one profile, e.g. the most recently initialized speaker model, is
replaced by a new one.
Table 6.5 contains the results measured on the complete test set. A more
detailed investigation of several evaluation bands is given in Table 6.6. Again,
6.6 Evaluation 141
Table 6.5 Comparison of different adaptation techniques for an open-set sce-
nario. Speaker sets with 5 speakers are examined. Table is taken from [Herbig
et al., 2011].
Speaker adaptation WA [%]
Baseline 85.23
Short-term adaptation 86.13
EV-MAP adaptation
ML (λ ≈ 0) 86.73
λ = 4 87.64
λ = 8 87.74
λ = 12 87.73
λ = 16 87.70
λ = 20 87.77
EV (λ →∞) 87.20
Typical errors
min ±0.23
max ±0.25
Table 6.6 Comparison of different adaptation techniques for an open-set sce-
nario. Speaker sets with 5 speakers are investigated. The speech recognition rate is
examined on evaluation bands comprising I = [1; 50[, II = [50; 100[, III = [100; 250]
utterances.
Speaker I II III
adaptation WA [%] WA [%] WA [%]
Baseline 84.51 86.84 85.02
Short-term 85.89 87.22 85.87
adaptation
EV-MAP
adaptation
ML (λ ≈ 0) 85.36 87.08 87.24
λ = 4 86.61 88.75 87.74
λ = 8 86.84 88.86 87.78
λ = 12 86.99 88.84 87.70
λ = 16 86.85 88.83 87.72
λ = 20 86.88 88.81 87.83
EV (λ →∞) 86.84 88.32 86.98
Typical errors
min ±0.54 ±0.50 ±0.30
max ±0.58 ±0.54 ±0.33
the combination of EV and ML estimates outperforms pure ML or EV es-
timates as well as the baseline and short-term adaptation without speaker
identification. The parameter λ can be selected in a relatively wide range
without negatively affecting speech recognition.
142 6 Evolution of an Adaptive Unsupervised Speech Controlled System
In summary, the achieved WA is smaller compared to closed-set experi-
ments but there is still a clear improvement with respect to the baseline and
short-term adaptation. Due to unsupervised speaker identification the recog-
nition rate was reduced from 88.87 % (Table 4.1) to 88.10 % WA (Table 5.1)
for λ = 4. The additional decrease to 87.64 % WA therefore demonstrates
how important the training of new speaker models based on the first two
utterances and the knowledge about the number of users seem to be for the
experiments described in the preceding sections.
The performance of the closed-set and open-set implementation are com-
pared in Fig. 6.15. On the first evaluation band I a relative increase of 6.4 %
can be observed for the error rate of the open-set implementation when λ = 4
is used for codebook adaptation. On the evaluation bands II and III, however,
the open-set implementation approaches the closed-set realization. Here rela-
tive error rates of 4 % and 2.7 % can be measured. Comparing Table 5.2 with
Table 6.6 it can be stated that this effect is not limited to λ = 4. This finding
gives reason to assume long-term stability of the unsupervised system.


BL ML 4 8 12 16 20 EV ST
84
85
86
87
88
89
90
W
A
[
%
]
λ
(a) Evaluation band I = [1; 50[ utter-
ances.


BL ML 4 8 12 16 20 EV ST
84
85
86
87
88
89
90
W
A
[
%
]
λ
(b) Evaluation band II = [50; 100[ utter-
ances.


BL ML 4 8 12 16 20 EV ST
84
85
86
87
88
89
90
W
A
[
%
]
λ
(c) Evaluation band III = [100; 250] ut-
terances.
Fig. 6.15 Speech recognition is compared for supervised speaker adaptation
with predefined speaker identity (black) and long-term speaker tracking for the
closed-set (gray) and open-set (white) scenario. Improvements in speech recognition
rate are evaluated with respect to the baseline (BL) and short-term adaptation (ST)
since different subsets have been examined. Figure is taken from [Herbig et al., 2011].
6.7 Summary 143
6.7 Summary
The unified approach for speaker identification and speech recognition intro-
duced in the preceding chapter can be extended by an additional component
to appropriately track speakers on a series of utterances. Speaker adaptation
can be delayed until a more confident guess of the speaker identity is ob-
tained. The main aspect is to consider speaker profiles with respect to their
training and to provide prior knowledge about their expected performance.
A method has been developed to compute posterior probabilities reflecting
not only the value of the log-likelihood but also prior knowledge about the
evolution of speaker profiles. The performance of non-target speaker profiles
is included to calculate a probability for each enrolled speaker to be responsi-
ble for the current utterance. By using a simple HMM on an utterance level,
the decoding techniques known from speech recognition can be applied to
determine the most probable speaker alignment.
A significantly higher speaker identification rate compared to the first
realization of the target system was achieved for closed sets. An optimum
of 98.59 % correctly identified speakers was obtained for λ = 4 whereas an
identification rate of 94.64 % was observed without long-term tracking.
Long-term speaker tracking is essential to implement a speaker specific
speech recognizer operated in an unsupervised way. Since this technique can
be easily extended to open-set speaker tracking, new speaker profiles can be
initialized without any additional intervention of new users. No training or
explicit authentication of a new speaker is required.
Without long-term speaker tracking a notable number of codebooks is
expected to be initialized even though an enrolled user is speaking as the
corresponding ROC curves in Sect. 5.4.2 suggest. When memory is limited,
only a small number of codebooks can be considered. Only weakly or moder-
ately speaker profiles would be effectively used for speech recognition if un-
known speakers were not reliably detected. On the other hand missed speaker
changes would also negatively affect speech and speaker modeling leading to
significantly higher recognition error rates. Applying long-term speaker track-
ing can help to prevent a decrease of the speech recognizer’s performance. A
steady stabilization of the complete system was observed in the long run.
7
Summary and Conclusion
The goal of this book was to develop an unsupervised speech controlled sys-
tem which automatically adapts to several recurring users. They are allowed
to operate the system without the requirement to attend a training. Instead,
each user can directly use the speech controlled system without any con-
straints concerning vocabulary or the need to identify himself. Each utterance
contains additional information about a particular speaker. The system can
be incrementally personalized. Enrolled speakers have to be reliably iden-
tified to allow optimal speech decoding and continuous adjustment of the
corresponding statistical models. New speaker profiles are initialized when
unknown speakers are detected.
An implementation in an embedded system was intended and therefore
computational complexity and memory consumption were essential design
parameters. A compact representation with only one statistical model was
targeted and an extension of a common speech recognizer was preferred.
The discussion started with speech production represented by a simple yet
effective source-filter model. It defines the process of speech generation and
provides a first overview of the complexity of speech and speaker variabil-
ity and their dependencies. The acoustic environment, especially for in-car
applications, is characterized by varying background noises and channel char-
acteristics degrading automatic speech recognition and speaker identification.
Automated feature extraction was motivated by human hearing. The for-
mer one processes a continuous stream of audio data and extracts a feature
vector representation for further pattern recognition. A standard technique
for noise reduction was included.
Then several basic techniques for speaker change detection, speaker iden-
tification and speech recognition were discussed. They represent essential
components of the intended solution. Step by step more sophisticated tech-
niques and statistical models were introduced to capture speech and speaker
characteristics.
Speaker adaptation moderates mismatches between training and test by
adjusting the statistical models to unseen situations. Modeling accuracy
T. Herbig, F. Gerl, and W. Minker: Self-Learning Speaker Identification, SCT, pp. 145–148.
springerlink.com c Springer-Verlag Berlin Heidelberg 2011
146 7 Summary and Conclusion
depends on the number of parameters which can be reliably estimated. Sev-
eral strategies were discussed suitable for fast and long-term adaptation de-
pending on the amount of speaker specific training data.
Feature vector enhancement was introduced as an alternative approach to
deal with speaker characteristics and environmental variations on a feature
level.
Then some dedicated realizations of complete systems for unsupervised
speaker identification or speech recognition were presented as found in the
literature. Several strategies were considered to handle speaker variabilities
and to establish more complex systems which enable an unsupervised speaker
modeling and tracking. The solutions differ in model complexity and the
mutual benefit between speech and speaker characteristics, for example. All
these implementations cover several aspects of this book.
This was the starting point for the fusion of the basic techniques of speaker
identification and speech recognition into one statistical model. A completely
unsupervised system was developed:
First, an appropriate strategy for speaker adaptation was discussed. It
is suitable for both short-term and long-term adaptation since the effective
number of parameters are dynamically adjusted. Even on limited data a fast
and robust retrieval of speaker related information is achieved. The speech
recognizer is enabled to initialize and continuously adapt speaker specific
models. One important property, especially for embedded devices, is the mod-
erate complexity.
The experiments carried out showed the gain of this approach compared to
strictly speaker independent implementations and speaker adaptation based
on a few utterances without speaker identification. The error rate of the
speech recognizer could be reduced by 25 % compared to the speaker indepen-
dent baseline. With respect to the implementation of short-term adaptation
a relative error rate reduction of 20 % was achieved.
Then a first realization of the target system was presented. Speaker spe-
cific codebooks can be used for speech recognition and speaker identification
leading to a compact representation of speech and speaker characteristics.
Multiple recognition steps are not required to generate a transcription and
to estimate the speaker identity. All components comprising speaker iden-
tification, speech recognition and speaker adaptation were integrated into a
speech controlled system.
In the experiments speech recognition accuracy could be increased by this
technique to 88.20 %. For comparison, the corresponding upper limit was
88.90 % WA when the speaker is known. The speaker independent baseline
only achieved 85.23 % WA. An optimum of 94.64 % identification rate could
be observed.
In addition, a reference system was implemented which integrates a stan-
dard technique for speaker identification. All implementations obtained lower
identification rates but similar speech recognition results in a realistic closed-
set scenario. Furthermore, it was obviously more complicated to reliably
7 Summary and Conclusion 147
detect unknown speakers compared to the unified approach of the target
system. For both systems unknown speakers appeared to be an unsolved
problem because of the unacceptably high error rates of missed and falsely
detected out-of-set speakers.
Speaker identification was therefore extended by long-term speaker track-
ing. Instead of likelihoods, posterior probabilities can be used to reflect not
only the match between model and observed data but also the training sit-
uation of each speaker specific model. Both weakly and extensively trained
models are treated appropriately due to prior knowledge about the evolution
of the system. Speaker tracking across several utterances was described by
HMMs based on speaker specific codebooks considered on an utterance level.
The forward-backward algorithm allows a more robust speaker identification
since a series of utterances is evaluated. Furthermore, additional devices such
as a beamformer can be easily integrated to support speaker change detection.
No structural modifications of the speech controlled system are required. In
addition, open-set scenarios can be handled. New codebooks can be initial-
ized without imposing new users to identify themselves. No time-consuming
enrollment is necessary.
The experiments for supervised adaptation and the first implementation
of integrated speaker identification and speech recognition were repeated.
Remarkable identification rates up to 98.59 % could be obtained in the closed-
set scenario. Nearly the same speech recognition rate was observed. In an
additional experiment the number of enrolled speakers was doubled. The
identification rate on an utterance level decreased from 94.64 % to 89.56 %.
However, the identification rate was raised again by speaker tracking so that
97.89 %of the enrolled speakers were correctly identified. Furthermore, speech
recognition accuracy could be improved for some realizations.
Finally, the former experiment was considered in an open-set scenario. For
speech recognition approximately 17 % and 12 % relative error rate reduction
compared to the baseline and short-term adaptation were measured. Since
the WA approached the results of the closed-set implementation in the long
run, even higher gains might be expected for more extensive tests. For closed-
sets a benefit for speech recognition could only be observed when long-term
speaker tracking is applied to larger speaker sets. However, it is hard to
imagine that unknown speakers can be reliably detected without long-term
speaker tracking.
Potential for further improvements was seen in an enhanced feature extrac-
tion. If an optimal feature normalization can be guaranteed for the current
speaker, both the speaker identification and speech recognition can bene-
fit even on an utterance level. Under favorable conditions an optimum of
97.33 % correctly identified speakers and 88.61 % WA for speech recognition
were obtained.
It can be concluded that a compact representation of speech and speaker
characteristics was achieved which allows simultaneous speech recognition
and speaker identification. Obviously, an additional module for speaker
148 7 Summary and Conclusion
identification is not required. Only calculations of low complexity have to
be performed in real-time since speaker tracking and adaptation are com-
puted after utterance is finished. Therefore applications in embedded systems
become feasible. The approach presented here has potential for further ex-
tensions and may be applied in a wide range of scenarios as sketched in the
outlook.
8
Outlook
At the end of this book an outlook for prospective applications and extensions
of the techniques presented here is discussed. The outlook is intended to
emphasize the potential for many applications and shall motivate for further
research.
Adaptation
One important aspect of the speech controlled system presented in this book
is to capture speaker characteristics fast and accurately for robust speaker
identification. This task was accomplished by EV adaptation. To obtain op-
timal performance of EV adaptation, training and test conditions have to be
balanced. Especially for applications in varying environments, re-estimating
the eigenvoices may be applied to improve on the eigenspace given by the
PCA [Nguyen et al., 1999].
Furthermore, different adaptation schemes may be investigated with re-
spect to joint speaker identification and speech recognition as possible future
work. For example, MLLR adaptation which is widely used in speaker adap-
tation, e.g. Ferr` as et al. [2007, 2008]; Gales and Woodland [1996]; Leggetter
and Woodland [1995b], may be integrated into the system. For example,
the subspace approach presented by Zhu et al. [2010] may be a promising
candidate.
Usability
The discussion was focused on a speech controlled system triggered by a
push-to-talk button to simplify speech segmentation. A more convenient
human-computer communication may be realized by an unsupervised end-
point detection. Standard VAD techniques based on energy and fundamental
frequency are expected to have limited capabilities in adverse environments,
e.g. in automobiles.
T. Herbig, F. Gerl, and W. Minker: Self-Learning Speaker Identification, SCT, pp. 149–152.
springerlink.com c Springer-Verlag Berlin Heidelberg 2011
150 8 Outlook
Model-based speech segmentation may be implemented by similar ap-
proaches as speaker identification. For instance, GMMs of moderate com-
plexity can be used to model speech and non-speech events [Herbig et al.,
2008; Lee et al., 2004; Oonishi et al., 2010; Tsai et al., 2003]. A decision logic
similar to the implementation of speaker specific speech recognition may al-
low a simple decision whether speech is present.
Alternatively, this problem may be solved by an additional specialized
codebook of the speech recognizer to represent different background scenarios
during speech pauses. When a speech pause is detected, speech decoding can
be rejected. More sophisticated solutions comprising Viterbi time alignment
of all speaker models similar to the decoder-guided audio segmentation are
also possible.
Barge-in functionality can be incorporated into speech segmentation to
allow users to interrupt speech prompts. A more natural and efficient con-
trol of automated human-computer interfaces should be targeted. In the lit-
erature [Ittycheriah and Mammone, 1999; Ljolje and Goffin, 2007] similar
techniques exist which might be integrated. Crosstalk detection may be an
additional challenge [Song et al., 2010; Wrigley et al., 2005].
Development of the Unified Modeling of Speech and Speaker-Related
Characteristics
The solution presented in this book is developed on a unified modeling of
speech and speaker characteristics. The speech recognizer’s codebooks are
considered as common GMMs with uniform weights. Since the likelihood val-
ues are buffered until the speech recognition result is available, the phoneme
alignment can be used not only to support speaker adaptation but also
speaker identification in a similar way. Since an estimate for the state se-
quence is accessible, a refined likelihood becomes feasible. The restriction of
equal weights can be avoided so that speaker identification can apply state
dependent GMMs for each frame which may result in a better statistical
representation.
Furthermore, some phoneme groups are well suited for speaker identifica-
tion whereas others do not significantly contribute to speaker discrimination
as outlined in Sect. 2.1 and Sect. 3.3. In combination with a phoneme align-
ment emphasis can be placed on vowels and nasals. The influence of fricatives
and plosives on speaker identification can be reduced, for instance. Together
this would result in a closer relationship between speech and speaker related
information.
Enhancing Speaker Identification and Speech Recognition
Another method to increase the WA and identification rate is to employ a
short supervised enrollment. Significant improvements are expected when the
spoken text and the speaker identity are known for a couple of utterances.
8 Outlook 151
The approach presented in this book also allows hybrid solutions. Each
user can attend a supervised enrollment of variable duration to initialize a
robust speaker profile. The speech recognizer is always run in an open-set
mode to protect these profiles against unknown speakers. If a new speaker
operates the device for the first time and decides not to attend an enrollment,
a new profile can be temporarily initialized and applied subsequently.
In addition, the concepts of the speaker adaptation scheme and long-term
speaker tracking presented here might be applied to the reference speaker
identification. Even though no essential improvement is expected for the WA,
an increase in the speaker identification rate might be achieved when both
identification techniques are combined.
Mobile Applications
Even though only in-car applications were evaluated in this book, mobile
applications can be considered as well. If a self-learning speaker identification
is used for speech controlled mobile devices, one important aspect should be
the problem of environmental variations.
The front-end of the system presented in this book reduces background
noises and compensates for channel characteristics by noise reduction and
cepstral mean subtraction. However, artifacts are still present and are learned
by speaker adaptation. Especially for mobile applications, e.g. a Personal
Digital Assistant (PDA), this can be undesirable. The focus should be set
here on the variety of background conditions, e.g. in-car, office, public places
and home. All scenarios have to be covered appropriately.
A simple realization would be to use several codebooks for each speaker
and environment. Alternatively, feature vector normalization or enhancement
may be combined with speaker adaptation. Effects from the acoustic back-
ground should be suppressed. In the optimal case only speaker characteristics
are incorporated into the statistical modeling. The latter approach seems to
be preferable since only one speaker specific codebook is needed and different
training levels of domain specific codebooks are circumvented.
Thus, combining feature vector enhancement and adaptation may be ad-
vantageous when speaker characteristics and environmental influences can
be separated. Since eigenvoices achieved good results for speaker adaptation,
compensation on a feature vector level implemented by the eigen-environment
approach could be a promising candidate. The identification of the current
background scenario might help to apply more sophisticated compensation
and speaker adaptation techniques.
Interaction
The use cases so far have involved identifying persons only by external
means [Herbig et al., 2010c] or by an enrolling phase [Herbig et al., 2010d,e],
or such as here where no external information was used [Herbig et al., 2011].
152 8 Outlook
However, a wide range of human-computer communication may be of interest
for realistic applications:
A dialog engine may give information about the temporal or semantic
coherence of a user interaction. During an ongoing dialog a user change is
not expected. Feedback of the speaker identity allows special habits and the
experience of the user about the speech controlled device to be taken into
consideration. Whereas beginners can be offered a comprehensive introduc-
tion or help concerning the handling of the device, advanced users can be
pointed how to use the device more efficiently.
An acoustic preprocessing engine, e.g. in an infotainment or in-car com-
munication system [Schmidt and Haulick, 2006], may use beamforming on
a microphone array which can track the direction of the sound signal. For
example, driver and co-driver may enter voice commands in turn. This soft
information could be integrated to improve the robustness of speaker track-
ing. In addition, devices which are usually registered to a particular user, e.g.
car keys, mobile phones or PDAs, can support speaker identification. Visual
information may also be employed [Maragos et al., 2008].
In summary, a flexible framework has been developed which shows a high
potential for future applications and research.
A
Appendix
A.1 Expectation Maximization Algorithm
The Expectation Maximization (EM) algorithm [Dempster et al., 1977] is a
fundamental approach for mathematical optimization problems comprising
a hidden or latent variable. An optimal statistical representation of a train-
ing data set x
1:T
is targeted. A statistical model with parameter set Θ
ML
characterized by
Θ
ML
= arg max
Θ
{p(x
1:T
|Θ)} (A.1)
has to be determined. An iterative procedure is required due to the latent
variable. In this section the meaning of the latent variable is explained for
GMM training which was introduced in Sect. 2.4.2. Then the two steps of
the EM algorithm are described.
Latent variables can be intuitively explained when a GMM is considered as
a random generator. GMMs can be explained by a combination of two random
processes as given by (2.35). For convenience, the first process is used to select
the Gaussian density with index k according to the prior probability p(k|Θ). A
second random generator produces a Gaussian distributed random vector x
t
characterized by the probability density function p(x
t
|k, Θ). By repetition of
these two steps, a sequence of feature vectors is produced. The assignment
of the feature vector x
t
to the corresponding Gaussian density p(x
t
|k, Θ)
given by the first random generator is hidden for an observer. Therefore, the
index k denotes the latent variable of GMM modeling.
GMM training or adaptation encounters the inverse problem since only the
measured feature vectors x
1:T
are available. The information which Gaussian
density generated an observed feature vector is missing. Thus, training and
adaptation algorithms require knowledge or at least an estimate concerning
the corresponding Gaussian density.
Feature vectors x are subsequently called incomplete data since only the
observations of the second random process are accessible. In contrast, com-
plete data (x, k) comprise feature vector and latent variable.
154 A Appendix
Therefore, the likelihood of complete data p(x
t
, k|Θ) instead of incomplete
data p(x
t
|Θ) is employed in the EM algorithm. To simplify the following con-
siderations, the iid assumption is used. The problem of parameter optimiza-
tion is solved in two steps:
• The E-step computes an estimate of the missing index k with the help
of the posterior probability p(k|x
t
,
¯
Θ) based on initial parameters or the
parameter set of the last iteration
¯
Θ. This posterior probability represents
the responsibility of a Gaussian density for the production of the observa-
tion x
t
and controls the effect on the training or adaptation of a particular
Gaussian density.
• The M-step maximizes the likelihood function by calculating a new set of
model parameters
ˆ
Θ based on complete data given by the previous E-step.
E-step and M-step can be summarized for a sequence of training or adaptation
data x
1:T
in a compact mathematical description which is known as the
auxiliary function
Q
ML
(Θ,
¯
Θ) = E
k
1:T
{log (p(x
1:T
, k
1:T
|Θ)) |x
1:T
,
¯
Θ}. (A.2)
It can be calculated by
Q
ML
(Θ,
¯
Θ) =
T

t=1
N

k=1
p(k|x
t
,
¯
Θ) · log (p(x
t
, k|Θ)) (A.3)
where the new parameter set
ˆ
Θ is characterized by
d

Q
ML
(Θ,
¯
Θ)
¸
¸
¸
¸
Θ=
ˆ
Θ
= 0. (A.4)
An iterative procedure is required since the assignment of the feature vec-
tors x
t
to a particular Gaussian density may be different for an updated
parameter set
ˆ
Θ compared to the preceding parameter set
¯
Θ. Several iter-
ations of the EM algorithm are performed until an optimum of the likeli-
hood p(x
1:T
|Θ) is obtained. However, local and global maxima cannot be
distinguished.
For limited training data or in an adaptation scenario this might be inad-
equate. The risk of over-fitting arises since insignificant properties of the ob-
served data are learned instead of general statistical patterns. The problem of
over-fitting can be decreased when equation (A.3) is extended by prior knowl-
edge p(Θ) about the parameter set Θ. The corresponding auxiliary function
is then given by
Q
MAP
(Θ,
¯
Θ) = Q
ML
(Θ,
¯
Θ) + log (p(Θ)) . (A.5)
For further reading on the EM algorithm, GMM and HMM training the
reader is referred to Bishop [2007]; Dempster et al. [1977]; Rabiner and Juang
[1993].
A.2 Bayesian Adaptation 155
A.2 Bayesian Adaptation
In this section codebook adaptation is exemplified by adapting the mean
vectors in the Bayesian framework. The goal is an improved statistical rep-
resentation of speech characteristics and speaker specific pronunciation for
enhanced speech recognition. It is assumed that a set of speaker specific
training data x
1:T
is accessible.
A two-stage procedure is applied similar to the segmental MAP algorithm
introduced in Sect. 2.6.2. Speech decoding provides an optimal state sequence
and determines the state dependent weights w
s
k
. Then the codebook of the
current speaker i is adapted.
For convenience, the subsequent notation neglects the optimal state se-
quence to obtain a compact representation. In the following derivation, the
auxiliary function of the EM algorithm
Q
MAP

i
, Θ
0
) = Q
ML

i
, Θ
0
) + log (p(Θ
i
)) (A.6)
=
T

t=1
N

k=1
p(k|x
t
, Θ
0
) · log (p(x
t
, k|Θ
i
)) + log (p(Θ
i
)) (A.7)
is extended by a term comprising prior knowledge as given in (A.3) and (A.5).
Only one iteration of the EM algorithm is calculated. The initial parameter
set
Θ
0
=
_
w
0
1
, . . . , w
0
N
, μ
0
1
, . . . , μ
0
N
, Σ
0
1
, . . . , Σ
0
N
_
(A.8)
is given by the standard codebook. Since only mean vectors are adapted, the
following notation is used for the speaker specific codebooks:
Θ
i
=
_
w
0
1
, . . . , w
0
N
, μ
i
1
, . . . , μ
i
N
, Σ
0
1
, . . . , Σ
0
N
_
. (A.9)
Subsequently, the speaker index i is omitted. For reasons of simplicity, μ
k
and
Σ
k
denote the speaker specific mean vectors to be optimized and the covari-
ance matrices of the standard codebook.
For the following equations it is assumed that each Gaussian distribution
can be treated independently from the remaining distributions. Thus, the
prior distribution of the speaker specific mean vector μ
k
can be factorized or
equivalently the logarithm is given by a sum of logarithms. A prior Gaussian
distribution is assumed
log (p(μ
1
, . . . , μ
N
)) =
N

k=1
log (N {μ
k

pr
k
, Σ
pr
k
}) (A.10)
156 A Appendix
= −
1
2
N

k=1

k
−μ
pr
k
)
T
· (Σ
pr
k
)
−1
· (μ
k
−μ
pr
k
)

1
2
N

k=1
log (|Σ
pr
k
|) −
d
2
N

k=1
log (2π)
(A.11)
where the covariance matrix Σ
pr
k
represents the uncertainty of the adaptation.
μ
pr
k
may be given by the mean vectors of the standard codebook, for example.
Q
MAP
is maximized by taking the derivative with respect to the mean
vector μ
k
and by calculating the corresponding roots. Subsequently, equa-
tion (A.11) is inserted into (A.7) and the derivative
d

k
Q
MAP
(Θ, Θ
0
) =
T

t=1
p(k|x
t
, Θ
0
) ·
d

k
log (p(x
t
, k|Θ))
+
d

k
log (p(Θ))
(A.12)
is calculated for a particular Gaussian density k.
The derivative
d

k
log (p(Θ)) =
d

k
log (p(μ
1
, . . . , μ
N
)) (A.13)
removes the constant terms in (A.11) comprising normalization and the de-
terminant of the covariance matrix. Only the derivative of the squared Ma-
halanobis distance
¯
d
Mahal
k
= (μ
k
−μ
pr
k
)
T
· (Σ
pr
k
)
−1
· (μ
k
−μ
pr
k
) (A.14)
is retained in
d

k
log (p(μ
1
, . . . , μ
N
)) = −
1
2
·
d

k
¯
d
Mahal
k
(A.15)
= −(Σ
pr
k
)
−1
· (μ
k
−μ
pr
k
) (A.16)
according to (2.113).
Since the weights are given by the standard codebook, the derivative
d

k
Q
ML
(Θ, Θ
0
) =
T

t=1
p(k|x
t
, Θ
0
) ·
d

k
log (N {x
t

k
, Σ
k
}) (A.17)
= −
1
2
T

t=1
p(k|x
t
, Θ
0
) ·
d

k
d
Mahal
k
(A.18)
A.2 Bayesian Adaptation 157
only depends on the squared Mahalanobis distance
d
Mahal
k
= (x
t
−μ
k
)
T
· Σ
−1
k
· (x
t
−μ
k
) (A.19)
and the posterior probability p(k|x
t
, Θ
0
). The derivative of Q
ML
is given by
d

k
Q
ML
(Θ, Θ
0
) =
T

t=1
p(k|x
t
, Θ
0
) · Σ
−1
k
· (x
t
−μ
k
) . (A.20)
By using (A.20) and (A.16) in (A.12) a compact notation of the optimiza-
tion problem can be obtained:
d

k
Q
MAP
(Θ, Θ
0
) =
T

t=1
p(k|x
t
, Θ
0
) · Σ
−1
k
· (x
t
−μ
k
)
−(Σ
pr
k
)
−1
· (μ
k
−μ
pr
k
) .
(A.21)
The mean vector μ
k
and the covariance matrix Σ
−1
k
of the standard codebook
are time invariant and can be placed in front of the sum
d

k
Q
MAP
(Θ, Θ
0
) = Σ
−1
k
·
_
T

t=1
p(k|x
t
, Θ
0
) · x
t
−n
k
· μ
k
_
−(Σ
pr
k
)
−1
· (μ
k
−μ
pr
k
)
(A.22)
where
n
k
=
T

t=1
p(k|x
t
, Θ
0
). (A.23)
The normalized sum over all observed feature vectors weighted by the poste-
rior probability is defined in (2.43) as the ML estimate
μ
ML
k
=
1
n
k
T

t=1
p(k|x
t
, Θ
0
) · x
t
. (A.24)
The optimization problem
d

k
Q
MAP
(Θ, Θ
0
)
¸
¸
¸
¸
μ
k

opt
k
= 0 (A.25)
is solved by
n
k
· Σ
−1
k
·
_
μ
ML
k
−μ
opt
k
_
= (Σ
pr
k
)
−1
·
_
μ
opt
k
−μ
pr
k
_
(A.26)
which is known as Bayesian adaptation [Duda et al., 2001].
158 A Appendix
A.3 Evaluation Measures
Speech recognition is evaluated in this book by the so-called Word Accu-
racy (WA) which is widely accepted in the speech recognition literature [Boros
et al., 1996]. The recognized word sequence is compared with a reference
string to determine the word accuracy
WA = 100 ·
_
1 −
W
S
+ W
I
+ W
D
W
_
%. (A.27)
W represents the number of words in the reference string and W
S
, W
I
and W
D
denote the number of substituted, inserted and deleted words in the recog-
nized string [Boros et al., 1996]. Alternatively, the Word Error Rate (WER)
defined by
WER = 100 ·
_
W
S
+ W
I
+ W
D
W
_
% (A.28)
can be employed [Bisani and Ney, 2004].
Speaker identification is evaluated by the identification rate ζ which is
approximated by the ratio
ˆ
ζ =

N
u
u=1
δ
K
(i
MAP
u
, i
u
)
N
u
(A.29)
of correctly assigned utterances and the total number of utterances. In this
book, only those utterances are investigated which lead to a speech recogni-
tion result.
In addition, a confidence interval is given for identification rates
1
to reflect
the statistical uncertainty. The uncertainty is caused by the limited number of
utterances incorporated into the estimation of the identification rate. There-
fore, the interval is determined in which the true identification rate ζ has to
be expected with a predetermined probability 1 − α given an experimental
mean estimate
ˆ
ζ based on N
u
utterances:
p(z
1−
α
2
<
(
ˆ
ζ −ζ) ·

N
u
ˆ σ
≤ z
α
2
) = 1 −α. (A.30)
z is the cut-off of the normalized random variable and ˆ σ denotes the standard
deviation of the applied estimator. Subsequently, the Gaussian approximation
is used to compute the confidence interval
ˆ
ζ −
ˆ σ · t
α
2

N
u
< ζ ≤
ˆ
ζ +
ˆ σ · t
α
2

N
u
(A.31)
1
Even though the WA is a rate and not a probability [Bisani and Ney, 2004],
equation (A.31) is equally employed in this book to indicate the reliability of the
speech recognition rate.
A.3 Evaluation Measures 159
of a Binomial distribution as found by Kerekes [2008]. The standard devi-
ation ˆ σ is given by ˆ σ
2
=
ˆ
ζ · (1 −
ˆ
ζ). t
α
2
is characterized by the Student
distribution. In this context α = 0.05 and t
α
2
= 1.96 are used. In the figures
of this book confidence intervals are given by a gray shading.
Frequently, several realizations of an automated speech recognizer are eval-
uated and compared for certain values of a tuning parameter, e.g. in Table 4.1.
Those tests are usually performed on identical data sets to avoid uncertainties
caused by independent data. In this case paired difference tests are appropri-
ate to determine whether a significant difference between two experiments can
be assumed for a particular data set [Bisani and Ney, 2004; Bortz, 2005]. For
example, the typical error of the results shown in Table 4.1 is about ±0.2 %
for independent datasets and about ±0.03 % for this specific evaluation. In
this book only the typical errors of independent datasets are given for reasons
of simplicity.
The problem of speaker change detection is characterized by a binary de-
cision problem. H
0
and H
1
denote the hypotheses of no speaker change and
a speaker change, respectively. The optimal classifier is given in (2.14). The
following two error scenarios determine the performance of the binary classi-
fier:
p(H
1
|i
u−1
= i
u
) false alarm,
p(H
0
|i
u−1
= i
u
) missed speaker change.
Alternatively, the accuracy of a binary classifier can be graphically rep-
resented for varying thresholds by the Receiver Operation Characteris-
tics (ROC) curve
2
. The true positives (which are complimentary to the miss
rate) are plotted versus the false alarm rate independently from the prior
probability of a speaker change.
For details on confidence measures and ROC curves it is referred to Kerekes
[2008]; Macskassy and Provost [2004].
2
Detection Error Trade-off (DET) curves can be equivalently used to graphically
represent both error types of a binary classifier [Martin et al., 1997].
References
Ajmera, J., McCowan, I., Bourlard, H.: Robust speaker change detection. IEEE
Signal Processing Letters 11(8), 649–651 (2004)
Angkititrakul, P., Hansen, J.H.L.: Discriminative in-set/out-of-set speaker recog-
nition. IEEE Transactions on Audio, Speech, and Language Processing 15(2),
498–508 (2007)
Barras, C., Gauvain, J.-L.: Feature and score normalization for speaker verification
of cellular data. In: IEEE International Conference on Acoustics, Speech, and
Signal Processing, ICASSP 2003, vol. 2, pp. 49–52 (2003)
Benesty, J., Makino, S., Chen, J. (eds.): Speech Enhancement. Signals and Com-
munication Technology. Springer, Heidelberg (2005)
Bennett, P.N.: Using asymmetric distributions to improve text classifier probability
estimates. In: Proceedings of the 26th Annual International ACM SIGIR Con-
ference on Research and Development in Information Retrieval, SIGIR 2003, pp.
111–118 (2003)
Bimbot, F., Magrin-Chagnolleau, I., Mathan, L.: Second-order statistical measures
for text-independent speaker identification. Speech Communication 17(1-2), 177–
192 (1995)
Bimbot, F., Mathan, L.: Text-free speaker recognition using an arithmetic-harmonic
sphericity measure. In: EUROSPEECH 1993, pp. 169–172 (1993)
Bisani, M., Ney, H.: Bootstrap estimates for confidence intervals in ASR perfor-
mance evaluation. In: IEEE International Conference on Acoustics, Speech, and
Signal Processing, ICASSP 2004, vol. 1, pp. 409–412 (2004)
Bishop, C.M.: Neural Networks for Pattern Recognition, 1st edn. Oxford University
Press, Oxford (1996)
Bishop, C.M.: Pattern Recognition and Machine Learning, 1st edn. Springer, New
York (2007)
Boros, M., Eckert, W., Gallwitz, F., Gorz, G., Hanrieder, G., Niemann, H.: To-
wards understanding spontaneous speech: word accuracy vs. concept accuracy
vs. concept accuracy. In: International Conference on Spoken Language Process-
ing, ICSLP 1996, vol. 2, pp. 1009–1012 (1996)
Bortz, J.: Statistik: F¨ ur Human- und Sozialwissenschaftler, 6th edn. Springer-
Lehrbuch. Springer, Heidelberg (2005) (in German)
Botterweck, H.: Anisotropic MAP defined by eigenvoices for large vocabulary con-
tinuous speech recognition. In: IEEE International Conference on Acoustics,
Speech, and Signal Processing, ICASSP 2001, vol. 1, pp. 353–356 (2001)
162 References
Britanak, V., Yip, P., Rao, K.R.: Discrete Cosine and Sine Transforms: General
Properties, Fast Algorithms and Integer Approximations. Academic Press, Lon-
don (2006)
Bronstein, I.N., Semendjajew, K.A., Musiol, G., Muehlig, H.: Taschenbuch der
Mathematik. Verlag Harri Deutsch, Frankfurt am Main (2000) (in German)
Buera, L., Lleida, E., Miguel, A., Ortega, A., Saz,
´
O.: Cepstral vector normalization
based on stereo data for robust speech recognition. IEEE Transactions on Audio,
Speech and Language Processing 15(3), 1098–1113 (2007)
Buera, L., Lleida, E., Rosas, J.D., Villalba, J., Miguel, A., Ortega, A., Saz,
O.: Speaker verification and identification using phoneme dependent multi-
environment models based linear normalization in adverse and dynamic acoustic
environments. In: Summer School for Advanced Studies on Biometrics for Secure
Authentication: Multimodality and System Integration (2005)
Campbell, J.P.: Speaker recognition - a tutorial. Proceedings of the IEEE 85(9),
1437–1462 (1997)
Capp´e, O.: Elimination of the musical noise phenomenon with the Ephraim and
Malah noise suppressor. IEEE Transactions on Speech and Audio Process-
ing 2(2), 345–349 (1994)
Che, C., Lin, Q., Yuk, D.-S.: An HMM approach to text-prompted speaker ver-
ification. In: IEEE International Conference on Acoustics, Speech, and Signal
Processing, ICASSP 1996, vol. 2, pp. 673–676 (1996)
Chen, S.S., Gopalakrishnan, P.S.: Speaker, environment and channel change de-
tection and clustering via the Bayesian information criterion. In: Proceedings of
the DARPA Broadcast News Transcription and Understanding Workshop, pp.
127–132 (1998)
Cheng, S.-S., Wang, H.-M.: A sequential metric-based audio segmentation method
via the Bayesian information criterion. In: EUROSPEECH 2003, pp. 945–948
(2003)
Class, F., Haiber, U., Kaltenmeier, A.: Automatic detection of change in
speaker in speaker adaptive speech recognition systems. US Patent Application
2003/0187645 A1 (2003)
Class, F., Kaltenmeier, A., Regel-Brietzmann, P.: Optimization of an HMM - based
continuous speech recognizer. In: EUROSPEECH 1993, pp. 803–806 (1993)
Class, F., Kaltenmeier, A., Regel-Brietzmann, P.: Speech recognition method with
adaptation of the speech characteristics. European Patent EP0586996 (1994)
Cohen, I.: Noise spectrum estimation in adverse environments: Improved minima
controlled recursive averaging. IEEE Transactions on Speech and Audio Process-
ing 11(5), 466–475 (2003)
Cohen, I., Berdugo, B.: Noise estimation by minima controlled recursive. IEEE
Signal Processing Letters 9(1), 12–15 (2002)
Davis, S.B., Mermelstein, P.: Comparison of parametric representations for mono-
syllabic word recognition in continuously spoken sentences. IEEE Transactions
on Acoustics, Speech, and Signal Processing 28(4), 357–366 (1980)
Delakis, M., Gravier, G., Gros, P.: Stochastic models for multimodal video analysis.
In: Maragos, P., Potamianos, A., Gros, P. (eds.) Multimodal Processing and
Interaction: Audio, Video, Text, 1st edn. Multimedia Systems and Applications,
vol. 33, pp. 91–109. Springer, New York (2008)
References 163
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society Series
B 39(1), 1–38 (1977)
Dobler, S., R¨ uhl, H.-W.: Speaker adaptation for telephone based speech dialogue
systems. In: EUROSPEECH 1995, pp. 1139–1143 (1995)
Droppo, J., Acero, A.: Maximum mutual information SPLICE transform for seen
and unseen conditions. In: INTERSPEECH 2005, pp. 989–992 (2005)
Droppo, J., Deng, L., Acero, A.: Evaluation of the SPLICE algorithm on the Au-
rora2 database. In: EUROSPEECH 2001, pp. 217–220 (2001)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley Inter-
science, New York (2001)
Eatock, J.P., Mason, J.S.: A quantitative assessment of the relative speaker discrim-
inating properties of phonemes. In: IEEE International Conference on Acoustics,
Speech, and Signal Processing, ICASSP 1994, vol. 1, pp. 133–136 (1994)
Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean-square er-
ror short-time spectral amplitude estimator. IEEE Transactions on Acoustics,
Speech, and Signal Processing ASSP-32(6), 1109–1121 (1984)
Espi, M., Miyabe, S., Nishimoto, T., Ono, N., Sagayama, S.: Analysis on speech
characteristics for robust voice activity detection. In: IEEE Workshop on Spoken
Language Technology, SLT 2010, pp. 139–144 (2010)
Faltlhauser, R., Ruske, G.: Improving speaker recognition using phonetically struc-
tured Gaussian mixture models. In: EUROSPEECH 2001, pp. 751–754 (2001)
Fant, G.: Acoustic Theory of Speech Production. Mouton de Gruyter, The Hague
(1960)
Felippa, C.A.: Introduction to finite element methods (2004)
Ferr`as, M., Leung, C.C., Barras, C., Gauvain, J.-L.: Constrained MLLR for speaker
recognition. In: IEEE International Conference on Acoustics, Speech, and Signal
Processing, ICASSP 2007, vol. 4, pp. 53–56 (2007)
Ferr`as, M., Leung, C.C., Barras, C., Gauvain, J.-L.: MLLR techniques for speaker
recognition. In: The Speaker and Language Recognition Workshop, Odyssey
2008, pp. 21–24 (2008)
Fink, G.A.: Mustererkennung mit Markov-Modellen: Theorie-Praxis-Anwendungs-
gebiete. Leitf¨aden der Informatik. B. G. Teubner, Stuttgart (2003) (in German)
Forney, G.D.: The Viterbi algorithm. Proceedings of the IEEE 61(3), 268–278 (1973)
Fortuna, J., Sivakumaran, P., Ariyaeeinia, A., Malegaonkar, A.: Open-set speaker
identification using adapted Gaussian mixture models. In: INTERSPEECH 2005,
pp. 1997–2000 (2005)
Furui, S.: Selected topics from 40 years of research in speech and speaker recogni-
tion. In: INTERSPEECH 2009, pp. 1–8 (2009)
Gales, M.J.F., Woodland, P.C.: Mean and variance adaptation within the MLLR
framework. Computer Speech and Language 10, 249–264 (1996)
Garau, G., Renals, S., Hain, T.: Applying vocal tract length normalization to meet-
ing recordings. In: INTERSPEECH 2005, pp. 265–268 (2005)
Gauvain, J.-L., Lamel, L.F., Prouts, B.: Experiments with speaker verification over
the telephone. In: EUROSPEECH 1995, pp. 651–654 (1995)
Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate Gaus-
sian mixture observations of Markov chains. IEEE Transactions on Speech and
Audio Processing 2(2), 291–298 (1994)
164 References
Geiger, J., Wallhoff, F., Rigoll, G.: GMM-UBM based open-set online speaker di-
arization. In: INTERSPEECH 2010, pp. 2330–2333 (2010)
Genoud, D., Ellis, D., Morgan, N.: Combined speech and speaker recognition with
speaker-adapted connectionist models. In: IEEE Workshop on Automatic Speech
Recognition and Understanding, ASRU 1999 (1999)
Gish, H., Schmidt, M.: Text-independent speaker identification. IEEE Signal Pro-
cessing Magazine 11(4), 18–32 (1994)
Godin, K.W., Hansen, J.H.L.: Session variability contrasts in the MARP corpus.
In: INTERSPEECH 2010, pp. 298–301 (2010)
Goldenberg, R., Cohen, A., Shallom, I.: The Lombard effect’s influence on auto-
matic speaker verification systems and methods for its compensation. In: Inter-
national Conference on Information Technology: Research and Education, ITRE
2006, pp. 233–237 (2006)
Gollan, C., Bacchiani, M.: Confidence scores for acoustic model adaptation. In:
IEEE International Conference on Acoustics, Speech, and Signal Processing,
ICASSP 2008, pp. 4289–4292 (2008)
Gutman, D., Bistritz, Y.: Speaker verification using phoneme-adapted Gaussian
mixture models. In: The XI European Signal Processing Conference, EUSIPCO
2002, vol. 3, pp. 85–88 (2002)
H¨ ab-Umbach, R.: Investigations on inter-speaker variability in the feature space.
In: IEEE International Conference on Acoustics, Speech, and Signal Processing,
ICASSP 1999, vol. 1, pp. 397–400 (1999)
Hain, T., Johnson, S.E., Tuerk, A., Woodland, P.C., Young, S.J.: Segment gener-
ation and clustering in the HTK broadcast news transcription system. In: Pro-
ceedings of the Broadcast News Transcription and Understanding Workshop, pp.
133–137 (1998)
H¨ ansler, E.: Statistische Signale: Grundlagen und Anwendungen, 3rd edn. Springer,
Heidelberg (2001) (in German)
H¨ ansler, E., Schmidt, G.: Acoustic Echo and Noise Control: A Practical Approach.
Wiley-IEEE Press, Hoboken (2004)
H¨ ansler, E., Schmidt, G. (eds.): Speech and Audio Processing in Adverse Environ-
ments. Signals and Communication Technology. Springer, Heidelberg (2008)
Harrag, A., Mohamadi, T., Serignat, J.F.: LDA combination of pitch and MFCC
features in speaker recognition. In: IEEE Indicon Conference, pp. 237–240 (2005)
Hautam¨ aki, V., Kinnunen, T., Nosratighods, M., Lee, K.A., Ma, B., Li, H.: Ap-
proaching human listener accuracy with modern speaker verification. In: IN-
TERSPEECH 2010, pp. 1473–1476 (2010)
Herbig, T., Gaupp, O., Gerl, F.: Modellbasierte Sprachsegmentierung. In: 34.
Jahrestagung f¨ ur Akustik, DAGA 2008, pp. 259–260 (2008) (in German)
Herbig, T., Gerl, F.: Joint speaker identification and speech recognition for speech
controlled applications in an automotive environment. In: International Confer-
ence on Acoustics, NAG-DAGA 2009, pp. 74–77 (2009)
Herbig, T., Gerl, F., Minker, W.: Detection of unknown speakers in an unsupervised
speech controlled system. In: Lee, G.G., Mariani, J., Minker, W., Nakamura, S.
(eds.) IWSDS 2010. LNCS, vol. 6392, pp. 25–35. Springer, Heidelberg (2010a)
Herbig, T., Gerl, F., Minker, W.: Evaluation of two approaches for speaker specific
speech recognition. In: Lee, G.G., Mariani, J., Minker, W., Nakamura, S. (eds.)
IWSDS 2010. LNCS, vol. 6392, pp. 36–47. Springer, Heidelberg (2010b)
References 165
Herbig, T., Gerl, F., Minker, W.: Fast adaptation of speech and speaker characteris-
tics for enhanced speech recognition in adverse intelligent environments. In: The
6th International Conference on Intelligent Environments, IE 2010, pp. 100–105
(2010c)
Herbig, T., Gerl, F., Minker, W.: Simultaneous speech recognition and speaker
identification. In: IEEE Workshop on Spoken Language Technology, SLT 2010,
pp. 206–210 (2010d)
Herbig, T., Gerl, F., Minker, W.: Speaker tracking in an unsupervised speech con-
trolled system. In: INTERSPEECH 2010, pp. 2666–2669 (2010e)
Herbig, T., Gerl, F., Minker, W.: Evolution of an adaptive unsupervised speech
controlled system. In: IEEE Workshop on Evolving and Adaptive Intelligent
Systems, EAIS 2011, pp. 163–169 (2011)
Huang, C.-H., Chien, J.-T., Wang, H.-M.: A new eigenvoice approach to speaker
adaptation. In: The 4th International Symposium on Chinese Spoken Language
Processing, ISCSLP 2004, pp. 109–112 (2004)
Huang, X.D., Jack, M.A.: Unified techniques for vector quantization and hidden
Markov modeling using semi-continuous models. In: IEEE International Con-
ference on Acoustics, Speech, and Signal Processing, ICASSP 1989, vol. 1, pp.
639–642 (1989)
Iskra, D., Grosskopf, B., Marasek, K., van den Heuvel, H., Diehl, F., Kiessling, A.:
SPEECON - speech databases for consumer devices: Database specification and
validation. In: Proceedings of the Third International Conference on Language
Resources and Evaluation, LREC 2002, pp. 329–333 (2002)
Ittycheriah, A., Mammone, R.J.: Detecting user speech in barge-in over prompts us-
ing speaker identification methods. In: EUROSPEECH 1999, pp. 327–330 (1999)
Johnson, S.E.: Who spoke when? - Automatic segmentation and clustering for de-
termining speaker turns. In: EUROSPEECH 1999, vol. 5, pp. 2211–2214 (1999)
Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer Series in Statistics.
Springer, Heidelberg (2002)
Jon, E., Kim, D.K., Kim, N.S.: EMAP-based speaker adaptation with robust corre-
lation estimation. In: IEEE International Conference on Acoustics, Speech, and
Signal Processing, ICASSP 2001, vol. 1, pp. 321–324 (2001)
Junqua, J.-C.: The influence of acoustics on speech production: A noise-induced
stress phenomenon known as the Lombard reflex. Speech Communication 20(1-
2), 13–22 (1996)
Junqua, J.-C.: Robust Speech Recognition in Embedded Systems and PC Applica-
tions. Kluwer Academic Publishers, Dordrecht (2000)
Kammeyer, K.D., Kroschel, K.: Digitale Signalverarbeitung, 4th edn. B. G. Teub-
ner, Stuttgart (1998) (in German)
Kerekes, J.: Receiver operating characteristic curve confidence intervals and regions.
IEEE Geoscience and Remote Sensing Letters 5(2), 251–255 (2008)
Kim, D.Y., Umesh, S., Gales, M.J.F., Hain, T., Woodland, P.C.: Using VTLN for
broadcast news transcription. In: INTERSPEECH 2004, pp. 1953–1956 (2004)
Kimball, O., Schmidt, M., Gish, H., Waterman, J.: Speaker verification with limited
enrollment data. In: EUROSPEECH 1997, pp. 967–970 (1997)
Kinnunen, T.: Designing a speaker-discriminative adaptive filter bank for speaker
recognition. In: International Conference on Spoken Language Processing, ICSLP
2002, pp. 2325–2328 (2002)
166 References
Kinnunen, T.: Spectral Features for Automatic Text-Independent Speaker Recog-
nition. Ph.lic. thesis, Department of Computer Science, University of Joensuu,
Finland (2003)
Krim, H., Viberg, M.: Two decades of array signal processing research: the para-
metric approach. IEEE Signal Processing Magazine 13(4), 67–94 (1996)
Kuhn, R., Junqua, J.-C., Nguyen, P., Niedzielski, N.: Rapid speaker adaptation
in eigenvoice space. IEEE Transactions on Speech and Audio Processing 8(6),
695–707 (2000)
Kuhn, R., Nguyen, P., Junqua, J.-C., Boman, R., Niedzielski, N., Fincke, S., Field,
K., Contolini, M.: Fast speaker adaptation using a priori knowledge. In: IEEE
International Conference on Acoustics, Speech, and Signal Processing, ICASSP
1999, vol. 2, pp. 749–752 (1999)
Kuhn, R., Nguyen, P., Junqua, J.-C., Goldwasser, L., Niedzielski, N., Fincke, S.,
Field, K., Contolini, M.: Eigenvoices for speaker adaptation. In: International
Conference on Spoken Language Processing, ICSLP 1998, vol. 5, pp. 1771–1774
(1998)
Kuhn, R., Perronnin, F., Junqua, J.-C.: Time is money: Why very rapid adaptation
matters. In: Adaptation 2001, pp. 33–36 (2001)
Kwon, S., Narayanan, S.: Unsupervised speaker indexing using generic models.
IEEE Transactions on Speech and Audio Processing 13(5), 1004–1013 (2005)
Kwon, S., Narayanan, S.S.: Speaker change detection using a new weighted distance
measure. In: International Conference on Spoken Language Processing, ICSLP
2002, pp. 2537–2540 (2002)
Lee, A., Nakamura, K., Nisimura, R., Saruwatari, H., Shikano, K.: Noise robust real
world spoken dialogue system using GMM based rejection of unintended inputs.
In: INTERSPEECH 2004, pp. 173–176 (2004)
Leggetter, C.J., Woodland, P.C.: Flexible speaker adaptation using maximum like-
lihood linear regression. In: Proceedings of the ARPA Spoken Language Tech-
nology Workshop, pp. 110–115 (1995a)
Leggetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker
adaptation of continuous density hidden Markov models. Computer Speech and
Language 9, 171–185 (1995b)
Lehn, J., Wegmann, H.: Einf¨ uhrung in die Statistik, 3rd edn. B. G. Teubner,
Stuttgart (2000) (in German)
Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design. IEEE
Transactions on Communications 28(1), 84–95 (1980)
Liu, L., He, J., Palm, G.: A comparison of human and machine in speaker recogni-
tion. In: EUROSPEECH 1997, pp. 2327–2330 (1997)
Ljolje, A., Goffin, V.: Discriminative training of multi-state barge-in models. In:
IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU
2007, pp. 353–358 (2007)
Logan, B.: Mel frequency cepstral coefficients for music modeling. In: 1st Interna-
tional Symposium on Music Information Retrieval, ISMIR 2000 (2000)
Lu, L., Zhang, H.J.: Speaker change detection and tracking in real-time news broad-
casting analysis. In: Proceedings of the Tenth ACM International Conference on
Multimedia, MULTIMEDIA 2002, pp. 602–610 (2002)
Macskassy, S.A., Provost, F.J.: Confidence bands for ROC curves: Methods and
an empirical study. In: 1st Workshop on ROC Analysis in Artificial Intelligence,
ROCAI 2004, pp. 61–70 (2004)
References 167
Malegaonkar, A.S., Ariyaeeinia, A.M., Sivakumaran, P.: Efficient speaker change
detection using adapted Gaussian mixture models. IEEE Transactions on Audio,
Speech, and Language Processing 15(6), 1859–1869 (2007)
Mammone, R.J., Zhang, X., Ramachandran, R.P.: Robust speaker recognition.
IEEE Signal Processing Magazine 13(5), 58–71 (1996)
Maragos, P., Potamianos, A., Gros, P. (eds.): Multimodal Processing and Interac-
tion: Audio, Video, Text, 1st edn. Multimedia Systems and Applications, vol. 33.
Springer, New York (2008)
Markov, K., Nakagawa, S.: Frame level likelihood normalization for text-
independent speaker identification using Gaussian mixture models. In: Inter-
national Conference on Spoken Language Processing, ICSLP 1996, vol. 3, pp.
1764–1767 (1996)
Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The DET
curve in assessment of detection task performance. In: EUROSPEECH 1997, pp.
1895–1898 (1997)
Matrouf, D., Bellot, O., Nocera, P., Linares, G., Bonastre, J.-F.: A posteriori and a
priori transformations for speaker adaptation in large vocabulary speech recog-
nition systems. In: EUROSPEECH 2001, pp. 1245–1248 (2001)
Matsui, T., Furui, S.: Concatenated phoneme models for text-variable speaker
recognition. In: IEEE International Conference on Acoustics, Speech, and Signal
Processing, ICASSP 1993, vol. 2, pp. 391–394 (1993)
Meinedo, H., Neto, J.: Audio segmentation, classification and clustering in a broad-
cast news task. In: IEEE International Conference on Acoustics, Speech, and
Signal Processing, ICASSP 2003, vol. 2, pp. 5–8 (2003)
Meyberg, K., Vachenauer, P.: H¨ohere Mathematik 1: Differential- und Integral-
rechnung. Vektor- und Matrizenrechnung, 6th edn. Springer-Lehrbuch. Springer,
Heidelberg (2003) (in German)
Meyer, B., Wesker, T., Brand, T., Mertins, A., Kollmeier, B.: A human-machine
comparison in speech recognition based on a logatome corpus. In: Speech Recog-
nition and Intrinsic Variation, SRIV 2006, pp. 95–100 (2006)
Moreno, P.J.: Speech Recognition in Noisy Environments. PhD thesis, Department
of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh,
Pennsylvania (1996)
Mori, K., Nakagawa, S.: Speaker change detection and speaker clustering using
VQ distortion for broadcast news speech recognition. In: IEEE International
Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, vol. 1,
pp. 413–416 (2001)
Morris, A.C., Wu, D., Koreman, J.: MLP trained to separate problem speakers pro-
vides improved features for speaker identification. In: 39th Annual International
Carnahan Conference on Security Technology, CCST 2005, pp. 325–328 (2005)
Nakagawa, S., Zhang, W., Takahashi, M.: Text-independent / text-prompted
speaker recognition by combining speaker-specific GMM with speaker adapted
syllable-based HMM. IEICE Transactions on Information and Systems E89-D(3),
1058–1065 (2006)
Nguyen, P., Wellekens, C., Junqua, J.-C.: Maximum likelihood eigenspace and
MLLR for speech recognition in noisy environments. In: EUROSPEECH 1999,
pp. 2519–2522 (1999)
168 References
Nishida, M., Kawahara, T.: Speaker indexing and adaptation using speaker clus-
tering based on statistical model selection. In: IEEE International Conference
on Acoustics, Speech, and Signal Processing, ICASSP 2004, vol. 1, pp. 353–356
(2004)
Nishida, M., Kawahara, T.: Speaker model selection based on the Bayesian infor-
mation criterion applied to unsupervised speaker indexing. IEEE Transactions
on Speech and Audio Processing 13(4), 583–592 (2005)
Olsen, J.Ø.: Speaker verification based on phonetic decision making. In: EU-
ROSPEECH 1997, pp. 1375–1378 (1997)
Oonishi, T., Iwano, K., Furui, S.: VAD-measure-embedded decoder with online
model adaptation. In: INTERSPEECH 2010, pp. 3122–3125 (2010)
Oppenheim, A.V., Schafer, R.W.: Digital Signal Processing. Prentice-Hall, Engle-
wood Cliffs (1975)
O’Shaughnessy, D.: Speech Communications: Human and Machine, 2nd edn. IEEE
Press, New York (2000)
Park, A., Hazen, T.J.: ASR dependent techniques for speaker identification. In:
International Conference on Spoken Language Processing, ICSLP 2002, pp. 1337–
1340 (2002)
Pelecanos, J., Slomka, S., Sridharan, S.: Enhancing automatic speaker identification
using phoneme clustering and frame based parameter and frame size selection.
In: Proceedings of the Fifth International Symposium on Signal Processing and
Its Applications, ISSPA 1999, vol. 2, pp. 633–636 (1999)
Petersen, K.B., Pedersen, M.S.: The matrix cookbook (2008)
Quatieri, T.F.: Discrete-Time Speech Signal Processing: Principles and Practice.
Prentice-Hall, Upper Saddle River (2002)
Rabiner, L., Juang, B.-H.: Fundamentals of Speech Recognition. Prentice-Hall, En-
glewood Cliffs (1993)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Ram´ırez, J., G´orriz, J.M., Segura, J.C.: Voice activity detection. Fundamentals
and speech recognition system robustness. In: Grimm, M., Kroschel, K. (eds.)
Robust Speech Recognition and Understanding, pp. 1–22. I-Tech Education and
Publishing, Vienna (2007)
Reynolds, D., Campbell, J., Campbell, B., Dunn, B., Gleason, T., Jones, D.,
Quatieri, T., Quillen, C., Sturim, D., Torres-Carrasquillo, P.: Beyond cepstra:
Exploiting high-level information in speaker recognition. In: Proceedings of the
Workshop on Multimodal User Authentication, pp. 223–229 (2003)
Reynolds, D.A.: Large population speaker identification using clean and telephone
speech. IEEE Signal Processing Letters 2(3), 46–48 (1995a)
Reynolds, D.A.: Speaker identification and verification using Gaussian mixture
speaker models. Speech Communication 17(1-2), 91–108 (1995b)
Reynolds, D.A., Carlson, B.A.: Text-dependent speaker verification using decoupled
and integrated speaker and speech recognizers. In: EUROSPEECH 1995, pp.
647–650 (1995)
Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted
Gaussian mixture models. Digital Signal Processing 10(1-3), 19–41 (2000)
Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using
Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Pro-
cessing 3(1), 72–83 (1995)
References 169
Rieck, S., Schukat-Talamazzini, E.G., Niemann, H.: Speaker adaptation using semi-
continuous hidden Markov models. In: 11th IAPR International Conference on
Pattern Recognition, vol. 3, pp. 541–544 (1992)
Rodr´ıguez-Li˜ nares, L., Garc´ıa-Mateo, C.: On the use of acoustic segmentation in
speaker identification. In: EUROSPEECH 1997, pp. 2315–2318 (1997)
Rozzi, W.A., Stern, R.M.: Speaker adaptation in continuous speech recognition
via estimation of correlated mean vectors. In: IEEE International Conference
on Acoustics, Speech, and Signal Processing, ICASSP 1991, vol. 2, pp. 865–868
(1991)
Schmalenstroeer, J., H¨ab-Umbach, R.: Joint speaker segmentation, localization and
identification for streaming audio. In: INTERSPEECH 2007, pp. 570–573 (2007)
Schmidt, G., Haulick, T.: Signal processing for in-car communication systems. Sig-
nal processing 86(6), 1307–1326 (2006)
Schukat-Talamazzini, E.G.: Automatische Spracherkennung. Vieweg, Braunschweig
(1995) (in German)
Segura, J.C., Ben´ıtez, C., de la Torre,
´
A., Rubio, A.J., Ram´ırez, J.: Cepstral domain
segmental nonlinear feature transformations for robust speech recognition. IEEE
Signal Processing Letters 11(5), 517–520 (2004)
Setiawan, P., H¨oge, H., Fingscheidt, T.: Entropy-based feature analysis for speech
recognition. In: INTERSPEECH 2009, pp. 2959–2962 (2009)
Siegler, M.A., Jain, U., Raj, B., Stern, R.M.: Automatic segmentation, classification
and clustering of broadcast news audio. In: Proceedings of the DARPA Speech
Recognition Workshop, pp. 97–99 (1997)
Song, H.J., Kim, H.S.: Eigen-environment based noise compensation method for
robust speech recognition. In: INTERSPEECH 2005, pp. 981–984 (2005)
Song, J.-H., Lee, K.-H., Park, Y.-S., Kang, S.-I., Chang, J.-H.: On using Gaus-
sian mixture model for double-talk detection in acoustic echo suppression. In:
INTERSPEECH 2010, pp. 2778–2781 (2010)
Stern, R.M., Lasry, M.J.: Dynamic speaker adaptation for feature-based isolated
word recognition. IEEE Transactions on Acoustics, Speech and Signal Process-
ing 35(6), 751–763 (1987)
Thelen, E.: Long term on-line speaker adaptation for large vocabulary dictation.
In: International Conference on Spoken Language Processing, ICSLP 1996, pp.
2139–2142 (1996)
Thompson, J., Mason, J.S.: Cepstral statistics within phonetic subgroups. In: In-
ternational Conference on Signal Processing, ICSP 1993, pp. 737–740 (1993)
Thyes, O., Kuhn, R., Nguyen, P., Junqua, J.-C.: Speaker identification and verifi-
cation using eigenvoices. In: International Conference on Spoken Language Pro-
cessing, ICSLP 2000, vol. 2, pp. 242–245 (2000)
Tritschler, A., Gopinath, R.: Improved speaker segmentation and segments cluster-
ing using the Bayesian information criterion. In: EUROSPEECH 1999, vol. 2,
pp. 679–682 (1999)
Tsai, W.-H., Wang, H.-M., Rodgers, D.: Automatic singer identification of popu-
lar music recordings via estimation and modeling of solo vocal signal. In: EU-
ROSPEECH 2003, pp. 2993–2996 (2003)
Vary, P., Heute, U., Hess, W.: Digitale Sprachsignalverarbeitung. B.G. Teubner,
Stuttgart (1998) (in German)
Veen, B.D.V., Buckley, K.M.: Beamforming: A versatile approach to spatial filter-
ing. IEEE ASSP Magazine 5(2), 4–24 (1988)
170 References
Wendemuth, A.: Grundlagen der stochastischen Sprachverarbeitung. Oldenbourg
(2004) (in German)
Wilcox, L., Kimber, D., Chen, F.: Audio indexing using speaker identification. In:
Proceedings of the SPIE Conference on Automatic Systems for the Inspection
and Identification of Humans, vol. 2277, pp. 149–157 (1994)
Wrigley, S.N., Brown, G.J., Wan, V., Renals, S.: Speech and crosstalk detection in
multi-channel audio. IEEE Transactions on Speech and Audio Processing 13(1),
84–91 (2005)
Wu, T., Lu, L., Chen, K., Zhang, H.-J.: UBM-based real-time speaker segmentation
for broadcasting news. In: IEEE International Conference on Acoustics, Speech,
and Signal Processing, ICASSP 2003, vol. 2, pp. 193–196.
Yella, S.H., Varma, V., Prahallad, K.: Significance of anchor speaker segments for
constructing extractive audio summaries of broadcast news. In: IEEE Workshop
on Spoken Language Technology, SLT 2010, pp. 13–18 (2010)
Yin, S.-C., Rose, R., Kenny, P.: Adaptive score normalization for progressive model
adaptation in text independent speaker verification. In: IEEE International Con-
ference on Acoustics, Speech and Signal Processing, ICASSP 2008, pp. 4857–4860
(2008)
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., Moore, G.,
Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book
Version 3.4. Cambridge University Press, Cambridge (2006)
Zavaliagkos, G., Schwartz, R., McDonough, J., Makhoul, J.: Adaptation algorithms
for large scale HMM recognizers. In: EUROSPEECH 1995, pp. 1131–1135 (1995)
Zeidler, E.: Oxford Users’ Guide to Mathematics. Oxford University Press, Oxford
(2004)
Zhang, Z.-P., Furui, S., Ohtsuki, K.: On-line incremental speaker adaptation with
automatic speaker change detection. In: IEEE International Conference of Acous-
tics, Speech, and Signal Processing, ICASSP 2000, vol. 2, pp. 961–964 (2000)
Zhou, B., Hansen, J.H.L.: Unsupervised audio stream segmentation and clustering
via the Bayesian information criterion. In: International Conference on Spoken
Language Processing, ICSLP 2000, vol. 3, pp. 714–717 (2000)
Zhu, D., Ma, B., Lee, K.-A., Leung, C.-C., Li, H.: MAP estimation of subspace
transform for speaker recognition. In: INTERSPEECH 2010, pp. 1465–1468
(2010)
Zhu, X., Barras, C., Meignier, S., Gauvain, J.-L.: Combining speaker identification
and BIC for speaker diarization. In: INTERSPEECH 2005, pp. 2441–2444 (2005)
Zochov´a, P., Radov´a, V.: Modified DISTBIC algorithm for speaker change detec-
tion. In: INTERSPEECH 2005, pp. 3073–3076 (2005)
Index
audio segmentation, 59
agglomerative clustering, 60
decoder-guided, 62
divisive segmentation, 60
metric-based, 60
model-based, 61
barge-in, 3, 150
Bayes’ theorem, 22
Bayesian Adaptation, 155
Bayesian information criterion, 18, 138
BIC, see Bayesian information criterion
binary hypothesis test, 17
CDHMM, see hidden Markov model
cepstral mean normalization, 14, 93,
102, 132
cepstral mean subtraction, see cepstral
mean normalization
codebook, 32, 35
confidence
interval, 79, 158
measure, 40, 128
covariance matrix, 18
diagonal, 20, 25, 28
crosstalk, 150
DCT, see discrete cosine transform
delta features, 15
delta-delta features, 15
discrete cosine transform, 12, 26
eigen-environment, 56
eigenvoice, 45
expectation maximization algorithm,
27, 153
auxiliary function, 27, 40, 41, 154
feature extraction, 10
feature vector normalization and
enhancement, 53, 151
eigen-environment, 56, 151
stereo-based piecewise linear com-
pensation for environments,
54
formant, 6
forward-backward algorithm, 34, 127,
131
backward algorithm, 34, 127
forward algorithm, 34, 127
fundamental frequency, 6
Gaussian distribution, 18
Gaussian mixture model, 25
hidden Markov model, 31
continuous, 33, 90, 126
semi-continuous, 33, 35, 87
iid, see random variable
in-set speaker, 70, 129
Kronecker delta, 21
LDA, see linear discriminant analysis
likelihood, 18, 22
ratio, 17
linear discriminant analysis, 15
172 Index
log-likelihood, 19
evolution, 117
ratio, 29, 93, 104, 110, 117
Lombard effect, 7
MAP, see maximum a posteriori
Markov model, 31
maximum a posteriori, 22
maximum likelihood, 28
mean vector, 18
mel frequency cepstral coefficients, 10
MFCC, see mel frequency cepstral
coefficients
ML, see maximum likelihood
multilayer perceptron, 24, 125
noise reduction, 9
non-target speaker, 24, 117, 129
out-of-set speaker, 29, 70, 129
PCA, see principal component analysis
phone, 7
phoneme, 7, 31, 65, 150
pitch, 6
principal component analysis, 12, 48
probability
conditional, 22
joint, 22
posterior, 22
prior, 17, 22
random variable
ergodic, 19
independently and identically
distributed, 18
stationary, 12
receiver operator characteristics, 103,
159
SCHMM, see hidden Markov model
simultaneous speaker identification and
speech recognition, 84
soft quantization, 36, 88
source-filter model, 5
speaker adaptation, 37
eigenvoices, 45, 149
extended maximum a posteriori, 51
maximum a posteriori algorithm, 41
maximum likelihood linear regression,
41, 44, 149
segmental maximum a posteriori
algorithm, 39
speaker change detection, 16
speaker identification, 21
closed-set, 23, 28, 88, 91, 125
open-set, 23, 29, 93, 128
text-dependent, 24
text-independent, 24
text-prompted, 24
speaker identification rate, 158
speaker tracking, see speaker identifica-
tion
speaker verification, 23
speech production, 5
unvoiced, 6
voiced, 6
speech recognition, 30
decoder, 36
garbage words, 36
grammar, 31
language model, 31, 34, 36
lexicon, 34, 36
speech segmentation, 10, 149, 150
SPEECON, 76
SPLICE, see feature vector normaliza-
tion and enhancement
standard codebook, 35
statistical model
discriminative, 23, 125
generative, 24, 125
non-parametric, 23
parametric, 23
sufficient statistics, 74, 133
supervector, 46
target speaker, 24, 117
universal background model, 29, 41, 95
Viterbi algorithm, 35
vocal tract, 6
voice activity detection, see speech
segmentation
word accuracy, 79, 158
word error rate, 158

Signals and Communication Technology

For further volumes: http://www.springer.com/series/4748

Tobias Herbig, Franz Gerl, and Wolfgang Minker

Self-Learning Speaker Identification
A System for Enhanced Speech Recognition

ABC

Author Tobias Herbig Institute of Information Technology University of Ulm Albert-Einstein-Allee 43 89081 Ulm Germany E-mail: tobias.herbig@uni-ulm.de Dr. Franz Gerl Harman Becker Automotive Systems GmbH Dept. Speech Recognition Söflinger Str. 100 89077 Ulm Germany E-mail: franz.gerl@harman.com Wolfgang Minker Institute of Information Technology University of Ulm Albert-Einstein-Allee 43 89081 Ulm Germany E-mail: wolfgang.minker@uni-ulm.de

ISBN 978-3-642-19898-4 DOI 10.1007/978-3-642-19899-1

e-ISBN 978-3-642-19899-1

Library of Congress Control Number: 2011927891 c 2011 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed on acid-free paper 987654321 springer.com

Self-Learning Speaker Identification for Enhanced Speech Recognition

Summary. A self-learning speech controlled system has been developed for unsupervised speaker identification and speech recognition. The benefits of a speech controlled device which identifies its main users by their voice characteristics are obvious: The human-computer interface may be personalized. New ways for interacting with a speech controlled system may be developed to simplify the handling of the device. Furthermore, speech recognition accuracy may be significantly improved if knowledge about the current user can be employed. The speech modeling of a speech recognizer can be improved for particular speakers by adapting the statistical models on speaker specific data. The adaptation scheme presented captures speaker characteristics after a very few utterances and transits smoothly to a highly specific speaker modeling. Speech recognition accuracy may be continuously improved. Therefore it is quite natural to employ speaker adaptation for a fixed number of users. Optimal performance may be achieved when each user attends a supervised enrollment and identifies himself whenever the speaker changes or the system is reset. A more convenient human-computer communication may be achieved if users can be identified by their voices. Whenever a new speaker profile is initialized a fast yet robust information retrieval is required to identify the speaker on successive utterances. Such a scenario presents a unique challenge for speaker identification. It has to perform well on short utterances, e.g. commands, even in adverse environments. Since the speech recognizer has a very detailed knowledge about speech, it seems to be reasonable to employ its speech modeling to improve speaker identification. A unified approach has been developed for simultaneous speech recognition and speaker identification. This combination allows the system to keep long-term adaptation profiles in parallel for a limited group of users. New users shall not be forced to attend an inconvenient and time-consumptive enrollment. Instead new users should be detected in an unsupervised manner while operating the device. New speaker profiles have to be initialized based on the first occurrences of a speaker. Experiments on the evolution of such a system were carried out on a subset of the SPEECON database. The results show that in the long run the system produces adaptation profiles which give continuous improvements in speech recognition and speaker identification rate. A variety of applications may benefit from a system that adapts individually to several users.

.

. . . 2. . . . . . . . . . . . . . . .4. . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Motivation . 2. . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . .5 Eigenvoices . . . . . . . . . . . . . . . . .5 Speech Recognition . . . . . . . . . .4. . . . . . .1 Motivation .2 Bayesian Information Criterion . . . . . . . . . . . .3 Implementation of an Automated Speech Recognizer . . . . . .1 Motivation .6 Speaker Adaptation . . . . .1 Speech Production . . .6. . 2. . . . . . . . . . . . . . . . . . .Contents 1 Introduction . . . . . . . . . .4 Speaker Identification . . . . . . . . .1 Motivation . . . . . . 2. .6.2 Gaussian Mixture Models . . . . . . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5. . 2. . . . . . . . . . . 2. . . . . . . . . . . . . . . . 1. . . . . . .6 Extended Maximum A Posteriori . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . .3 Detection of Unknown Speakers . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . . 2. . . . Fundamentals . . . .6. .6. . . . . . . . . . . . . . . . . . . . . . . . . .2 Hidden Markov Models . . . . . . . . . . . .2 Applications for Speaker Adaptation in Speaker Identification and Speech Recognition . . . . . . . . . . . .6. . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . 1 2 4 5 5 8 16 16 18 21 22 25 29 30 30 31 35 37 38 39 41 44 45 51 53 53 2 . . . . . . . . . . . . . . . . . .4 Maximum Likelihood Linear Regression . . . . . . . . . . . . . . . . . . . .4. . . . . . . . . . . . .5. . . . . . . . . . . . . . . . . . . .6. . . . . . . . . . . . .7 Feature Vector Normalization . . . . . . . . . . . . . . . . . . .5. .3 Speaker Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Front-End . . . . . . . . . . . . . . . . . .2 Overview . . . . . . 2. . . . . . . . 2. . . . . .1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . .1 Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. . . . . . . . . . . . . . 2. . . . . . . . . . 2. . .7. 2. . . . . . . . . . . . . . . 2. . .3 Maximum A Posteriori . . . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . .

. . . . . . . .3. . . . .1 Statistical Modeling of the Likelihood Evolution . . . . . . . . . . . 4. .2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Reference Implementation . . . . . . . . . . . . . . . . . . . . 93 5. . . . . . . . . . 2. . . . . . . . . . . . . . . . .2. . . . . . . . . . .3 Closed-Set Speaker Tracking . . . . . . . . . . . 85 5. . . . . . .3 System Architecture . . . .3. . . . . . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Evaluation . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 115 115 116 117 122 125 128 . . . . . . . . . . . . . . . . . . . . .2 Evaluation of the Reference Implementation . . . . . . . . .4. . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . 97 5. . . . . . . . .2 Evaluation Setup . .3 Eigen-Environment . . . . . . . . . . . . . . . . .4 Evaluation . . . .2. . . . . .VIII Contents 2. . . . . . . . . . . . . . .1 Database . . . . 4. . . . . . .1 Motivation . . . .4. . . . . . . . . . . . . . . . . . . . . . . . . .2 Stereo-based Piecewise Linear Compensation for Environments . . . . . . . .8 Summary . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . . . . . . . . . 54 56 57 59 59 63 65 69 71 71 72 76 76 77 79 82 4 5 Unsupervised Speech Controlled System with Long-Term Adaptation . . . . . . . . . . . . . . . . . . .2. . . 6. . . . . .3 Results . . . . . .2 Speaker Identification . . . . . . . . . . . . 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . . .5 Summary and Discussion . . . . . . . . . . . . . . . 86 5. .4 Summary . . . . . . . . . . . . . . . . . 4. . . . . . .2 Joint Speaker Identification and Speech Recognition . . . . 95 5. . . . . . . . . . .1 Motivation . 83 5. . . . . . . . . . .2 Posterior Probability Depending on the Training Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. . . . . .7. . . . . . . . . . . .2. . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Evaluation of Joint Speaker Identification and Speech Recognition . . . . . Combined Speaker Adaptation . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . . 105 5. .2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .1 Speaker Specific Speech Recognition . . . . . . . . 4.4 First Conclusion . . . . . . . . . . . . 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5. 6. . . . . .3 Phoneme Based Speaker Identification . . . .4 Open-Set Speaker Tracking . . . . . . . . . . . . . . . .7. . . .1 Motivation . . . . . . . . . . . . . . . . . . . . .2 System Architecture . . . . . . . . . 91 5. 96 5. . . . . .1 Audio Signal Segmentation . . . . . . . . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. . . . . .2 Posterior Probability Computation at Run-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. . .1 Speaker Identification . . . . . . . 3 Combining Self-Learning Speaker Identification and Speech Recognition . . 83 5. . . . . . . . . . . . . . . . . . . . . . . . . .2 Multi-Stage Speaker Identification and Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . 111 Evolution of an Adaptive Unsupervised Speech Controlled System . .

. . .6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. . . . . . . . . . . . . . . . . 149 Appendix . . . . . . . . . . 145 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. . . . . . . . . . . . . . .2 Open-Set Speaker Identification . . . . . . . . . . . . . . . . . . . . .1 Expectation Maximization Algorithm .3 Evaluation Measures . . . . . . . . . . . .2 Bayesian Adaptation .5 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Closed-Set Speaker Identification . . . . . . . . . . . .6. . . . . . . . . . . . . 153 153 155 158 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. . . . . . . . . . . . .7 Summary . . . . . . . . . . A. . . . . . . . . . . . . . . . . . . . . . . . . . . .Contents IX 6. . . . 6. . . . . . . . . . . 171 . . . . . . . . . . 161 Index . . . . . . . . . . . . . . . A. . . . . . . 7 8 A 132 134 134 139 143 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

Nomenclature ADC ASR BIC CDHMM CMS cos DCT DFT EM EMAP EV FFT GMM HMM Hz ID iid kHz LDA LLR log LR MAP MFCC min ML MLLR MLP MMSE NN PCA Analog to Digital Converter Automated Speech Recognition Bayesian Information Criterion Continuous Density HMM Cepstral Mean Subtraction cosine Discrete Cosine Transform Discrete Fourier Transform Expectation Maximization Extended Maximum A Posterior Eigenvoice Fast Fourier Transform Gaussian Mixture Model Hidden Markov Model Hertz Identity independent and identically distributed kilo Hertz Linear Discriminant Analysis Log Likelihood Ratio logarithm Lip Radiation Maximum A Posteriori Mel Frequency Cepstral Coefficients minute Maximum Likelihood Maximum Likelihood Linear Regression Multilayer Perceptron Minimum Mean Squared Error Neural Networks Principal Component Analysis .

XII Nomenclature RBF ROC SCHMM sec SPEECON SPLICE STFT UBM VAD VQ VT VTLN WA WER Radial Basis Functions Receiver Operator Characteristics Semi-Continuous HMM second Speech-Driven Interfaces for Consumer Devices Stereo-based Piecewise Linear Compensation for Environments Short Time Fourier Transform Universal Background Model Voice Activity Detection Vector Quantization Vocal Tract Vocal Tract Length Normalization Word Accuracy Word Error Rate .

An automatic personalization through a self-learning speech controlled system is targeted. To achieve a high degree of user-friendliness. The history of speech recognition and speaker identification is characterized by steady progress. recognition accuracy can be negatively affected by changing environments.g. 1993]. For example. telecommunication. machines and computers can be operated more conveniently with the help of automated speech recognition and understanding [Furui. a natural and easy to use human-computer interface is targeted for technical applications. Whereas in the 1950s the first realizations were mainly built on heuristic approaches. F. Despite significant advances during the last few decades. Gerl. 1–4. 2009]. pp. An integrated implementation of speech recognition and speaker identification has been developed which adapts individually to several users. e. more sophisticated statistical techniques have been established and continuously developed since the 1980s [Furui. manufacturing. speaker variability and natural language input [Junqua. Herbig.1 Introduction Automatic speech recognition has attracted various research activities since the 1950s. The goal of this book is to contribute to a natural human-computer communication. medical reports or infotainment systems in automobiles [Rabiner and Juang. c Springer-Verlag Berlin Heidelberg 2011 springerlink. 2009]. 2000]. Minker: Self-Learning Speaker Identification. Significant progresses for speech recognition are exemplified for in-car applications. T. Speech recognition has attracted attention for a variety of applications such as office systems. Speech recognition can increase labor efficiency in call-centers or for dictation tasks of special occupational groups with extensive documentation duty. For incar applications both the usability and security can be increased for a wide variety of users. SCT.com . and W. The driver can be supported to safely participate in road traffic and to operate technical devices. navigation systems or hands-free sets. Today a high level of recognition accuracy has been achieved which allows applications of increasing complexity to be controlled by speech. 2009]. Since speech is the most important means of interpersonal communication. there still exist some deficiencies that limit the wide-spread application of speech recognition in technical applications [Furui.

[2000]. e. e. In addition to an enhanced speech recognition accuracy by incremental speaker adaptation. in general navigation. e. However. An integrated implementation comprising speech recognition. 1996]. the user is known in this case and the environment in general does not change. 5 recurring speakers. Nevertheless. the usability can be improved by tracking personal preferences and habits. For embedded systems computational efficiency and memory consumption are important design parameters. city or street names.g. Automatic personalization through a self-learning speech controlled system is targeted in this book. or transient events such as passing cars or babble noise. often one can expect that a device is only used by a small number of users. the trained speech pattern does not fit the voice of each speaker perfectly [Zavaliagkos et al. A more natural human-computer communication will be achieved by identifying the current user in an unsupervised way. Kuhn et al. a higher number of parameters allows more individual representation of the speaker’s characteristics leading to improved speech recognition results. the statistical models can be perfectly trained to a particular user [Thelen. mainly differ in the number of parameters which have to be estimated. needs to be accurately recognized. it seems to be natural to employ speaker adaptation separately for several speakers and to integrate speaker identification.1 Motivation Infotainment systems with speech recognition. telephone or music control. Especially during the initialization of a new speaker profile. detection of new users and speaker adaptation has been developed. A simple implementation would be to impose the user to identify himself whenever the system is initialized or the speaker changes. fast speaker adaptation that can converge within a few utterances is essential to provide robust statistical models for efficient speech recognition and reliable speaker identification. The speech signal may be degraded by varying engine. such approaches are only successful when a sufficient amount of utterances proportional to the number of adaptation parameters can be reliably attributed to the correct speaker.g. Obviously. speaker identification. In general. a very large vocabulary. Speech recognition is enhanced by individually adapting a speech recognizer to 5-10 recurring users in an unsupervised manner. Several tasks have to be accomplished by such a system: The various speaker adaptation schemes which have been developed. No additional intervention of the user should be required. To achieve high recognition rates in large vocabulary dictation systems. Therefore. Stern and Lasry [1987].2 1 Introduction 1. When prior knowledge about . 1995]. typically are not personalized to a single user.. Speaker independent speech recognizers are trained on a large set of speakers.g. However. by Gauvain and Lee [1994]. In a system without speaker tracking all information acquired about a particular speaker is lost with each speaker turn. wind and tire noises. However.

Beyond the scope of this book. without pressing a push-to-talk button. Confusions of the speech recognizer can be reduced and in doubt the user can be offered a list of reasonable alternatives leading to an increased usability. the knowledge about the speaker is not only useful for speech recognition. especially for real-time computation. Speakers have to be identified reliably despite limited adaptation data to guarantee long-term stability and to support the speech recognizer to build up well trained speaker profiles. An enrollment to train new profiles should be avoided for the sake of a more natural operation of speech controlled devices. A unified statistical modeling of speech and speaker related characteristics will be presented as an extension of common speech recognizer technologies. Optimal recognition accuracy may be achieved for each user by an efficient profile selection during speech recognition. A more natural and efficient control of automated human-computer interfaces may be achieved.1.1 Motivation 3 speaker variability can be used only a few parameters are required to efficiently adapt the statistical models of a speech recognizer. Unknown speakers have to be detected quickly so that new profiles can be initialized in an unsupervised way without any intervention of the user. Knowledge about the speaker identity enables the system to prefer or preselect speaker specific parameter sets such as an address book for hands-free telephony or a list of frequently selected destinations for navigation. two operator modes for beginners and more skilled users are possible. Speech recognition has to be extended so that several individually trained speaker profiles can be applied in parallel. the system shall be designed to efficiently retrieve and represent speech and speaker characteristics. Multiple recognitions of speaker identity and spoken text have to be avoided.g. Barge-in allows the user to interrupt speech prompts. Recognition accuracy and computational complexity have to be addressed. Thus. Since applications for embedded systems are examined. Advanced users can be pointed to more advanced features. Beginners can be offered a comprehensive introduction or assistance concerning the handling of the device. Ljolje and Goffin [2007]. e. The human-computer interface may be personalized in different ways: Feedback of the speaker identity to the speech dialog engine allows special habits and knowledge of the user about the speech controlled device to be taken into consideration. For instance. Barge-in detection is another application where techniques for speaker identification and speech recognition can be of help as shown by Ittycheriah and Mammone [1999]. However. originating from a Text-To-Speech (TTS) system. the capability to track individual characteristics can be strongly limited. a balanced strategy of fast and individual adaptation has to be developed by combining the respective advantages of both approaches. Speaker identification is essential to track recurring users across speaker turns to enable optimal long-term adaptation. . In the long run speaker characteristics have to be optimally captured.

a summary and conclusion are given which repeat the main aspects and results. The target system of this book is sketched and the main aspects are emphasized. It is used subsequently to initialize and continuously adapt the speaker profiles of the target system and forms the identification basis of an integrated approach for speaker identification and speech recognition. Speech recognition accuracy is continuously improved. are introduced and discussed in detail. Speaker identification enables the system to continuously adapt the speaker profiles of its main users. the fundamentals relevant for the scope of this book are discussed. Several future applications and extensions are discussed in an outlook. The evolution of the adaptive system can be taken into consideration so that a strictly unsupervised system may be achieved. Speaker tracking across several utterances significantly increases speaker identification accuracy and ensures long-term stability of the system. speech recognition and speaker adaptation is given. speaker change detection. Then a brief survey of more complex approaches and systems concerning the intersection of speaker change detection.4 1 Introduction 1. Experiments have been conducted to show the benefit for speech recognition and speaker identification. The core components of a speech controlled system. Speech and speaker characteristics can be involved in several ways to identify the current user of a speech controlled system and to understand the content of the spoken phrase. This speaker adaptation method is able to capture speaker characteristics after a few utterances and transits smoothly to a highly specific speaker modeling. Speech production is explained by a simple model.2 Overview First. Step by step more sophisticated strategies and statistical models are explained to handle speech and speaker characteristics. The first component of the target system is a speaker adaptation scheme which combines short-term adaptation as well as individual adjustments of the speaker profiles in the long run. The basic system can be extended by long-term speaker tracking. Finally. Remarkable improvements for speech recognition accuracy may be achieved under favorable conditions. New users may be detected in an unsupervised way. speech recognition and speaker adaptation. Another important component of the target system is the unified realization of speaker identification and speech recognition which allows a compact statistical representation for both tasks. It provides a first overview of speech and speaker variability. speaker identification. . namely signal processing and feature extraction. speaker identification.

and W. Gerl. automatic speech processing and the components of a complete system for self-learning speaker identification and speech recognition are introduced in this chapter. SCT. two competing approaches are presented to handle unseen situations such as speaker variability or acoustic environments. each module introduces an additional extension for a better representation of speech signals motivated by speech production theory. c Springer-Verlag Berlin Heidelberg 2011 springerlink. Finally. Based on these features the modules speaker change detection.1 Speech Production Fant’s model [Fant. 5–57. T. A continuous improvement of the overall performance is targeted. Starting from a basic statistical model. several realizations of complete systems known from literature are presented in the next chapter. Minker: Self-Learning Speaker Identification. 1960] defines speech production as a source-filter model1 . pp. F.com . They are essential for a speech controlled system operated in an unsupervised manner. speaker identification and speech recognition are described. Then feature extraction is considered as the inverse process.2 Fundamentals The fundamentals of speech production. Herbig. A technique is introduced which extracts the relevant speech features from the recorded speech signal that can be used for further automated processing. 2. Speaker adaptation allows modifying the statistical models for speech and speaker characteristics whereas feature vector normalization or enhancement compensates mismatches on a feature level. The air stream originating from the lungs flows through the vocal cords and 1 The subsequent description of speech production follows O’Shaughnessy [2000] if not indicated otherwise. Based on the fundamentals. In the first section discussion starts with speech production to gain insight into speech and the problem of speaker variability. A simple yet effective model is provided to simplify the understanding of the algorithms handling speaker variability and speech characteristics.

It depends on speaker and gender specific properties of the vocal cords. Female speakers have an average fundamental frequency of 223 Hz. The characteristics of glottis. anatomical differences in the vocal tract and acquired . can be integrated by n(τ ) and a further room impulse response hCh . the speech transmission from the speaker’s mouth to the microphone. The speech signal sMic recorded by the microphone is given by sMic (τ ) = u(τ ) ∗ hVT (τ ) ∗ hCh (τ ) +n(τ ). The speech signal s(τ ) is given by the convolution s(τ ) = u(τ ) ∗ hG (τ ) ∗ hVT (τ ) ∗ hLR (τ ) u(τ ) ∗ h(τ ) = ∞ τ =−∞ ˜ (2. τ ˜ τ Fig. Children use even higher frequencies. It can be roughly approximated by a series of acoustic tubes producing characteristic resonances known as formant frequencies [Campbell. length. h(τ ) (2. e.1) (2. The length of the vocal tract is given by the distance between the glottis and the lips.g.2) u(˜) · h(τ − τ ) d˜. 1997]. The modeling of speech production is based on the three main components comprising excitation source. As speakers differ anatomically in shape and length of their vocal tract. 1993]. Rabiner and Juang. 1995]. hVT and hLR . Unvoiced speech is caused by turbulences in the vocal tract whereas voiced speech is due to a quasi-periodic excitation [Schukat-Talamazzini. tension and mass [Campbell. 2. formant frequencies are speaker dependent [Campbell. 2004]. 1997]. The Vocal Tract (VT) generally comprises all articulators above the vocal cords that are involved in the speech production process [Campbell. Male speakers have an average length of 17 cm whereas the vocal tract of female speakers is 13 cm long on average.1 depicts the associated block diagram of the source-filter model. The fundamental frequencies of male speakers fall in the range between 80 Hz and 160 Hz and are on average about 132 Hz. 1997]. In this context u(τ ) denotes the excitation signal at time instant τ . e. Additive noise and channel characteristics. A simplified model can be given by s(τ ) = u(τ ) ∗ hVT (τ ) (2.4) Speaker variability can be viewed as a result of gender and speaker dependent excitation. The periodicity of the excitation is called fundamental frequency F0 or pitch. vocal tract and lip radiation are modeled by filters with the impulse responses hG . 1997]. 1995].6 2 Fundamentals generates the excitation [Campbell.g. The opening of the glottis determines the type of excitation and whether voiced or unvoiced speech is produced. 1997.3) if the glottis and the lip radiation are neglected [Wendemuth. vocal tract and lip radiation [SchukatTalamazzini.

word usage and conversational style [Reynolds et al.2. Schukat-Talamazzini [1995]. The physical sound generated by the articulation of a phoneme is called phone [O’Shaughnessy. Noisy environments can have an additional influence on the manner of articulation which is known as Lombard effect [Rabiner and Juang.1 Source-filter model for voiced and unvoiced speech production as found by Rabiner and Juang [1993]. nasal consonants and plosives: • Vowels are produced by a quasi-periodic excitation at the vocal cords and are voiced. vocal tract and the lip radiation. 2000]. . 2. e. fricatives. Speaker change detection and speaker identification have to provide appropriate statistical models and techniques which optimally capture the individual speaker characteristics. 2003]. significant changes in their voice patterns can occur degrading automated speech recognition and speaker identification [Goldenberg et al. 1996]. 2006]. The speech production comprises either a quasi-periodic or noise-like excitation and three filters which represent the characteristics of the glottis. Therefore further aspects of speech are important.g.. fundamental frequency and vowel duration as well as a shift of the first two formants and energy distribution [Junqua.. Since speakers try to improve communication intelligibility. 1993]. Speech can be decomposed into phonemes2 which can be discriminated by their place of articulation as exemplified for vowels. They are characterized by line spectra located at multiples of the fundamental frequency and their intensity exceeds other phonemes.1 Speech Production 7 speaking habits [Campbell. Vocal Tract Lip Radiation The focus of speech recognition is to understand the content of the spoken phrase. 1997]. Low-level acoustic information is related to the vocal apparatus whereas higher-level information is attributed to learned habits and style. The effects are highly speakerdependent and cause an increase in volume. Many languages can be described by 20 − 40 phonemes. 2 Phonemes denote the smallest linguistic units of a language. prosody. scaling quasiperiodic Glottis voiced / unvoiced noise scaling Fig.

The feature vectors are normalized in the post processing to compensate for channel characteristics.2.2 Front-End The front-end is the first step in the speech recognition or speaker identification processing chain. The recorded microphone signal is preprocessed to reduce background noises. 3. Signal Processing Feature Extraction Post Processing Fig. 1993]. The following post-processing comprises feature vector normalization to compensate for channel characteristics. . The signal processing applies a sampling and preprocessing to the recorded microphone signals. The goal is to extract the relevant spectral properties of speech which can be used for speaker change detection. A vector representation is extracted from the speech signal. In this book it consists of a signal processing part. 2. little speaker discrimination is expected. The feature extraction transforms discrete time signals into a vector representation which is more feasible for pattern recognition algorithms. In consequence. • Plosives (stops) are produced by an explosive release of an occlusion of the vocal tract followed by a turbulent noisy air-stream. speaker identification and speech recognition. They are characterized by energy located at higher frequencies.3 exploits the knowledge about the speaker variability of phonemes to enhance the identification accuracy.8 2 Fundamentals • Nasal consonants are characterized by a total constriction of the vocal tract and a glottal excitation [Rabiner and Juang. The three parts of the front-end are described in more detail. no information about the vocal tract is present if the place of articulation is given by the teeth or lips as in /f/ or /p/. 2. In the next section an automated method for feature extraction is presented. 2. • Fricatives are produced by constrictions in the vocal tract that cause a turbulent noisy airflow. For example. The constriction produces a loss of energy so that fricatives are less intensive. The nasal cavity is excited and attenuates the sound signals so that nasals are less intensive than vowels. In the case of speech recognition a discriminative mapping is used in addition.2 Block diagram of a front-end for speech recognition and speaker identification. feature extraction and post-processing as displayed in Fig. The phoneme based speaker identification discussed in Sect.

5) and performs a quantization. Noise reduction targets to minimize environmental influences which is essential for a reliable recognition accuracy. e Ephraim and Malah [1984] in more detail. 3. As noise reduction algorithms do not play an important role in this book. In this book a sampling rate of fs = 11. In combination with an estimate of the disturbed speech spectrum. H¨nsler and Schmidt [2004. (2. speech signals are superimposed with background noises affecting the discrimination of different speakers or speech decoding. The Wiener filter.3 Block diagram of the signal processing in the front-end. Fig. fs l = 0. [1998]. . spectral subtraction and Ephraim-Malah are well-known weighting techniques which are described by Capp´ [1994]. A noise estimation algorithm calculates the noise spectrum as found by Cohen [2003]. Cohen and Berdugo [2002]. the power of the undisturbed or clean speech signal is estimated by a spectral weighting. [2005]. ADC sMic (τ ) s(l) Noise Reduction VAD Fig. Especially in automotive applications. The recorded speech signal is sampled at discrete time instances and noise reduction is performed. 2. The Analog to Digital Converter (ADC) samples the incoming time signals at discrete time instances sMic = sMic ( l l ). Speech pauses are excluded for subsequent speech processing.025 kHz and a 16 bit quantization are used.4 displays the noise reduction setup as described above.3.. Noise reduction can be achieved by a spectral decomposition and a spectral weighting as described by Vary et al. The principle block diagram is displayed in Fig. In this context l denotes the discrete time index. The time signal is split into its spectral components by an analysis filter bank. .2. After spectral weighting the temporal speech signals are synthesized by a synthesis filter bank. . The enhanced speech signal is employed in the subsequent feature extraction. 2008] for further detailed a information. The phase of the speech signals remains unchanged since phase distortions seem to be less critical for human hearing [Vary et al. 1. 2. only the widely-used Wiener filter is applied. 2. 1998].2 Front-End 9 Signal Processing The signal processing receives an input signal from the microphone and delivers a digitized and enhanced speech signal to feature extraction. Interested readers are referred to Benesty et al. 2. .

2. . Liu et al. speech segmentation can shift the detected boundaries of the beginning and end of an utterance to guarantee that the entire phrase is enclosed for speech recognition. O’Shaugha nessy [2000] and Meyer et al. [2010]. [1997]. 2002]. For the comparison of human and machine in speaker identification and speech recognition it is referred to Hautam¨ki et al. For further reading. Conventional speech segmentation algorithms rely on speech properties such as the zero-crossing rate. The enhanced speech signal is synthesized by the synthesis filter bank. The vocal tract structure may be considered as the 3 4 E{} denotes the expectation value. The computational load and the probability of false classifications can be reduced for speaker identification and speech recognition.4 Block diagram of the noise reduction. The power of the clean speech signal is estimated by a spectral weighting of the disturbed speech spectrum.10 2 Fundamentals Analysis filter bank Synthesis filter bank Power Weight factor Fig. the l harmonic structure of voiced speech segments may be used. they seem to capture enough spectral information for speaker identification. the interested reader is referred to literature such as Espi et al. Adverse environments complicate the end-pointing because background noises mask regions of low speech activity. The Mel Frequency Cepstral Coefficients (MFCC) are frequently used in speech recognition and speaker identification [O’Shaughnessy. 2005]. [2006]. The reader may wonder why the MFCC features are also used for speaker identification. MFCCs are physiologically motivated by human hearing4 . Voice Activity Detection (VAD) or speech segmentation excludes speech pauses so that the subsequent processing is done on portions of the microphone signal that are assumed to comprise only speech utterances. 2000. [2010]. ırez Feature Extraction Feature extraction transforms the enhanced speech signals into a vector representation which reflects discriminatory properties of speech. Furthermore. especially for vowels. Finally. Ram´ et al. The signal is split into its spectral components by an analysis filter bank. Quatieri. rising or falling energy3 E{(sMic )2 } [Kwon and Narayanan. Even though MFCCs are expected to cause a loss of speaker information. Noise and speech power are estimated to calculate a spectral weighting. [2007].

1997] as explained later in Sect.8) The FFT represents an efficient implementation of the DFT [Kammeyer and Kroschel. 1998].. . 2. This feature is reflected by the speech spectrum [Reynolds and Rose. The windowing ˜ sw = sMicshift +l · hl . features directly computed by a filter bank such as the MFCCs are obviously better suited for noisy applications [Reynolds and Rose. 2003] are not considered since a command and control scenario is investigated. the Short Time Fourier Transform (STFT) divides the sequence of speech samples in frames of predetermined length.6. applies a window function and splits each frame into its spectral components. Subsequently.2. . According to Quatieri [2002] the mel-cepstrum can be regarded as one of the most effective features for speech-related pattern recognition. Vary et al. 1993]: S(fb . In this book ˜ the window function h is realized by the Hann window 1 ˜ hl = 2 1 + cos( 2πl ) . . 2 2 (2.4. Oppenheim and Schafer [1975]. NFFT NFFT ≥ NFrame . Reynolds [1995a. the description in this book is restricted to the extraction of the MFCC features. 1998]. .l 1 2 NFrame = l=− 1 NFrame 2 5 sw · exp(−i ˜t. ˜t. (2. 2004. 2002. NFrame 1 1 l = − NFrame . The incoming microa phone signals are split into NFFT narrow band signals S(fb . Hence.. . The frame shift Nshift describes the time which elapsed between two frames.l 2π fb l). 1995]. (2. for example. An exemplary MFCC block diagram is shown in Fig. 1995]. The filter characteristics are determined by the window’s transfer function shifted in the frequency domain [H¨nsler and Schmidt. High-level features [Quatieri. However.. t) = F F T sw ˜t. 1998]. Rabiner and Juang.6) where the length of one frame is given by NFrame.b] obtains excellent speaker identification rates even for large speaker populations. t) [H¨nsler and a Schmidt. Reynolds et al.l t·N t > 1. t denotes the discrete frame index. Common window functions are the Hamming and Hann window [Vary et al.. NFrame . 2. The frame length is 20 ms and frame shift is given as half of the frame length.2 Front-End 11 dominant physiological feature to distinguish speakers. Kinnunen [2003] provides a thorough investigation of several spectral features for speaker identification and concludes that the best candidates range at the same level. 2004. MFCCs enable a compact representation and are suitable for statistical modeling by Gaussian mixtures [Campbell. 1998]. First. Details on DFT and FFT are given by Kammeyer and Kroschel [1998].7) combined with the Discrete Fourier Transform (DFT) or Fast Fourier Transform (FFT)5 realizes a filter bank [Vary et al.

1975] is applied. The DCT approximates the Karhuen-Loeve Transform (KLT) . The human auditory system cannot resolve frequencies linearly [Logan. 2000]. The phase information is discarded. Stationarity claims that the statistical properties of the random variable are independent from the investigated time instances [H¨nsler. 2000]. Applying the STFT to speech signals presumes that speech can be regarded as stationary for short time periods. In the linear frequency domain this corresponds to filters with increasing bandwidth for higher frequencies. The relation between linear frequencies f and warped mel-frequencies fmel is derived from psychoacoustic experiments and can be mathematically described by f fmel = 2595 · log10 (1 + ) (2. all vectors are viewed as column vectors. speech characteristics and scaling factors are separated into additive components in the case of clean speech. Irrespective of voiced or unvoiced speech. 2000]. Triangular filters can be employed as found by Davis and Mermelstein [1980].5 an example is shown. 2. The result can be grouped into the vectors xMEL representt 2 ing all energy values of frame index t. For example. The spectral resolution decreases with increasing frequency. The number of frequency bins is reduced here by this filter bank from 1 NFFT + 1 to 19.12 2 Fundamentals In this context the index fb denotes the frequency bin and i is the imaginary unit. Up to 1 kHz the linear and warped frequencies are approximately equal whereas higher frequencies are logarithmically compressed. 1996]. 2001]. For convenience. t)|2 are calculated. Wendemuth. the speca tral amplitudes of speech can be regarded as quasi-stationary for tens of milliseconds [O’Shaughnessy. 2004]. t t The dynamics of the mel energies are compressed and products of spectra are deconvolved [O’Shaughnessy. Further details can be found by Rabiner and Juang [1993]. 2000. t)| and the local energy |S(fb . The Discrete Cosine Transform (DCT) is a well-known algorithm from image processing because of its low computational complexity and decorrelation property. In the next step of the feature extraction the magnitude |S(fb . For the feature extraction this assumption is approximately true if the frame length is shorter than 30 ms [SchukatTalamazzini. 1995].. The mel filter bank comprises a non-linear frequency warping and a dimension reduction. The critical-band rate or Bark scale reflects the relationship between the actual and frequency resolution which is approximated by the mel filter bank [O’Shaughnessy. In the warped frequency domain an average energy is computed by an equally spaced filter bank [Mammone et al.9) 700 Hz as found by O’Shaughnessy [2000]. In Fig. The FFT order is selected as NFFT = 256 and zero-padding [Oppenheim and Schafer. The logarithm is applied element by element to xMEL and returns xLOG .

The KLT also known as Principal Component Analysis (PCA) projects input vectors onto directions efficient for representation [Duda et al.10) as found by O’Shaughnessy [2000]. for example. The resulting vector is denoted by xMFCC . 2000]. The first coefficient (l = 0) is typically not used because it represents an average energy of xLOG and therefore depends on the scaling t of the input signal [O’Shaughnessy. A dimension reduction is used for all experiments in this book t and the remaining 11 MFCC features are employed. 2. 2000]. An equally spaced filter bank is used in the warped frequency domain to compute an averaged energy.. d contains the dimensionality of the specified vector.t 2 dLOG . . It is replaced in this book by a logarithmic energy value which is derived from xMEL.. The second coefficient t of the DCT can be interpreted as the ratio of the energy located at low and high frequencies [O’Shaughnessy. The coefficients xDCT of the resulting vector xDCT are given by the formula t l. 2006] which are described later in Sect. .10) is applied to the logarithmic mel-energies. l = 1.2.2 Front-End 13 2000 1500 Mel scale 1000 500 0 0 1000 2000 3000 Frequency [Hz] 4000 5000 Fig. (2. 2001]. The DCT can be combined with a dimension reduction by omitting the higher dimensions. The DCT in (2.7 . Triangular filters can be applied. Fig.5. .t dLOG xDCT = l. in the case of Markov random signals of first order [Britanak et al. In the linear frequency domain the bandwidth of the corresponding filters increases for higher frequencies. 2. dDCT . 2.2.5 Example for a mel filter bank. .t l1 =1 1 π xLOG · cos l(l1 − ) l. Post Processing In post processing a normalization of the developed feature vectors xMFCC t is performed and a Linear Discriminant Analysis (LDA) is applied.

2000. 2007]. 1996]. the long-term average of feature vector μMFCC is approximately based on the unwanted channel characteristics. t Mean Subtraction xMFCC t xCMS t LDA xLDA t Fig. It improves the robustness of speaker identification and speech recognition [Reynolds et al.. A simple iterative mean computation is given by . For common speech recognizers it has been established as a standard normalization method [Buera et al.4) remains relatively unaffected. The discrete cosine transform is applied to the logarithm of the mel energies. the inter-speaker variability is reduced by mean normalization [H¨b-Umbach. phonemes. t) MEL xMEL t LOG xLOG t DCT xDCT t |S(k. 2. e.. In the case of a speech recognizer an LDA is employed to increase the discrimination of clusters in feature space. The MFCC feature vectors xMFCC are usually computed by discarding t the first coefficient of xDCT . The cepstral mean vector iteratively estimated is subtracted from the MFCC vectors. 2. Young et al. When the room impulse response varies only slowly in time relative to the speech signal. In addition. a 1999].14 2 Fundamentals STFT sw ˜t. The subtraction of this mean vector from the feature vectors compensates for unwanted channel characteristics such as the microphone transfer function and leaves the envelope of the speech signal relatively unaffected. 2006]. A dimension reduction can be applied. The logarithm of feature extraction splits the speech signal and the room impulse response into additive components if the channel transfer function is spectrally flat within the mel filters and if clean speech is considered.g..7 Block diagram of the post processing in the front-end.. To avoid latencies caused by the re-processing of entire utterances the mean can be continuously adapted with each recorded utterance to track the long-term average. The STFT is applied to the windowed speech signal and the spectral envelope is further processed by the mel filter bank.l Spectral Envelope S(k. Noise reduction as a pre-processing diminishes the influence of background noises on the speech signal whereas the room impulse response hCh in (2. t)|2 Fig. Mean subtraction is an elementary approach to compensate for channel characteristics [Mammone et al.6 Block diagram of the MFCC feature extraction. displays the corresponding block diagram comprising the Cepstral Mean Subtraction (CMS) and LDA.

A peak tracker detecting the maxima of the energy estimate is used for normalization.2 Front-End 15 μMFCC = t μMFCC 1 nMFCC − 1 MFCC 1 μt−1 + xMFCC . The same algorithm can be applied to the delta features to compute delta-delta features. Furthermore. These features are also referred as delta features xΔ t 2 and delta-delta features xΔ and are known to significantly enhance speech t recognition accuracy [Young et al. It is a linear transform realized by a matrix multiplication. Class et al. the mean vector μMFCC is estimated according to (2. Input vectors are projected onto those directions which are efficient for discrimination [Duda et al. e. Segura et al.2.g. The dynamic behavior of speech can be partially captured by the first and second order time derivatives of the feature vectors. Delta features are computed by the weighted sum of feature vector differences within a time window of 2 · NΔ frames given by NΔ l · xMFCC − xMFCC t+l t−l xΔ = l=1 (2. t MFCC μ1 and nMFCC are initialized in an off-line training. The mean of xMFCC is realized by a recursion of first t order. 2001]. [1993.13) t NΔ 2 · j=l l2 as found by Young et al..11). nMFCC is limited to guarantee a constant learning rate in the long run. [2004]. 1994] describe an advanced method to iteratively adapt the mean subtraction. 2006]. a dimension reduction can be achieved. Instead. Delta or delta-delta features are not computed in the speech recognizer used for the experiments of this book. [2006]. nMFCC nMFCC t = xMFCC 1 t>1 (2. The feature vectors used for speech recognition may comprise static and dynamic features stacked in a supervector. The dynamics of delta and delta-delta coefficients have been incorporated into the LDA using a bootstrap training [Class et al. A decay factor is chosen so that a faster adaptation during the enrollment of new speakers is achieved. phonemes. In this book. Furthermore. A slower constant learning rate is guaranteed afterwards. the preceding and subsequent 4 vectors are stacked together with the actual feature vector in a long supervector.. 2006]. .11) (2. Further normalization techniques can be found by Barras and Gauvain [2003].12) where the total number of feature vectors is denoted by nMFCC .. MFCCs capture only static speech features since the variations of one frame are considered [Young et al.. The result is a compact representation of each cluster with an improved spatial discrimination with respect to other clusters. 1993]. Mean normalization is not applied to the first coefficient (l = 0) which is replaced by a logarithmic energy estimate as explained above. An LDA is applied in order to obtain a better discrimination of clusters in feature space.

the motivation of speaker change detection is introduced and the corresponding problem is mathematically described. At time instance tCh a speaker change is hypothesized and the algorithm has to decide whether part I and part II . However. First.1 Motivation The task of speaker change detection is to investigate a stream of audio data for possible speaker turns. only MFCCs are considered to limit the computational complexity of the target system including self-learning speaker identification and speech recognition. t An unsupervised system for speaker identification usually encounters difficulties to define reasonable clusters. In Fig. [2004] demonstrate that MFCCs can also be used to detect speaker changes. 2. For example. 3. a representation of speech data suitable for speaker change detection has to be found and a strategy has to be developed to confirm or reject hypothetical speaker changes. The speech decoding works on the resulting vectors xLDA as explained later in Sect.3 Speaker Change Detection The feature extraction described in the preceding section provides a compact representation of speech that allows further automated processing. a data stream is iteratively investigated frame by frame or in data blocks for speaker turns. Thus.2 extracts the relevant features from the time signal and provides a continuous stream of MFCC feature vectors for speech recognition and speaker identification.8 an example is shown. 2.3. Ajmera et al. The most obvious strategy is to cluster the feature vectors of each speaker separately.1 several strategies for audio signal segmentation are outlined. 2. Since the pool of speakers is generally unknown a priori.16 2 Fundamentals Further details can be found by Duda et al. In Sect. Thus. speaker identification usually does not use an LDA. 2. The front-end introduced in Sect. Then a detailed description of one representative which is well-known from literature is presented. other features could be used as well in the subsequent considerations. 2. The goal is to obtain precise boundaries of the recorded utterances so that always only one speaker is enclosed. In the following. One application is speaker change detection which performs a segmentation of an incoming speech signal into homogeneous parts in an unsupervised way. a buffered sequence of a data stream is divided into segments of fixed length and each boundary has to be examined for speaker turns. Many algorithms work on the result of a speaker change detection as the first part of an unsupervised speaker tracking. A sequence of T frames is given in a buffer. Then speaker identification and speech recognition can be applied to utterances containing only one speaker.5. [2001].

• Hypothesis H0 assumes no speaker change so that both sequences of feature vectors x1:tCh and xtCh +1:T originate from one speaker. falsely accepted H0 and correctly detected H1 .14) is called the likelihood ratio. The left side of (2. In Sect. 2001]: a p(x1:T |H0 ) H0 (c10 − c11 ) · p(H1 ) ≥ . originate from one or two speakers. (2. Subsequently. the shorthand notation x1:tCh for the sequence of feature vectors {x1 .1 several strategies are listed to select hypothetical speaker changes. • Hypothesis H1 presumes a speaker turn so that two speakers are responsible for x1:tCh and xtCh +1:T . c00 . For equal prior probabilities the criterion chooses the hypothesis with the highest likelihood.3 Speaker Change Detection 17 H0 I H1 t=1 t = tCh t=T II Fig. . 3. This procedure can be repeated for all possible time instances 1 < tCh < T . . . model-based or decoder-guided techniques. c10 and c11 denote the costs for falsely accepted H1 . The probabilities p(H0 ) and p(H1 ) stand for the prior probabilities of the specified hypothesis. . truly detected H0 . Here it is assumed that a hypothetical speaker turn at an arbitrary time instance tCh has to be examined and that a binary hypothesis test has to be done. the threshold only depends on the prior probabilities. The basic procedure checks a given time interval by iteratively applying metric-based. H0 is accepted by the Bayesian criterion if the following inequality is valid [H¨nsler.8 Problem of speaker change detection. In the case of two competing hypotheses H0 and H1 .14) p(x1:T |H1 ) (c01 − c00 ) · p(H0 ) Here the parameters c01 .2. 2. If the costs for correct decisions equal zero and those for false decisions are equal. A continuous data stream has to be examined for all candidates where a speaker turn can occur. xtCh } is used. . The right side contains the prior knowledge and acts as a threshold. The optimal decision of binary hypothesis tests can be realized by the Bayesian framework. respectively.

3.8 denotes the first part before the assumed speaker turn as I and the second part as II. 2004]. 2007. Σ. μ. 2. The probability density function is given by p(xt |Θ) = N {xt |μ. Lehn . the basic statistical models can be easily estimated for each part of the data stream.2 Bayesian Information Criterion The so-called Bayesian Information Criterion (BIC) belongs to the most commonly applied techniques for speaker change detection [Ajmera et al. 2. 2. Fig.15) In addition. There exist several BIC variants and therefore only the key issues are discussed in this section. all features are usually assumed to be independent and identically distributed (iid) random variables.5. 3. The model built on the conjunction presumes only one origin as it is the case in H0 . Θ is the parameter set of a multivariate Gaussian distribution comprising the mean vector and covariance. Even though successive frames of the speech signal are not statistically independent as shown in Sect. Since the statistical dependencies of the speech trajectory are neglected.g. The statistical models for part I and II cover the case of two origins. e. respectively. The existing algorithms for finding speaker turns can be classified into metric-based. Subsequently. this assumption simplifies parameter estimation and likelihood computation but is still efficient enough to capture speaker changes. Σ} T −1 1 1 = · e− 2 (xt −μ) ·Σ ·(xt −μ) .1. The superscript indices T and −1 are the transpose and inverse of a matrix. For further details. two speakers.5d |Σ|0. especially to metricbased or decoder-guided algorithms. The binary hypothesis test requires the statistical models for part I and II as well as for the conjunction of both parts I + II. covariance matrix. In this context only one prominent representative of the model-based algorithms is introduced and discussed. Parameter Estimation Each part is modeled by one multivariate Gaussian distribution which can be robustly estimated even on few data. model-based and decoder-guided approaches as found by Chen and Gopalakrishnan [1998]. the determinant of the covariance matrix and the dimension of the feature vectors.18 2 Fundamentals A statistical model is required to compute the likelihoods p(x1:T |H0 ) and p(x1:T |H1 ).5 (2π) (2. and thus reflect the assumption of hypothesis H1 . 0. The mean vector and covariance matrix are unknown a priori and can be determined by the Maximum Likelihood (ML) estimates [Bishop. the interested reader is referred to the literature listed in Sect.. |Σ| and d denote the mean vector.

t=1 The parameters μII and ΣII are estimated on part II in analogy to μI and ΣI .20) (2.17) (2. only log-likelihoods are examined T log (p(x1:T |H0 )) = t=1 tCh log(N {xt |μI+II . The log-likelihood can be realized by the sum of log-likelihoods of each time step T log (p(x1:T |Θ)) = t=1 log (p(xt |Θ)) .21) ΣI = Ex {(xt − μI ) · (xt − μI )T } ≈ 1 tCh tCh (xt − μI ) · (xt − μI )T . For convenience. t=1 In the case of a speaker turn. . The a parameters for the statistical model of the hypothesis H0 are given by μI+II = Ex {xt } ≈ 1 T T xt t=1 (2. ΣI+II }) log(N {xt |μI .3 Speaker Change Detection 19 and Wegmann. t=tCh +1 whereas the parameter set Θ is omitted.23) log (p(x1:T |H1 )) = + (2.19) (2. An ergodic random process is assumed so that the expectation value can be replaced by the time average [H¨nsler. (2. 2000]. the parameters are estimated separately on part I and II as given by μI = Ex {xt } ≈ 1 tCh tCh xt t=1 (2.22) Now the likelihood values p(x1:T |H0 ) and p(x1:T |H1 ) can be given for both hypotheses.16) (2. ΣI }) t=1 T (2.18) ΣI+II = Ex {(xt − μI+II ) · (xt − μI+II )T } ≈ 1 T T (xt − μI+II ) · (xt − μI+II )T . The likelihood computation of feature vector sequences becomes feasible due to the iid assumption. 2001]. ΣII }).2.24) log(N {xt |μII .

26) comprises the number of parameters which are used to estimate the mean and covariance.25) θth replaces the costs and prior probabilities in (2.28) describes the LLR algorithm [Ajmera et al. 2005]. The BIC criterion only requires to compute the covariances for both parts and their conjunction as found by Zhu et al.. 2004].27) (2.g. short utterances. ≤ 2 sec.. lc ) denotes the matrix element of row lr and column lc : . the inequality T · |ΣI+II | − tCh · |ΣI | − (T − tCh ) · |ΣII | − λ · P ≤ 0 (2. 2004]. 2004]. This approach is known as the Log Likelihood Ratio (LLR) algorithm [Ajmera et al. [2005]. [2004]. The training and test data of the Gaussian distributions are identical and permit a more efficient likelihood computation. The BIC criterion test extends this technique by introducing a penalty term and does not need a threshold θth . seem not to provide enough information for a reliable estimation as the results of Zhou and Hansen [2000] suggest. Even though Gaussian distributions only consist of one mean vector and one covariance matrix.28) is also known as the variance BIC [Nishida and Kawahara. The penalty term P compensates this inequality and guarantees an unbiased test [Ajmera et al. equation (2. P may be multiplied by a tuning parameter λ which should theoretically be λ = 1 as found by Ajmera et al. The models of H1 are expected to lead to a better representation of the observed speech data since the number of model parameters is doubled.14) and has to be determined experimentally. e. A speaker turn is assumed if the following inequality is valid log (p(x1:T |H1 )) − log (p(x1:T |H0 )) − P ≥ 0. The elements of diagonal covariances are equal to zero except the main diagonal. If the factor λ is equal to zero..20 2 Fundamentals Hypothesis Test The hypothesis test can now be realized by using these likelihoods and a threshold < θth > log (p(x1:T |H1 )) − log (p(x1:T |H0 )) . The requirement to estimate robust parameters on limited data does not allow using more complex statistical models. where the penalty P = 1 1 · d + d(d + 1) · log T 2 2 (2. The hypothesis H1 assumes a speaker turn and therefore applies two Gaussians in contrast to H0 . (2. Thus. Σdiag (lr . The complexity of the statistical models can be further reduced by using diagonal covariances.

2. ΣII and simplifies the covariance estimation. lc ) · σlr .4. Speech recognition can then be enhanced for a particular speaker by individual long-term adaptation as explained in the following sections.lc δK (lr .4 Speaker Identification 21 Σldiag = δK (lr . it seems to be difficult to distinguish properly between the changes of the acoustic environment and speaker changes in an unsupervised manner. Alternatively. Due to the low complexity of the statistical models previously introduced. Their algorithm estimates only the mean vectors μI . r . One Gaussian distribution might be not sufficient to capture the variations sufficiently [Nishida and Kawahara. The variation of the utterance duration is expected to be high so that the comparison of such utterances seems to be difficult [Nishida and Kawahara. (2. lc ) = 0 if lr = lc 1 if lr = lc . a technique is necessary to detect whether a speaker is known to the system. Since short utterances do not cover all phonemes sufficiently. The problem of speaker identification is formulated and a brief survey of the main applications and strategies is provided in Sect. 2. This assumption is a trade-off between the reduced number of parameters and statistical modeling accuracy.1. 2005] which is undesirable with respect to the limited training data. Adverse environments such as in-car applications contain time-variant background noises. [2007]. Speaker identification is an essential part of the target system since speakers should be tracked across several speaker changes. The covariance ΣI+II replaces ΣI .29) σ and δK are the scalar variance and Kronecker delta. Zhou and Hansen [2000] suggest to use the Hotelling measure instead of Gaussian distribution for very short utterances. text-dependency can be a problem. μII and μI+II separately.4 Speaker Identification The discussion of speaker changes is now extended to speaker identification and the question who is speaking is addressed in this section. more sophisticated statistical models incorporating prior knowledge about speech and speaker characteristics can be used as found by Malegaonkar et al. In this section a short introduction into speaker identification is given by a selection of the relevant literature.2. The average duration can be less than 2 sec which limits the use of complex algorithms for speaker change detection. The training of the . 2. Furthermore. 2. In this context it should be emphasized that the use case of this book is a command and control application. 2005]. Only in combination with a robust detection of new users can a completely unsupervised system be achieved which is able to enroll speakers without any additional training. The main representative of the statistical models used for speaker identification is explained in Sect.4.

.31) The combination of (2.32) . 2000]. Reynolds et al. which describes the match between the speaker model and the observed data. The current utterance is characterized by a sequence of feature vectors x1:T . For example. i (2. 1994]. Delta features may be used in addition [Reynolds et al. access control and law enforcement [Gish and Schmidt. Speaker identification has to find the speaker who maximizes the posterior probability of being the origin of the recorded utterance: iMAP = arg max {p(i|x1:T )} . 2000]. information retrieval systems. Finally. (2. Bayes’ theorem [Bronstein et al. security applications.30) Variable i and iMAP denote the speaker index and the corresponding MAP estimate.1 Motivation and Overview Speaker identification has found its way into a manifold of speech controlled applications such as automatic transcription of meetings and conferences. 1994. b) = p(a|b) · p(b) = p(b|a) · p(a). The density function of the spoken phrase p(x1:T ) is speaker independent and can be discarded. Then speaker identification can be realized by the Maximum A Posteriori (MAP) criterion [Gish and Schmidt. 2000] relates the joint probability p(a. Reynolds and Rose.. The resulting criterion is based on a multiplication of the likelihood. It can be expressed by a multiplication of prior and conditional probabilities: p(a. and the prior probability for a particular speaker: iMAP = arg max {p(x1:T |i) · p(i)} .31) likewise describes the identification problem depending on the likelihood.4.30) and (2. 1995]. i (2. MFCC features or mean normalized MFCCs computed by the front-end may be employed for speaker identification [Reynolds. Probabilistic approaches determine statistical models for each speaker so that this problem can be solved in a statistical framework. b). conditional probabilities p(a|b) and p(b|a) as well as the prior probabilities p(a) and p(b) for two arbitrary random variables a and b. 2.. 1995a.22 2 Fundamentals statistical model and speaker identification at run-time are described in more detail. some techniques are discussed to detect unknown speakers. The MAP criterion can be expressed depending on the likelihood p(x1:T |i) and the prior probability p(i) instead of the posterior probability p(i|x1:T ) by applying Bayes’ theorem. The main task is to identify speakers by their voice characteristics.

speaker verification is typically applied to authentication problems [Campbell. 1997]. Speaker identification enables the personalization and optimization of speech controlled systems by accounting for user specific habits. Clustering algorithms such as k-means or Linde Buzo Gray (LBG) [Linde et al. speaker verification compares the observed voice characteristics with stored voice patterns of the claimed speaker identity and speech recognition checks the spoken password. Non-parametric techniques usually admit any kind of distribution function to represent the measured distribution. However.. training has . In open-set scenarios identification has not only to determine the speaker’s identity but has also to decide whether the current user is an enrolled speaker or an unknown speaker. 1995]. Speech plays an important role in biometric systems such as financial transactions or access control in security relevant areas [Campbell. In closed-set scenarios the current user has to be assigned to one speaker out of a pool of enrolled speakers. Usability can be increased by speaker-adapted user-machine interfaces. Several strategies exist for the training of statistical models.4 Speaker Identification 23 Depending on the application different approaches concerning the statistical models and training algorithms exist: Speaker identification can be separated from speaker verification [Campbell. The decision in (2. Speaker identification has to recognize the voice of a current user without any prior knowledge about his identity whereas speaker verification accepts or rejects a claimed identity. The user enters an identity claim and is asked to speak a prompted text or a password [Campbell. Gish and Schmidt [1994] provide a general description of different speaker identification techniques and refer to parametric and non-parametric identification. Speech can serve as biometric key to secure the access to computer networks and to ensure information security [Reynolds and Carlson.2. Speaker identification can be further subdivided into closed-set and openset systems [Gish and Schmidt. [2000] provide an introduction into speaker verification which widely applies to speaker identification as well. 1997]. the Vector Quantization (VQ). 1997]. At run-time speaker identification algorithms. 1994]. 1997]. A further discrimination is given by the general type of statistical model such as non-parametric and parametric models [Gish and Schmidt.g. rely on distance measures to find the speaker model with the highest match as found by Gish and Schmidt [1994]. 1994]. Parametric approaches presume a special kind of statistical model a priori and only have to estimate the corresponding parameters. 2007]. The goal is to learn how to separate enrolled speakers without the need to train individual statistical models for each speaker. Reynolds et al. The criterion for speaker selection can be given by the MAP criterion in (2.30) can be directly realized by discriminative speaker models [Bishop. For example. e. 1980] are applied to capture speaker specific centers of gravity in the feature space. Thus.30).

e. 1994]. Furthermore. The final design of a speaker identification implementation depends on the application. generative speaker models individually represent the characteristics of the target speaker.4. These requirements suggest a speaker identification approach based on generative speaker models. [2005]. Kwon and Narayanan. Reynolds and Rose [1995] employ Radial Basis Functions (RBF) as a reference implementation for their proposed speaker identification framework based on generative models. The speaker whose model yields the best match with the observed feature vectors is considered as the target speaker. 1994. MLPs have been used for speaker identification as found by Genoud et al.g. For a more detailed survey on standard techniques for speaker identification or verification. for access control. In contrast. The pool of enrolled speakers can be extended iteratively without re-training of all speaker models. digits. 2007] provides an extensive investigation of the NN framework and MLPs. Reynolds et al. only the most dominant generative statistical model is introduced in Sect. This minimizes the risk that an impostor plays back an utterance via loudspeaker. text-independent identification has to handle arbitrary speech input [Gish and Schmidt. Gish and Schmidt [1994]. 1996. The Multilayer Perceptron (MLP) as a special case of the Neural Networks (NN) framework6 is a well-known discriminative model. A larger vocabulary has to be recognized but the transcription is still known. Thus. Furui.g. Another example may be limited vocabulary. in special applications. can ask the user to speak an utterance that is prompted on a display. Reynolds and Rose [1995] investigate the statistical modeling of speaker characteristics with Gaussian mixtures models which is one of the standard approaches for speaker identification based on generative speaker modeling. in particular. This identification task requires different statistical approaches. This constraint is known as text-dependent speaker identification or verification [Gish and Schmidt. For verification the current user might be asked to read a prompted text so that the content of the utterance is known. 6 Bishop [1996. For example. 2. They are independent from non-target speakers. Reynolds and Carlson [1995] describe two approaches for text-dependent realizations to verify a speaker’s identity. Users are not restricted to special applications or a given vocabulary so that speaker identification has to be operated in a textindependent mode. it is referred to Campbell [1997]. 2005].. Morris et al. applications in an embedded system impose constraints concerning memory consumption and computational load. [1997]. Kimball et al.2.24 2 Fundamentals to be re-computed when a new speaker is enrolled. In contrast to discriminative models. e. The scope of this book requires a complete system characterized by a high flexibility to integrate new speakers and to handle individually trained speaker models. [1999]. . Text-prompted systems [Che et al. 2009]. [2000].

For example. . wN . According to Bayes’ theorem. 1995a. MFCC features or mean normalized MFCCs may be employed for speaker identification [Reynolds. k=1 (2. μi . The parameter set i i Θi = w1 . the joint probability density function p(xt . μi . k k k=1 (2. Σi . . Diagonal covariance matrices in contrast to full matrices are advantageous because of their computational efficiency [Reynolds et al. Delta features may be used in addition [Reynolds et al.37) contains the weighting factors. The resulting probability density function of a particular speaker model i is a convex combination of all density functions: GMMs are constructed on standard multivariate Gaussian densities as given in (2.. 2000]. mean vectors and covariance matrices for a particular speaker model i. .33) characterize the prior contribution of the corresponding component to the density function of a GMM and fulfill the condition N i wk = 1.. Σi .15) but introduce the component index k as a latent variable with the discrete probability p(k|i). .. The weights i wk = p(k|i) (2. Reynolds et al.. xt denotes the feature vector. 2000]. 2000]. Θi ) i wk · N xt |μi . . . .34) Each Gaussian density represents a conditional density function p(xt |k. Σi 1 N 1 N (2.35) = (2. k|i) is given by the multiplication of both.36) Each component density is completely determined by its mean vector μk and covariance matrix Σk .2 Gaussian Mixture Models Gaussian Mixture Models (GMMs) have emerged as the dominating generative statistical model in the state-of-the-art of speaker identification [Reynolds et al. . . 2000].4 Speaker Identification 25 2. GMMs comprise a set of N multivariate Gaussian density functions subsequently denoted by the index k. .4. i).2. The sum over all densities results in the multi-modal probability density of GMMs N p(xt |Θi ) = k=1 N p(k|Θi ) · p(xt |k. . .. Multivariate Gaussian distributions with diagonal covariance matrices comprise . GMMs are attractive statistical models because they can represent various probability density functions if a sufficient number of parameters can be estimated [Reynolds et al. 2000].

2. 2. 2000]. Finally. the optimization problem is to find a parameter set ΘML which maximizes the likelihood d p(x1:T |Θ) dΘ =0 Θ=ΘML (2. Fig.. Training Subsequently. The loss of information can be compensated by a larger a set of multivariate Gaussian distributions so that correlated features can be accurately modeled by GMMs [Reynolds et al. 2001]. the speaker index is omitted. 2000]. For convenience. two dimensional mean and feature vectors are chosen. 2002]. it is assumed that speaker specific data can be employed to train a GMM for each speaker. x1 and x2 denote the elements of the feature vector.9 Likelihood function of a GMM comprising 4 Gaussian densities..26 2 Fundamentals uncorrelated univariate distributions and even imply statistical independence [H¨nsler. 2002. If no prior knowledge about the parameter set is available. 0. The goal of the training is to determine a parameter set Θ that optimally represents the voice characteristics of a particular speaker. The training of GMMs and speaker identification at run-time are described in the following. For convenience. GMMs based on diagonal covariances have been found to outperform realizations with full covariance matrices [Quatieri.9 exemplifies the likelihood function for a GMM comprising 4 Gaussian distributions with diagonal covariance matrices. Reynolds et al. Because of the DCT a low correlation of the MFCC features can be assumed [Quatieri.3 Likelihood 0.4 0.38) .1 0 5 5 0 x 2 -5 -5 0 x 1 Fig.2 0.

(2. • In the M-step new ML estimates of the parameters are calculated based on the assignment of the E-step. Therefore different initialization strategies are described in the literature. Θ) = Ek1:T {log (p(x1:T . The likelihood increases with each iteration until a maximum is reached. The number of the softly assigned feature vectors is denoted by T nk = t=1 ¯ p(k|xt . it does not need to be the global optimum since the EM algorithm cannot distinguish between local and global maxima [Bishop.. Gauvain and Lee [1994]. Θ). Σl ¯ μ ¯ (2. A. Θ) = wk · N xt |¯ k .1. 1977] which is characterized by an iterative procedure. However.41) according to the importance for the training of a particular Gaussian density as found by Reynolds and Rose [1995].42) . Under the iid assumption. Each iteration consists of the following two steps: • In the E-step all feature vectors are assigned to each Gaussian distribution ¯ by computing the posterior probability p(k|xt . 2007].39) with respect to Θ as found by Bishop [2007]. k1:T |Θ)) |x1:T . . The likelihood acts as stopping condition. . Each iteration provides a new parameter set Θ starting ¯ from the parameter set of the preceding iteration Θ or an initial set. kT } (2. the derivative of p(x1:T |Θ) cannot be solved analytically for its parameters as described in Sect. Θ) = T −1 N −1 t=0 k=0 ¯ p(k|xt .4 Speaker Identification 27 for a given training data set x1:T . Reynolds and Rose [1995] show that randomized initializations yield results comparable to more sophisticated strategies. Θ} k1:T = {k1 .2. An improved statistical modeling is therefore achieved by each iteration. Θ). . In the E-step each feature vector is assigned to all Gaussian densities by the posterior probability ¯ p(k|xt . k|Θ)) (2. maximizing ¯ QML (Θ. . Due to the latent variable k. Σk ¯ μ ¯ N l=1 wl · N xt |¯ l .40) leads to a locally optimal solution. The standard approximation is known as the Expectation Maximization (EM) algorithm [Dempster et al. Θ) · log (p(xt . The EM algorithm is equivalently described by maximizing the auxiliary function ¯ ¯ QML (Θ.

6.3 or offers the possibility to further adapt the corresponding GMMs as explained in Sect. For each speaker i the log-likelihood T N log (p(x1:T |i)) = t=1 log k=1 i wk · N xt |μi . 3.28 2 Fundamentals In the M-step the mean vectors. A lower bound should be assigned to avoid singularities in the likelihood function leading to infinite likelihood values [Bishop. Under optimal conditions excellent speaker identification rates can be obtained as demonstrated by Reynolds [1995a]. . If the covariance matrices are realized by diagonal matrices. Θ) · xt ¯ p(k|xt .44) is retained. The speaker with the highest posterior probability p(i|x1:T ) is identified. each utterance is characterized by a sequence of feature vectors x1:T and statistical dependencies between successive time instances are neglected by the iid assumption. Knowing the speaker identity enables speaker specific speech recognition as described in Sect.32) is applied. Θ) · xt · xT − μML · (μML )T t k k (2.47) selects the speaker model with the highest match [Reynolds and Rose. 2. 1995]. only the main diagonal in (2. If the prior probability for the enrolled speakers is unknown or uniformly distributed.43) 1 nk nk = T (2. The parameter set Θ is omitted. Again.44) (2. Σi k k (2. covariance matrices and weights are recalculated μML = k ΣML = k ML wk 1 nk T t=1 T t=1 ¯ p(k|xt . A more detailed description of GMMs and training algorithms can be found by Bishop [2007]. 2007]. the ML criterion iML = arg max {p(x1:T |i)} i (2. Evaluation Speaker identification has to decide which user is speaking.45) based on the result of the E-step as found by Reynolds and Rose [1995].46) is calculated and the MAP criterion given by (2.

g.4. (2. caused by unseen data or the training conditions. ∀i. another strategy is necessary to extend the GMM based speaker identification for out-of-set speakers. Furthermore.49) is valid for all speaker models. Instead of a threshold for log-likelihoods. e. Markov and Nakagawa. The advanced approach has the advantage of lowering the influence of events that affect all statistical models in a similar way. 1996]. If the speaker’s identity does not correspond to a particular speaker model. High fluctuations of the absolute likelihood are probable and may affect the threshold decision. 2000]. 2005]. 2005. can be reduced [Fortuna et al. [2005] or cohort models [Markov and Nakagawa. Alternatively.48) an unknown speaker has to be assumed. If the following inequality log (p(x1:T |Θi )) − log (p(x1:T |ΘUBM )) ≤ θth . However. A generative statistical model for a closed set of enrolled speakers was explained in Sect. the ML criterion can be applied [Zhang et al. For example. Speaker identification for known and unknown speakers can be implemented as a two-stage approach. If all log-likelihoods of the enrolled speakers fall below a predetermined threshold log (p(x1:T |Θi )) ≤ θth . 1996]. [2005]. The UBM is a speaker independent model which is trained on a large group of speakers. The likelihood describes the match between a given statistical model and the observed data. Fortuna et al.3 Detection of Unknown Speakers The detection of unknown speakers is a critical issue for open-set speaker identification. In those cases the log-likelihood ratio appears to be more robust than absolute likelihoods. Thus. textdependent fluctuations in a spoken phrase. In the first stage the most probable enrolled speaker . a low likelihood value is expected. phrases spoken in an adverse environment may cause a mismatch between the speaker models and the audio signal due to background noises.4 Speaker Identification 29 2. the log-likelihood ratios of the speaker models and UBM can be examined for out-of-set detection [Fortuna et al..2. A simple way is to introduce a threshold θth for the absolute log-likelihood values given by (2..2. Advanced techniques may use normalization techniques comprising a Universal Background Model (UBM) as found by Angkititrakul and Hansen [2007]. an unknown speaker is likely.46) as found by Fortuna et al. ∀i (2..4. Detecting unknown or out-of-set speakers is difficult since unknown speakers cannot be modeled explicitly. Each speaker participates in a training procedure in which speaker specific utterances are used to train speaker models. in-car applications suffer from adverse environmental conditions such as engine noises. 2.

For convenience. the architecture of the speech recognizer used for the experiments of this book is explained. .. W2 . First. Then a statistical model is presented that has been established as a widely-used approach in Automated Speech Recognition (ASR) [Bishop. Setiawan et al. 1995. Hands-free telephony or the operation of a navigation system are in-car applications where speech recognition enables the user to speak a sequence of digits. Speaker variability will be explicitly addressed in the next section. The likelihood p(x1:T |W 1:NW ) determines the match between the acoustic speech . The speech characteristics are represented by a sequence of feature vectors x1:T which is extracted in the front end of the speech recognizer. A word sequence W 1:NW = {W1 . In this context the influence of the speaker on speech recognition is temporarily neglected. The basic components and the architecture of the speech recognizer are considered in more detail. On the one hand information has to be presented visually or acoustically to the user. 2. select an entry from an address book or to enter city names. . The index LDA is omitted. speech recognition is introduced from a general point of view and the problem of speech decoding is described mathematically.50) has to be determined out of all possible sequences [Schukat-Talamazzini.1 Motivation For a speech controlled system it is essential to communicate with the user. 2009]. It is used in the following chapters. the LDA features xLDA introduced in t Sect. WNW } has to be assigned to an utterance or equivalently a transcription of the spoken phrase has to be generated. Finally. . 2. Then the speaker identity is accepted or rejected by speaker verification based on the techniques described above [Angkititrakul and Hansen.5.2 are used for speech recognition. Now the question has to be answered how to decode or recognize the understanding of a spoken utterance since the target system shall be operated by speech commands. NW denotes the variable number of words within a single utterance. . 2007].5 Speech Recognition In the preceding sections several strategies were discussed to track different speakers. Speech recognition can be formulated as a MAP estimation problem.30 2 Fundamentals is determined. 2007]. The most probable word sequence W MAP = arg max {p(x1:T |W 1:NW ) · p(W 1:NW )} 1:NW W 1:NW (2. 2. The algorithms presented try to recognize either a speaker turn or the speaker identity. On the other hand the user should be able to control the device via spoken commands. for example.

p(st . st−1 ) is split by Bayes’ theorem into a transition probability p(st |st−1 ) and the probability of the preceding state p(st−1 ). transition probabilities and an initialization. fricatives and plosives are produced by a specific configuration of the articulatory organs and excitation. nasal consonants. This prior probability reflects how often these word sequences occur in a given language and whether this sequence is reasonable with respect to the grammar of a particular language model. It is characterized by a temporal limitation of the statistical dependencies [H¨nsler. For example. The probability p(W 1:NW ) represents the speech recognizer’s prior knowledge about given word sequences. During speech production the vocal tract and the articulatory organs cannot change arbitrarily fast [O’Shaughnessy.5 Speech Recognition 31 signal and the statistical model of a particular word sequence. 2001]. 3 states can be used to represent the context to the prior phoneme. 2000]. Markov Model The Markov model is the statistical description of a system which can assume different states.1 since vowels. 2. This becomes evident from Sect. Only the static features are tracked by GMMs. This motivates to integrate the dynamic behavior of speech into the statistical model. In a first step this underlying model is described and in a second step both models are combined. the steady-state of the actual phoneme and the context to the successor [O’Shaughnessy. Statistical dependencies of successive time instances are usually not modeled because of the iid assumption. st−1 ) is reduced to the probability p(st ) by the sum over all previous states st−1 : . The joint probability p(st . 2000]. Markov models are also called Markov models a of first order if each transition only depends on the preceding and current state [H¨nsler. The result is the so-called Hidden Markov Model (HMM). GMMs can be extended by an underlying hidden statistical model to represent states and transitions. Statistical dependencies across multiple time instances a are ignored. 2.2 Hidden Markov Models As discussed so far. 2001]. Finally.5. The dynamic of speech is therefore neglected. Speech signals can be decomposed into phonemes as the smallest possible entity [Schukat-Talamazzini. A Markov model is defined by a set of states s. 2000].2. speech characteristics and speaker variability were represented by particular time instances. Phonemes can be defined by a characterizing sequence of states and transitions [O’Shaughnessy. 1995].

The emission probability represents the distribution of the observations belonging to the state st . respectively. The states are arranged in a chain and transitions can only occur either back to the original state or to one of the next states in a forward direction. The transition probability is aj1 j2 and the initial probability of a particular state is given by p(s1 ).32 M 2 Fundamentals p(st ) = st−1 =1 p(st |st−1 ) · p(st−1 ). The Markov model represents the hidden unobserved random process and its states introduce the latent variable st . Speech production of single phonemes can be modeled. It is modeled by a GMM which is subsequently called codebook . Fusion of Markov Model and GMM An HMM fuses a Markov model and one or more GMMs. Bayes’ theorem enables the combination of both random processes in analogy to Sect. st |Θ) p(xt |k.53) . st =1 k=1 (2. by a feedforward chain as given in Fig. t>1 (2. However. st .10 Example for a Markov chain comprising 3 states organized as a feedforward chain. Fig.52) = (2. The final probability density function is composed by the superposition of all states of the Markov model and the Gaussian density functions of the GMM: M N p(xt |Θ) = st =1 k=1 M N p(xt |k. 2.2. The transition probabilities aj1 j2 denote the transitions from state j1 to state j2 . Θ) · p(k|st . a13 1 a12 2 a23 3 a11 a22 a33 Fig. e. 2.g. State 2 denotes the steady state of a phoneme whereas the states 1 and 3 represent the preceding and subsequent phoneme. st .10 displays a Markov chain as a feed-forward realization.4. 2. variations of the pronunciation are not captured. Θ) · p(k.51) aj1 j2 = p(st = j2 |st−1 = j1 ) where M denotes the number of states.10. 2. Θ) · p(st |Θ).

CDHMMs enable an accurate statistical model for each state at the expense of a higher number of Gaussian distributions. 1989. SCHMMs split the parameter set into mean vectors of low complexity and more complex covariances. Furthermore. For convenience. The latter is essential especially for embedded systems. 1992. 1995]. One realization is the Semi-Continuous HMM (SCHMM) [Huang and Jack. Σk } · p(st ). A further realization is given by the Continuous Density HMM (CDHMM) which supplies a complete GMM parameter set for each state: M N s wkt · N {xt |μst .55) The superscript index st indicates the state dependence of the weight factor. the following notation omits the parameter set. 1995]. 1993. SCHMMs achieve high recognition accuracy.. This separation enables high speech recognition accuracies and makes adaptation efficient. a phoneme can be represented by an HMM comprising 3 states. The conditional discrete probability p(k|st ) corresponds to the weights of the GMM and the likelihood function p(xt |k. For instance. They reflect the steady state and the transitions from the preceding and to the subsequent phoneme within . state transitions and the GMM parameters as defined in (2. 1995]. Σst } · p(st ).54) SCHMMs take benefit from the ability of GMMs to approximate arbitrary probability density functions. Schukat-Talamazzini.5 Speech Recognition 33 N and k denote the number of Gaussian densities and the corresponding index. All states share the mean vectors and covariances of one GMM and only differ in their weighting factors: M N s wkt · N {xt |μk . Θ contains all parameters of the HMM including the initial state probabilities.51). However. Schukat-Talamazzini.37). Rabiner and Juang. 1989. The covariance matrices may be realized by diagonal matrices to limit the number of parameters. st ) is realized by a Gaussian density. speaker adaptation is straightforward due to the possibility to define codebooks which can be easily modified [Rieck et al. When using full covariance matrices. st =1 k=1 pSCHMM (xt ) = (2. Speaker adaptation is usually realized by a shift of mean vectors whereas covariances are left unchanged as described in the next section. It has only one GMM parameter set at its disposal. Schukat-Talamazzini. they are efficient in terms of memory and computational complexity [Huang and Jack. k k st =1 k=1 pCDHMM (xt ) = (2. The state probability p(st ) originates from the Markov model given by (2. mean vector and covariance matrix. Training and Evaluation Speech can be modeled by HMMs.2.

st = j|Θ) is denoted by αt (j) for all states j = 1. The initialization is given by πj = p(s1 = j). 2000. st |Θ) ∝ p(x1:t . Rabiner and Juang. Subsequently. The forward algorithm can be iteratively calculated.58) In addition.56) and the initialization α1 (j) = πj · bj (x1 ). In contrast to GMM training. 1993. the backward algorithm [O’Shaughnessy. . The temporal sequence of states and transitions has to be determined so as to provide the highest match between model and the observed data. 1993. t>1 (2. (2. The recursion is given by M αt (j) = bj (xt ) · l=1 αt−1 (l) · alj . M . Rabiner [1989]. 1995] can be applied when the complete utterance of length T is buffered. p(x1:t |Θ) (2. An algorithm similar to the forward algorithm is applied in reversed temporal order to calculate the likelihood βt (j) = p(xt+1:T |st = j. 1995] based on the feature vectors x1:t observed so far and the initial state probability p(s1 ). the decoding problem is exemplified for only one HMM. 1995].59) . The corresponding GMMs represent the variability of the pronunciation. The evaluation of HMMs has to determine the optimal sequence of models. The corpus of all possible words is given by a lexicon. Schukat-Talamazzini. Θ) for the successive observations xt+1:T given the current state st = j. HMMs can be trained by the Baum-Welch algorithm.57) According to Bayes’ theorem. Rabiner and Juang. The conditional probability density function p(xt |st = j) is given by bj (xt ) and transitions of the Markov model are represented by alj = p(st = j|st−1 = l). The recursion is given by M βt (j) = l=1 ajl · bl (xt+1 ) · βt+1 (l). st |Θ).34 2 Fundamentals a spoken phrase. It determines the utterances to be recognized. . Schukat-Talamazzini. . 0<t<T (2. The following notation is used: The density function p(x1:t . Θ) = p(x1:t . The current state can be estimated by the forward algorithm [O’Shaughnessy. . the posterior of state st given the history of observations x1:t can be obtained by a normalization of the forward algorithm: p(st |x1:t . The concatenation of models enables speech modeling on a word level. The prior probabilities of word sequences are given by the language model [Schukat-Talamazzini. A detailed description can be found by Rabiner and Juang [1993]. state and transition probabilities have to be included. 2000.

. l t>1 (2. 1993. st−1 (2. 2.2. O’Shaughnessy.3 Implementation of an Automated Speech Recognizer In this book.11 shows the speech recognizer setup applied in the following chapters. .62) can then be determined according to the MAP criterion. (2. WNW }. The Viterbi algorithm is faster than the forward-backward algorithm since only the best sequence of states is considered instead of the sum over all paths [O’Shaughnessy. In contrast to the forward algorithm. 1:N Further details about the evaluation of HMMs can be found by Rabiner and Juang [1993]. the density function p(x1:t . 2. SchukatTalamazzini [1995]. For speech recognition as described by (2. 1973. . Rabiner and Juang. . Rabiner and Juang [1993]. To determine the best path in an HMM which means the most likely sequence of states.60) The posterior probability p(st = j|x1:T ) can be calculated by combining the forward and backward algorithm p(st = j|x1:T ) = αt (j) · βt (j) . Backtracking is required to determine the best sequence of states as described by Schukat-Talamazzini [1995].63) The Viterbi algorithm is recursively calculated by ϑt (j) = max {ϑt−1 (l) · alj · bj (xt )} .5 Speech Recognition 35 and the initialization βT (j) = 1.65) ϑ1 (j) = πj · bj (x1 ). the Viterbi algorithm is usually employed [Forney. M l=1 αt (l) · βt (l) 0<t≤T (2.61) as described by O’Shaughnessy [2000]. It can be subdivided into three parts: . 2000]. s1:t |Θ)|st = j} . Fig.5. 1995]. s1:t |Θ) with st = j has to be optimized: ϑt (j) = max {p(x1:t . a speech recognizer based on SCHMMs is used.50) this algorithm has to be extended so that transitions between HMMs can be tracked.64) (2. 2000. Path search algorithms such as the Viterbi algorithm can be used to determine the recogniopt opt tion result given by the optimal word sequence W opt W = {W1 . The most probable state sMAP = arg max{p(st |x1:T )} t st (2. Schukat-Talamazzini.

. The recognition task in (2. For example. a speaker hesitates within an utterance. A feature vector representation of the speech signal is extracted in the front-end of the speech recognizer. the MFCC features are extracted from the audio signal. In combination with the lexicon and language model. Garbages might contain speech which does not contribute to the speech recognition result. the transcription of the spoken utterance is obtained as depicted in Fig. • Codebook.68) N {xt |μl . For each time instance the vector representation xt of the relevant speech characteristics is computed. mispronounces single words. • Speech Decoder. It consists of about 1000 multivariate Gaussian densities. The acoustic model is realized by Markov chains. In the front-end a noise reduction is performed. does not exceed a pre-determined threshold. special models are used to capture garbage words and short speech pauses not removed by the speech segmentation. Σk } N l=1 T (2. Each feature vector is compared with all Gaussian densities of the codebook. The soft quantization qt = (q1 (xt ). In addition. coughs or clears his throat.. • Front-end. . qk (xt ). The feature vectors are evaluated by a speaker independent codebook subsequently called standard codebook . . 2.11 Block diagram of the SCHMM-based speech recognizer used for the experiments of this book. qN (xt )) qk (xt ) = N {xt |μk . . 2010c].g. Σk } (2.12. 2. The weights are evaluated in the speech decoder.66) of each Gaussian density with index k does not require a state alignment since only the weighting factors are state dependent. The result is used for speech decoding to generate a transcription of the spoken phrase. . Utterances can be rejected when the confidence of the recognition result. given by the posterior probability p(W 1:NW |x1:T ). Σl } contains the matching result represented by the normalized likelihoods of all Gaussian densities [Fink.36 2 Fundamentals Front-end xt Codebook qt Decoder Fig. Figure is taken from [Herbig et al. .50) is solved by speech decoding based on q1:T . e.67) (2. The likelihood computation p(xt |k) = N {x|μk . . a cepstral mean subtraction is calculated and an LDA is applied. 2003]. .

A simplified notation is introduced to describe both adaptation scenarios. Speech characteristics were also briefly described which are important for speech decoding. They can be equally applied for speaker identification and speech recognition: First. For each speaker an individually adapted statistical model can be given. In contrast to this representative. speaking rate [Campbell.2 speech production was discussed as a source-filter model. Thus. lexicon and language models. Speaker variability has its origins in a combination of gender and speaker specific excitation source.6 Speaker Adaptation In Sect. an adaptation algorithm is described that excels in fast convergence. In this section several adaptation strategies are discussed. 1997]. speaker adaptation provides an alternative method to adjust or initialize statistical models. They enable the estimation of a high number of adaptation parameters. speaker adaptation is an essential part of the speech controlled target system of this book.12 Components of the speech decoding comprising the acoustic models. 2. physical anatomy of the vocal tract and acquired speaking habits.2. e. 2.g. Several applications for speaker adaptation in speaker identification and speech recognition are described. Then two of the main representatives for long-term speaker adaptation are introduced.6 Speaker Adaptation 37 Decoder Acoustic model Lexicon Language model Fig. The latter can affect speaker identification and speech recognition. In the preceding sections several techniques were introduced to deal with speaker and speech variability. The differences between the adaptation of GMMs and HMMs are emphasized. At the same . Since a re-training is usually not feasible at run-time. 2. This algorithm is suitable for adaptation on limited training data since it only requires a few parameters. In practical applications the problem of unseen conditions or time-variant speaker and speech characteristics occur. the application of speaker adaptation is motivated and possible sources of speech variations are given.

Stern and Lasry. Speaker adaptation offers the opportunity to moderate those mismatches and to obtain improved statistical representations.4.70) ΘMAP = arg max {p(x1:T |Θ) · p(Θ)} as found by Gauvain and Lee [1994].1. e. The practical realization of the statistical models described in the preceding sections often suffers from mismatches between training and test conditions when the training has to be done on incomplete data that do not represent the actual test situation. 2. In the case of speaker specific variations speaker adaptation integrates them into the statistical modeling to initialize or enhance a speaker specific ¯ model. Even though the noise reduction of the front-end reduces background noises and smooths the power spectrum. an adaptation strategy is introduced which integrates short-term and long-term adaptation capabilities. Finally. the residual noise and channel distortions can affect the speaker identification and speech recognition accuracy. can be applied to reduce unwanted variations such as channel characteristics. This can be implemented by the EM algorithm as described in Sect.69) (2.6. Optimization can be achieved by the ML or MAP criterion which increase either the likelihood or posterior probability for the recorded utterance x1:T . Equation (2. Among those. 2001. The optimal transition between shortterm and long-term adaptation is achieved at the expense of a higher computational load [Jon et al.1 Motivation In the preceding sections several statistical models and training algorithms were presented to handle speech and speaker related information. . Enhancement or normalization techniques. In general. 2. The speaker index is omitted. 1987]. Speaker adaptation provides an alternative way to create adjusted statistical models.2. Incomplete data have to be handled because of latent variables. cepstral mean normalization. there are the channel impulse response as well as background noises. pattern recognition on speech encounters several sources which are responsible for a mismatch. It starts from a preceding or initial parameter set Θ and provides an optimized parameter set Θ for each speaker based on training data. 2001]. A.4) reveals that the speaker’s vocal tract and environmental influences may cause variations.38 2 Fundamentals time the minimal number of adaptation parameters restricts the algorithm’s accuracy for long-term adaptation [Botterweck. The same problem has to be solved as for the training algorithm in Sect..g. The problem can be formulated by ΘML = arg max {p(x1:T |Θ)} Θ Θ (2.

The former ones are trained by the EM algorithm to obtain an optimal statistical representation.6. Many speech recognizers are extensively trained on a large group of speakers and thus do not represent the true speech characteristics of particular speakers [Zavaliagkos et al. A high number of parameters.3) form the base of the following algorithms.2. Otherwise over-fitting can occur since the statistical model starts to represent insignificant properties and loses the ability to generalize [Duda et al. Subsequently. In realistic applications speaker adaptation can enhance the robustness of the ASR. just one iteration or only a few ones are performed. 1995]. [2000]. Speaker adaptation for speaker identification purely aims at a better representation of the speaker’s static voice characteristics with respect to the likelihood or posterior probability.. However. Equations (A.. Thus. 2001]. requires a large amount of adaptation data. Kuhn et al. In Sect. 2. either different approaches are needed for special applications depending on the amount of data or an integrated strategy has to be found.4.5) and (A.. In the literature a further important application is updating the codebook of a speech recognizer [Zavaliagkos et al.4. Therefore similar techniques can be applied to adapt GMMs and HMMs. 2. however. 1995].6 Speaker Adaptation 39 In contrast to the realization of the EM algorithm in Sect. The preceding sections showed that GMMs are a special case of HMMs.2. Codebook adaptation is intended to optimize the recognition of spoken phrases and therefore includes the temporal speech dynamics as a further aspect. for example. The first iterations usually contribute most to the final solution so that a reduction of the computational complexity is achieved by this restriction. The recognition rate can be significantly increased as demonstrated by Gauvain and Lee [1994].5 speech recognition was introduced from a speaker independent point of view. Furthermore. the risk of over-fitting can be reduced. in the literature there exist alternative methods to obtain speaker specific GMMs. Codebooks can be optimized by taking advantage from the speech . According to Nishida and Kawahara [2005] adapting a Universal Background Model (UBM) to a particular speaker can be viewed as one of the most frequently applied techniques to build up speaker specific GMMs. 2. With a higher number of parameters a more individual and accurate representation of the voice pattern of a particular speaker can be achieved. only speech recognizers based on SCHMMs are investigated in this book. 2.2 as common statistical models to capture speaker characteristics. The capability of adaptation algorithms depends on the number of available parameters.2 Applications for Speaker Adaptation in Speaker Identification and Speech Recognition GMMs were introduced in Sect.

. Θ) · log (p(k|ˆt .72) (2. Θ)) s s ¯ p(k|ˆt . One problem of such a two-stage approach is the fact that transcription errors can negatively affect the speaker adaptation accuracy. . Θ)) s s + t=1 k=1 to integrate the knowledge of the speech transcription in the codebook adaptation. Θ and Θ denote the parameter set of the preceding iteration and the adapted parameter set. Several iterations may be performed to calculate Θseg MAP . the optimization problem can be solved in analogy to the segmental MAP algorithm Θseg MAP = arg max max {p(x1:T .71) or equivalently by the following recursion ¯ s1:T = max p(x1:T . xt . s1:T |Θ) ˆ s1:T (2.75) ¯ p(k|ˆt . The ˆ speech recognition result is integrated in codebook optimization by the state dependent weights of the shared GMM since only SCHMMs are considered.5) have to be re-written ¯ ¯ Qseg MAP (Θ. The auxiliary functions of the EM algorithm in (A. Θ) · log (p(xt . xt . . sT }. Θ) = Qseg ML (Θ. Θ)) s ˆ t=1 k=1 T N (2. k|ˆt .40 2 Fundamentals recognizer’s knowledge about the spoken utterance. Θ) + log (p(Θ)) ¯ Qseg ML (Θ. Since speaker adaptation follows the speech decoding in the realization of this book. During speech decoding an optimal state sequence s1:T is determined by the Viterbi algorithm. especially when a large number of free parameters have to be estimated. 2008].73) ˆ ˆ Θ = arg max {p(x1:T . st . s2 . Thus confidence measures may be employed on a frame or word level to increase the robustness against speech recognition errors [Gollan and Bacchiani. Θ) = = t=1 k=1 T N T N (2.3) and (A.76) ¯ p(k|ˆt . s1:T |Θ) · p(Θ)} Θ s1:T (2. Θ) · log (p(xt |k. .74) (2. The following approximation is used in this book. Hence. a two-stage approach becomes feasible. The state sequence of an observed utˆ ¯ terance is given by s1:T = {s1 . xt . s1:T |Θ) · p(Θ)} Θ as found by Gauvain and Lee [1994].

Gauvain and Lee u [1994] for adapting the Markov transitions. . [2001] for adaptation scenarios modifying also the covariance matrices of an HMM. Θ). Θ)) . Speaker adaptation normally modifies only the codebooks of a speech recognizer. The state variable is omitted in the case of a speech recognizer’s codebook to obtain a compact representation. 2005]. Θ) integrating the transcrips tion of the spoken phrase is used for codebook adaptation. For example. st . Thus. . Codebook adaptation of a speech recognizer is a further important application as already discussed in Sect.. Θ) of a particular Gaussian distribution does not s depend on the current state st and the prior probability p(k|st .6 Speaker Adaptation 41 When speaker adaptation is restricted to adapt only the mean vectors μl . the contribution to the recognition result gains importance for speaker adaptation. codebooks and GMMs can be equally considered when adaptation is described in this book. Adapting the transitions of the Markov chain is usually discarded because the emission densities are considered to have a higher effect on the speech recognition rate compared to the transition probabilities [Dobler and R¨hl. They both belong to the most frequently used techniques to establish speaker specific GMMs by adapting a UBM on speaker specific data [Nishida and Kawahara. The difference between the adaptation of GMMs and codebooks of SC¯ HMMs becomes now obvious. The posterior probability p(k|ˆt . xt .6.2. the Estep does not only check whether a feature vector is assigned to a particular Gaussian distribution. 2001]. . they contain the speech rate and average duration between the start and end state of a feed-forward chain as shown in Fig. Because of the two-staged approach a simplified notation can be used for both applications in speaker identification and speech recognition. Θ) · s log (p(xt |k. 1995].77) The likelihood p(xt |k. Θ) = dμl T N t=1 k=1 d ¯ p(k|ˆt . xt . Subsequently. 2. 1999. It is referred to Dobler and R¨hl [1995]. dμl (2. Θ) determines the assignment of the observed feature vector xt to the Gaussian distributions ¯ of a GMM. 2000]. It provides more ¯ information about the speech signal compared to p(k|xt . . 2. The posterior probability p(k|xt .2. Matrouf et al. N . l = 1.10.3 Maximum A Posteriori Maximum A Posteriori and Maximum Likelihood Linear Regression can be viewed as standard adaptation algorithms [Kuhn et al. Thus. 2. the discussion is focused on GMM and codebook adaptation. The reader is referred to Gauvain and Lee [1994]. . the optimization of the auxiliary function can be simplified for SCHMMs: d ¯ Qseg ML (Θ.6. Θ) = wkt is independent from the mean vectors μl . The transition probabilities reflect the durational informau tion [O’Shaughnessy. In addition.

4 are modified p(k|xt . a new parameter set of ML estimates ΘML is calculated by applying one E-step and one M-step of the EM algorithm.83) . More details can be found in Sect. The ML estimates are denoted by the superscript ML. N and nk denote the total number of Gaussian densities and the number of feature vectors softly assigned to a particular Gaussian density. ΣUBM wl t l l (2. The following two steps are performed: First.2 where codebook adaptation is exemplified.78) has to be optimized with respect to the modified parameter set Θ.79) nk = t=1 p(k|xt .78) ¯ where Θ denotes an initial or speaker specific parameter set. ΘUBM ) 1 nk T (2. It is directly derived from the extended auxiliary function ¯ ¯ QMAP (Θ. Reynolds et al. This algorithm is characterized by individual adaptation of each Gaussian density [Gauvain and Lee. the equations of the ML estimation in Sect. For convenience. 1995]. The UBM parameters μUBM .. (2. 1994] is a standard approach. 1994. p(Θ) can be factorized [Zavaliagkos et al. The feature vectors are viewed in the context of the iid assumption and statistical dependencies among the Gaussian densities are neglected. ΘUBM ) · (xt − μML ) · (xt − μML )T k k t=1 (2. Thus.81) 1 nk nk = T p(k|xt .. It integrates prior knowledge p(Θ) and the ML estimates ΘML based on the observed data [Stern and Lasry.82) (2. 2. 2000]. A. covariance matrices and the weights. Θ) = QML (Θ. ΣUBM k k N l=1 UBM · N x |μUBM . Θ) + log (p(Θ)) . ΘUBM ) = T UBM wk · N xt |μUBM . MAP adaptation starts from a speaker independent UBM as the initial parameter set ΘUBM or alternatively from a trained speaker specific GMM.42 2 Fundamentals MAP adaptation also known as Bayesian learning [Gauvain and Lee. Jon et al.80) μML = k ΣML = k ML wk p(k|xt . This allows individual adaptation equations for each Gaussian distribution. Equation (2. 1987].. ΣUBM and wk subsequently contain the prior knowlk k edge about the mean vectors. ΘUBM ) · xt t=1 T (2. Gauvain and Lee [1994] provide a detailed derivation and the mathematical background of the Bayesian learning algorithm so that only the results are given here. respectively. 2001.

78) and can be found by Reynolds et al. When an iterative procedure is applied. The associated equations αk = μMAP k ΣMAP k nk nk + η = (1 − αk ) · μUBM + αk · μML k k = (1 − αk ) · ΣUBM k + αk · ΣML k N MAP UBM ML wk ∝ (1 − αk ) · wk + αk · wk . [2000]. Adapting the mean vectors seems to have the major effect on the speaker verification results [Reynolds et al. The Gaussian densities at the top are individually adapted with respect to the assigned feature vectors.2. ΘUBM has to be replaced ¯ by Θ which represents the parameter set of the preceding iteration. [2000]: Fig. only mean vectors and if applicable weights can be adapted as a compromise between enhanced identification rate and computational load. .6 Speaker Adaptation 43 so that the UBM determines the feature vector assignment and acts as the initial model. Thus. The resulting ellipses cover the feature vectors and parts of the original ellipses indicating the convex combination in (2.87) can be derived from (2.85) (2.13(a) and Fig.13(b) shows the adapted Gaussian densities as filled ellipses and the original density functions as unfilled ellipses. The ML estimates ΘML represent the observed data whereas the initial parameters ΘUBM comprise the prior distribution. 2.85). One density function remains unchanged since no feature vectors are assigned. The centers represent mean vectors and the main axes of the ellipses correspond to variances. l=1 MAP wl =1 (2. In practical applications η can be set irrespectively of this. The constant η contains prior knowledge about the densities of the GMM [Gauvain and Lee. After the ML estimation two sets of parameters exist.86) (2. Fig. 2. The crosses indicate the observed feature vectors. 1994]. This problem is solved in the second step by a convex combination of the initial parameters and the ML estimates resulting in the optimized parameter set ΘMAP . The weight computation wk introduces a constant factor to assure that the resulting weights are valid probabilities and sum up to unity. The number of softly assigned feature vectors nk controls for each Gaussian density the weighting of this convex combination. Fig. 2.84) (2. 2000]. The influence of the new estimates ΘML increases with higher nk .13(b) illustrate the procedure of the MAP adaptation in analogy to Reynolds et al. Small values of nk cause the MAP estimate to resemble the prior parameters.. 2.13(a) displays the original GMM as ellipses. For adaptation a smooth transition between ΘUBM and ΘML has to be found depending on these observations. Reynolds et al. [2000] show that the choice of this constant η is not sensitive for the performance of speaker verification and can be MAP selected in a relatively wide range.

Leggetter and Woodland [1995a] describe Maximum Likelihood Linear Regression (MLLR) which is widely used in speaker adaptation.. In general the number of adaptation parameters has to be balanced with the amount of speaker specific data. 2. The observed feature vectors are depicted as crosses.. 2000]. 1995]. However. Thus.6. inefficient adaptation of untrained Gaussian densities has to be expected resulting in slow convergence [Zavaliagkos et al. For nk → 0 no adaptation occurs or only a few parameters are marginally adapted.13 Example for the MAP adaptation of a GMM comprising 3 Gaussian densities. Finally. 2. MAP adaptation is important for long-term adaptation and suffers from inefficient adaptation on limited data [Kuhn et al. Fig. One way is to estimate a set of model transformations for classes of Gaussian distributions to capture speaker characteristics. The key idea of MLLR is to represent model . (b) Adapted Gaussian densities are depicted as filled ellipses whereas the initial Gaussian densities of the GMM are denoted by unfilled ellipses (dotted line). require a sufficient amount of training data to be efficient. Several publications such as Gales and Woodland [1996]. MAP adaptation is robust against over-fitting caused by limited speaker specific data. where each Gaussian distribution is adapted individually. If the constant η is selected appropriately.44 2 Fundamentals (a) Filled ellipses display the original Gaussian densities of a GMM. the advantages and drawbacks of the MAP adaptation should be discussed: Individual adaptation is achieved on extensive adaptation data which enables high adaptation accuracy.4 Maximum Likelihood Linear Regression Approaches for speaker adaptation such as MAP adaptation.

Θ) of the EM algorithm. Θ) · Σ−1 · Wm · ζ kr · ζ Tr k kr t=1 r=1 t=1 r=1 (2.2 can be intuitively explained. For example. the combination of short-term and longterm adaptation in Sect. mean vector adaptation may be realized for each regression class r by a multiplication of the original mean vector μkr with matrix Ar and an offset vector br . MAP adaptation is viewed as an appropriate candidate for long-term adaptation. Θ) denotes the posterior probability where each feature vector xt is assigned to a particular Gaussian density kr within a given regression class r. Furthermore. In this book.92) ¯ as found by Gales and Woodland [1996].6 Speaker Adaptation 45 adaptation by a linear transformation. 2.5 Eigenvoices The MAP adaptation has its strength in long-term speaker adaptation whereas the Eigenvoice (EV) approach is advantageous in the case of . p(kr |xt . or the parameter set of the previous iteration when an iterative procedure is applied. 4.91) = = bopt Aopt r r T T 1 μkr .g. the standard codebook Θ0 . The optimal opt transformation matrix Wkr is given by the solution of the following set of equations: T R ¯ p(kr |xt . This problem may be solved by maximizing the auxiliary func¯ ¯ tion QML (Θ.2. opt The goal of MLLR is to determine an optimal transformation matrix Wkr in such a way that the new parameter set maximizes the likelihood of the training data. Θ) · Σ−1 · xt · ζ Tr = k kr T R opt ¯ p(kr |xt . e. A highly specific speaker modeling can be achieved on extensive data.89) (2. Thus.90) (2. The optimized mean vector μMLLR is given by kr μMLLR = Aopt · μkr + bopt kr r r which can be rewritten as opt μMLLR = Wr · ζ kr kr opt Wr ζ Tr k (2. The covariance matrices may be adapted in a similar way as described by Gales and Woodland [1996].88) (2. In this context Θ denotes the initial parameter set.6. MLLR is not considered.

. k k N k = 1. ⎟ ⎝ . Each speaker with index i provides a large amount of training utterances so that an efficient long-term speaker adaptation is possible. GMMs for speaker identification can be realized by significantly fewer parameters as will be shown in Sect. ⎟ i μ = ⎜ μi ⎟ ˇ (2. 2000]. The EV technique in the realization of this book only needs 10 parameters to efficiently adapt codebooks. . When prior knowledge about speaker variability can be employed. [2000]. μi . In total. For convenience. . This technique benefits from prior knowledge about the statistical dependencies between all Gaussian densities. N Gaussian densities are employed. this algorithm significantly reduces the number of adaptation parameters. μi N . . . They are presented by eigenvectors which are subsequently called eigenvoices. ⎟ ⎜ .94) ⎜ k⎟ ⎜ .3 is considered in the following. the codebook of a speech recognizer can be efficiently adapted by some 10 adaptation parameters [Kuhn et al. Training In an off-line training step NSp speaker specific codebooks are trained. In an off-line training step codebook adaptation is performed for a large pool of speakers. 2.3. allows estimating some scalar parameters. .93) which is grouped by the component index k. 000 parameters to be adapted when the MAP adaptation is applied to the codebook of the speech recognizer introduced in Sect.g. At run-time codebook adaptation can be efficiently implemented by a linear combination of all eigenvoices. N (2. . A more detailed investigation of the EV approach can be found by Kuhn et al.5. [2000]. [2000]. ⎠ . e. short command and control utterances. . . . . Thyes et al. 2. the MAP algorithm introduced in Sect. this presentation only considers the mean vector adaptation of codebooks but may be extended to weights or covariance matrices. . all mean vectors are arranged in supervectors ⎛ i ⎞ μ1 ⎜ . Only the weighting factors have to be determined to modify the model parameters in an optimal manner. The origin of all speaker specific models is given here by the standard codebook with parameter set Θ0 . Subsequently. ⎟ ⎜ .6.46 2 Fundamentals limited training data as shown by Kuhn et al. For convenience. The main directions of variations in the feature space are extracted. Even very few training data.3. The result is a speaker specific set of mean vectors μ0 −→ k MAP μ1 .. 5. There are about 25. μk Sp . Thus.

ˇ MAP (2.14. . indicated by the superscript in μ and the missing index k.93) is represented by supervectors ˇ ˇ ˇ μ0 −→ μ1 . . (2.96) An additional offset vector is omitted here.95) An example of the supervector notation is displayed in Fig 2. 2. μ2 . The covariance matrix ˇ ˇ ˇ ˇ ΣEV = Ei { μi − μ0 · μi − μ0 ˇ T } (2. . To obtain a comˇ pact notation. μNSp . Now the training has to obtain the most important adaptation directions which can be observed in this speaker pool. . a pool of diverse speakers is considered. In the following.2.14 Example for the supervector notation.6 Speaker Adaptation 47 Speaker 1 Supervectors Speaker 2 Fig.97) . For convenience. the mean over all speaker specific supervectors is assumed to be identical to the supervector of the speaker independent mean vectors given by the standard codebook ˇ ˇ Ei {μi } = μ0 . All mean vectors of a codebook are stacked into a long vector. equation (2.

48 2 Fundamentals ˇ is calculated and the eigenvectors of ΣEV are extracted using PCA7 as described by Jolliffe [2002].l ⎜ . eEV N. [2000] suggest.. 2002].100) where the index k indicates the assignment to a particular Gaussian density. 2003]. ⎟ ⎜ . l2 ).. Up to this point the principle directions of speaker adaptation are represented as supervectors. ⎟ = ⎜ eEV ⎟ . el1 el2 1 ≤ l1 . (2. ⎜ k. l2 ≤ NSp (2. The first eigenvector usually represents gender information and enables gender detection as the results of Kuhn et al. Each supervector ˇEV separates into its elements el ⎞ eEV 1. The corresponding eigenvectors ˇEV fulfill the condition e ˇ ΣEV · ˇEV = λEV · ˇEV . the representation of ˇ e the eigenvoices for each Gaussian density is used 7 Realizations of the PCA which are based on the correlation matrix also exist [Jolliffe.l ⎛ ˇEV el l = 1. ⎟ ⎝ . The eigenvectors ˇEV are sorted along decreasing eigenvalues and e the following considerations focus only on the first L eigenvectors. . . . 2002. The eigenvalues are denoted by λEV . High values indicate the important directions in the sense of an efficient speaker adaptation. ⎠ .l ⎟ ⎜ . The number of eigenvoices is limited by the number of speakers NSp . [2000] because the corresponding eigenvalues diminish rapidly.98) ˇ ˇ since the covariance matrix is symmetric ΣEV = (ΣEV )T by its definition [Bronstein et al. Kuhn et al. Adaptation At run-time the optimal combination of the eigenvoices has to be determined to obtain a set of adapted parameters. . L (2. 4. . At run-time the EV speaker adaptation and especially the combination of short-term and long-term adaptation in Sect. This book makes use of only L = 10 eigenvoices similar to Kuhn et al. el el and can be chosen as orthonormal (ˇEV )T · ˇEV = δK (l1 . Meyberg and Vachenauer.99) 1 ≤ l ≤ NSp . ⎟ ⎜ . 2000. The eigenvalues characterize the variance of the observed data projected onto these eigenvectors if the eigenvectors are normalized [Jolliffe. For convenience. The adapted mean vector μEV reˇ sults from a linear combination of the original speaker independent mean vector μ0 and the eigenvoices ˇEV .2 become more feasible when each Gaussian density is individually represented. 2000].

The optimal combination of the eigenvoices is obtained by inserting the linear combination (2.105) p(xt |k.101) as noted in (2. eEV ).100).4. N (2. The weight vector α = (α1 . . wk and Σ0 are given by the standard codebook.103) For adaptation the optimal scalar weighting factors αl have to be determined. 2.106) (2.102) containing all eigenvoices is introduced. .101) by a matrix product and an offset vector: μEV = μ0 + Mk · α. [1998].6 Speaker Adaptation L 49 μEV = μ0 + k k l=1 αl · eEV k. Θ) and p(k|Θ) were already introduced in Sect. αL )T is employed to represent (2. In order to obtain a comprehensive notation the matrix Mk = (eEV . (2. the mean vectors μEV depending on α and k 0 0 the covariance matrices Σk .L k = 1. . k k (2.1 k. Prior knowledge can be included by optimizing QMAP (Θ.104) as demonstrated by Kuhn et al. Kuhn et al.35) and (2.107) . Σ0 k k dα dα . for example. . [2004]. . (2. Θ0 ) = QML (Θ. In this context. Θ0 ) · [log (p(xt |k. . Θ0 ) = t=1 k=1 p(k|xt .103) into the auxiliary function (2. Θ0 has to be replaced ¯ by the parameter set Θ of the preceding iteration. . Θ con0 tains the weighting factors wk . Θ0 ) with respect to the weighting factors α. k. Standard approaches solve this problem by optimizing the auxiliary function of the EM algorithm T N QML (Θ. .36).2. k When speaker adaptation is iteratively performed.l (2.104) and optimizing QML (Θ. [1998]. . Θ0 ) + log (p(Θ)) . Θ)) = log N xt |μEV (α). Θ)) + log (p(k|Θ))] (2. . The computation is simplified since the weights are independent of the mean vectors: d log (p(k|Θ)) = 0 dα d d log (p(xt |k.2. The derivations known from literature differ in the choice of the prior knowledge p(Θ) concerning the parameter set as found by Huang et al. They denote the likelihood function of a single Gaussian density and the weight of a particular density function as given by the equations (2. for example. In this book a uniform distribution is chosen so that the MAP criterion reduces to the ML estimation problem. .

Θ0 ) = − · dα 2 T −1 N p(k|xt .112) (2.113) (Σ0 )−1 k · xt − μ0 k − Mk · α is applicable as found by Felippa [2004]. This allows the following modification: . According to equation (2. the sum over all observed feature vectors weighted by the posterior probability is equal to the ML estimate μML k multiplied by the soft number of assigned feature vectors nk . Θ0 ) · t=0 k=1 d Mahal d (α).115) T − t=1 k=1 p(k|xt . Θ0 ) · (Mk ) · (Σ0 )−1 k · xt − μ0 k . Θ0 ) · (Mk )T · (Σ0 )−1 · Mk · αML k (2. dα k (2.114) which leads to the following set of equations T N 0= t=1 k=1 T N p(k|xt .108) The squared Mahalanobis distance dMahal (α) = xt − μEV (α) k k T · (Σ0 )−1 · xt − μEV (α) k k (2. dMahal is a quadratic distance measure k and therefore the derivative leads to linear expressions in terms of α.110) (2.109) is given by the exponent of the Gaussian density excluding the scaling fac1 tor − 2 as found by Campbell [1997]. Since Σ0 k is symmetric. The optimal parameter set αML has to fulfill the condition d QML dα =0 α=αML (2.111) (2.50 2 Fundamentals Now the problem of the parameter computation can be given in a compact notation d 1 QML (Θ. Petersen and Pedersen [2008]. the derivation rule d Mahal d (α) dα k d T = xt − μ0 − Mk · α · (Σ0 )−1 · xt − μ0 − Mk · α k k k dα d T d T = α · MT · (Σ0 )−1 · Mk · α − α · MT · (Σ0 )−1 · xt − μ0 k k k k k dα dα d T − xt − μ0 · (Σ0 )−1 · Mk · α k k dα T 0 −1 =2 · Mk · (Σk ) · Mk · α − 2 · MT · (Σ0 )−1 · xt − μ0 k k k = − 2 · (Mk ) · T (2. The matrix Mk and the covariance matrix Σ0 are independent from the obk servations xt .43).

with the help of a robust speaker identification. 2.6. Thus. First. each Gaussian density can be modified individually.3 this approach is characterized by fast convergence even on limited training data due to prior knowledge about speaker variability.3 since the parameters are regarded as random variables with statistical dependencies across all Gaussian densities [Stern and Lasry. EMAP differs from the MAP adaptation in Sect. This behavior is similar to the EV technique and is important when adaptation relies only on limited data such as short utterances. One way is to use an experimentally determined threshold. especially for the use case of this book. The prior knowledge how the Gaussian densities of a codebook depend on the adaptation of the remaining density functions is employed to efficiently adjust even those which are not or not sufficiently trained [Jon et al. Hard decisions implicate the need for a reasonable choice of the associated thresholds and bear the risk to be not optimal. The adapted mean vectors μEV result from the weighted sum of all eigenvoices k in (2. 2001] as shown in (2. some vectors and matrices are introduced so that EMAP adaptation can be described by the supervector notation introduced in the preceding section.6 Speaker Adaptation N 51 0= k=1 N nk · (Mk )T · (Σ0 )−1 · Mk · αML k (2. e. As soon as more utterances are accumulated. In contrast to the MAP adaptation in Sect.2. All mean vectors μk are stacked into a supervector . However.101) given the optimal weights αML .6. EMAP jointly adapts all Gaussian densities whereas MAP adaptation individually modifies each distribution. Only a few utterances are required. 2001].116) − k=1 nk · (Mk )T · (Σ0 )−1 · μML − μ0 . The Extended Maximum A Posterior (EMAP) speaker adaptation presented by Stern and Lasry [1987] follows the strategy to integrate both aspects and offers an optimal transition. EV adaptation is considered as a promising candidate for short-term adaptation.6. 1987]..6 Extended Maximum A Posteriori The preceding sections presented some techniques for long-term and shortterm adaptation. this also limits the adaptation accuracy for long-term adaptation [Botterweck.g. in the extreme case of large data sets EMAP behaves like the MAP algorithm as shown later in (2.124).125) further below. k k k The solution of this set of equations delivers the optimal weigths αML. 2. Now the question emerges how to handle the transition between limited and extensive training data. 2. Thus.

∀k and nk → ∞. Θ0 ) (2.81) are stacked into a long concatenated vector μML . k ∀k.52 2 Fundamentals ⎞ μ1 ⎜ . ⎟ μ = ⎜ μk ⎟ ˇ ⎜ ⎟ ⎜ . [2001]. (2. nN ) . . ⎟ ⎜ .122) as found by Jon et al.117) (2. Σ0 1 N (2.119) denotes a full covariance matrix representing prior knowledge about the dependencies between all Gaussian densities in analogy to (2. the MAP and EMAP estimates converge if a large data set can be used for adaptation. ˇ ˇ ˇ ˇ ˇ the term (Σ0 +C·ΣEMAP )−1 is dominated by C·ΣEMAP so that the following approximation may be employed: ˇ ˇ ˇ ˇ μ μEMAP ≈ μ0 + ΣEMAP · (C · ΣEMAP )−1 · C · (ˇ ML − μ0 ) ˇ ˇ ˇ μEMAP k ≈ μML . . Finally. Therefore the derivation is omitted in this book and only the result is given.121) Stern and Lasry [1987] describe the EMAP approach in detail.97). The adapted mean supervector μEMAP can be calculated by ˇ ˇ ˇ ˇ ˇ ˇ μ μEMAP = μ0 + ΣEMAP · (Σ0 + C · ΣEMAP )−1 · C · (ˇ ML − μ0 ) ˇ ˇ ˇ (2.123) (2. . The number of softly assigned feature vectors ˇ T nk = t=1 p(k|xt .120) build the main diagonal of the matrix ˇ C = diag (n1 . . ∀k should be discussed: • When the number of the softly assigned feature vectors increases nk → ∞. . the two extreme cases for nk → 0. . ⎟ ⎜ .124) Therefore. . for example.118) represents a block matrix whose main diagonal consists of the covariances matrices Σ0 of the standard codebook. ⎟ ⎝ . μN ⎛ whereas the covariance matrix ˇ Σ0 = diag Σ0 . ⎠ . (2. . The covariance matrix k ˇ ΣEMAP = Ei {(ˇ i − μ0 ) · (ˇ i − μ0 )T } μ ˇ μ ˇ (2. The estimation of this matrix is done in an off-line training. At run-time all ML estimates given by (2.

One disadvantage of the EMAP technique is the high computational complexity.5.1 Motivation The main aspect of speaker adaptation is to alleviate the mismatch between training and test situations. Jon ˇ et al.1 gives reasons for a possible mismatch by speaker specific characteristics. 2.7 Feature Vector Normalization and Enhancement In this section the normalization and enhancement of feature vectors is introduced.2.g. 2007]. are removed from the speech signal and are not integrated in the statistical modeling.122) is extremely demanding ˇ ˇ ˇ because the dimension of Σ0 + C · ΣEMAP is the product of the number of Gaussian densities N and the dimension of the feature vectors. it is more computationally demanding than speech normalization in the feature space [Buera et al. In contrast to speaker adaptation unwanted variations. Alternative approaches are described to deal with speaker variabilities. a motivation for feature normalization and enhancement as well as a brief classification of the existing techniques are given to demonstrate the change of the procedural method to handle speaker variabilities.7.. ∀k the influence of C · ΣEMAP decreases which allows the approximation ˇ ˇ ˇ μ μEMAP ≈ μ0 + ΣEMAP · (Σ0 )−1 · C · (ˇ ML − μ0 ). Unfortunately. for example. 1991]. 2.6. Especially. . [2001] approximate the covariance matrix ΣEMAP by some 25 eigenvectors. 2005]. 2.7 Feature Vector Normalization 53 ˇ ˇ • In the case of very limited data nk → 0. Even limited adaptation data can be handled appropriately. the inversion contains the number of the softly assigned feature vectors nk so that this inversion has to be done at run-time [Rozzi and Stern. caused by environmental effects or speakers [Song and Kim. channel impulse responses and background noises. the matrix inversion in (2. Even though speaker adaptation is considered to produce normally the best results. First. Then a selection of approaches known from the literature is presented. e. 2. The intention is therefore similar to the EV approach discussed in Sect. The source-filter theory in Sect.125) ˇ ˇ The matrix ΣEMAP · (Σ0 )−1 is independent from the adaptation data and represents prior knowledge. Speaker adaptation looks for a better statistical representation of speech signals by integrating these discrepancies into the statistical model. ˇ ˇ ˇ (2. This approximation helps to significantly reduce the complexity of the matrix inversion.

2001. 2007]: • High-pass filtering comprises cepstral mean normalization and relative spectral amplitude processing [Buera et al.7. They enable the correction or mapping of the observed disturbed feature vectors onto their corresponding clean or undisturbed representatives. Noise reduction and cepstral mean normalization join this procedure because they limit the effects of background noises or room impulse responses on the feature vector space. A variety of additional methods have been developed which perform a correction directly in the feature space. 2. techniques for feature vector enhancement are required which are able to handle different mismatch conditions [Song and Kim... Feature normalization and enhancement can be divided into three categories [Buera et al. 2004]. 2005. • Model-based techniques presume a structural model to describe the environmental degradation and apply the inverse operation for the compensation at run-time [Buera et al. 2007. Moreno.2 Stereo-based Piecewise Linear Compensation for Environments The Stereo-based Piecewise Linear Compensation for Environments (SPLICE) technique is a well-known approach for feature vector based compensation [Buera et al.7. Therefore. from the feature vectors. H¨b-Umbach. for example. a 1999]..g. Normalization and enhancement of feature vectors can be advantageous when users operate speech controlled devices under various environmental conditions and when speaker adaptation can adjust only one statistical model for each speaker. 2007].54 2 Fundamentals Normalization and enhancement follow the idea to remove a mismatch between the test and training conditions. 2.. e. e. 1996]. Kim et al. 2005]. . Droppo et al. Inter-speaker variability due to physiological differences of the vocal tract can be moderated by a speaker-specific frequency warping prior to cepstral feature extraction [Garau et al..2 and Sect. due to environmental disturbances or speaker dependencies. by modifying the center frequencies of the mel filter bank [Garau et al.3 which apply similar statistical models compared to the methods for speaker identification and speaker adaptation described in the preceding sections.g. 2. 2005]. • Empirical compensation apply statistical models which represent the relationship between the clean and disturbed speech feature vectors. Vocal Tract Length Normalization (VTLN) may be applied in addition.. It represents a non-linear feature correction as a front-end of a common speech recognizer [Droppo and Acero. These fluctuations should not be visible for speech recognition.. 2005..7. 2007]. Two algorithms are presented in Sect.

given by the Minimum Mean Squared Error (MMSE) criterion. equation (2.127) xt · p(xt |yt ) dxt .131) due to the linearity of the expectation value.2.128) reduces the joint density function p(xt .g. A more precise modeling can be achieved. MMSE estimates minimize the Euclidean distance between the correct clean feature vectors and the resulting estimates on average. H¨nsler. A further simplification is achieved by using Bayes’ theorem N xMMSE = t xt xt · k=1 p(xt |k. The disturbed feature vectors yt have to be replaced by optimal estimates.7 Feature Vector Normalization 55 A statistical model is trained to learn the mapping of undisturbed feature vectors to disturbed observations. Under the assumption that the undisturbed feature vector xt can be represented by the observation yt and an offset vector [Moreno.129) Integral and sum can be interchanged N xMMSE = t k=1 N p(k|yt ) · xt xt · p(xt |k. The optimal estimate xMMSE is given by the expectation of the corresponding conditional t probability density function p(xt |yt ) [Droppo and Acero. a Moreno. 1996] as shown below: xMMSE = Ex|y {xt } t xMMSE = t xt (2. yt ) · p(k|yt ) dxt . e. 2005.k {xt } (2. 1996].130) = k=1 p(k|yt ) · Ex|y.126) (2. yt ) dxt (2. A latent variable may be introduced to represent multi-modal density functions.131) can be realized by N xMMSE = yt + t k=1 p(k|yt ) · rk . (2.132) The offset vectors rk have to be trained on a stereo data set that comprises both the clean feature vectors x and the corresponding disturbed feature vectors y of the identical utterance. 2001. The sum over all density functions N xMMSE = t xt xt · k=1 p(xt . k|yt ) to the conditional density function p(xt |yt ). (2. . At run-time the inverse mapping is performed. k|yt ) dxt (2.

56

2 Fundamentals

The posterior probability p(k|yt ) determines the influence of each offset vector. An additional GMM is introduced
N

p(yt ) =
k=1

wk · N {yt |μk , Σk }

(2.133)

which represents the distribution of the observed feature vectors yt as found by Buera et al. [2007], for example. The posterior probability p(k|yt ) can be calculated by combining (2.133) and (2.41). This SPLICE technique only marks a basic system since a manifold of variations and extensions exist. Buera et al. [2005] should be mentioned because they investigate similar techniques in the context of speaker verification and speaker identification.

2.7.3

Eigen-Environment

The SPLICE algorithm is built on a mixture of conditional distributions p(xt |yt , k) each contributing to the overall correction. The goal is to deliver an optimal estimate xMMSE of the clean feature vector. t Song and Kim [2005] describe an eigen-environment approach that is a combination of the SPLICE method and an eigenvector technique similar to the EV approach. The main issue is the computation or approximation of this offset vector rk by a linear combination of eigenvectors. The basic directions of the offset vectors rk and the average offset vector eEnv have to be extracted k,0 in a training similar to the EV adaptation. Only the first L eigenvectors eEnv k,l associated with the highest eigenvalues are employed. At run-time a set of scalar weighting factors αl has to be determined to obtain an optimal approximation of the offset vector
L

rk = eEnv + k,0
l=1

αl · eEnv . k,l

(2.134)

Song and Kim [2005] estimate the parameters αl with the help of the EM algorithm by maximizing the auxiliary function QML in a similar way as the EV adaptation in Sect. 2.6.5. The estimate of the undisturbed feature vector can then be calculated by (2.132). As discussed before, an auxiliary GMM is required to represent the distribution of the disturbed feature vectors. It enables the application of (2.41) to compute the posterior probability p(k|yt ) which controls the influence of each offset vector.

2.8 Summary

57

2.8

Summary

In this chapter the basic components of a complete system comprising speaker change detection, speaker identification, speech recognition and speaker adaptation have been described. The discussion started with the human speech production and introduced the source-filter model. The microphone signal is explained by excitation, vocal tract, channel characteristics and additional noise. Speaker specific excitation and vocal tract as well as acquired speaker habits can contribute to speaker variability [Campbell, 1997; O’Shaughnessy, 2000]. The front-end compensates for noise degradations and channel distortions and provides a compact representation of the spectral properties suitable for speech recognition and speaker identification. Several statistical models were introduced to handle speech and speaker variability for speaker change detection, speaker identification and speech recognition: BIC as a well-known method for speaker change detection uses multivariate Gaussian distributions for the statistical modeling of hypothetical speaker turns. Speaker identification extends this model by applying a mixture of Gaussian distributions to appropriately capture speaker characteristics such as the vocal tract. Speech recognition is usually not intended to model particular speaker characteristics but has to provide a transcription of speech utterances. Dynamic speech properties have to be taken into consideration. For this purpose HMMs combine GMMs with an underlying Markov process to improve the modeling of speech. Finally, speaker adaptation provides several strategies to modify GMMs and HMMs and targets to reduce mismatch situations between training and test. The main difference of the techniques described in this chapter can be seen in the degrees of freedom which have to be balanced with the amount of available training data. Short-term and long-term adaptation have been explained in detail. Normalization or enhancement provides a different approach to deal with speaker variability. Instead of learning mismatch conditions, they are removed from the input signal. Even though speaker adaptation generally yields better results [Buera et al., 2007], applications are possible which combine those strategies. In fact, the combination of cepstral mean subtraction and speaker adaptation already realizes a simple implementation. More complex systems, e.g. speaker tracking or speaker specific speech recognition, can be constructed based on these components as described in the following chapters.

Further applications are automatic transcription systems for conferences or conversations [Lu and Zhang. 2003]. speaker identification. c Springer-Verlag Berlin Heidelberg 2011 springerlink. 2002. The criteria for homogeneity may be the acoustic environment. Herbig. 1998. Minker: Self-Learning Speaker Identification. The distinction of speech and non-speech parts can be an additional task [Meinedo and Neto. 1998. The basic strategies relevant for the problem of this book were explained. A continuous stream of audio data has to be divided into homogeneous parts [Chen and Gopalakrishnan. SCT. 2002]. speech recognition and speaker adaptation. Meinedo and Neto. and W. Gerl. They mainly differ in the complexity of the speaker specific models and the methods of involving speech and speaker characteristics.3 Combining Self-Learning Speaker Identification and Speech Recognition Chapter 2 introduced the fundamentals about speaker change detection. speaker identity or speaker gender [Chen and Gopalakrishnan.. 3. 1999]. 1998. This procedure is known as segmentation. Lu and Zhang. Automatic transcription systems aim at journalizing conversations and at T.1 Audio Signal Segmentation A frequently investigated scenario can be seen in the broad class of audio segmentation. 59–70. speaker adaptation and speech recognition is given. channel properties. 2003]. F. A first conclusion with respect to the cited literature is drawn in the last section. 2003]. for example.com . The approaches are classified into three groups. Meinedo and Neto. 1999]. end-pointing or divisive segmentation. speaker identification. The advantages and drawbacks of the approaches are discussed and the target system of this book is sketched. It can be used for broadcast news segmentation to track the anchor speakers [Johnson. The problem of speaker change detection is to find the beginning and end of an utterance for each speaker. Speaker change detection has found its way into a broad range of applications. In other words the question to be solved is which person has spoken when [Johnson. In this chapter a survey of the literature concerning the intersection of speaker change detection. Hain et al. pp.

2001]. Then. Video content analysis and audio/video retrieval can be enhanced by unsupervised speaker change detection [Lu and Zhang. 2002]. This can be realized by fixedlength segmentation or by a variable window length [Kwon and Narayanan. Fixed-length segmentation partitions the data stream into small segments of predefined length by applying overlapping or non-overlapping window functions. may complicate robust end-pointing. Divisive segmentation splits the data stream into segments that contain only one speaker. First. For example. This enables the system to estimate more parameters. for example. 2004]. Meinedo and Neto [2003]. 2002]. Johnson [1999]. 2002]. Cheng and Wang [2003] use a first segmentation with a window length of 12 sec and a refined window of 2 − 3 sec. However. the window length can be dynamically enlarged depending on the result of the speaker change detection [Ajmera et al. If no change is found both blocks are merged and the next boundary is tested. The given data segment is tested for all theoretical speaker turns as illustrated in Fig. speaker identity or the number of speakers [Lu and Zhang. 3. Small segments. both the computational load and time delays limit the application of complex statistical models and extensive training algorithms for real-time applications [Lu and Zhang. e. 2002]. Chen and Gopalakrishnan [1998].g. more robust and complex speaker specific models can be trained [Mori and Nakagawa. This problem can be extended to speaker tracking. For example. A high number of speaker changes has to be expected a priori for applications such as meeting indexing. Real-time conditions can be a further challenge [Lu and Zhang. 3. An example is given in Fig. in the case of small speech pauses [Kwon and Narayanan. 2002]. In this case an automatic speaker indexing has to be performed to retrieve all occurrences of one speaker in a data stream [Meinedo and Neto. 2002]. 2002]. The ability to track the speaker enables more advanced speaker adaptation techniques and reduces the error rate of unsupervised transcription systems [Kwon and Narayanan. however.. Hain et al. 2003]. the data stream is divided into homogeneous blocks. This strategy allows building larger clusters of utterances comprising only one speaker. Hence. Tritschler and Gopinath [1999]. agglomerative clustering can be regarded as the continuation of divisive clustering. Alternatively. [1998]. those blocks which are assumed to correspond to the same speaker are combined. . Each boundary is checked for a speaker change.1(a). Speaker change detection has to handle data streams typically without prior information about the beginning and end of a speaker’s utterance. [2010].60 3 Combining Self-Learning Speaker Identification and Speech Recognition assigning the recognized utterances to the associated persons. Thus.1(b). The clustering can be solved by using a two step procedure. Yella et al. the same or similar algorithms can be applied to the result of the speaker change detection to test the hypothesis that non-adjacent segments originate from one speaker.

Siegler et al. Kwon and Narayanan. 2003. Zhou and Hansen [2000] provide further detailed descriptions and several implementations. 2. Cheng and Wang. For example. a a • Model-based algorithms such as BIC estimate statistical models for each segment and perform a hypothesis test. A low model complexity is required because of limited data. Zochov´ and Radov´ [2005]. they usually span only short time intervals and thus may not be suitable for robust distance measures [Cheng and Wang. 3. 2002]. These models are locally trained on the segment under investigation. Maxima of the distance measure indicate possible speaker changes and a threshold detection leads to the acceptance or rejection of the assumed speaker turns. Fig. [2004].3. 2002]: • Metric-based techniques compute the distances of adjacent intervals of the considered audio signal [Kwon and Narayanan. Metric-based techniques are characterized by low computational cost but require experimentally determined thresholds [Cheng and Wang. Divisive segmentation splits the continuous data stream into data blocks of predefined length which are iteratively tested for speaker turns.3 or can be found by Ajmera et al. Chen and Gopalakrishnan [1998]. . [1997]. 1998. Segmentation and clustering can be structured into three groups [Chen and Gopalakrishnan. A detailed description of the BIC was given in Sect. 2003]. Furthermore. Zhou and Hansen [2000].1 Comparison of divisive segmentation (a) and agglomerative clustering (b). (b) Agglomerative clustering. Tritschler and Gopinath [1999]. the Euclidean distance or an information theoretic distance measure can be applied as found by Kwon and Narayanan [2002]. Based on this result the same or other algorithms are applied for agglomerative clustering to merge non-adjacent segments if they are assumed to originate from the same speaker.1 Audio Signal Segmentation 61 Time Time (a) Divisive segmentation. 2003].

In the first implementation the variance BIC is applied for divisive and agglomerative clustering. [1994]. Speaker change detection is implemented comparably to the detection of unknown speakers in Sect. [2000] investigate a two-stage procedure of speaker change detection and speech recognition. . When speaker models are trained off-line. more sophisticated models of higher complexity and more extensive training algorithms can be applied. The result of a robust audio signal segmentation can be employed to realize speaker specific speech recognition. The transition is realized by the Gaussian mixture size selection that is set up on the BIC framework. a new HMM is initialized by adapting the speaker independent HMM to the current utterance. a a Nishida and Kawahara [2005] present a multi-stage algorithm to cluster utterances based on speaker change detection or statistical speaker model selection. The second proposed framework operates either on discrete or continuous speaker models depending on the amount of training data.3. At run-time the Viterbi algorithm can be applied to determine the most likely sequence of speaker occurrences in a data stream as found by Hain et al. e. The first stage checks an audio signal for speaker turns and the second stage decodes the speech signal with a refined speaker adapted HMM. The transcription delivers the candidates for speaker turns. the audio stream can be processed by a speech recognizer. The UBM is realized by a GMM comprising 64 Gaussian distributions. The most likely GMM is selected according to the ML criterion and the associated HMM is used for speech decoding. This approach circumvents the restriction of locally optimized statistical modeling at the cost of an enrollment.4. speech pauses. The system switches in an unsupervised manner to GMM modeling when a robust parameter estimation becomes feasible. 2. During the first stage one speaker independent HMM and up to two speaker specific HMMs are evaluated in parallel. For example. Identification and verification methods enable the system to merge the speaker clusters. Robustness is gained by a simple discrete model during the initialization phase. which can be employed in the subsequent speaker segmentation [Zochov´ and Radov´. Zhang et al. The ML criterion is used to decide whether a new speaker has to be added to the recognition system and whether a speaker change has to be assumed. Wilcox et al. If a new speaker is indicated.62 3 Combining Self-Learning Speaker Identification and Speech Recognition • Decoder-guided techniques extend this model-based procedure by using pre-trained statistical models.g. As long as the current speaker does not change the speech recognizer is able to continuously adapt the speech model to the speaker’s characteristics. 2005]. The updated speaker specific HMM is then used for re-recognition of the utterance and delivers the final transcription result. [1998]. In a second experiment the HMM recognition system is extended by speaker specific GMMs and a further UBM to decrease the computational effort.

incremental speaker adaptation is applied to obtain a more precise speaker model. Speaker identification usually relies on more complex models such as GMMs which require a sufficient amount of training and test data.3. In the case of an assumed speaker turn this decision is verified by adapted GMMs. MAP adaptation is used to initialize additional GMMs and to continuously adapt the speaker models. Figure 3. . The feature extraction transforms the audio signal into a vector representation comprising discriminatory features of the speech signal. a new speaker is assumed and a new model is initialized. Kwon and Narayanan [2005] address the issue of unsupervised speaker indexing by developing a complete system combining speaker change detection. Geiger et al. When one of the gender models yields the highest likelihood score. If no speaker change is found. e. Speaker identification determines the speaker identity and selects the corresponding speaker model. the knowledge about the speaker identity also enables speaker specific speech recognition. Some publications employ the combination of both strategies as will be shown in this section. Otherwise. Based on reliable frames the dissimilarity between two segments is computed and potential speaker changes are identified. The subsequent speaker adaptation applies common adaptation techniques to better capture speaker characteristics. MFCCs are extracted to be used for several classification steps. Pre-segmentation is realized by a UBM which categorizes speech into three categories . A threshold criterion and a rule-based framework are applied to determine the start and end points of speech and garbage words based on the energy of the audio signal.male. VAD separates speech and speech pauses so that only the speech signals have to be processed. If speaker identification is combined with speech recognition. speech or non-speech as well as male or female. [2003] investigate speaker segmentation for real-time applications such as broadcast news processing. the existing model is substituted by a new model initialized from UBM.2 Multi-Stage Speaker Identification and Speech Recognition Many algorithms in audio signal segmentation such as BIC use simple statistical models to detect speaker changes even on few speech data.g. female and garbage. [2010] present a GMM based system for open-set on-line speaker diarization. Some of these algorithms were described in the preceding section. Wu et al. Speaker change detection clusters the feature vectors of an utterance and assures that only one speaker is present.reliable speakerrelated. In the first stage audio segmentation is performed frame by frame. They are employed to classify each segment and to identify known speakers.2 Multi-Stage Speaker Identification 63 3. A two-stage approach is presented comprising pre-segmentation and refinement.2(a) displays a possible realization of such an approach. Three GMMs are trained off-line . doubtful speaker-related and unreliable speaker-related.

This prevents a loss of information caused by threshold detections. Speaker identification controls speech recognition so that the associated speaker specific speech models can be used to enhance speech recognition. . for example. Nishida and Kawahara [2004] present an approach that combines automatic speech recognition with a preceding speaker indexing to obtain a higher recognition accuracy. Each speaker represents a state of an HMM and the associated GMM contains the speaker characteristics. Speech segmentation. Their algorithm presumes no prior knowledge about speaker changes or the number of speakers. VAD filters the speech signal and rejects all non-speech parts such as background noise. These speaker indexing algorithms consist of a step-by-step processing. Schmalenstroeer and H¨b-Umbach [2007] describe an approach for speaker a tracking. speaker identification and spatial localization by a beamformer1 are considered simultaneously. The problem of model initialization is solved by a set of reference speakers.1 to detect possible speaker turns. This loss of information can be circumvented when a system is based on soft decisions and when the final discrete decisions. [1998]. Therefore no confidence measure about previous decisions is available in the subsequent processing. The final speaker adaptation is implemented by MAP adaptation. This approach is called sample speaker model and is compared to the UBM and Universal Gender Model (UGM) initialization method for speaker adaptation. Only the speech signal is further processed by the speaker change detection which applies model-based segmentation algorithms as introduced in Sect. As soon as a confident guess concerning the speaker’s identity is available speaker specific speech recognition becomes feasible as found by Mori and Nakagawa [2001]. Veen and Buckley [1988]. the data between the last two speaker turns is clustered and a pool of speaker models is incrementally built up and refined. 3. All cues for speaker changes. This approach provides a method of a complete speaker tracking system. VAD makes discrete decisions in speech and non-speech segments and the speaker change detection controls the clustering and adaptation algorithm. for example. Further multi-stage procedures combining segmentation and clustering are described by Hain et al. As soon as a different speaker is assumed.g. speaker change detection. are delayed until speaker adaptation. The transitions of the Markov chain can be controlled by a speaker change or tracking algorithm. 1 Details on beamforming can be found by Krim and Viberg [1996]. speaker identity. The acoustic similarity between particular speakers is exploited by a Monte Carlo algorithm to construct an initial speaker model that can be further adapted to the new speaker. e.64 3 Combining Self-Learning Speaker Identification and Speech Recognition speaker identification and speaker adaptation. The most likely sequence of speaker identities is determined by the Viterbi algorithm to achieve a more robust speaker tracking. speaker identity and localization can be kept as soft quantities.

Thus. Hence. It is pointed out that speaker identification and speech recognition are interchanged compared to Fig. a lower error rate was achieved for speaker verification based on phoneme-adapted GMMs compared to a standard GMM approach.1.2(b) displays a simplified block diagram of phoneme based speaker identification. A speaker independent background model is equally constructed. the likelihoods of all speaker specific GMMs are computed given the phoneme segmentation. In their tests. The phonetic transcription of the TIMIT database and the Viterbi algorithm are used for the classification. These publications conclude that vowels and nasals contain the highest speaker specific variability whereas fricatives and plosives have the lowest impact on speaker identification. Gutman and Bistritz [2002] apply phoneme-based GMMs to increase the identification accuracy of GMMs. Acoustic discrimination can improve the performance of speaker identification [Rodr´ ıguezLi˜ ares and Garc´ n ıa-Mateo. A stronger correlation of phonemes and Gaussian mixtures is targeted. Thompson and Mason [1993] agree that several phoneme groups exhibit a higher degree of speaker discriminative information. This observation motivated Kinnunen [2002] to construct a speaker dependent filter bank.2(a). For testing. The key issue is to emphasize specific frequency bands which are characteristic for a considered group of phonemes to account for speaker characteristics. Finally. The opposite problem is also known from literature. 1997]. [1999]. a set of phoneme dependent GMMs is trained for all speakers by adapting the phoneme independent GMM to each phoneme cluster. The ratio of both likelihoods is evaluated to accept or reject the identity claim of the current speaker. The configuration of the articulatory organs widely varies during speech production as briefly described in Sect. Phoneme segmentation is then used to cluster the speaker’s training data. This scenario is similar to a textprompted speaker verification. . [2005]. Pelecanos et al. Figure 3. The goal is an improved representation of the speaker discriminative phonemes to increase the speaker identification rate. They are normalized by the corresponding background model.3. Eatock and Mason [1994].3 Phoneme Based Speaker Identification The goal of the algorithms in the preceding sections was to control speech recognition by estimating the speaker identity. For each speaker one GMM is extensively trained by standard training algorithms without using a phoneme classification. speaker specific voice characteristics are not accurately modeled by averaged features. Speech recognition can enhance speaker identification since some groups of phonemes differ in their discriminative information. a phoneme recognition before speaker identification can enhance the speaker identification rate. 3. Harrag et al.3 Phoneme Based Speaker Identification 65 3. 2.

2 Comparison of two examples for multi-stage speaker identification and phoneme based speaker identification. In both cases the result of the speaker change detection triggers the recognition. speaker adaptation can be continuously applied to the speech and speaker models. Each phone is modeled by a speaker-specific left-to-right HMM comprising three states. In addition. Gauvain et al. For each speaker the phone-based likelihood is . Either the most probable speaker is identified to enable an enhanced speech recognition or the spoken phrases is divided into phoneme groups by a speech recognizer to enhance speaker identification. Phone-based HMMs are applied to obtain a better statistical representation of speaker characteristics. (b) Exemplary block diagram of a phoneme based speaker identification with integrated speaker adaptation.66 3 Combining Self-Learning Speaker Identification and Speech Recognition Feature Extraction Feature Extraction Speech / Non-Speech Speech / Non-Speech Speaker Change Detection Speaker Change Detection Speaker Identification Speech Recognition Speaker Adaptation Speech Recognition Speaker Identification Speaker Adaptation (a) Exemplary block diagram of a multi-stage speaker identification including speech recognition and speaker adaptation. 75 utterances are used for training which is realized by MAP adaptation of an initial model comprising 35 speaker-independent contextfree phone models and 32 Gaussian distributions. 3. Fig. For testing all speaker models are run in parallel. [1995] present a statistical modeling approach for textdependent and text-independent speaker verification.

Genoud et al.3 Phoneme Based Speaker Identification 67 computed by the Viterbi algorithm and the ML criterion is used to hypothesize the speaker identity. The generalization towards open-set systems including unsupervised adaptation at the risk of error propagation is left for future research. This approach offers the opportunity of a mutual benefit because speech and speaker information are modeled and evaluated simultaneously. This representation is also known as the broad phonetic category model. The latter relies on the phonetically structured statistical models introduced by Faltlhauser and Ruske [2001] or phonetic class GMMs. the phonetically refined speaker models re-score the subset of selected speakers and lead to the final speaker identification. There. Park and Hazen [2002] also suggest to use a two-stage approach by combining common GMM based speaker identification and phoneme based speaker identification. GMMs are applied to pre-select a list of the best matching speaker models which have to be processed by the second stage. Besides the speaker independent phoneme output. the speaker adaptive scoring allows the system to learn the speaker specific phoneme pronunciation when sufficient speech material can be collected for a particular speaker. Olsen [1997] also uses a two-stage approach for speaker identification. speech segmentation and speaker identification by using an HMM system. For textdependent speaker verification the limitation is given by the concatenated phone sequence. Kimball et al. In addition. The evaluation of the HMMs associated with the given text is combined with cohort speaker normalization to solve the verification task. The initial stage consists of an HMM for phone alignment whereas an RBF is applied in the second stage for speaker identification given the phonetic information. The discriminative MLP receives MFCC features at its input layer and is trained for both tasks. The proposed identification framework hypothesizes the temporal phoneme alignment by a speaker independent speech recognizer given a spoken phrase. In summary. For the text-independent verification task local phonotactic constraints are imposed to limit the search space. Both high-quality and telephone speech are examined. Voice signatures are investigated for electronic documents. the posterior probability of each phoneme is examined for several target speakers. [1997] describe a two-stage approach for text-dependent speaker verification. In their experiments 12 registered speakers are investigated in a closed-set scenario. A common speech recognizer categorizes the speech input into several phonetic classes to train more discriminative statistical models for speaker verification.3. the speech recognition result controls the selection of the speaker models and exploits more information about the speech utterance than globally trained GMMs. The main issue is to use granular GMMs which resolve the phonetic content more precisely compared to globally trained GMMs. Rodr´ ıguez-Li˜ ares and Garc´ n ıa-Mateo [1997] combine speech classification. [1999] present a combination of speech recognition and speaker identification with the help of a neural network. . In parallel. It comprises two sets of output. The second approach applies supervised speaker adaptation to speaker specific HMMs.

The task is to verify the identity claim of the actual user by his voice characteristics and the spoken utterance that is prompted by the system. At run-time both systems run in parallel and a convex combination of the independent scores integrates both the text-independent speaker information and speech related knowledge. A decision logic combines the single scores of the decoupled identifiers. Thus. The first technique investigates a parallel computation of text-independent speaker verification and speaker independent speech recognition. The second method consists of parallel speaker specific speech recognizers which perform both tasks. The probabilities of all phonetic classes are accumulated and combined either by equal weighting or by applying selective factors to investigate only the effect of single classes. The result is a combined statistical model for text-dependent speech recognition and speaker verification. The speaker with the highest score is selected. In summary. Each speaker undergoes an enrollment phase of approximately 6 min and uses a limited vocabulary. Reynolds and Carlson [1995] compare two approaches for text-dependent speaker verification. The task is to verify both the speaker’s identity and the uttered phrase. At run time the phonetic classification is implemented by the Viterbi algorithm. They can be captured by approaches coming from the speech recognition research. In addition. The first model is a common GMM known from the speaker identification literature and the second one is an HMM well known from speech recognition. At run-time the convex combination of both likelihoods of a phoneme dependent and a phoneme independent speaker HMM enables the acceptance or rejection of the claimed speaker identity. the speaker identification rate on purely voiced parts is significantly higher than for unvoiced speech. In addition.68 3 Combining Self-Learning Speaker Identification and Speech Recognition Speech signals are divided into voiced and unvoiced speech as well as the associated transitions. Matsui and Furui [1993] consider speech recognition and speaker verification in a unified frame work. Matsui and Furui represent speakers by speaker specific phoneme models which are obtained by adaptation during an enrollment. Cohort speakers serve as alternative models and are used for score normalization. Their statistical model incorporates the knowledge about speech characteristics by adapting a syllable-based speaker independent speech model to particular speakers. The user is asked to speak 4 randomized combination locks prompted by the system. speech pauses representing background noises are modeled. [2006] propose a combined technique comprising two statistical models. These categories are modeled by four HMMs. speaker specific information is mainly contained in voiced parts of an utterance. Nakagawa et al. Both models have shown their strengths in representing speaker and speech information. . The second model is motivated by the fact that temporal transitions between different phonemes are not modeled by GMMs. the speech segmentation is obtained as a by-product. It has to be decided either to accept or reject the claimed identity.

The identification of the actually spoken phonemes and the knowledge that speaker identification is directly influenced by certain phoneme groups allow constructing more precise speaker models at the expense of a more complex training phase. Time latency is a further critical issue for the use case of this book. In general.4 First Conclusion 69 3. The algorithms belonging to audio signal segmentation usually offer the advantage of statistical models of low complexity which only require a small number of parameters to be estimated. The critical issues are the self-organization of the complete system. speech and speaker shall be recognized simultaneously to allow real-time processing on embedded devices. The intended functionality of the desired solution is sketched in Fig.3. The final result is often obtained by the ML criterion and does not provide confidence measures. This makes short-term speaker adaptation feasible but also limits the capability to represent the individual properties of a speaker which could be modeled using long-term speaker adaptation. Phoneme based speaker identification introduces a new aspect into speaker identification.3. However. Once again the separate modeling of speech and speakers and the application of multi-stage procedures may be of disadvantage.4 First Conclusion All approaches touched on in the preceding sections have advantages and drawbacks with respect to the use case of this book. Furthermore. However. it seems not possible to construct highly complex statistical models for speech and speaker characteristics in an unsupervised manner without sophisticated initial models [Kwon and Narayanan. Discrete decisions and the lack of confidence measures may be additional drawbacks. some implementations apply a strict multi-step procedure with separate modules. This information flow is usually unidirectional. The goal of this book is to present a solution which integrates speech and speaker information into one statistical model. prior knowledge is provided in form of an initial model which can be adapted to particular speakers. For example. Each module receives only the discrete results from the preceding stages. detecting unknown speakers and avoiding a supervised training phase. Multi-stage speaker identification overcomes the limitation of simple statistical models. 2005]. a training phase for each new speaker is usually obligatory. many of these algorithms bear the risk to be vulnerable against time-variant adverse environments which is the case for in-car applications. Furthermore. Some techniques tackle the problem of speaker change detection and speaker identification simultaneously and thus motivate a unified framework. 3. The system . most of these algorithms do not easily support to build-up a more complex system. Furthermore. A further drawback is the multiple computation of speech and speaker information even though similar statistical models and similar or even identical input are applied. Another strong point is the opportunity to omit the training to increase the usability of speech controlled devices.

speaker identification and speaker adaptation. The results of speaker identification and speech recognition trigger speaker adaptation either to adjust the codebooks of known speakers (in-set) or to initialize new codebooks (dotted line) in the case of a new speaker (out-of-set). has to handle speaker models at highly different training levels and has to provide a unified adaptation technique which permits both short-term and long-term adaptation. Interactions with a speech dialog should be possible. This number may be increased depending on the training level. The entire information is preserved by avoiding discrete decisions.70 3 Combining Self-Learning Speaker Identification and Speech Recognition Dialog Speaker symbolic level Speech In-set Codebooks Speaker adaptation Out-of-set sub-symbolic level Feature vectors Fig.3 Draft of the target system’s functionality. 3. . Confidence measures in form of posteriori probabilities instead of likelihoods have to be established. Significantly higher speech recognition rates are expected compared to speaker independent speech recognition. The system does not start from scratch but is supported by a robust prior statistical model and adapts to particular speakers. only a few adaptation parameters can be reliably estimated. Several profiles or codebooks are used for speech recognition. In the beginning. In the sub-symbolic level feature vectors are extracted to be used in the symbolic levels. A speaker independent codebook (solid line) acts as template for further speaker specific codebooks (dashed line). The speech controlled system shall comprise speech recognition. Speaker identification has to be realized on different time scales.

The key issue is to limit the effective number of adaptation parameters in the case of limited data and to allow a more individual adjustment later on.5. and W. Minker: Self-Learning Speaker Identification.4 Combined Speaker Adaptation The first component of the target system is a flexible yet robust speaker adaptation to provide speaker specific statistical models. 71–82. pp. c Springer-Verlag Berlin Heidelberg 2011 springerlink. F. The benefit of robust speaker adaptation for speech recognition will be demonstrated by experiments under the condition that the speaker identity is given. this adaptation technique will serve as the base for the target system comprising speaker identification.1 Motivation Initializing and adapting the codebooks of the automatic speech recognizer introduced in Sect. New speaker profiles can be initialized and existing profiles can be continuously adapted in an unsupervised way. 2. A first experiment has been conducted to investigate the improvement of a speech recognizer’s performance when speakers are optimally tracked so that the corresponding speaker models can be continuously adapted. a balanced adaptation strategy relying on short-term adaptation at the beginning and individual adaptation in the long-run as well as a smooth transition has to be implemented.3 is highly demanding since about 25. First. Herbig.com . In the subsequent chapters. T. 000 parameters are involved in adapting mean vectors. In this chapter a balanced adaptation strategy is motivated. SCT. Gerl. The algorithm is suitable for short-term adaptation based on a few utterances and long-term adaptation when sufficient training data is available. 4. speech recognition and speaker adaptation. it is assumed that speaker identification and speech recognition deliver a reliable guess of the speaker identity and the transcription of the processed speech utterance. Thus. Error propagation due to maladaptations is neglected in this chapter and will be addressed later.

. The key issue of the suggested adaptation scheme is to start from the prior distribution of μEV and to derive the resulting data-driven adaptation strategy. . . .1) (4. Prior knowledge about speaker adaptation effects.78). The initial parameter set 0 0 Θ0 = w1 .2.72 4 Combined Speaker Adaptation 4. μ0 . Since the transition from EV adaptation to MAP is crucial for the system presented in this work. . Kuhn et al.6 an overview of several well-known adaptation techniques which capture the voice characteristics of a particular speaker was given: Short-term adaptation such as EV adaptation imposes a tied adaptation scheme for all Gaussian densities of a codebook. This may explain their results for MLED=>MAP which performed worse on isolated letters compared to pure EVs in the unsupervised case. Speech decoding provides an optimal state s alignment and determines the state dependent weights wk of a particular codebook. Then the codebook is adapted based on the training data x1:T of the target speaker i. In Sect. Θ0 ) = QML(Θi . In the following derivation the auxiliary function of the EM algorithm is extended by a term comprising prior knowledge: QMAP (Θi . Furthermore. However. 2. They do not report any tuning of the interpolation factor for the MAP estimate. However. .3) . the problem of efficient data acquisition for continuous adaptation in an unsupervised system will be addressed. Σ0 1 N 1 N (4. Only a few parameters are required. Only one iteration of the EM algorithm is calculated. MAP adaptation is expected to be inefficient during the start or enrollment phase of a new speaker model as explained in Sect.40) and (2.6. .6 an optimal combination of short-term and long-term adaptation was discussed. In the following a simple interpolation of short-term adaptation realized by eigenvoices and long-term adaptation based on the MAP technique is investigated. Σ0 . the EMAP algorithm requires a costly matrix inversion and is considered to be too expensive for the purpose of this book. Robustness will be gained by the initial modeling of a speech recognizer based on SCHMMs. gender. 2. . . 2.3. [2000] present results mostly for supervised training for MAP using an EV estimate as prior. Θ0 ) · log (p(xt .2) = t=1 k=1 p(k|xt . . . wN .6. it will be derived in closer detail. Details will be provided on how to handle new speaker specific data and how to delay speaker adaptation until a robust guess of the speaker identity is available. k|Θi )) + log (p(Θi )) as given in (2.6. Θ0 ) + log (p(Θi )) T N (4. channel impulse responses and background noises contributes to the robustness of EV adaptation.2 Algorithm In Sect. . Eigenvoices have shown a very robust behavior in the experiments which will be discussed later. μ0 . The state variable s is omitted. Long-term adaptation implemented by the MAP adaptation permits individual modifications in the case of a large data set. 2. k A two-stage procedure is applied which is comparable to the segmental MAP algorithm in Sect.

In the learning phase which is characterized by limited training data the prior knowledge incorporated into the eigenvoices still allows efficient codebook initialization and continuous adaptation. . .7) can be rewritten by Σk · (ΣEV )−1 + nk · I · μopt = Σk · (ΣEV )−1 · μEV + nk · μML k k k k k (4. Σ0 . the speaker index i is omitted. the EV approach already contains prior knowledge as . However. μN )) = k=1 log N μk |μEV . The matrix Σk · (ΣEV )−1 makes use of k prior knowledge about statistical dependencies between the elements of each mean vector. The optimization problem is solved k by (A. .2. . . k k k k k k (4. A.6) log (2π) d log |ΣEV | − k 2 k=1 where the covariance matrix ΣEV represents the uncertainty of the EV k adaptation. . A prior Gaussian distribution is assumed N log (p(μ1 . only ML estimates are applied to adjust codebooks. 1 N 1 N (4. ΣEV k k 1 2 N (4.4) Subsequently.7) When nk approaches infinity. Since only mean vectors are adapted. . μk and Σk denote the speaker specific mean vectors to be optimized and the covariance matrices of the standard codebook.26): nk · Σ−1 · μML − μopt = (ΣEV )−1 · μopt − μEV . μEV is employed as prior information k about the optimized mean vector μopt .2 Algorithm 73 is given by the standard codebook. In the next step both sides of (4. For reasons of simplicity. μi . The derivation can k be found in Sect.5) =− 1 − 2 μk − μEV k k=1 T · (ΣEV )−1 · μk − μEV k k N N k=1 (4. wN . .8) where I denotes the identity matrix. the prior distribution of the mean vector μk can be factorized or equivalently the logarithm is given by a sum of logarithms. For the following equations it is assumed that each Gaussian distribution can be treated independently from the remaining distributions. . QMAP is maximized by taking the derivative with respect to the mean vector μk and by calculating the corresponding roots μopt . . the following notation is used for the speaker specific codebooks: 0 0 Θi = w1 . . . . . Thus. μi .7) are multiplied by Σk .4. In this context. . . Equation (4. Σ0 .

3. Furthermore. the EV adaptation was only calculated in the experiments carried out when sufficient speech data ≥ 0.116). the term MAP adaptation refers to the implementation discussed in Sect. The corresponding weights are calculated according to (2. Θ0 ) 1 nk T (4.10) (4. Since the ML estimates μML are weighted sums of all k observed feature vectors assigned to a particular Gaussian distribution. When sufficient speaker specific data is available.5 sec was available. If the approximation Σk · (ΣEV )−1 = λ · I. For sufficiently large nk there is no difference since μMAP k and μML converge.74 4 Combined Speaker Adaptation described in Sect. This may k reduce the performance of speaker identification and speech recognition.116). EV-MAP adaptation denotes the adaptation scheme explained in this chapter.6.6. Thus.. k λ = const (4. 2. nearly untrained Gaussian densities do not provide reliable ML estimates.9) is used.116). Only the sufficient statistics of the standard codebook T nk = t=1 p(k|xt .116) are replaced here by MAP estimates as given in (2. However. 2.5. the solution of the optimization problem becomes quite easy to interpret [Herbig et al. 2010c]: μopt = (1 − αk ) · μEV + αk · μML k k k nk αk = . If nk is small for a particular Gaussian distribution. It may be assumed that any matrix multiplication can be discarded. a 1 In the following.11) Since this adaptation scheme interpolates EV and ML estimates within the MAP framework. Θ0 ) · xt t=1 (4.13) are required. the ML estimates in (2. k The latter is decomposed into a linear combination of the eigenvoices as defined in (2. . Those mean vectors have only a limited influence on the k k result of the set of linear equations in (2. nk + λ (4.6.12) μML = k p(k|xt .101). this computation may lead to diverging values of μML and may force extreme weighting factors in (2. In fact. the k MAP estimate guarantees that approximately the original mean vector is used μMAP ≈ μ0 . it is called EV-MAP adaptation in the following. 2. The ML estimates have to be continuously updated so that μEV and μopt k k can be computed. the convex combination approaches ML estimates such as the MAP adaptation1 described in Sect. Few data causes this adaptation scheme to resemble the EV estimate μEV .85).3.

2009]. Finally.2 Algorithm 75 recursive computation is possible. The original equation is re-written by 2 A study of intra-session and inter-session variability can be found by Godin and Hansen [2010]. A permanent storage may be undesirable. Botterweck [2001] describes a similar combination of MAP and EV estimates.17) · μML − μEV . The prior knowledge is represented ˇ by ΣEMAP . The updated ML estimates μML are k weighted sums of the preceding estimates μML and an innovation term ¯k T nk = nk + ¯ t=1 p(k|xt .122). This allows efficient long-term speaker tracking which will be discussed later. Speaker specific adaptation data can be collected before a guess of the speaker identity is available. EMAP requires the computation and especially the matrix inversion in the supervector space. since speaker k characteristics significantly change over time2 [Furui.15) where nk denotes the number of the assigned feature vectors up to the last ¯ utterance. Combining EV and ML estimates takes benefit from statistical bindings in the EV estimation process.8) is equivalent to Σk · (ΣEV )−1 + nk · I · μopt k k = Σk · (ΣEV )−1 + nk · I · μEV + nk · μML − μEV .4.16) to obtain the following · μML − μEV k k (4. A concrete realization should apply an exponential weighting window for μML and nk . Θ0 ) · xt nk nk t=1 (4.122) identifies a similar structure of both adaptation ˇ terms. Despite these structural k similarities. there is a basic difference between EMAP and the adaptation scheme proposed here. . k k (4. The matrix ΣEV corresponds to ΣEMAP . k k k k Both sides are multiplied by Σk · (ΣEV )−1 + nk · I k term μopt = μEV + nk · Σk · (ΣEV )−1 + nk · I k k k which is equivalent to μopt = μEV + nk · ΣEV · Σk + nk · ΣEV k k k k −1 −1 −1 (4. However.14) μML = k nk ¯ 1 · μML + ¯k · p(k|xt . Θ0 ) T (4.18) A comparison with (2.8) may be discussed with respect to the EMAP solution in (2. the solution of (4. Equation (4. In this book no window is applied because time-variance is not a critical issue for the investigated speech database. this approach works in the feature space which is characterized by a lower dimensionality.

The Lombard effect is considered.76 4 Combined Speaker Adaptation ˇ ˇ μ = μ0 + P · μMAP − μ0 + ˇ ˇ ˇ L αl · ˇEV el l=1 (4.3. The sound files were down-sampled from 16 kHz to 11. the coefficients of the EVs αl have to be determined by optimizing the auxiliary function of the EM algorithm. [2002]. The recordings were done under realistic conditions and environments such as home. the database is introduced and the composition of the selected subset is explained.6. The matrix P guarantees that the standard MAP estimate only takes effect in the directions orthogonal to all eigenvoices.3 Evaluation Several experiments have been conducted with different speaker adaptation methods to investigate the benefit which can be achieved by a robust speaker identification. office. This results in at least 250 utterances per speaker which are ordered in the sequence of the recording session. Due to the matrix multiplication it is more expensive than the solution derived above.19) to agree with the notation of Sect. Only AKG microphone recordings were used for the experiments. This subset is subsequently used for all evaluations. Both read and spontaneous speech are covered. 4. It is employed to simulate realistic tasks for self-learning speaker identification and speech recognition. The results are presented and discussed in detail. a subset of the US-SPEECON database is used for evaluation. 3 The general description of the SPEECON database follows Iskra et al. Digit and spelling loops were kept irrespective of their length. Colloquial utterances with more than 4 words were removed to obtain a realistic command and control application. Adaptation is performed in the supervector space. The database comprises about 20 languages and is intended to support the development of voice-driven interfaces of consumer applications. The evaluation setup is described in more detail. First. 4. In this book. Using the MAP framework. .1 Database SPEECON3 is an extensive speech database which was collected by a consortium comprising several industrial partners. public places and in-car. Each language is represented by 600 speakers comprising 550 adult speakers. Finally. the results for closed-set experiments are presented. 2. This subset comprises 73 speakers (50 male and 23 female speakers) recorded in an automotive environment. μ0 represents all mean vectors of the stanˇ ˇ dard codebook in supervector notation. Utterances containing mispronunciations were also discarded.025 kHz.

1. The speaker specific codebooks are continuously adapted. No additional information concerning speaker changes and speaker identities will be given in Chapter 5 and 6. . another evaluation set comprising 30 different speaker sets is used. 000 command and control utterances).4. This results in 75. Baseline The speech recognizer without any speaker adaptation or speaker profiles acts as the baseline for all subsequent experiments. For closed sets the system has to identify a speaker from a list of enrolled speakers. Test Set The evaluation was performed on 60 sets. For each set all utterances of each speaker are used only once. After this enrollment at least 5 utterances are spoken before the speaker changes. In an open-set scenario the system has to decide when a new speaker model has to be added to the existing speaker profiles.3. the first two utterances of each new speaker are indicated during enrollment.2 Evaluation Setup In the following the configuration of the test set. For testing the corresponding grammar was selected prior to each utterance. Thus. The corresponding speaker specific codebook has to compete against existing speaker models. Furthermore. baseline and speaker adaptation is explained. 250 utterances originating from 5 speakers. Finally. every set provides 1. The speech recognizer applies grammars for digit and spelling loops as well as for numbers. It consists of 10 utterances for each speaker as shown in Fig. The speech recognizer was trained as described by Class et al. 4 When speaker sets with 10 enrolled speakers are examined in Chapter 6.3 Evaluation 77 4. 4. From one utterance to the next the probability of a speaker change is approximately 10 %. [1993]. 000 utterances used for evaluation. the evaluation measures for speech recognition and speaker identification are introduced. The composition of female and male speakers in a group is not balanced. error propagation and long-term speaker adaptation. The configuration of the test investigates the robustness of the start phase. At the beginning of each speaker set an unsupervised enrollment takes place. The order of the speakers in each set are chosen randomly. Thus. Each set comprises 5 enrolled speakers4 to ensure independence of the speaker set composition which is chosen randomly. a grammar was generated which contains all remaining utterances (≈ 2. For the remaining 8 utterances the system has to identify the speaker in an unsupervised manner. In Chapter 5 and 6 closed-set and open-set scenarios will be discussed.

The mean vectors of each codebook were stacked into long vectors. The eigenvoices were calculated using PCA. . It guarantees that speaker changes are 5 The USKCP is a speech database internally collected by TEMIC Speech Dialog Systems. Sp 5 Test Sp 2 . First. the speaker identity is only given in the supervised parts (diagonal crossed boxes) of Fig. Sp 1 (b) Open-set.14) and (4. The USKCP comprises command and control utterances recorded in an automotive environment. Germany. 4. .78 4 Combined Speaker Adaptation Sp 1 Enrollment Sp 2 Sp 3 Sp 4 Sp 5 Sp 3 Enrollment Sp 1 Sp 2 Sp 3 Sp 4 Sp 5 Sp 3 Sp 5 Test Sp 2 . Fig. 4. Sp 1 (a) Closed-set. . The setup consists of two blocks. Then speakers are randomly selected. For this purpose codebooks were trained by MAP adaptation for a large set of speakers of the USKCP5 development database. a start phase is used in which new speaker specific codebooks are initialized.1(a).1 Block diagram of the setup which is used for all experiments of this book. short-term adaptation is implemented by a modified EV approach. . However. Speaker Adaptation In an off-line training the main directions of speaker variability in the feature space were extracted. At least 5 utterances are spoken by each speaker before a speaker turn takes place. A smoothing of the weighting factors α is applied by introducing an exponential weighting window in (4. . For the remaining utterances speaker identification is delegated to the speech recognizer or a common speaker identification system. The language is US-English. Ulm. To simulate continuous speaker adaptation without speaker identification.15).

In Sect. an upper bound for unsupervised speaker adaptation is investigated [Herbig et al. However.3 the evaluation measures are described in detail. Only the intervals of the best and the worst result are given. For λ ≈ 0 the adaptation relies only on ML estimates. The results show a significant improvement of WA when the EVMAP speaker adaptation is compared to baseline and short-term adaptation. The typical error is given by a confidence interval assuming independent data sets. WI and WD denote the number of substituted. In Table 4. In contrast. When a test is repeated on an identical test set. 2010c]. speech recognition is still affected by frequent speaker changes. It determines the optimal WA of the speech recognizer which can be achieved by a perfect speaker identification combined with several adaptation strategies.. feature extraction is externally controlled so that energy normalization and mean subtraction are updated or reset before the next utterance is processed.3 Results Before a realistic closed-set scenario is examined in detail. 4.84) and (2. The latter case is expected to limit the improvement of word accuracy. a paired difference test may be more appropriate to evaluate the statistical significance. 4.4. The tuning parameter λ in (4. the performance of MAP adaptation is tested for dedicated values of the tuning parameter η used in (2. 1996]. A. λ → ∞ is the reverse extreme case since EV estimates are used irrespective of the amount of speaker specific data. Speaker identification is evaluated by the identification rate..11) is evaluated for specific values.2 the speech recognition rates are compared for the complete evaluation test set under the condition that the speaker identity is known.3 Evaluation 79 captured within approximately 5 utterances if no speaker identification is employed.3.85). 4. In addition. Nevertheless.2 are compared to the baseline system. 88. In addition.94 % WA .1 and Fig. Only about 6 % relative error rate reduction is achieved by short-term adaptation compared to the baseline. Short-term adaptation and the speaker adaptation scheme introduced in Sect.20) W represents the number of words in the reference string and WS . Evaluation Measures Speech recognition is evaluated by the Word Accuracy (WA) WA = 100 · 1 − WS + WI + WD W %. (4. inserted and deleted words in the recognized string [Boros et al. unsupervised speaker adaptation has to handle errors of speech recognition. EV adaptation combined with perfect speaker identification yields about 20 % relative error rate reduction.

87 88.62 88. In Table 4.13 88. Obviously. This adaptation is based only on a few utterances.44 88. Since the evaluation is done on different speech utterances the recognition rates and improvements are only comparable within each evaluation band.90 88.25 and 25 % relative error rate reduction with respect to the speaker independent baseline are achieved using λ = 12. but suffer from poor performance during the . Table is taken from [Herbig et al.57 ±0. On the first utterances EV adaptation is superior to ML estimates which will become obvious in the following chapters. Short-term adaptation without speaker identification is realized by EV adaptation with an exponential weighting window. In the long-run MAP or ML estimates allow more individual codebook adaptation. Speaker adaptation Baseline Short-term adaptation EV-MAP adaptation ML (λ ≈ 0) λ=4 λ=8 λ = 12 λ = 16 λ = 20 EV (λ → ∞) MAP adaptation η=4 η=8 η = 12 Typical errors min max WA [%] 85. The results for pure EV or MAP adaptation are worse as expected..22 ±0.2 the evaluation is split into several evaluation bands representing different adaptation stages. They outperform the EV approach in the long run. The implementations which interpolate EV and ML estimates yield consistent results for all evaluation bands.23 86. EV-MAP speaker adaptation is tested for several dedicated values of the parameter λ. The baseline is the speaker independent speech recognizer without any speaker adaptation.19 88. λ can be selected in a relatively wide range without negatively affecting the performance of the speech recognizer..62 88.86 88.86 88.94 88. moderately and extensively trained codebooks. The effect of speaker adaptation is evaluated for weakly. 2010c].80 4 Combined Speaker Adaptation Table 4. On the evaluation bands I and III the main difference between ML estimates and EV adaptation can be seen. MAP adaptation yields better results than the ML or EV approach [Herbig et al.1 Comparison of different adaptation techniques under the condition of perfect speaker identification. 2010c]. Speaker pools with 5 speakers are investigated.

33 Speaker adaptation learning phase.80 λ=8 88. In the following chapters it will be referred to this experiment as an upper bound for tests under realistic conditions where the speaker has to be identified in an unsupervised way.89 87.51 ±0.2 Results for speaker adaptation with predefined speaker identity.75 Typical errors min ±0. Furthermore.98 89.2 Comparison of different adaptation techniques under the condition of perfect speaker identification.87 EV-MAP adaptation ML (λ ≈ 0) 87.17 88.77 89.43 89.91 88.36 88.22 85.35 89. Table 4.00 MAP adaptation η=4 87. I II III WA [%] WA [%] WA [%] Baseline 84. Speech recognition results are presented for several evaluation bands .95 88.99 89. 50[.62 λ = 12 88.29 max ±0.73 η = 12 87.69 89.97 88. The evaluation is split into several evaluation bands according to the adaptation stage in terms of recognized utterances. It should be emphasized that this experiment was conducted under optimal conditions. III = [100. 250] utterances.79 λ = 16 88.54 ±0.67 η=8 87.54 88.03 88.67 88. Therefore the EV approach performs worse than ML estimates on the complete test set.56 89.84 85.3 Evaluation 90 89 WA [%] 88 87 86 85 81 BL ML 4 8 12 16 20 EV ST λ Fig.4. 100[.84 EV (λ → ∞) 87.48 ±0.83 88. II = [50. The ML approach will suffer from poor .02 Short-term adaptation 85.37 89.42 88.51 89.58 ±0. the speaker independent baseline (BL) and short-term adaptation (ST) are depicted. 4.51 86.I = [1.74 λ = 20 88.57 89.50 λ=4 88.

ML estimates do not have to be re-calculated as being already available from the EV adaptation. this assumption is expected to limit the application of some adaptation techniques to use cases where the speaker has to be identified in an unsupervised manner. 2.82 4 Combined Speaker Adaptation performance when the number of utterances is smaller than ≈ 40 as will be shown in Chapter 6. When the system has perfect knowledge of the identity of the current speaker. several techniques for speaker identification are discussed in the next chapters to approach this upper bound. The interpolation of EV and ML estimates only requires to calculate the weighting factors αk and the sum of the weighted estimates for each Gaussian density according to (4. 2. The conditions of this experiment may be realistic for some systems: A system may ask each speaker to identify himself when he starts a speaker turn or when the system is reset. Thus. Speaker specific codebooks can be initialized and continuously adapted based on the standard codebook. However. The algorithm provides a data-driven smooth transition between globally estimated mean vectors and locally optimized mean vectors permitting individual adjustment. 4.5 have performed worse than the combined approach since a fast convergence during the initialization and an individual training on extensive adaptation data are required as well as an optimal smooth transition. In this case the recognition accuracy is expected to be limited in the long run. Both the MAP and EV adaptation described in Sect.3 and Sect.10).6.6. a relative error rate reduction of 25 % can be achieved via the adaptation techniques described here.4 Summary A flexible adaptation scheme was presented and discussed. . Computational complexity is negligible compared to the EMAP algorithm.

83–113. SCT. In the preceding chapter a combination of short-term and long-term adaptation was examined. Gerl. The basic architecture of the target system including the modules speaker identification. Speaker variabilities are handled by an unsupervised speech controlled system comprising speech recognition and speaker identification. Then a unified speaker identification and speech recognition method is presented to determine the speaker’s identity and to decode the speech utterance simultaneously. both systems are discussed with respect to the problem of this book and an extension of the speaker identification method is motivated. A unified modeling is described which handles both aspects of speaker and speech variability. The fundamentals and existing solutions as known from literature were discussed in detail. pp. F. This technique does not force the user to identify himself. The experiments carried out showed a significant improvement for speech recognition if speaker identity is known or can be at least reliably estimated.1 Motivation In the preceding chapters the effects of speaker variability on speech recognition were described and several techniques which solve the problem of speaker T. Both implementations are evaluated for an in-car application. A reference implementation is then described. speech recognition and speaker adaptation is briefly discussed. and W.5 Unsupervised Speech Controlled System with Long-Term Adaptation Chapters 2 and 3 provided a general overview of problems related to the scope of this book. It is used as a reference implementation for the subsequent experiments. 5. It comprises a standard speaker identification technique in parallel to the speaker specific speech recognition as well as speaker adaptation of both statistical models. c Springer-Verlag Berlin Heidelberg 2011 springerlink. Herbig. In this chapter a flexible and efficient technique for speaker specific speech recognition is presented.com . Minker: Self-Learning Speaker Identification. Finally. The discussion starts with the problem to be solved and motivates the new approach. Speaker profiles can be initialized and continuously adapted under the constraint that the speaker has to be known a priori.

One feature extraction and only one statistical model are employed for speaker identification and speech recognition. The main goal is now to construct a complete system which includes speaker identification. BIC is based on single Gaussian distributions whereas the speech recognizer introduced in Sect. Models used in speaker change detection.i W 1:NW .3 is realized by a mixture of Gaussian distributions with an underlying Markov model.5. iMAP = arg max {p(W 1:NW .1 Unified speaker identification and speaker specific speech recognition. Each of the approaches described covers a part of the use case in this book. The parameter set Θi of the speaker specific codebook is omitted. An efficient implementation with respect to computational complexity and memory consumption is targeted.2) = arg max {p(W 1:NW |i. x1:T ) · p(i|x1:T )} as found by Nakagawa et al. Simultaneous speaker identification and speech recognition can be described as a MAP estimation problem W MAP . The associated speaker profile enables speech decoding. speaker identification and speech recognition are of increasing complexity. The estimated speaker identity and the transcription of the spoken phrase are used to initialize and continuously adapt speaker specific models. . i|x1:T )} 1:NW W 1:NW . [2006].84 5 Unsupervised Speech Controlled System with Long-Term Adaptation tracking and unsupervised adaptation were outlined. An approach is subsequently presented which is characterized by a mutual benefit between speaker identification and speech recognition. Computational load and latency are reduced by an on-line estimation of the speaker identity. 2.i (5. The MAP criterion is applied to the joint posterior probability of the spoken phrase W 1:NW and speaker identity i given the observed feature vectors x1:T .1) (5. speech recognition and speaker adaptation. Ideally. Speaker identification and speech recognition of successive utterances are enhanced. For each feature vector the most probable speaker is determined. it shall be operated in a completely unsupervised manner so that the user can interact with the system without any additional intervention. Speaker Front-End xt Speech Recognition Adaptation Transcription Fig. 5.

5.2 Joint Speaker Identification and Speech Recognition

85

The goal is to show in this book that the problem of speaker variability can be solved by a compact statistical modeling which allows unsupervised simultaneous speaker identification and speech recognition in a self-evolving system [Herbig and Gerl, 2009; Herbig et al., 2010d]. The subsequent sections describe a unified statistical model for both tasks as an extension of the speaker independent speech recognizer which was introduced in Sect. 2.5.3. It is shown how a robust self-learning speaker identification with enhanced speech recognition can be achieved for a closed-set scenario. The system structure is depicted in Fig. 5.1.

5.2

Joint Speaker Identification and Speech Recognition

In the preceding chapter an appropriate speaker adaptation technique was introduced to initialize and adapt codebooks of a speech recognizer. However, the derivation of the adaptation scheme was done under the condition that speaker identity and the spoken text are known. Now speaker specific speech recognition and speaker identification have to be realized. The implementation of an automated speech recognizer presented in Sect. 2.5.3 is based on a speaker independent codebook. As shown in Fig. 2.11 the feature vectors extracted by the front-end are compared with a speaker independent codebook. For each time instance the likelihood p(xt |k, Θ0 ) is computed for each Gaussian distribution with index k. The resulting vector qt given by (2.67) and (2.68) comprises all likelihoods and is used for the subsequent speech decoding. When the speaker’s identity is given, speaker adaptation can provide individually adapted codebooks which increase the speech recognition rate for each speaker. Therefore, the speech recognizer has to be able to handle several speaker profiles in an efficient way. In Fig. 5.2 a possible realization of an advanced speaker specific speech recognizer is shown. The speech recognizer is extended by NSp speaker specific codebooks which are operated in parallel to the speaker independent codebook. This approach allows accurate speech modeling for each speaker. However, the parallel computation significantly increases the computational load since the speech decoder has to process NSp + 1 data streams in parallel. The Viterbi algorithm is more computationally complex than a codebook

Front-end

xt

Speaker specific codebooks

qi t

Speech decoder

Fig. 5.2 Block diagram for a speaker specific ASR system. NSp speaker specific codebooks and speech decoders are operated in parallel to generate a transcription for each speaker profile.

86

5 Unsupervised Speech Controlled System with Long-Term Adaptation

selection strategy on a frame-level, especially for a large set of speaker specific codebooks operated in parallel. Thus, it seems to be advantageous to continuously forward only the result qi of the codebook belonging to the current speaker i. A solution witht out parallel speech decoding for each codebook is targeted. Such an approach, however, requires the knowledge of the current speaker identity for an optimal codebook selection. The computational complexity can be significantly decreased by a twostage approach. The solution of (5.2) can be approximated by the following two steps. First, the most probable speaker is determined by iMAP = arg max {p(i|x1:T )}
i

(5.3)

which can be implemented by standard speaker identification methods. In the second step, the complete utterance can be reprocessed. The corresponding codebook can then be employed in the speech decoder to generate a transcription. The MAP estimation problem W MAP = arg max 1:NW
W 1:NW

p(W 1:NW |x1:T , iMAP )

(5.4)

is then solved given the speaker identity iMAP . For example, the most likely word string can be determined by the Viterbi algorithm which was described in Sect. 2.5.2. The disadvantages of such an approach are the need to buffer the entire utterance and reprocessing causing latency. Thus, speaker identification is implemented here on two time scales. A fast but probably less confident identification selects the optimal codebook on a frame-level and enables speaker specific speech recognition. In addition, speaker identification on an utterance-level determines the current speaker and provides an improved guess of the speaker identity for speaker adaptation. The latter is performed after speech decoding. The goal is to identify the current user and to decode the spoken phrase in only one iteration [Herbig et al., 2010d].

5.2.1

Speaker Specific Speech Recognition

An appropriate speech model has to be selected on a frame-level for speech decoding. Since high latencies of the speech recognition result and multiple recognitions have to be avoided, the speech recognizer has to be able to switch rapidly between speaker profiles at the risk of not always applying the correct speech model. On the other hand, if the speech recognizer selects an improper speaker specific codebook because two speakers yield similar pronunciation patterns, the effect on speech recognition should be acceptable and no serious increase in the error rate is expected. False speaker identifications lead to mixed codebooks when the wrong identity is employed in speaker adaptation. It seems to

5.2 Joint Speaker Identification and Speech Recognition

87

Standard Codebook φ0 t

p(xt |Θ0 ) qi=0 t

qi=1 t

φ0 t Front-End xt Speaker 2 φ0 t . . . φ0 t Speaker NSp p(xt |ΘNSp ) qt
i=NSp

p(xt|Θ2 ) qi=2 t

ifast = arg maxit {p(it|x1:t)} t

Speaker 1

p(xt |Θ1 )

qifast t

p(xt|Θi ) qi t

φ0 t

Fig. 5.3 Implementation of the codebook selection for speaker specific speech recognition. Appropriate codebooks are selected and the corresponding soft quani tization qtfast is forwarded to speech decoding. Figure is taken from [Herbig et al., 2010d].

be evident that this kind of error can have a more severe impact on the longterm stability of the speech controlled system than an incorrect codebook selection for a single utterance. Subsequently, a strategy for codebook selection is preferred. It is oriented at the match between codebooks and the observed feature vectors xt . Since a codebook represents the speaker’s pronunciation characteristics, this decision is expected to be correlated with the speaker identification result. Class et al. [2003] describe a technique for automatic speaker change detection which is based on the evaluation of several speaker specific codebooks. This approach is characterized by low computational complexity. It is extended in this book by speaker specific speech recognition and simultaneous speaker tracking. Fig. 5.3 displays the setup of parallel codebooks comprising one speaker independent and several speaker specific codebooks. The standard codebook with index i = 0 is used to compute the match q0 between this codebook t and the feature vector xt . To avoid latencies caused by speech decoding, only codebooks are investigated and the underlying Markov models are not used. Therefore, the statistical models are reduced in this context to GMMs. Even s in the case of an SCHMM the weighting factors wkt are state dependent. The state alignment is obtained from the acoustic models during speech decoding which requires the result of the soft quantization qi . The codebooks are t further reduced to GMMs with uniform weighting factors

Thus. Speaker specific codebooks emerge from the standard codebook and speaker adaptation always employs the feature vector assignment of the standard codebook as described in Sect. Subsequently. 2010d].9) by evaluating this subset φ0 .1 t. 2010d]. . the computational load for the evaluation of NSp + 1 codebooks on a framelevel can be undesirable for embedded systems. The complexity of likelihood computation is t reduced because only N + NSp · Nb instead of N + NSp · N Gaussian densities have to be considered. . kt.5) for reasons of simplicity [Herbig et al. Θi ). Only the corresponding soft quantization qifast t is further processed by the speech decoder.1 . For Nb = 10 no eminent differences between the resulting likelihood p(xt |Θi ) = and p(xt |Θi ) ≈ 1 N 1 N N N xt |μi . Σ0 k k k∈φ0 t (5. each codebook generates a vector qi and a t scalar likelihood p(xt |Θi ) which determines the match between codebook and observation. .8) represents the indices of the associated Nb Gaussian densities relevant for the likelihood computation of the standard codebook. Θ0 ) for the standard codebook. p(xt |k = φ0 b . . Θi ) t t. . A decision logic can follow several strategies to select an appropriate speaker specific codebook. Subsequently.Nb t T (5. . Even though the parallel speech decoding of NSp + 1 codebooks is avoided.88 5 Unsupervised Speech Controlled System with Long-Term Adaptation wk = 1 N (5. . . the subset φ0 can be expected to be a reasonable selection t for all codebooks.2. The ML criterion can be applied so that in each time instance the codebook is selected with the highest likelihood value . Each codebook generates a separate vector qi ∝ p(xt |k = φ0 .6) N xt |μi . 4. Σ0 k k k=1 (5.N T (5.. Thus. the vector 0 0 φ0 = kt. Only those Gaussian densities are considered which generate the Nb highest likelihoods p(xt |k. Finally.7) have been observed in the experiments which will be discussed later [Herbig et al. a pruning technique is applied to the likelihood computation of speaker specific codebooks. three methods are described: One possibility may be a decision on the actual likelihood value p(xt |Θi )..

Due to the knowledge about the temporal context. the iid assumption is used for reasons of simplicity.2.5. The posterior probability p(it |x1:t ) decomposes into the likelihood p(xt |it ) and the predicted posterior probability p(it |x1:t−1 ): p(it |x1:t ) = p(xt |it . e. However.12) (5. x1:t−1 ) · p(it |x1:t−1 ) . Codebook selection may then be realized by i ifast = arg max lt (5. the identification accuracy is expected to be limited by the exponential weighting compared to (2. p(xt |x1:t−1 ) ∝ p(xt |it ) · p(it |x1:t−1 ) t>1 (5.46). an unrealistic number of speaker turns characterized by many switches between several codebooks can be suppressed. especially when no re-processing is allowed. ifast ) (5.14) t i and speech recognition can be described again by (5. t i (5.15) (5.16) (5.13) = p(x1 |Θi ) i where lt denotes the local log-likelihood of speaker i.g.2 Joint Speaker Identification and Speech Recognition 89 ifast = arg max {p(xt |Θi )} . A local speaker identification may be implemented by i i lt = lt−1 · α + log (p(xt |Θi )) · (1 − α). In the experiments carried out an unrealistic high number of speaker turns could be observed even within short utterances. from male to female.11). A more advanced strategy for a suitable decision criterion might be to apply an exponential weighting window. 2. The speech recognizer has to be able to react quickly to speaker changes.2. Again. 0 < α < 1 (5. . 5.11) 1:NW 1:T W 1:NW where ifast describes the codebook selection given by the corresponding q1:T .10) The optimization problem for speech recognition can then be approximated by W MAP = arg max p(W 1:NW |x1:T .17) p(i1 |x1 ) ∝ p(x1 |i1 ) · p(i1 ). a decision logic based on Bayes’ theorem may be applied in analogy to the forward algorithm introduced in Sect.5.2. i l1 1 < t ≤ T. an unrealistic number of speaker turns on a single utterance has to be suppressed. Alternatively. The posterior probability p(it |x1:t ) is estimated for each speaker i given the entire history of observations x1:t of the current utterance. The sum of the log-likelihood values enables standard approaches for speaker identification as described in Sect. However. This implementation seems to be inappropriate since the speech decoder is applied to a series of qt originating from different codebooks. 1:T This criterion has the drawback that the codebook selection may change frame by frame.

21) In summary.90 5 Unsupervised Speech Controlled System with Long-Term Adaptation The likelihood p(xt |it . The codebook with maximal posterior probability is selected according to the MAP criterion: ifast = arg max {p(it |x1:t )} . The standard codebook is always evaluated in parallel to all speaker specific codebooks to ensure that the speech recognizer’s performance does not decrease if none of the speaker profiles is appropriate. t i (5. The initial distribution p(i1 ) allows for a boost of a particular speaker identity. the parameter set Θ is omitted here. given by p(i1 ) ∝ δK (i1 .g. Here the prior probability p(i1 ) can be directly initialized. (5. A smoothed likelihood suffers from not normalized scores which makes it more complicated to determine an initial likelihood value for a preferred or likely speaker. (5. during the initialization of a new codebook. The codebooks represent the pronunciation characteristics of particular speakers. For convenience. ifast ) (5.20) 1:NW 1:T W 1:NW which can be considered as an approximation of (5. only this statistical model is used for codebook selection. e. the discrete decision i0 from the last utterance and the prior speaker change probability pCh may be used for initialization.18) This approach yields the advantage of normalized posterior probabilities instead of likelihoods.g. can be easily integrated to stabilize codebook selection at the beginning of an utterance. Subsequently. The states of the Markov model used for codebook selection are given by the identities of the enrolled speakers. This Bayesian framework yields the advantage that prior knowledge about a particular speaker. i0 ) · (1 − pCh ) + (1 − δK (i1 . e. a smoothing of the posterior may be employed to prevent an instantaneous drop or rise. this representation is similar to a CDHMM. For example. from the last utterance. Since Markov models are not considered here. i0 )) · pCh . t > 1.2). e. .19) The transcription can be determined in a second step by applying the MAP criterion W MAP = arg max p(W 1:NW |x1:T . The predicted posterior distribution may be given by NSp p(it |x1:t−1 ) = it−1 =1 p(it |it−1 ) · p(it−1 |x1:t−1 ).g. caused by background noises during short speech pauses. x1:t−1 ) represents new information about the speaker identity given an observed feature vector xt and is regarded independently from the feature vector’s time history x1:t−1 . e.g.

Each speaker specific codebook can be evaluated in the same way as common GMMs for speaker identification. The following notation is used. the key idea is to use the codebooks to decode speech utterances and to identify speakers in a single step as shown in Fig. Σ0 ⎠ (5. #Φ . Thus.2. The log-likelihood Li denotes the accumulated log-likelihood of speaker i u and the current utterance with index u.2 Joint Speaker Identification and Speech Recognition 91 5.1. a more robust speaker identification can be used on an utterance-level. 5.2. Error propagation due to maladaptation seems to be more severe than an improper codebook selection on parts of an utterance. codebooks are expected to provide an appropriate statistical model of speech and speaker characteristics since HMM training includes more detailed knowledge of speech compared to GMMs. Only the simplification of equal weighting factors wk is imposed for the same reasons as in Sect. Since speaker adaptation is calculated after speech decoding. The speech recognition result may be employed to obtain a more precise speech segmentation. The likelihood computation can be further simplified by the iid assumption.2 Speaker Identification In the preceding section codebook selection was implemented by likelihood computation for each codebook and applying Bayes’ theorem. The weighting factors wk = N are omitted. 5. Furthermore. for convenience. this implementation of speaker identification suffers from speech pauses and garbage words which do not contain speaker specific information. Σ0 ⎠ (5. The set Φ contains all time instances which are considered as speech.23) u k k #Φ 0 t∈Φ k∈φt is calculated for each speaker as soon as a precise segmentation is available. Each utterance is represented by its feature vector sequence x1:Tu . The accumulated log-likelihood ⎛ ⎞ 1 Li = log ⎝ N xt |μi . However.22) u k k Tu Tu t=1 0 k∈φt 1 in analogy to Markov and Nakagawa [1996].5. The experiments discussed later clearly show the capability of speaker specific codebooks to distinguish different speakers. Since speaker identification is enforced after speech decoding. the techniques discussed in Sect.4. the scalar likelihood scores p(xt |Θi ) can be temporarily buffered.2. The log-likelihood is normalized by the length Tu of the recorded utterance which results in an averaged log-likelihood per frame ⎛ ⎞ T 1 1 u Li = log (p(x1:Tu |Θi )) = log ⎝ N xt |μi . 5. However.1 should be viewed as a local speaker identification.

respectively. p(xt |Θ0 ) and qi are employed t t for speaker identification and speech recognition. Each speaker specific codebook is evaluated on the corresponding Gaussian densities given by φ0 . 5.3 and speaker identification on an utterance-level.4 Joint speaker identification and speech recognition. . . 5. Part I and II denote the codebook selection for speaker specific speech recognition as depicted in Fig. Each feature vector is evaluated by the standard codebook which determines the subset φ0 comprising t the Nb relevant Gaussian densities. respectively. φ0 t Speaker NSp p(xt |Θ0 ) qi=0 t p(xt |Θ1 ) ifast = arg maxi(p(i|xt )) Σ Σ Σ qi=1 t p(xt |Θ2) qi=2 t qifast t p(xt |Θi ) qi t p(xt|ΘNSp ) qt i=NSp Σ Σ φ0 t Speech pause detection II islow = arg maxi(p(x1:Tu |Θi)) islow Fig.92 5 Unsupervised Speech Controlled System with Long-Term Adaptation I Standard Codebook φ0 t Speaker 1 φ0 t FrontEnd xt Speaker 2 φ0 t . .

For example.24) Conventional speaker identification based on GMMs without additional speech recognition cannot prevent that neither training. Since speaker changes are assumed to be rare. users can be expected to speak several utterances. The computational complexity of two separate algorithms is avoided. u u i (5.25) u u similar to Fortuna et al. (5. ∀i.5 the block diagram of the system is depicted. It is assumed that users have to press a push-to-talk button before starting to enter voice commands. To obtain a strictly unsupervised speech controlled system. The modified log-likelihood ensures a relatively robust estimate islow = arg max Li . The basic architecture of the speech recognizer is preserved. the best in-set speaker is determined according to (5. speech segmentation is neglected since at least a rough end-pointing is given. An additional speaker change detection is omitted. [2005]. 5. Speech pauses are identified by a VAD based on energy and fundamental frequency. adaptation nor test situations contain garbages or speech pauses. only the simple solution of one active feature extraction is considered.2. States for silence or speech pauses are included in the Markov models of the speech recognizer to enable a refined speech segmentation. Feature vectors are extracted from the enhanced speech signal to be used for speech recognition and speaker identification. Therefore. For each speaker the long-term energy and mean normalization is continuously adapted. 5. Automated speaker identification and speech recognition are now briefly summarized: • Front-end.3 System Architecture Now all modules are available to construct a first complete system which is able to identify the speaker and to decode an utterance in one step. Standard speech enhancement techniques are applied to the speech signal to limit the influence of the acoustic environment. It can be easily implemented by the following threshold decision Li − L0 < θth . The speech recognizer dominates the choice of feature vectors because speaker identification is only intended to support speech recognition as an optional component of the complete system. When a speaker turn . Speaker specific codebooks are initialized and continuously adapted as introduced in the preceding chapter. In Fig.2 Joint Speaker Identification and Speech Recognition 93 denotes the corresponding number of feature vectors containing purely speech.24). The corresponding log-likelihood ratio is then tested for an out-of-set speaker. the detection of new users is essential.5. Thus.

94 5 Unsupervised Speech Controlled System with Long-Term Adaptation Speaker Front-End Speech ML Adaptation Transcription I II Fig. The speech recognizer can react quickly to speaker changes. The likelihood values p(xt |Θi ). One front-end is employed for speaker specific feature extraction. The speaker identity is used for codebook adaptation and for speaker specific feature extraction. • Speech recognition. NSp of each time instance are buffered and the speaker identity is estimated on an utterance-level. This results in an enhanced recognition rate for speech recognition and speaker identification as well. . The codebooks of the speech recognizer are initialized and adjusted based on the speaker identity and a transcription of the spoken phrase. Speech recognition delivers a transcription of the spoken phrases on an utterance-level. . For each recorded utterance the procedure above is applied. codebook selection is expected to be less confident compared to speaker identification on an utterance-level. the front-end discards the mean and energy modifications originating from the last utterance. a set of parallel speaker specific feature extractions may be realized to provide an optimal feature . i = 0. The state alignment of the current utterance can be used in the subsequent speaker adaptation to optimize the corresponding codebook. Both results are used for speaker adaptation to enhance future speaker identification and speech recognition. If the computational load is not a critical issue. 2010e].5 System architecture for joint speaker identification and speech recognition. . The parameter set which belongs to the identified speaker is used to process the next utterance. 5. • Speaker identification. • Speaker adaptation. Figure is taken from [Herbig et al. occurs. . Appropriate speaker specific codebooks are selected by a local speaker identification to achieve an optimal decoding of the recorded utterance. Speaker specific codebooks (I) are used to decode the spoken phrase and to estimate the speaker identity (II) in a single step. However..

The performance decrease which is caused by only one active feature normalization will be evaluated in the experiments.7 specifying speaker identification and front-end. 2. 64.3. First.4. Germany. Fig. standard techniques for speaker identification and speaker adaptation can be employed. About 3. 5. The USKCP was collected by TEMIC Speech Dialog Systems. The UBM was trained by the EM algorithm. Then the system architecture of the alternative system is described. this introduces an additional latency and overhead. The experiments of the preceding chapter will be discussed with respect to this reference. The feature vectors comprise 11 mean normalized MFCCs and delta features. It is intended to deliver insight into the speech recognition and speaker identification rates which can be achieved by such an approach.3 Reference Implementation In the preceding sections a unified approach for speaker identification and speech recognition was introduced.5. the specifications and the implementation of the speaker identification are given. 5. However.5 h speech data originating from 41 female and 36 male speakers of the USKCP1 database was incorporated into the UBM training. Ulm. In Sect.1 a reference implementation can be easily obtained. This speaker identification technique is used here without further modifications as a reference for the unified approach discussed. 128 or 256 Gaussian distributions with diagonal covariance matrices have been examined in the experiments.6 standard methods for speaker adaptation were introduced to initialize and continuously adjust speaker specific GMMs.1 Speaker Identification Speaker identification is subsequently realized by common GMMs purely representing speaker characteristics. 5. The 0 th cepstral coefficient is replaced by a normalized logarithmic energy estimate. Several UBMs comprising 32. For each speaker about 100 command 1 The USKCP is an internal speech database for in-car applications. 2.2 a standard technique for speaker identification based on GMMs purely optimized to capture speaker characteristics was described. Combined with the speaker specific speech recognizer of Sect. the complete utterance may be re-processed when a speaker change is detected. In Sect. Alternatively.6 shows the corresponding block diagram and has to be viewed as a detail of Fig. 5.2.3 Reference Implementation 95 normalization for each codebook. 5. Alternatively. . A speaker independent UBM is used as template for speaker specific GMMs.

. Each speaker model is continuously adapted as soon as an estimate of the speaker identity is available.46). NSp in parallel to speech recognition. speaker adaptation is based on the result of this speaker identification. 5.3.96 5 Unsupervised Speech Controlled System with Long-Term Adaptation Speaker 1 arg maxi p(x1:T |Θi ) Front-End xt Speaker 2 islow t Speaker 3 Fig. . The speaker model with the highest likelihood is selected on an utterance-level.26) To guarantee that always an appropriate codebook is employed for speech decoding. codebook selection is realized as discussed before. spelling and digit loops are available. The iid assumption is applied to simplify likelihood computation by accumulating the log-likelihood values of each time instance as described by (2. 5. Each speaker is represented by one GMM. u i (5.6 Block diagram of speaker identification exemplified for 3 speakers. The speaker identity is used for speaker adaptation. a new GMM is initialized by MAP adaptation according to (2. . . speaker specific feature normalization is controlled by speaker identification.2 System Architecture The reference system is realized by a parallel computation of speaker identification with GMMs on an utterance-level and the speaker specific speech . The accumulated log-likelihood is calculated based on the iid assumption. However. Additionally. When a new speaker is enrolled. and control utterances such as navigation commands. The log-likelihood log p (x1:Tu |Θi ) of each utterance x1:Tu is iteratively computed for all speakers i = 1. The language is US-English.85) based on the first two utterances. The speaker with the highest likelihood score is identified as the current speaker islow = arg max {log(p(x1:Tu |Θi ))} . Θi denotes the speaker specific parameter set. The training was performed only once and the resulting UBM has been used without any further modifications in the experiments.

. 4.3. 5.2. Adaptation accuracy is supported here by the moderate complexity of the applied GMMs. The work flow of the parallel processing can be summarized as follows [Herbig et al. Speaker identification is realized by an additional GMM for each speaker. Appropriate speaker specific codebooks are selected for the decoding of the recorded utterance as presented in Sect. GMMs are adjusted by MAP adaptation introduced in Sect. • Speaker identification. Codebook selection is implemented as discussed before. First.4 Evaluation In the following both realizations for speaker specific speech recognition combined with speaker identification are discussed. The setup is depicted in Fig. • Speaker adaptation. recognition previously described. 5. The codebooks of the speech recognizer and the corresponding GMMs are adapted according to this estimate. 5.7. GMM models and the corresponding codebooks are continuously updated given the guess of the speaker identity. the joint approach and . Feature vectors are extracted as discussed before. 5. The recorded speech signal is preprocessed to reduce background noises.2. Codebook adaptation is implemented by the speaker adaptation scheme described in Sect. A standard speaker identification technique is applied to estimate the speaker’s identity on an utterance-level.7 System architecture for parallel speaker identification and speaker specific speech recognition..4 Evaluation 97 Speaker Identification Front-End xt Speech Recognition GMM Adaptation Codebook Adaptation Transcription Fig. 2010b]. • Speech recognition.6. One GMM is trained for each speaker to represent speaker characteristics. 2.5. 2010b]: • Front-end. Figure is taken from [Herbig et al.

8.5. Then a realistic scenario for a limited group of enrolled speakers is discussed. The correct speaker identity is used here for speaker adaptation to investigate the identification accuracy depending on the training level of speaker specific codebooks.98 5 Unsupervised Speech Controlled System with Long-Term Adaptation then the reference implementation are considered for closed-set and open-set scenarios. 5. the influence of feature normalization on the performance of the system is considered. the identification rate drops significantly since unreliable ML estimates are used. The EV approach obviously loses for larger speaker sets. For evaluation. modeling accuracy is limited and the probability of false classifications rises in the long run compared to the combination of EV and ML estimates. However. 5. 4.4. the speaker identification accuracy of the joint approach is examined based on a supervised experiment characterized by optimum conditions. The test setup is structured as described in Fig. For each speaker 200 utterances are used randomly. III = [100. experiments are discussed which investigate speaker identification for closed sets. First. Speaker Identification Accuracy In a first step the performance of speaker identification is determined for the approach shown in Fig. For λ = 4 the best performance of speaker identification is achieved whereas accuracy decreases for higher values of λ. It is represented by four evaluation bands based on I = [1. 150[ and . The likelihood scores of all speakers are recorded. Finally.1(a). 50[. 5. Since each speaker is represented by only 10 parameters. II = [50. when λ tends towards zero. Even though error propagation is neglected. 100[. one can get a first overview of the identification accuracy of the joint approach discussed in the preceding sections. It comprises 4 sets each with 50 speakers in a randomized order. 5. The resulting speaker identification rate is plotted in Fig. several permutations are used to randomly build speaker pools of a predetermined size. especially during the initialization.1 Evaluation of Joint Speaker Identification and Speech Recognition In the following experiments the joint speaker identification and speech recognition is investigated. In Fig. Then the detection of unknown speakers is examined. The codebook of the target speaker is modified by speaker adaptation after each utterance as described in Chapter 4.9 the result is depicted in more detail. Closed-Set Speaker Identification First.

9 shows a consistent improvement of the identification rate for all setups as soon as sufficient adaptation data is accessible. According to Fig 5. λ = 8 ( ).3. λ = 16 (·). 2010d]. 200] utterances. The threshold λ has still an influence on adaptation accuracy even though 150 utterances have been accumulated. Unsupervised Closed-Set Identification of 5 enrolled Speakers In the following the closed-set scenario of a self-learning speaker identification combined with speech recognition and unsupervised speaker adaptation is investigated [Herbig et al. In summary. 5. Even in the last evaluation band a significant difference between λ = 4. speaker identification is generally affected by a higher number of enrolled speakers since the probability of confusions increases.3. the number of enrolled speakers NSp and training.9 0. λ = 4 ( ). .85 2 3 4 7 6 5 Number of enrolled speakers 8 9 10 Fig. 5. the enrollment phase which is characterized by the lowest identification rate is critical for long-term stability as shown subsequently.ML (λ ≈ 0) (+). The identification rate is a function of the tuning parameter λ. 4.95 0. The subsequent considerations continue the discussion of the example in Sect. Several speaker pools are investigated for different parameters of speaker adaptation .9 it can be stated that the identification accuracy rises with improved speaker adaptation as expected.. Speaker adaptation is supervised so that no maladaptation occurs. 1 Identification rate 0. The evolution of speaker specific codebooks versus their adaptation level is shown. However. a small parameter λ seems to be advantageous for an accurate speaker identification. However. Having analyzed the measured identification rates. λ = 12 ( ).5. especially for larger speaker sets. The implementation with λ = 4 clearly outperforms the remaining realizations. λ = 20 (◦) and EV (λ → ∞) (2). Each curve represents a predetermined speaker pool size. Fig. λ = 12 and λ = 20 can be observed.4 Evaluation 99 IV = [150.8 Performance of the self-learning speaker identification.

The two special cases ML (λ ≈ 0) and EV (λ → ∞) clearly fall behind the combination of both adaptation .3. Fig. 5. In addition. 8 ( ).96 0.94 0. 3 (2).3.98 Identification rate 100 150 Adaptation level [NA ] 200 0. The goal is now to determine what speech recognition and identification results can be achieved if this knowledge about the true speaker identity (ID) is not given. Each set comprises 5 enrolled speakers. the percentage of the utterances which are rejected by the speech recognizer is given. 6 (×). 200 (c) λ = 20. 4 (◦).3 has been repeated with unsupervised speaker identification.100 1 0.9 0. The resulting speech recognition and speaker identification rates are summarized in Table 5.9 0. 4.92 0. In Sect.98 Identification rate 0.9 0.2 ( ).3 the speaker identity was given for speaker adaptation leading to optimally trained speaker profiles of the speech recognizer. Only the first two utterances of a new speaker are indicated and then the current speaker has to be identified in a completely unsupervised manner. 5 (+).96 0.88 50 100 150 Adaptation level [NA ] (b) λ = 12. 4.9 Performance of the self-learning speaker identification versus adaptation stage.96 0. Several speaker pools are considered . The results show significant improvements of the WA with respect to the baseline and short-term adaptation.94 0. Closed-set speaker identification was evaluated on 60 sets.92 0.98 Identification rate 0.88 50 5 Unsupervised Speech Controlled System with Long-Term Adaptation 1 0. The experiment described in Sect. 7 (·). 1 0.88 50 100 150 Adaptation level [NA ] 200 (a) λ = 4.94 0.1. NA denotes the number of utterances used for speaker adaptation. 9 ( ) and 10 ( ) enrolled speakers listed in top-down order.92 0.

09 93.87 - 85.71 88.17 88.28 Table 5.1 Comparison of the different adaptation techniques for self-learning speaker identification.71 87.10 88.23 ±0.38 85.49 ±0.83 87.33 ±0.64 93.37 88.54 94.30 ±0.69 89.33 92.26 91.43 ±0.02 85.89 88.44 89. Speaker pools with 5 enrolled speakers are considered.57 ±0.18 88.20 87.51 89.51 86. 50[.46 90.23 86.99 88.04 95. 250] utterances.40 89. 100[.48 88.I = [1.06 2.02 85.58 ±0.09 2.51 87.07 93.22 90.11 2.2 Comparison of different adaptation techniques for self-learning speaker identification. III = [100.5.54 88.57 89.47 ±0.89 87.56 89.11 2.10 2.22 85. Speaker adaptation Baseline Short-term adaptation EV-MAP adaptation ML (λ ≈ 0) λ=4 λ=8 λ = 12 λ = 16 λ = 20 EV (λ → ∞) MAP adaptation η=4 Typical errors min max Rejected [%] WA [%] Speaker ID [%] 2.77 87.08 ±0.35 83.25 81. Speaker pools with 5 enrolled speakers are considered.16 88.13 86.49 92.31 91.38 87.15 92.33 91.42 92.75 ±0.68 84.51 89.38 ±0.49 78.56 84.19 ±0.67 87.35 ..09 2.79 87. Speaker adaptation Baseline Short-term adaptation EV-MAP adaptation ML (λ ≈ 0) λ=4 λ=8 λ = 12 λ = 16 λ = 20 EV (λ → ∞) Typical errors min max I II III WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] 84.18 93.67 87. 2010b]. II = [50.53 ±0.45 79.4 Evaluation 101 Table 5.93 89.65 87.11 94.94 87.21 84.42 92.16 ±0.43 ±0.84 85. Table is taken from [Herbig et al.10 2. Speaker identification and speech recognition results are given for several evaluation bands .54 ±0.61 87.13 88.

Potential for Further Improvements Finally. Before an utterance is processed. Additionally. no eminent difference in WA can be observed for 4 ≤ λ ≤ 20 [Herbig et al. Furthermore.3. the influence of speaker specific feature extraction on speaker identification and speech recognition is quantified for the scenario discussed before. .. speaker characteristics can be captured despite limited adaptation data and the risk of error propagation. It becomes obvious that speaker identification can be optimized independently from the speech recognizer and seems to reach an optimum of 94. Again. In Fig. This finding also agrees with the considerations concerning the identification accuracy sketched above. A comparison of the evaluation bands shows a consistent behavior for all realizations. 2010d]. Obviously.10 the speech recognition rates achieved by the joint speaker identification and speech recognition are graphically compared with the supervised experiment in Sect. the speaker identification rates are depicted. no eminent differences in the WA can be stated but a significant decrease in the speaker identification rate can be observed for increasing λ. Speaker identification. speech recognition and speaker adaptation remain unsupervised. 2010d]. It can be stated that this first implementation is already close to the upper bound given by the supervised experiment and is significantly better than the baseline. techniques.. The experiment has been repeated with supervised energy and mean normalization. For higher values the identification rates drop significantly. Fig. 4.10 Comparison of supervised speaker adaptation with predefined speaker identity (black) and the joint speaker identification and speech recognition (dark gray) described in this chapter.64 % for λ = 4. Figure is taken from [Herbig et al.102 90 89 WA [%] 88 87 86 85 5 Unsupervised Speech Controlled System with Long-Term Adaptation 100 96 92 88 84 80 BL ML 4 8 12 16 20 EV ST λ ID [%] ML 4 8 12 λ 16 20 EV (a) Results for speech recognition. the correct parameter set is loaded and continuously adapted. 5. (b) Results for speaker identification. 5. The speaker independent baseline (BL) and shortterm adaptation (ST) are depicted for comparison.

Again.5. The detection rate of unknown speakers is plotted versus false alarm rate.25 97.05 2.23 86. a clear improvement for speaker identification can be expected when feature extraction is operated in parallel. A. WA can be increased when the feature vectors are accurately normalized.56 88. an unknown speaker is hypothesized. The parameters are continuously adapted.. unknown speakers have to be automatically detected. Unknown speakers are detected by the threshold decision in (5.22 95. it becomes evident that feature normalization influences speaker identification significantly. Even though this is an optimal scenario concerning feature extraction.08 85.55 88.06 2.09 2. To achieve a speech controlled system which can be operated in a completely unsupervised manner.23 ±0.12 ±0. Open-Set Speaker Identification In the preceding section speaker identification was evaluated for a closed-set scenario.13 88. Speaker adaptation is supervised so that maladaptation with respect to the speaker identity is neglected. If this threshold is not exceeded by any speaker specific codebook.88 95.75 ±0. 2010a]. speaker specific codebooks are evaluated as discussed before.3 Results for joint speaker identification and speech recognition with supervised speaker specific feature extraction.15 When Table 5.47 88.60 88.3 is compared with Table 5.4 Evaluation 103 Table 5.33 96. This enables the initialization of new codebooks which can be adapted to new users.3. The feature vectors are normalized based on the parameter set of the target speaker. In Fig.1.66 96.25). .11 an open-set scenario is considered [Herbig et al. 5. Speaker adaptation Baseline Short-term adaptation EV-MAP adaptation λ=4 λ=8 λ = 12 λ = 16 λ = 20 Typical errors min max Rejected [%] WA [%] Speaker ID [%] 2. Furthermore.11 2.61 ±0. The performance of the in-set / out-of-set classification is evaluated by the so-called Receiver Operator Characteristics (ROC) described in Sect.

2010a]. 0 0. Several evaluation bands are shown . For NA ≤ 20 both error rates are unacceptably high even if only two enrolled speakers are considered.8 0.8 0.6 0.4 False alarm rate 0.NA ≤ 20 (◦). unknown speakers can be detected by applying a simple threshold to the log-likelihood ratios of the speaker specific codebooks and the standard codebook.6 0.5 0.2 0.3 0.6 0. Figure is taken from [Herbig et al. 20 < NA ≤ 50 ( ).1 0.5 0.4 False alarm rate 0.6 (a) 2 enrolled speakers.1 0. 2010a]. In summary.11 Detection of unknown speakers based on log-likelihood ratios of the speaker specific codebooks and standard codebook. Maladaptation with respect to the speaker identity is neglected. The ROC curves in Fig. 1 0.7 0.5 0.2 0..7 0.4 5 Unsupervised Speech Controlled System with Long-Term Adaptation 1 0.6 (c) 10 enrolled speakers.4 0 0.3 0. 5. In a practical implementation this problem would be even more complicated when mixed codebooks are used or when several codebooks belong to . 5.9 Detection rate 0 0.4 False alarm rate 0. However.104 1 0.1 0. Confidence intervals are given by a gray shading.11 show that open-set speaker identification is difficult when speaker models are trained on only a few utterances.8 0.5 0.4 (b) 5 enrolled speakers. For 5 or 10 enrolled speakers the detection rate of unknown speakers is even worse.9 Detection rate 0. Fig. Speaker adaptation (λ = 4) is supervised.3 0. At least 50 utterances are necessary for adaptation to achieve a false alarm and miss rate less than 10 %. 50 < NA ≤ 100 (2) and 100 < NA ≤ 200 (+)..7 0.9 Detection rate 0.5 0.2 0.5 0.6 0. the detection rates achieved in the experiments show that it is difficult for such a system to detect new users based on only one utterance as expected. A global threshold seems to be inappropriate since the training of each speaker model should be taken into consideration [Herbig et al.

3. a more robust technique is required for a truly unsupervised out-of-set detection. The subsequent considerations continue the discussion of the example in Sect. Maladaptation due to speaker identification errors is neglected. GMMs containing a high number of Gaussian distributions show the best performance probably because a more individual adjustment is possible. 5. If not indicated otherwise. 4.4 Evaluation 105 one speaker. mean vector adaptation is performed to enhance speaker specific GMMs. Speaker identification and GMM training are realized as described in Sect.8. especially when 10 speakers are enrolled.4. 5. However. Several settings of the reference implementation are compared for the closedset and open-set identification task [Herbig et al.1. 5.8 reveals that the codebooks of the speech recognizer cannot be expected to model speaker characteristics as accurately as GMMs purely optimized for speaker identification. The further experiments will show the deficiencies of the reference implementation to accurately track the current speaker.4. A comparison with Fig. Unsupervised Speaker Identification of 5 enrolled Speakers A realistic closed-set application for self-learning speaker identification combined with speech recognition and unsupervised speaker adaptation is investigated. It should be emphasized that this consideration is based on a supervised adaptation.12 the averaged accuracy is compared for dedicated realizations in analogy to Fig. Then a realistic closed-set scenario is considered for a limited group of enrolled speakers. Closed-Set Speaker Identification First. 5. In Fig. in terms of performance all realizations yield good results. . Speaker Identification Accuracy The performance of speaker identification is determined for the approach shown in Fig. Closed-set speaker identification is evaluated on 60 sets.5.b]. Thus. The correct speaker identity is used here for speaker adaptation to describe the identification accuracy of speaker specific GMMs. 5.3 and Sect. Each set comprises 5 enrolled speakers.3. the speaker identification accuracy of the reference implementation is examined based on a supervised experiment characterized by optimum conditions. 5.. It is hard to imagine that a reasonable implementation of an unsupervised speech controlled system can be realized with these detection rates. 5.2 Evaluation of the Reference Implementation In the following the performance of the reference implementation is examined.7. 2010a.

06 89. For comparison.09 88.87 84.96 ±0.97 87.24 88.92 87.23 ±0.82 80.94 87. 64 (·).90 81.106 5 Unsupervised Speech Controlled System with Long-Term Adaptation 1 0.97 85.28 87.50 87.23 ±0.59 ±0.98 0.23 ±0.68 87. In Table 5.24 ±0.98 87.18 87.17 88.48 ±0. only a reduced number of distributions can be efficiently estimated at the Table 5.01 88.23 ±0.12 Performance of the reference implementation. MAP adaptation of mean vectors is used.96 0. For all implementations maladaptation is not considered.23 ±0. 128 and 256 Gaussian distributions.23 ±0. MAP N 32 64 128 256 Typical errors min max η=4 η=8 η = 12 η = 20 WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] 88.98 87. MAP adaptation (η = 4) is only applied to mean vectors. 2010b]. the codebooks (×) of the speech recognizer are adjusted using λ = 4.25 87.97 0. 64.23 ±0.92 87. 5. For higher values of η this optimum is shifted towards a lower number of Gaussian distributions as expected.97 87.22 ±0.06 88.92 85. Confidence intervals are given by a gray shading..29 87.31 .09 87.4 the results of this scenario are presented for several implementations with respect to the number of Gaussian distributions and values of parameter η.23 ±0.64 87.20 ±0. Codebook adaptation uses λ = 12. Both the speaker identification and speech recognition rate reach an optimum for η = 4 and N = 64 or 128. 128 (2) or 256 ( ) Gaussian distributions.94 2 3 4 5 6 7 Number of enrolled speakers 8 9 10 Fig. Speaker identification is implemented by GMMs comprising 32 (◦). Since the learning rate of the adaptation algorithm is reduced.73 76.99 Identification rate 0.21 ±0.13 91.4 Realization of parallel speaker identification and speech recognition. Speaker identification is implemented by several GMMs comprising 32.30 87.95 0.64 88.24 ±0.04 91. Table is taken from [Herbig et al.

2010b].39 87.01 ±0.03 89.25 max ±0..59 88. 250] utterances.11 90. Codebook adaptation uses λ = 12.5 contains the corresponding detailed results of speaker identification and speech recognition obtained on different evaluation bands for η = 4.03 88.23 ±0.23 ±0.17 88.61 88.20 ±0.58 ±0.24 ±0.5 similar results for speech recognition can be observed.24 88.53 ±0.23 ±0.04 87. 2010b].93 86. The highest identification rates of 91.41 86. Comparing Table 5.02 92. 64.23 ±0.5.91 84.13 91.54 ±0.88 82.84 128 87.97 87.6 Realization of parallel speaker identification and speech recognition.23 ±0.22 85.24 87.02 Short-term 85.99 88.09 91.30 ±0..14 89.06 89.09 % have been achieved with 64 and 128 distributions.51 86.88 ±0.4 Evaluation 107 Table 5.87 adaptation Number of Gaussian distributions N 32 87. MAP adaptation I II III η=4 WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] Baseline 84.53 88.62 ±0.46 64 87. Speaker identification is implemented by several GMMs comprising 32.5 Detailed investigation of the WA and identification rate on several evaluation bands.95 Typical errors min ±0.80 88.31 85.64 87.10 91.73 89.97 88. It can be concluded that 32 Gaussian distributions are not sufficient whereas more than 128 Gaussian distributions seems to be oversized.57 ±0.38 88. 128 or 256 Gaussian distributions. 50[.89 89. II = [50.25 88. Table 5.I = [1.23 ±0.52 ±0. MAP N 32 64 128 256 Typical errors min max η=4 η=8 η = 12 η = 20 WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] 87.49 88. III = [100.50 ±0.92 87.02 86. MAP adaptation of weights and mean vectors is used.13 88.22 ±0.42 87.23 ±0.23 ±0.29 Table 5.23 87. The performance of the speech recognizer is marginally reduced with higher η [Herbig et al. . Table is taken from [Herbig et al.12 256 87.33 ±0.45 ±0.71 ±0.02 87.10 88. MAP adaptation of mean vectors is used.2 and Table 5.27 beginning.03 88.03 90.97 88.26 87. 100[. Several evaluation bands are considered .88 88.23 ±0.89 87.84 85.11 91.53 ±0.24 88.86 87.32 88.76 89. Codebook adaptation uses λ = 12.80 88.65 87.18 % and 91.

MAP adaptation (η = 4) of mean vectors and weights (black) and only mean vectors (dark gray) are depicted..108 90 89 WA [%] 5 Unsupervised Speech Controlled System with Long-Term Adaptation 90 89 WA [%] 88 87 86 85 88 87 86 85 32 64 N 128 256 BL ML 4 8 12 16 20 EV ST λ (a) Speech recognition rates of the reference implementation. 5. 2010b]. (b) Speech recognition rates of the unified approach. Now a steady improvement and an optimum . 100 96 92 88 84 80 100 96 92 88 84 80 ID [%] 32 64 N 128 256 ID [%] ML 4 8 12 λ 16 20 EV (a) Speaker identification rates of the reference implementation. The results are summarized in Table 5. Results are shown for speaker adaptation with predefined speaker identity (black) as well as for joint speaker identification and speech recognition (dark gray). Fig. Fig.14 Comparison of the reference implementation (left) and the joint speaker identification and speech recognition (right) with respect to speaker identification.6. 5. 2010b]. MAP adaptation (η = 4) of mean vectors and weights (black) and only mean vectors (dark gray) are depicted. (b) Speaker identification rates of the joint speaker identification and speech recognition. Figure is taken from [Herbig et al. For the next experiment not only mean vectors but also weights are modified. For N = 256 the identification rate dropped significantly.13 Comparison of the reference implementation (left) and the joint speaker identification and speech recognition (right) with respect to speech recognition. In the preceding experiment the speaker identification accuracy could be improved for η = 4 by increasing the number of Gaussian distributions to N = 128. The speaker independent baseline (BL) and shortterm adaptation (ST) are shown for comparison. Figure is taken from [Herbig et al..

3 0. However.5 %. 0 0. 2010b]. Speaker identification is realized by GMMs with 64 Gaussian distributions.15 Detection of unknown speakers based on log-likelihood ratios of the speaker specific GMMs and UBM.7 0.9 Detection rate 0.6 (c) 10 enrolled speakers.4 False alarm rate 0.6 0.4 False alarm rate 0. This observation supports again the finding that the speech recognition accuracy is relatively insensitive with respect to moderate error rates .4 False alarm rate 0. Both mean vector and weight adaptation are depicted for η = 4 representing the best speech recognition and speaker identification rates obtained by GMMs [Herbig et al.6 0.4 109 0 0.5 0.2 0. η = 4 is used to adapt the mean vectors.6 (a) 2 enrolled speakers. The relative error rate reduction which is achieved by increasing the number of Gaussian distributions from 128 to 256 is about 3. For η = 4 doubling the number of Gaussian distributions from 32 to 64 results in 26 % relative error rate reduction.3 0.13 and Fig.9 Detection rate 0.4 Evaluation 1 0. 50 < NA ≤ 100 (2) and 100 < NA ≤ 200 (+)..9 0.5 0.1 % WA [Herbig et al. Finally.1 0. 5.4 (b) 5 enrolled speakers..5 0. the results of the unified approach characterized by an integrated speaker identification and the reference implementation are compared in Fig. the identification rate approaches a limit.6 0. Maladaptation is neglected. Fig.1 0.5 0.3 0. 2010b].2 0.7 0.2 0.5 0.4 Detection rate 1 0.6 0 0. 5.62 % can be observed for N = 256. 1 0.14.8 0.8 0. 20 < NA ≤ 50 ( ).8 0.5.5 0. Confidence intervals are given by a gray shading. In summary. The optimum for speech recognition is again about 88.7 0. of 91. similar results for speech recognition are achieved but identification rates are significantly worse compared to the experiments discussed before. Several evaluation bands are investigated . 5.1 0.NA ≤ 20 (◦).

6 0. Figure is taken from [Herbig et al.4 False alarm rate 0.9 0.5 0.1 0.7 0. Open-Set Speaker Identification In the following the open-set scenario depicted in Fig.4 Detection rate 0 0.4 False alarm rate 0.6 (c) 10 enrolled speakers.5 0. The ROC curves show again the limitations of open-set speaker identification to detect unknown speakers.8 0. 2010a].110 5 Unsupervised Speech Controlled System with Long-Term Adaptation 1 0. Thus.8 0. Speaker identification is realized by GMMs with 256 Gaussian distributions. Several evaluation bands are investigated . The corresponding ROC curves are depicted in Fig.NA ≤ 20 (◦). especially when speaker models are trained .11 is examined for GMMs purely optimized to identify speakers [Herbig et al.4 (b) 5 enrolled speakers. The log-likelihood scores are normalized by the length of the utterance.3 0.6 0. 5.49).9 Detection rate 0. Maladaptation is neglected.1 0.3 0.15.4 0 0. 2010a]. η = 4 is used for adapting the mean vectors. 0 0.5 0.2 0.16 Detection of unknown speakers based on log-likelihood ratios of the speaker specific GMMs and UBM.2 0.7 0. Confidence intervals are given by a gray shading..7 0.2 0. 20 < NA ≤ 50 ( ). 50 < NA ≤ 100 (2) and 100 < NA ≤ 200 (+). 5.6 (a) 2 enrolled speakers.1 0. of speaker identification. Unknown speakers are detected by a threshold decision based on log-likelihood ratios of speaker specific GMMs and UBM similar to (2.9 Detection rate 0. Fig. 5.5 0.5 0.8 0..6 0.5 0. 1 0.4 False alarm rate 0.3 0. different strategies can be applied to identify speakers without affecting the performance of the speech recognizer as long as a robust codebook selection is employed in speech decoding.6 1 0.

It becomes obvious that these error rates are unacceptably high for a direct implementation in an unsupervised complete system. Both implementations do not modify the basic architecture of the speech recognizer. It becomes obvious that the accuracy of all reference implementations examined here is inferior to the codebook based approach. Then an alternative system comprising a standard technique for speaker identification was introduced. the detection accuracy approaches a limit when a higher number of Gaussian distributions is used. 64 ( ). Compared to Fig.3 0. Speaker specific codebooks (solid line) are compared to GMMs comprising 32 (◦). 5.16 the same experiment is repeated with GMMs comprising 256 Gaussian distributions. Only the case of extensively trained speaker models (100 < NA < 200) is examined.5.17 Detection of unknown speakers. In contrast to the former implementation.11 the detection rates are significantly worse. 128 (2) and 256 (+) Gaussian densities for 100 < NA ≤ 200.5 Summary and Discussion Two approaches have been developed to solve the problem of an unsupervised system comprising self-learning speaker identification and speaker specific speech recognition. Figure is taken from [Herbig et al. However. on only a few utterances. Speaker identification and speech recognition use an identical front-end so that a parallel feature extraction for speech recognition and .9 Detection rate 0. MAP adaptation (η = 4) is employed to adjust the mean vectors. the joint approach of speaker identification and speech recognition still yields significantly better results. In Fig.8 0.4 Fig. First. 5. 5. 5.. To compare the influence of the number of Gaussian densities on the detection accuracy. the performance is clearly improved. a unified approach for simultaneous speaker identification and speech recognition was described. However. 5.5 Summary and Discussion 111 1 0. 2010a]. Both systems have several advantages and drawbacks which should be discussed. a clear improvement can be observed.2 False alarm rate 0.1 0.17. When the number of Gaussian distributions is increased from 32 to 64.7 0. all reference implementations and a specific realization with speaker specific codebooks are shown in Fig.6 0 0.

112

5 Unsupervised Speech Controlled System with Long-Term Adaptation

speaker identification is avoided. Computation on different time scales was introduced. A trade-off between a fast but less accurate speaker identification for speech recognition and a delayed but more confident speaker identification for speaker adaptation has been developed: Speaker specific speech recognition is realized by an on-line codebook selection. On an utterance level the speaker identity is estimated in parallel to speech recognition. Multiple recognitions are not required. Identifying the current user enables a speech recognizer to create and continuously adapt speaker specific codebooks. This enables a higher recognition accuracy in the long run. 94.64 % speaker identification rate and 88.20 % WA were achieved by the unified approach for λ = 4 and λ = 20, respectively. The results for the baseline and the corresponding upper bound were 85.23 % and 88.90 % WA according to Table 4.1 [Herbig et al., 2010b]. For the reference system several GMMs are required for speaker identification in addition to the HMMs of the speech recognizer. Complexity therefore increases since both models have to be evaluated and adapted. Under perfect conditions higher speaker identification accuracies could be achieved with the reference implementation in the experiments carried out. Under realistic conditions a speaker identification rate of 91.18 % was achieved for 128 Gaussian distributions and η = 4 when only the mean vectors were adapted. The best speech recognition result of 88.13 % WA was obtained for 64 Gaussian distributions. By adapting both the mean vectors and weights, the speaker identification rate could be increased to 91.62 % for 256 Gaussian distributions and η = 4. The WA remained at the same level [Herbig et al., 2010b]. However, the detection rates of unknown speakers were significantly worse compared to the unified approach. Especially for an unsupervised system this out-of-set detection is essential to guarantee long-term stability and to enhance speech recognition. Thus, only the unified approach is considered in the following. The major drawback of both implementations becomes evident by the problem of unknown speakers which is not satisfactorily solved. This seems to be the most challenging task of a completely unsupervised speech recognizer, especially for a command and control application in an adverse environment. Therefore, further effort is necessary to extend the system presented in this chapter. Another important issue is the problem of weakly, moderately and extensively trained speaker models. Especially during the initialization on the first few utterances of a new speaker, this approach bears the risk of severe error propagation. It seems to be difficult for weakly adapted speaker models to compete with well trained speaker models since adaptation always tracks speaker characteristics and residual influences of the acoustic environment. As discussed so far, both algorithms do not appropriately account for the training status of each speaker model except for codebook and GMM adaptation. The evolution of the system is not appropriately handled. Likelihood is expected to be biased.

5.5 Summary and Discussion

113

Furthermore, both realizations do not provide any confidence measure concerning the estimated speaker identity. Utterances with uncertain origin cannot be rejected. Long-term stability might be affected because of maladaptation. As discussed so far, the integration of additional information sources concerning speaker changes or identity are not addressed. For example, a beamformer, the elapsed time between two utterances or the completion of speech dialog steps may contribute to a more robust speaker change detection. In summary, the joint approach depicted in Fig. 5.5 constitutes a complete system characterized by a unified statistical modeling, moderate complexity and a mutual benefit between speech and speaker modeling. It has some deficiencies due to the discrete decision of the speaker identity on an utterance-level, lack of confidence measures and therefore makes the detection of unknown speakers more complicated. In the next chapter an extension is introduced which alleviates the drawbacks described above. A solution suitable for open-set scenarios will be presented.

an overview of the advanced architecture of the speech controlled system is presented. The evolution of the self-adaptive system is taken into consideration. Herbig. Long-term speaker identification is investigated to track different speakers across several utterances. This technique is based on a unified modeling of speech and speaker related information. and W. Especially for embedded systems in adverse environments. The latter became obvious due to the evaluation results and the improvements achieved. Together this results in a more confident guess of the current speaker’s identity. The discussion in this chapter goes beyond that. However. The motivation and goals of the new approach are discussed. 6.com .1 Motivation In Chapter 5 a unified approach was introduced and evaluated on speech data recorded in an automotive environment.6 Evolution of an Adaptive Unsupervised Speech Controlled System In the preceding chapter a first solution and a reference implementation for combined speaker identification and speaker specific speech recognition were presented and evaluated as complete systems. 115–143. Minker: Self-Learning Speaker Identification. c Springer-Verlag Berlin Heidelberg 2011 springerlink. This technique can be completed by the detection of unknown speakers to obtain a strictly unsupervised solution. Long-term stability and the detection of unknown speakers can be significantly improved. Gerl. The problem of speaker variability is solved on different time scales. pp. F. Finally. SCT. Combined with long-term speaker tracking on a larger time scale a flexible and robust speaker identification scheme may be achieved. T. some deficiencies discussed in the preceding section motivate to extend this first solution. Posterior probabilities are derived which reflect both the likelihood and adaptation stage of each speaker model. The evaluation results are given at the end of this chapter. such an approach seems to be advantageous since a compact representation and fast retrieval of speaker characteristics for enhanced speech recognition are favorable. Speaker identification as discussed so far has been treated on an utterance level based on a likelihood criterion.

Thus. a beamformer. Newly initialized codebooks are less individually adapted compared to extensively trained speaker models. the evaluation of each codebook in (5. In addition. As discussed so far. unknown speakers appear to be a more challenging problem. the integrated speaker identification does not enable a robust detection of unknown speakers. Therefore. As discussed before. to support speaker tracking. the training status of speaker specific codebooks is considered. the problem of speaker models at highly different training levels should be addressed in a self-learning speech controlled system. a posterior probability is introduced based on a single utterance by taking this likelihood evolution into consideration. In addition. are expected to be more sophisticated than decisions based on likelihoods. The problem of speaker tracking is now addressed on a long-term time scale. it may be advantageous to integrate additional devices. especially if it is operated in an unsupervised manner. Error propagation is expected since weakly trained speaker models are characterized by a higher error rate with respect to the in-set / out-of-set detection. The goal is to buffer a limited number of utterances in an efficient manner. This degree of individualism results in an improved match between statistical model and observed feature vectors.23) does not distinguish between marginally and extensively adapted speaker models. A robust estimate of the speaker identity can be given by the MAP criterion. A confidence measure is automatically obtained since the information about the speaker’s identity is kept as a probability instead of discrete ML estimates. Obviously. The strengths of the unified approach shall be kept but its deficiencies shall be removed or at least significantly moderated. On .g. e.116 6 Evolution of an Adaptive Unsupervised Speech Controlled System However. Unknown speakers can be detected as a by-product of this technique. e. 6. posterior probabilities.2 Posterior Probability Depending on the Training Level In this section the effect of different training or adaptation levels on likelihood values is investigated for a closed-set scenario. Long-term speaker adaptation is characterized by a high number of parameters which allow a highly individual adjustment. Then long-term speaker tracking is employed to extend this posterior probability to a series of successive utterances. Even though this aspect goes beyond the scope of this book. In a first step. Confidence measures.g. the following considerations can be easily extended as will be shown later. it is very difficult to detect new users only on a single utterance. Finally. A path search algorithm is applied to find the optimal assignment to the enrolled speakers or an unknown speaker. the likelihood is expected to converge to higher values for extensively adapted codebooks. the solution presented in the preceding chapter shall be extended.

radio and navigation.5 %. The order of utterances and speakers was also randomized. In total. speaker adaptation was supervised in the sense that the speaker identity was given to prevent any maladaptation. Subsequently.. For each speaker a training set of 200 utterances was separated. The codebook of the target speaker was continuously adapted by combining EV and ML estimates as described by (4. The goal is to employ posterior probabilities instead of likelihoods which integrate the log-likelihood scores Li and the training status of each speaker u model [Herbig et al. 6. The latter has to be performed only once as described subsequently. The language is US-English. Ulm. 000 utterances for target and non-target speakers were investigated in a closed-set scenario. 2010c]. concerning speaker changes the speakers were organized in 20 sets containing 25 enrolled speakers. Users were asked to speak short command and control utterances such as digits and spelling loops or operated applications for hands-free telephony. Thus. it is proposed to consider likelihood values as random variables with respect to the adaptation status and to learn the corresponding probability density function in a training. A statistical description is targeted which does not depend on a specific speaker set. 2010e]. e. Hence. The USKCP comprises command and control utterances for in-car applications such as navigation commands. log-likelihood and training level.2 Posterior Probability Depending on the Training Level 117 average a smaller ratio of the speaker specific log-likelihoods Li and the logu likelihood of the standard codebook L0 is expected compared to extensively u trained models since all codebooks evolve from the standard codebook [Herbig et al. The standard codebook as the origin of all speaker specific codebooks is viewed as a reasonable reference. In order to have a realistic profile of internal parameters.g. The speaker change rate from one utterance to the next was set to about 7. spelling and digit loops.. At least 5 utterances were spoken between two speaker turns. All speakers of this subset are subsequently considered to describe the likelihood evolution only depending on the number of utterances used for adaptation NA .6.1 Statistical Modeling of the Likelihood Evolution For a statistical evaluation of the likelihood distribution which is achieved in an unsupervised speech controlled system the scores of 100.2. The composition was randomly chosen and needed not to be gender balanced. For each speaker approximately 300 − 400 utterances were recorded under different driving and noise conditions. 197 native speakers of the USKCP1 development database were employed. 1 The USKCP is a speech database internally collected by TEMIC Speech Dialog Systems. .10).23). log-likelihood ratios with respect to the standard codebook are considered for target and non-target speakers. The speaker specific codebooks were evaluated in parallel according to (5. Speaker dependencies of this evolution are not modeled for reasons of flexibility. Germany.

The central limit theorem predicts a Gaussian distribution of a random variable if realized by a sum of an infinite number of independent random variables [Bronstein et al.11) is increased from Fig.5 reveal a remarkable behavior of the speaker specific codebooks which should be discussed in more detail. In Fig. Asymmetric distributions. σ }. μ ¨ To estimate the parameter set {μ. In Figs. The procedure only differs in the fact that speaker model and current speaker do not match [Herbig et al. • Non-target speakers. σ. 6. as found by Bennett [2003]. 6. σ } of the μ ¨ non-target speaker models can be calculated in a similar way. 2004]. All log-likelihoods Li measured within a certain u adaptation stage NA are employed to estimate the mean and standard deviation of the pooled log-likelihood ratio scores. In consequence univariate symmetric Gaussian distributions are used to represent the log-likelihood ratios. 6. e. (b) Target speaker. 6. In summary. Fig. The threshold λ for combining EV and ML estimates in (4.3 . The parameters {¨.6. λ = 4 and NA = 5 are employed.. are not considered for reasons of simplicity. 6. Subsequently.6. The codebook belongs to a non-target speaker profile. Different values of the parameter λ were employed in speaker adaptation. it is assumed that this condition is approximately fulfilled since Li is an averaged logu likelihood.118 6 Evolution of an Adaptive Unsupervised Speech Controlled System 90 80 70 60 50 40 30 20 10 0 -4 -3 -2 -1 1 0 Li − L0 u u 2 3 4 5 200 180 160 140 120 100 80 60 40 20 0 -3 -2 -1 0 1 2 Li − L0 u u 3 4 5 (a) Non-target speaker. μ.3 . The following two cases are considered: • Target speaker. The codebook under investigation belongs to the target speaker. the Figs. This mismatch situation is described by the parameter set {¨. 2010e]. σ} several adaptation stages comprising ˙ ˙ 5 utterances are defined.1 and Fig. 6. σ } are shown for certain adapta˙ ¨ ˙ ¨ tion stages NA . σ} comprising ˙ ˙ mean and standard deviation. 2000. Zeidler.5 the parameters {μ.g.1 Histogram of the log-likelihood ratios Li − L0 for non-target speakers u u and the target speaker. This match is described by the parameter set {μ.2 both the histogram and the approximation by univariate Gaussian distributions are shown for some examples.3 to ..

3 0. At first glance the log-likelihood values behave as expected.3 Log-likelihood ratios of the target (solid line) and non-target speaker models (dashed line) for different training levels.5 1 0 μ.6 0. 2010e]. (b) Target speaker.5 pure EV estimates are employed. Log-likelihood ratios increase with continued speaker adaptation. Fig.5 0. σ }. the performance of the modified codebooks is significantly worse compared to the standard codebook.5 0. Even with only a few utterances an improvement can be achieved on average.8 0.8 0.5. 2 1 1. 6. μ ˙ ¨ -1 0. λ = 4 is applied for codebook adaptation. μ}. The correct assignment of an utterance to the corresponding codebook yields a clear improvement with respect to the standard codebook..4 0.7 0.3.1 0 -3 -2 -1 0 Li − L0 u u 1 2 3 0.2 Distribution of log-likelihood ratios for dedicated adaptation stages NA = 10 (2). . 6. σ ˙ ¨ 101 Adaptation level [NA ] 102 101 Adaptation level [NA ] 102 (a) Mean values {μ.6 0. ˙ ¨ (b) Standard deviations {σ.5 -2 -3 0 σ. The convex combination in (4.7 0. ˙ ¨ Fig. In Fig. Figure is taken from [Herbig et al.10) are substituted by unreliable ML estimates which converge slowly.6.10) and (4. NA = 50 (◦) and NA = 100 ( ). 6. NA = 20 ( ). During the first 20 utterances. 6.2 Posterior Probability Depending on the Training Level 0. However.11) uses only ML estimates (λ → 0).3 0. one would not employ the scenario depicted in Fig. Only ML estimates are applied whereas in Fig. 6.3 this constant is almost zero. The prior knowledge incorporated into the EV estimates enables fast speaker adaptation based on only a few adaptation parameters. Fig. 6.1 0 -3 -2 -1 0 Li − L0 u u 1 2 3 119 (a) Non-target speaker.2 0. The initial mean vectors in (4.2 0.4 0.

The convex combination in (4.5 0 -0.5 6 Evolution of an Adaptive Unsupervised Speech Controlled System 1 0.5 the other extreme case is depicted.5 0 -0.8 1 μ. σ }. σ }.2 -1 101 Adaptation level [NA ] 102 0 μ. the discrimination capability of speaker specific codebooks becomes obvious. ˙ ¨ (b) Standard deviations {σ.4 0. ˙ ¨ Fig.5 Log-likelihood ratios of the target (solid line) and non-target speaker models (dashed line) for different training levels. The mismatch between statistical model and actual speaker characteristics is increasingly dominant. 2 1.6 (a) Mean values {μ. ˙ ¨ (b) Standard deviations {σ. ML estimates are consequently neglected and only EV adaptation is applied. μ ˙ ¨ 0. This shows the restriction of this implementation of the EV approach since only 10 parameters can be estimated. 6.4 0.2 -1 -1.4 Log-likelihood ratios of the target (solid line) and non-target speaker models (dashed line) for different training levels.5 -1. μ}. ˙ ¨ Fig. the extent of this drift is not comparable to the likelihood evolution observed for the target speaker.10) and (4. The convex combination in (4. In Fig.11) only uses EV estimates (λ → ∞). σ ˙ ¨ 0. σ ˙ ¨ 0.8 1 0.5 0.. When their codebooks are continuously adapted. However.e].5 1 0.5 101 Adaptation level [NA ] 102 (a) Mean values {μ.10) and (4. . On average codebooks of non-target speakers perform worse than the standard codebook. μ}.6 σ. Figure is taken from [Herbig et al.120 2 1. Fast adaptation is achieved on the first utterances but the log-likelihood scores settle after some 20 − 30 utterances. This finding agrees with the reasoning of Botterweck [2001].5 101 Adaptation level [NA ] 102 0 101 Adaptation level [NA ] 102 0. 6. μ ˙ ¨ σ. 6.11) uses λ = 4. 2010c.

6. 4. MAP adaptation with η = 4 (2). EV (◦) and ML ( ) adaptation. At a second glance these distributions reveal some anomalies and conflicts.5 -1 μ ˙ μ ¨ 0 -0.6 Evolution of log-likelihood ratios for different speaker adaptation schemes . In Fig.2 by the corresponding univariate Gaussian distributions. Larger standard deviations can be observed for continuously adapted codebooks. Those errors are usually not detected by speaker identification techniques when only ML estimates are employed.3 . Fig. The investigation reveals a large overlap of both distributions. The combination of short-term and long-term speaker adaptation discussed in Sect. As soon as about 20 utterances are assigned to the correct speaker model. this situation is less critical since the overlap is reduced.5 -2 -2. Especially a weakly adapted codebook of the target speaker seems to perform often either comparably well or even worse than non-target speaker models.2 might give a reasonable explanation.6 likelihood evolution is evaluated for several adaptation schemes.5 -3 121 101 Adaptation level [NA ] 102 101 Adaptation level [NA ] 102 (a) Target speaker. it is complicated to construct a self-organizing system as long as the start or enrollment of new speakers is critical.5 0 -0. 6. The conclusion that the codebook of the target speaker performs better than the standard codebook and that codebooks of non-target speakers are worse is only valid on average.5 1 0.6.4 are represented in Fig.combined EV and ML adaptation with λ = 4 ( ). Especially in the learning phase. even relatively well adapted models of non-target speakers might behave comparably or better than the standard codebook in some cases. (b) Non-target speaker. The Figs. This restriction might explain the relatively tight Gaussian distributions at the beginning.2 Posterior Probability Depending on the Training Level 2 1. Speaker adaptation starts with a small set of adaptation parameters and allows a higher degree of individualism as soon as a sufficient amount of training data is available. Dedicated values of the means and standard deviations shown in Fig. the use of posterior probabilities should help to prevent false decisions and error propagation.5 contain a further detail concerning the standard deviations of the log-likelihood ratios. However.5 -1 -1. When NA exceeds a threshold of about 10 utterances.6. 6. the standard deviation increases . 6. Obviously. 6.

. For example. . . L1 . Yin et al. . The performance of each speaker model is evaluated in the u context of the codebooks of all enrolled speakers. This effect is unwanted for speaker identification and . In the following. this approach should be discussed with respect to the literature. . . posterior probabilities are derived for each speaker. . Lu Sp |iu ) · p(iu ) u u p(L0 . LNSp ) = u u u p(L0 . However. the shift of the Gaussian distributions is larger than the broadening of the curve. L1 . They conclude that likelihood scores drift with an increasing length of enrollment duration. . The procedure has to be applicable to a varying number of enrolled speakers without any re-training since a self-learning speech controlled system is targeted [Herbig et al. In the following the finding of the experiment described in this section is used to calculate posterior probabilities which reflect the observed likelihood scores based on a single utterance and the training level of all speaker specific codebooks. a decision strategy based on likelihoods instead of posterior probabilities is used.1) is represented by the conditional probability density of all log-likelihood scores. The proposed posterior probability is decomposed into the contributions of each speaker model and is derived by applying Bayes’ theorem. They assume Gaussian distributed likelihoods and perform mean subtraction and variance normalization to receive normalized scores. However. L1 . In contrast to this book. 2010e]. xTu } and the training levels of all speaker models. In a second step this concept is extended to posterior probabilities which take a series of successive utterances into consideration. They equally consider the observed likelihood scores given the current utterance Xu = {x1 .122 6 Evolution of an Adaptive Unsupervised Speech Controlled System significantly. This includes not only the log-likelihood Liu of the hypothesized target speaker iu u but also log-likelihoods of the standard codebook L0 and non-target speaker u models Li=iu . . The posterior probability p(iu |L0 . [2008] investigate the influence of speaker adaptation on speaker specific GMMs. . it seems to be advantageous to use relative measures instead of the absolute likelihood scores: • Absolute likelihood scores of speaker specific codebooks obviously do not provide a robust measure. . . . . prior probability of the assumed speaker and a normalization. A tendency towards more reliable decisions can be observed. . . Lu Sp ) u u N N (6. .2.2 Posterior Probability Computation at Run-Time Prior knowledge about likelihood evolution allows calculating meaningful posterior probabilities which are more reliable than pure likelihoods as will be shown later. adverse environments might affect modeling accuracy. 6. Finally.

equation (6. b) = Ea.6) (6.5) Subsequently.. . iu) · p(L0 |iu ) · p(iu ) u u u p(L1 . . • Furthermore. L1 . .2. 2000] given by ρ(a. Thus. e. i = 1. LNSp ) = u u u p(L1 . .3) (6.. the log-likelihoods of the statistical model representing the target speaker seem to be only marginally correlated with those of nontarget speaker models if relative measures. .2) (6. A similar result can be observed when speaker models of u two non-target speakers are compared.6. In the case of speaker models trained on a few utterances the magnitude of the correlation coefficient was |ρ| < 0.4) a = Li − L0 . . . it is assumed that speaker specific log-likelihoods Li which are normalized by L0 can be treated separately or equivalently can be viewed as statistically independent: p(iu |L0 . iu ) p(L0 |iu ) · p(iu ) u u u u · . Lu Sp |L0 ..1) is rewritten in terms of relative measures with respect to the score of the standard codebook: p(iu |L0 . j = i was quite small on average when all speakers of the development set were investigated. . . . . . b=L −L . Lu Sp |L0 ) · p(L0 ) u u u N N . A further decrease could be observed for moderately and extensively trained speaker models. This problem is related to the problem of open-set speaker identification as found by Fortuna et al. NSp . Adverse conditions are expected to affect all codebooks in a similar way. in general. iu ) p(Lu Sp |L0 . .. In both cases the correlation coefficient [Bronstein et al. .7) . are considered u u instead of Li . The likelihood converges to higher values when the statistical models are continuously adapted. u N p(L1 |L0 ) p(Lu Sp |L0 ) u u u N N (6. . NSp j = 1.· · p(iu |L0 ). . . LNSp ) u u u = = p(L1 |L0 . ..b {(a − Ea {a}) · (b − Eb {b})} Ea {(a − Ea {a})2 } · j 0 Eb {(b − Eb {b})2 } (6.2 Posterior Probability Depending on the Training Level 123 can be circumvented or at least alleviated by using relative instead of absolute measures. iu ) p(Lu Sp |L0 . . The range of loglikelihood scores depends on the effective number of speaker adaptation parameters. . (6.g. . [2005].· · N p(L1 |L0 ) p(L0 ) p(Lu Sp |L0 ) u u u u p(L1 |L0 . • Another strong point to employ relative scores is the likelihood evolution described in the last section. The standard codebook as a speaker independent and extensively trained speech model can be considered as a reasonable reference. . . Li − L0 . L1 . iu ) u u u · . .

iu ) u u u ·.· · p(iu ). iu ) p(Lu Sp |L0 . . Lu Sp ) is equal to unity. . As discussed so far. iu ). The parameters for the target speaker are characterized by the mean μ ˙ and standard deviation σ based on log-likelihood ratios. the final result can be given by p(iu |L0 . iu ) = N Li=iu − L0 |¨i . LNSp ) ∝ p(L1 |L0 . 2007] seem to offer a statistical framework suitable to integrate the log-likelihood of a particular speaker and the reference given by the standard codebook. Thus. iu ) · . it is not obvious how this approach reflects different training levels. . .9). σiu ˙ u u u u ˙ and non-target speakers p(Li=iu |L0 . This step is repeated for all speakers.8) The denominators contain normalization factors which guarantee that the N sum over all posterior probabilities p(iu |L0 . In the following. The parameters for ˙ non-target speakers are denoted by μ and σ . Now the conditional density ¨ ¨ functions p(Li |L0 . .124 6 Evolution of an Adaptive Unsupervised Speech Controlled System Furthermore. σ} and {¨. L1 . . . .10) For testing the parameter sets {μ.9). . . Two sets of parameters are required for each adaptation level as discussed before. univariate Gaussian distributions can be applied which are trained and evaluated on the log-likelihood ratio Li − L0 as done in the preceding section. Both cases are combined in (6. LNSp ) = u u u p(L1 |L0 . . L1 . . .. each posterior probability takes all log-likelihoods of a particular utterance into consideration. L1 . . conditional Gaussian distributions [Bishop. the log-likelihoods of the standard codebook L0 are assumed u to be independent of the speaker identity iu : p(iu |L0 .9) To compute the posteriori probability p(iu |L0 . . This procedure is not limited by the number of enrolled speakers NSp because an additional speaker only causes a further factor in (6. it is assumed u u that the speaker under investigation is the target speaker whereas all remaining speakers are non-target speakers. The parameters depend only on the number of training . . u u u u μ ¨ (6. σi . iu ) in u u the case of a non-target speaker. NSp 1 |L0 ) p(Lu u p(Lu |L0 ) u N (6. iu ) . Alternatively. . . Lu Sp ). · p(LNSp |L0 . u u u u u u u N (6. u u For reasons of simplicity the latter approach is employed. the knowledge about the adaptation status is directly incorporated into the parameters of the conditional density N functions p(L1 |L0 . iu) u u has to be calculated in the case of the target speaker and p(Li=iu |L0 . σ } are calculated for each ˙ ˙ μ ¨ speaker individually. the density function p(Liu |L0 .. u u u In general. u u Therefore. p(Lu Sp |L0 . iu ) can be calculated for the target speaker u u p(Liu |L0 . iu ) = N Liu − L0 |μiu . Thus. iu ) · p(iu ). . . .11) (6. L1 . .

In addition. 2 A detailed description of MLPs. the goal of this chapter is to provide the posterior probability p(iu |X1:Nu ) for each speaker i which reflects an entire series of utterances X1:Nu . In summary. With respect to Fig. 6.6. the number of speakers is often unknown and speech data for an enrollment is usually not available for a self-learning speaker identification system. The goal is to calculate a posterior probability which is not only calculated on a single utterance. .3 Closed-Set Speaker Tracking 125 utterances NA . Another strategy might be to categorize mean values and standard deviations in groups of several adaptation stages and to use a look-up table. For each scenario one MLP was trained prior to the experiments which will be discussed later. Each MLP may be realized by one node in the input layer. The resulting posterior probability can be factorized. Prior knowledge about likelihood evolution is directly incorporated into the parameters of the corresponding univariate Gaussian distributions. Referring to Sect. Therefore a statistical model has to be found which allows long-term speaker tracking.1 both discriminative and generative statistical models can be applied to distinguish between different speakers: • Discriminative models such as an MLP can be trained to separate enrolled speakers and to detect speaker changes.4 this problem can be considered as a regression problem. only a posterior probability of a single utterance has been considered. 4 nodes in the hidden layer and two output nodes. In the following sections long-term speaker tracking is described. an enrollment is undesirable with respect to convenience. 6. However. However. their training and evaluation can be found by Bishop [1996].4. For example. The computation of the posterior probabilities can be simplified when all log-likelihoods are normalized by the log-likelihood of the standard codebook. the performance of non-target speaker models is evaluated likewise. Furthermore. a Multilayer Perceptron2 (MLP) may be trained to interpolate the graph for unseen NA . A solution is targeted which is independent from the composition of the speaker pool. 2. • Generative statistical models are preferred in this context since enrolled speakers are individually represented.3 Closed-Set Speaker Tracking As discussed so far. The latter represent the mean and standard deviation of the univariate Gaussian distribution. Instead a series of successive utterances is investigated to determine the probability that speaker i has spoken the utterance Xu given the preceding and successive utterances. posterior probabilities are employed to reflect the match between the target speaker model and an observation as well as the expected log-likelihood due to the adaptation stage of the statistical model.

Transition probabilities are chosen independently from enrolled speakers. Instead of single observations xt . A flexible structure is used which allows the number of states to be increased with each new speaker until an upper limit is reached. The emission probability density of a particular speaker is given by p(Xu |iu ) = exp Tu · Liu . In Fig. Speaker identification makes use of speaker specific codebooks which are interpreted as GMMs with uniform weighting factors. Now these codebooks are used to determine the average match with the speaker’s characteristics for speaker identification according to (5.12) Each speaker or equivalently each state is described by an individual GMM and therefore the resulting HMM equals an CDHMM as introduced in 3 A similar model the so-called segment model is known from the speech recognition literature. 6. Then one speaker model is dropped and replaced by a new speaker model to limit the computational complexity. 2008]. They are either determined by an expected prior speaker change rate pCh or can be controlled by additional devices such as a beamformer if a discrete probability for speaker changes can be given.23). 6.2 and the Markov model for speaker tracking has to be implemented.7 an example of three enrolled speakers is depicted.5. 2010e]. u (6. The transitions aij denote the speaker change probability from speaker i to speaker j. Subsequently. segments x1:T of variable length T are assigned to each state to overcome some limitations of HMMs [Delakis et al.. Each state represents one enrolled speaker and the transitions model speaker changes from one utterance to the next. In contrast to Sect..126 6 Evolution of an Adaptive Unsupervised Speech Controlled System a13 a31 a12 a21 a23 a32 Sp 1 Sp 2 Sp 3 a11 a22 a33 Fig.2. A new HMM3 can be constructed on an utterance level by combining codebooks and the Markov model for speaker tracking [Herbig et al.7 Speaker tracking for three enrolled speakers realized by a Markov model of first order. During speech recognition codebooks represent speaker specific pronunciations to guarantee optimal speech decoding. In the next step the link between speaker identification described in Sect. 2. speaker tracking is viewed as a Markov process. 5. the Markov model is operated on asynchronous events namely utterances instead of equally spaced time instances. . Codebooks are considered here in the context of a series of utterances and speaker changes.

iu) of each speaker iu given the observed history of utterances X1:u . After the speaker change there is a temporary decrease of the resulting posterior probability because speaker turns are penalized by the prior probability pCh . Xu . . This is achieved by the fusion of forward and backward algorithm which was introduced in Sect.5. .15) (6. 2. . . . . Decoding techniques for HMMs can be used to assign a series of utterances X1:Nu = {X1 . 6. The prior probability of each speaker is modeled here by p(iu ) = N1 .6. 2. It requires to save the likelihoods p(Xu |iu ) of Nu utterances in a buffer and to re-compute the complete backward path after each new utterance. . iNu ).8 an example with 5 speakers and two exemplary paths are shown. 2.13) to the most probable sequence of speaker identities iMAP = iMAP .3 Closed-Set Speaker Tracking 127 Sect.5. (6.5. Therefore. In this example the first three utterances are expected to produce high posterior probabilities since no speaker turn is presumed. 1:Nu 1 u Nu (6. 6.2. The forward algorithm can be employed to compute the joint probability p(X1:u . . For all speaker models the likelihood p(Xu+1:Nu |iu ) has to be calculated as described in Sect.2. the backward algorithm can be employed. .7 only discloses all possible states and transitions as well as the corresponding probabilities. The backward algorithm encounters the inverse problem since only the last three utterances are expected to result in high posterior probabilities. The time trajectory is concealed.2. p(iNu |X1:Nu ) ∝ p(X1:Nu . The dashed path is iteratively processed by the forward algorithm. trellis representation is subsequently used because it spans all possible paths or equivalently the trajectory of states as a surface. complete posteriors are used to capture the entire temporal knowledge about the complete series of utterances. Here only the final result is given: p(iu |X1:Nu ) ∝ p(X1:u .5. .2. . In Fig. This example shows the problem of incomplete posterior probabilities.16) The MAP criterion can be applied to determine the most probable speaker identity iMAP = arg max {p(iu |X1:Nu )} . . iu ) · p(Xu+1:Nu |iu ). NSp represents the number of enrolled Sp speakers. . 0 < u < Nu (6. iMAP . . Thus. . Additionally. The graphical representation in form of a finite state machine as used in Fig. 2.14) This task can be solved by an iterative algorithm such as the forward algorithm already explained in Sect. . iMAP . The dashed path displays the correct path and the dotted one denotes a non-target speaker.17) u iu . XNu } (6.

. . Lu Sp are derived from u u the observation Xu . 6. The N probability density function p(L0 .9). Those utterances are not used for speaker adaptation. 6. L1 . States are depicted versus the discrete time axis. Lu Sp |iu ) is emu u ployed in the forward-backward algorithm instead of the likelihood p(Xu |iu ). In addition.128 6 Evolution of an Adaptive Unsupervised Speech Controlled System Time p(iu1:3 |Xu1:3 ) pCh u3 p(iu1:6 |Xu1:6 ) p(iu1:4 |Xu1:4 ) Speaker Fig.8 Example for speaker tracking based on posterior probabilities. If uncertainty exceeds a given threshold. . The former one can be obtained by applying Bayes’ theorem to (6. . . p(L0 . . particular utterances can be rejected by speaker identification. . 5 speakers and two exemplary paths are shown. In the experiments carried out adaptation was not performed when p(iMAP |X1:Nu ) < 0. . L1 . . . Lu Sp |iu ) is used as an indirect meau u N sure for p(Xu |iu ) since the likelihoods L0 . u The Markov model for speaker tracking and the modified posterior probability discussed before can be combined to obtain a more precise speaker modN eling where likelihood evolution is considered. L1 .4 Open-Set Speaker Tracking The focus of the preceding sections has been to enhance speech recognition by an unsupervised speaker tracking which is able to identify a limited number of . a direct measure for confidence or complementary uncertainty is achieved by posterior probabilities.5. . .

e. to obtain a higher robustness against false detections. Only the closed-set case has been investigated. given by a UBM or speaker independent codebook. The range of log-likelihood values is known a priori. this normalization can cause unwanted results and might pretend a high confidence. 5. σi · p(iu = NSp + 1). .2. there is no measure whether the existing speaker models fit at all. . .3 some techniques were discussed to detect unknown speakers. Thus. especially unknown speakers. Long-term speaker tracking suffers from this deficiency in the same way. Log-likelihood scores can be normalized by a reference.2.5 and Fig.6. Although each speaker model is investigated with respect to its adaptation level. 6. 2. The goal is to define a statistical approach to detect mismatch conditions. the posterior probability of an out-of-set speaker is proportional to the product of the univariate Gaussian functions of all non-target speakers [Herbig et al. the current user is always assigned to one of the enrolled speaker profiles.16) always guarantees that the sum of all posterior probabilities equals unity.9) each log-likelihood score is validated given the adaptation stage of the corresponding codebook.g. an alternative model for situations when no speaker model represents an utterance appropriately can be obtained as a by-product of the posterior probability explained in Sect. In Sect. In consequence. LNSp ) ∝ u u u i=1 N Li − L0 |¨i . u u μ ¨ (6. In (6. . 6. In this scenario speaker identification needs additional input when a new user operates the speech controlled device for the first time. The Figs. However. .4.. The feature vectors of an utterance are only compared with existing speaker profiles and the one with the highest match or negatively spoken with lowest discrepancy is selected. The normalization in (6.9) already contains the log-likelihood ratios of all possible non-target speaker models. When there is a mismatch between an utterance and the speaker’s codebook. Unknown speakers cannot be modeled directly because of a lack for training or adaptation data. 6. 2011]: NSp p(iu = NSp + 1|L0 . Especially in the learning phase of a new codebook. The main principle is to consider one speaker as the target speaker whereas remaining speakers have to be viewed as non-target speakers. it is complicated to find an optimal threshold. L1 .18) The closed-set solution in (6.3 . Therefore.9 show that a global threshold for loglikelihood ratios is difficult. In the case of an unknown user no speaker model exists so that all speaker models act as non-target speaker models. only the normalization has to be adjusted to guarantee that the sum over all in-set and out-of-set speakers equals unity.6. The principle idea was to apply either fixed or dynamic thresholds to the loglikelihoods of each speaker model.11. An unacceptably high error rate with respect to the discrimination of in-set and out-of-set speakers is expected.4 Open-Set Speaker Tracking 129 enrolled speakers. The log-likelihood distributions significantly vary depending on the adaptation stage.

5 0.8 0.7 0.8 0.3 0. -1 0 Li − L0 u u 1 2 3 (c) NA = 20.3 0. -1 0 Li − L0 u u 1 2 3 (e) NA = 50.5 0.6 0.1 -1 0 Li − L0 u u 1 2 3 0 -3 -2 -1 0 Li − L0 u u 1 2 3 (a) NA = 5. Fig.2 0.130 0.3 0.7 0.4 0.2 0. The x-coordinate denotes the log-likelihood ratio of speaker specific codebooks and standard codebook.3 0. λ = 4 is employed in speaker adaptation.4 0.6 0.4 0.5 0.8 0. The concept of long-term speaker tracking.7 0.6 0.9 Overlap of the log-likelihood distributions of the target (solid line) and non-target speaker models (dashed line).6 0. however.1 0 -3 -2 -1 0 Li − L0 u u 1 2 3 0.1 0 -3 -2 -1 0 Li − L0 u u 1 2 3 0.4 0.2 0.1 0 -3 -2 (d) NA = 30.2 0.5 0.2 0. exploits the knowledge of a series of utterances and is more robust compared to a binary classifier based on likelihoods or posterior probabilities originating from a single utterance.7 0.5 0.6 0.8 0.3 0.7 0.4 0.7 0. (f) NA = 100.6 0.2 0.1 0 -3 -2 (b) NA = 10. 6. The Markov model used for speaker tracking can be easily .8 0.8 0. 0.5 0.1 0 -3 -2 6 Evolution of an Adaptive Unsupervised Speech Controlled System 0.3 0. 0.4 0.

However. The MAP estimate in (6. 2007]. This can be the case when all posterior probabilities tend towards a uniform distribution. respectively. The optimal decision is defined in (2. the following two probabilities can be calculated by the forwardbackward algorithm. Hypotheses H0 and H1 characterize the event of an in-set and an out-of-set speaker.10 open-set speaker tracking is shown for an example of three paths. If the out-of-set speaker obtains the highest posterior probability. The posterior probability p(H0 |X1:Nu ) = p(iu = NSp + 1|X1:Nu ) (6. It seems to be advantageous to consider speaker identification as a twostage decision process [Angkititrakul and Hansen. This leads to a binary decision between in-set and out-of-set speakers.6. a new speaker specific codebook may be initialized. a relatively small posterior probability might be sufficient to enforce a new codebook.19) . In the second step.17) may be used to detect unknown speakers in direct comparison with all enrolled speaker models. this speaker identity has to be verified. the MAP criterion is applied as given in (6. In Fig. In the first step.10 Example for open-set speaker tracking. 6.14). extended from NSp to NSp + 1 states to integrate the statistical modeling of unknown speakers into the forward-backward algorithm. In general. 6.4 Open-Set Speaker Tracking 131 Time In-set speakers Speaker Out-of-set speaker Fig.17) to select the most probable in-set speaker. Two belong to enrolled speakers and an additional one represents an unknown speaker.

5. 6. 6. After each utterance the updates of the speaker specific parameter sets for energy normalization and mean subtraction are kept if no speaker turn is detected. especially when speaker models are differently trained. Furthermore. the structure of the speaker identification process shown in Fig.26) In particular. Codebook selection is realized by a local speaker identification on a frame level as discussed in Sect. Otherwise a reset is .25) (6. (6. Subsequently. (6. equal costs for not detected out-of-set speakers and not detected in-set speakers are assumed to simplify the following considerations. I Speech recognition.3.132 6 Evolution of an Adaptive Unsupervised Speech Controlled System denotes an enrolled speaker and the probability of an unknown is given by p(H1 |X1:Nu ) = p(iu = NSp + 1|X1:Nu ).11 is employed.14) the following criterion can be applied to decide whether a new speaker profile has to be initialized: p(H0 |X1:Nu ) H1 ≤ 1 p(H1 |X1:Nu ) 1 − p(H1 |X1:Nu ) H1 ≤ 1 p(H1 |X1:Nu ) H1 1 p(H1 |X1:Nu ) ≥ . II Preliminary speaker identification. Since speech recognition requires a guess of the speaker identity on different time scales.22) where the posterior probabilities p(H0 |X1:Nu ) and p(H1 |X1:Nu ) are complementary p(H0 |X1:Nu ) + p(H1 |X1:Nu ) = 1.2 can be used in the forward-backward algorithm. Bayes’ theorem permits the following notation p(X1:Nu |H0 ) · p(H0 ) = p(H0 |X1:Nu ) · p(X1:Nu ) p(X1:Nu |H1 ) · p(H1 ) = p(H1 |X1:Nu ) · p(X1:Nu ). 2011]..20) Optimal Bayes decisions depend on the choice of the costs for both error scenarios as described in Sect. 6.21) (6. a series of utterances is considered to reduce the risk of false decisions.5 System Architecture In combination with long-term speaker tracking can a completely unsupervised system be implemented where the detection of unknown speakers is integrated [Herbig et al. 2.2.2. Discrete decisions on an utterance level are avoided and a maximum of information is exploited. the posterior probabilities derived in Sect.24) (6. This technique provides a solution for the problem of open-set speaker tracking. 2 (6. (6.1.23) According to equation (2.

Speaker adaptation is then calculated for the most probable alignment of speaker identities. Unknown speakers can be detected. . e.g. The speaker adaptation scheme presented in Chapter 4 allows the system to delay the final speaker identification since only the sufficient statistics of the standard codebook given by (4.11 System architecture for joint speaker identification and speech recognition comprising three stages.6. Therefore the speaker identity is not needed to accumulate the adaptation data. Speaker adaptation is employed to enhance speaker identification and speech recognition. This stage is required for practical reasons since speaker identification and speech recognition accuracy are sensitive to an appropriate feature normalization. Codebooks are initialized in the case of an unknown speaker and the statistical modeling of speaker characteristics is continuously improved. To balance speaker identification accuracy with the latency of speaker adaptation and memory consumption. performed and the parameter set of the identified speaker is used subsequently. An additional speaker change detection.5 System Architecture 133 ML Speaker Front-End Speech l p Forward Backward Algorithm Adaptation Transcription I II III Fig. Speaker tracking is based on posterior probabilities originating from the reinterpretation of the observed log-likelihood scores. may be included without any structural modifications. respectively. at most 6 successive utterances are subsequently employed for speaker tracking. In Part III posterior probabilities are calculated for each utterance and long-term speaker tracking is performed. Part I and II denote the speaker specific speech recognition and preliminary speaker identification. the data of this speaker is accumulated and an enhanced codebook is computed.14) and (4. At the latest after a predetermined number of utterances a final speaker identification is enforced. As soon as the speaker identity is determined. IIILong-term speaker tracking. The assignment of observed feature vectors to the Gaussian distributions of a codebook is determined by the standard codebook.15) are required. with the help of a beamformer. 6.

e]. In total... Finally.5. 2010d. 5. no improvement can be expected by the knowledge of the adaptation status in the long run. the combination of EV and ML estimates yields the best identification result for λ = 4. The identification accuracy is noticeably improved by long-term speaker tracking compared to the results in Table 5. The speech recognition rate is compared to the results of the first solution presented in Chapter 5 and optimal speaker adaptation investigated in Chapter 4. For λ = 4 a relative error rate reduction of approximately 74 % can be achieved by long-term speaker tracking. For λ → ∞ only a marginal improvement is achieved which can be explained by Fig. Unfortunately. 000 test utterances were examined.3 is taken into consideration.1. This experiment has been repeated with long-term speaker tracking under realistic conditions to obtain a direct comparison. speech recognition does not benefit from this approach and remains at the same level as before. for all implementations except the baseline and short-term adaptation the identification rates are higher than for the experiments discussed in Sect.134 6 Evolution of an Adaptive Unsupervised Speech Controlled System 6.59 % of the speakers are correctly identified.1. First. 5. 2011]. 4. Even for λ ≈ 0 an improvement can be stated since the temporary decrease of the likelihood shown in Fig. . 75. The optimum of the WA was determined in Sect.1. A completely unsupervised speech controlled system has been evaluated. 98. 6. Speaker identification and speech recognition are evaluated for speaker groups of 5 or 10 enrolled speakers [Herbig et al.. 5. 60 different sets were evaluated.4.3 by supervised speaker adaptation in the sense that the speaker identity was given.6. For higher values of λ the identification accuracy is degraded. the closed-set scenario examined in Sect.4.4. 6. In addition. the robustness of both implementations against speaker changes is examined. 2010d. the effect of speaker changes on the identification accuracy is examined.1 closed-set identification was examined for groups of 5 speakers. 6. Again.6 Evaluation In the following experiments speaker identification has been extended by longterm speaker tracking.e].1 is discussed. However. The results are summarized in Table 6. Then speaker tracking is tested in a completely unsupervised way to simulate an open-set test case [Herbig et al. Since the corresponding likelihood scores settle after 10 − 20 utterances.1 Closed-Set Speaker Identification In a first step. Unsupervised Speaker Identification of 5 enrolled Speakers In Sect. closed-set experiments are considered to investigate the speaker identification and speech recognition rates for speaker sets with 5 or 10 speakers [Herbig et al.

92 87.05 89.51 88.35 87.35 88.29 89.50 96.07 97.09 88.38 81.62 96.13 97.02 85. however. 100[.30 88.87 - 85.09 ±0.28 85. Speaker identification and speech recognition results are presented for several evaluation bands I = [1.25 86. Speaker adaptation Baseline Short-term adaptation EV-MAP adaptation ML (λ ≈ 0) λ=4 λ=8 λ = 12 λ = 16 λ = 20 EV (λ → ∞) Typical errors min max Rejected [%] WA [%] Speaker ID [%] 2.59 97.79 88.45 ±0..6.94 88.30 96.54 ±0.11 2. Speaker identification.05 98.17 95. Speaker adaptation Baseline Short-term adaptation EV-MAP adaptation ML (λ ≈ 0) λ=4 λ=8 λ = 12 λ = 16 λ = 20 EV (λ → ∞) Typical errors min max I II III WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] 84.23 ±0.21 ±0.10 ±0.53 93. 2010e].40 97.25 Table 6. 250] utterances.83 88.19 86.22 88.24 ±0.82 87.48 89.74 98.33 ±0.32 Comparing the evaluation bands in Table 5.39 96.17 88.54 87.84 85.6 Evaluation 135 Table 6.12 2. 50[.18 88.04 87.58 ±0.30 ±0.32 86.13 86.2 no significant differences with respect to speech recognition can be observed.60 87.63 87.17 98.22 85.67 89.68 87.49 ±0. Speaker sets with 5 enrolled speakers are investigated.58 ±0.2 Comparison of different adaptation techniques for long-term speaker tracking.1 Comparison of different adaptation techniques for self-learning speaker identification with long-term speaker tracking.69 ±0.11 2.59 97.06 98. benefits from long-term speaker tracking even in the .23 86.48 ±0. II = [50.69 89. Speaker sets comprising 5 enrolled speakers are considered.49 87.09 2.04 86.51 86.13 2.21 94.12 2.21 88.89 87.95 87.58 86.37 97.2 and Table 6. Table is taken from [Herbig et al.53 ±0.85 88.29 97.28 95. III = [100.20 87.84 ±0.91 97.

This indicates that error propagation in the first evaluation band negatively affected the results of the implementations examined in the preceding chapter. Both the WA and identification rate are depicted. joint speaker identification and speech recognition and the combination with long-term speaker tracking are finally compared in Fig. A new test set comprising 30 speaker sets is used for evaluation. Then. In both cases about 20 % relative error rate reduction can be obtained for speech recognition when λ = 4 is used for adaptation.. 6. Unsupervised Speaker Identification of 10 enrolled Speakers As discussed so far. Since this example has been discussed for several implementations. (b) Results for speaker identification.12 Comparison of supervised speaker adaptation with predefined speaker identity (black).136 90 89 WA [%] 88 87 86 85 6 Evolution of an Adaptive Unsupervised Speech Controlled System 100 95 90 85 80 75 BL ML 4 8 12 16 20 EV ST λ ID [%] ML 4 8 12 λ 16 20 EV (a) Results for speech recognition. learning phase given by the evaluation band I. the robustness of the speech controlled system should be investigated when the number of speakers is doubled.1 and Table 6..12. speaker sets of 5 enrolled speakers have been evaluated. 2010d]. Even though 5 enrolled speakers seem to be sufficient for many use cases.g. the speech recognition accuracy of the unified approach appears to be relatively insensitive to moderate speaker identification errors as expected [Herbig et al. in-car applications. The results are given in Table 6.4. First. the speaker independent baseline (BL) and short-term adaptation (ST) are shown. Fig. it can be stated that the speaker identification rates drop significantly. Thus. joint speaker identification and speech recognition (dark gray) and long-term speaker tracking (light gray). . the results for supervised speaker adaptation. Furthermore. However. Figure is taken from [Herbig et al.3 and Table 6. 6. respectively. long-term speaker tracking is activated to evaluate whether an increased error rate on single utterances severely affects speaker tracking. the joint speaker identification and speech recognition is tested without long-term speaker tracking to determine the performance decrease for speaker identification and speech recognition. e. When Table 5. even with 10 enrolled speaker about 90 % identification rate can be achieved. 2010e].3 are compared despite different test sets.

Speaker sets comprising 10 enrolled speakers are investigated. Especially for larger speaker sets.64 1.32 91. Speaker adaptation Baseline Short-term adaptation EV-MAP adaptation ML (λ ≈ 0) λ=4 λ=8 λ = 12 λ = 16 λ = 20 EV (λ → ∞) Typical errors min max Rejected [%] WA [%] Speaker ID [%] 1.47 89.74 86.26 72.89 97.72 84.30 However.65 1.3 Comparison of different adaptation techniques for self-learning speaker identification without long-term speaker tracking. Speaker adaptation Baseline Short-term adaptation EV-MAP adaptation ML (λ ≈ 0) λ=4 λ=8 λ = 12 λ = 16 λ = 20 EV (λ → ∞) Typical errors min max Rejected [%] WA [%] Speaker ID [%] 1.68 4.04 ±0.4 Comparison of different adaptation techniques for self-learning speaker identification with long-term speaker tracking.70 1.53 72.08 ±0.70 1.72 1.75 1.78 87.96 87.23 ±0..35 96. 2010d].33 Table 6.82 83. Speaker sets with 10 enrolled speakers are investigated.3 and Table 6.93 85.6. a fast and robust information retrieval already in the learning phase is essential.55 97. a comparable level of the speaker identification rates to former experiments is achieved when speaker tracking is active as shown in Table 6.10 ±0.82 87.71 1.05 88. Therefore long-term speaker tracking generally yields an advantage for the speaker identification and speech recognition rate.87 87.02 85. Table is taken from [Herbig et al.46 84.04 ±0.56 87.48 87.89 77.22 ±0. 2010d].. .81 84.4 it becomes evident that an increased number of maladaptations also includes a less confident codebook selection which reduces the WA.4.89 87.01 87.67 87.6 Evaluation 137 Table 6.46 96.86 87.93 85.84 87.74 86.64 1. With respect to Table 6.74 1.23 ±0.65 1.26 85. Table is taken from [Herbig et al.70 1.91 88.68 ±0.

6.3 were conducted.14 are based on the results of Table 5.1 and Table 6. In addition. This question should be answered by certain points on the ROC curve which are shown in Fig. Additional information about ROC curves can be found in Sect. The true positives that means correctly detected speaker turns are plotted versus the rate of falsely assumed speaker changes. the evolution of the unsupervised system tends towards an even more robust system. 6. In addition. A.3. Since only moderate speaker change rates are expected. A reliable guess of the current speaker identity is available as soon as the entire utterance is processed. Effects of Speaker Changes on the Speaker Identification Accuracy The techniques for speech controlled systems examined here do not use an explicit speaker change detection such as BIC. Remarkably high identification rates have been achieved.13 and Fig. But Fig. 6. The latter case can be explained by the fact that speaker changes are penalized by pCh in the forward-backward algorithm. the first utterance after a speaker change uses improper parameters for feature extraction. a parallel feature extraction can be avoided. A rapid switch between the codebooks of the enrolled speakers is possible if a clear mismatch between expected speaker characteristics and observed data is detected.1 is rather small and is comparable to the error rate of speaker identification on an utterance level as given in Table 5.14. It is assumed that most of the identification errors occur after speaker changes. respectively. According to Fig. [1995]. 2. speaker changes are tracked by speaker identification on different time scales since speaker turns within an utterance are not considered. several experiments for speaker change detection as described in Sect. Nevertheless the question remains how the speech recognizer with integrated speaker identification reacts to speaker changes. Fig. This technique bears the risk not to detect speaker changes accurately since the parameter set of the speaker previously detected is used. the Arithmetic-Harmonic Sphericity (AHS) measure as found by Bimbot et al. the vulnerability of speaker identification with respect to speaker changes is investigated.13 and Fig. Bimbot and Mathan . Subsequently.138 6 Evolution of an Adaptive Unsupervised Speech Controlled System For the considered use case (NSp < 10) it can be concluded that a speech controlled system has been realized which is relatively insensitive to the number of enrolled speakers or the tuning parameter of speaker adaptation. The prior probability of speaker changes pCh = 0. Instead. 6. In fact.1.13 and Fig. 6. 6. Furthermore. 6. Codebooks are identified on a frame level for speaker specific speech recognition.14 also reveal that long-term speaker tracking only reduces the false alarm rate and even marginally increases the rate of missed speaker changes. Speaker change detection is given by the identification result of successive utterances.14 it seems to be obvious that speaker changes are not a critical issue for the self-learning speaker identification. But feature extraction is also speaker specific and is controlled on an utterance level.1. The performance of the target system either with or without long-term speaker tracking is depicted. 6.13 and Fig.

the first two utterances of a new speaker were indicated to belong to an unknown speaker as depicted in Fig. It seems to be complicated to accurately detect speaker changes in adverse environments using simple statistical models as already mentioned in Sect.94 0. 250] and the complete test set are used .2.92 0.05 0. no robust speaker change detection could be achieved.98 0.6. Speaker change detection was delegated to speaker identification. II (+). 0.1(a). 2.1 Falsely detected speaker turns 0.6 Evaluation 139 1 Correctly detected speaker turns 0.13 Error rates of speaker change detection. only a prior probability for speaker changes was used for longterm speaker tracking. (d) λ = 16.94 0. 100[.96 0.98 0.05 0. II = [50. Fig.94 0. III = [100.98 0. Confidence bands are given by a gray shading.05 0.9 0 0.9 0 0. The detection rate of speaker changes is plotted versus false alarm rate for speaker identification on an utterance level.1 Falsely detected speaker turns 0.96 0.92 0.1 Falsely detected speaker turns 0.94 0.2 Open-Set Speaker Identification The closed-set scenario is now extended to an open-set scenario.15 Correctly detected speaker turns 1 0. 1 Correctly detected speaker turns 0.1 Falsely detected speaker turns 0.9 0 0. Hence. 6. As discussed so far.I ( ).9 0 (b) λ = 8.05 0.15 (a) λ = 4.15 (c) λ = 12.92 0.4. The reason may be the interference of speaker and channel characteristics as well as background noises. III (2) and on average (◦).3.6.92 0. The evaluation bands I = [1. .98 0. Unfortunately. [1993] was examined to capture speaker turns.96 0.15 Correctly detected speaker turns 1 0.96 0. 50[. 6.

4.02 0.140 6 Evolution of an Adaptive Unsupervised Speech Controlled System Correctly detected speaker turns 0.1(b). Correctly detected speaker turns Correctly detected speaker turns 0.92 0.96 0. Fig.5 contains the results measured on the complete test set.01 Falsely detected speaker turns 0.98 0. 4. Subsequently. III = [100. e.98 0. 1 0.9 0 (b) λ = 8.92 0.005 0.015 0.g.02 0.01 Falsely detected speaker turns 0. II = [50.005 0. If a known user is detected.01 Falsely detected speaker turns 0.94 0. To obtain a strictly unsupervised speech controlled system. 100[.025 Correctly detected speaker turns 1 1 0. (d) λ = 16. Otherwise a new speaker profile has to be initialized on the first few utterances and continuously adapted on the subsequent utterances. 50[.005 0. long-term speaker tracking is extended so that unknown speakers can be detected as described in Sect.98 0. the most recently initialized speaker model. Table 6.025 (a) λ = 4. The evaluation bands I = [1. If necessary one profile.94 0.01 Falsely detected speaker turns 0.025 (c) λ = 12. no information of the first occurrences of a new speaker is given by external means.14 Error rates of speaker change detection. only one out-of-set and at most six in-set speakers can be tracked.96 0. The test setup is shown in Fig.96 0.015 0.005 0. 250] and the complete test set are used .9 0 0.015 0. To limit memory consumption and computational load. 6.02 0.I ( ). . Again.92 0. A more detailed investigation of several evaluation bands is given in Table 6. The evaluation set comprises 5 speakers in each speaker set and is identical to the experiment repeatedly discussed before.9 0 0. II (+).92 0.98 0.015 0. Confidence bands are given by a gray shading.94 0.9 0 0. the correct speaker profile has to be adjusted. The detection rate of speaker changes is plotted versus false alarm rate for long-term speaker tracking.025 1 0. is replaced by a new one.02 0.94 0. III (2) and on average (◦).6. 6.96 0.

23 86. Speaker adaptation Baseline Short-term adaptation EV-MAP adaptation ML (λ ≈ 0) λ=4 λ=8 λ = 12 λ = 16 λ = 20 EV (λ → ∞) Typical errors min max I II III WA [%] WA [%] WA [%] 84.77 87. III = [100.50 ±0. Speaker sets with 5 speakers are examined.99 86.02 85.72 87.84 86.73 87.81 88.30 ±0.85 86.32 ±0.70 87.23 ±0.84 88.36 86.87 85. 250] utterances.. II = [50. The speech recognition rate is examined on evaluation bands comprising I = [1. The parameter λ can be selected in a relatively wide range without negatively affecting speech recognition.70 87.5 Comparison of different adaptation techniques for an open-set scenario.25 Table 6. Table is taken from [Herbig et al. Speaker adaptation Baseline Short-term adaptation EV-MAP adaptation ML (λ ≈ 0) λ=4 λ=8 λ = 12 λ = 16 λ = 20 EV (λ → ∞) Typical errors min max WA [%] 85.88 86.86 88. .74 87.61 86.89 86.75 88.73 87.83 86.74 87.24 87.6 Evaluation 141 Table 6.84 87.98 ±0. 2011].84 ±0.13 86.22 85.64 87. 100[.6. 50[. Speaker sets with 5 speakers are investigated.51 85.6 Comparison of different adaptation techniques for an open-set scenario.33 the combination of EV and ML estimates outperforms pure ML or EV estimates as well as the baseline and short-term adaptation without speaker identification.08 88.58 87.83 88.54 ±0.54 87.78 87.20 ±0.

Figure is taken from [Herbig et al. Improvements in speech recognition rate are evaluated with respect to the baseline (BL) and short-term adaptation (ST) since different subsets have been examined. Fig.(b) Evaluation band II = [50. however.64 % WA therefore demonstrates how important the training of new speaker models based on the first two utterances and the knowledge about the number of users seem to be for the experiments described in the preceding sections.4 % can be observed for the error rate of the open-set implementation when λ = 4 is used for codebook adaptation. The performance of the closed-set and open-set implementation are compared in Fig. 90 89 88 WA [%] 87 86 85 84 BL ML 4 8 12 16 20 EV ST λ WA [%] 90 89 88 87 86 85 84 BL ML 4 8 12 16 20 EV ST λ (a) Evaluation band I = [1. . On the first evaluation band I a relative increase of 6.2 with Table 6.87 % (Table 4.15 Speech recognition is compared for supervised speaker adaptation with predefined speaker identity (black) and long-term speaker tracking for the closed-set (gray) and open-set (white) scenario. On the evaluation bands II and III. Here relative error rates of 4 % and 2. 2011]. Due to unsupervised speaker identification the recognition rate was reduced from 88.15.1) to 88. 100[ utterances.10 % WA (Table 5.142 6 Evolution of an Adaptive Unsupervised Speech Controlled System In summary. This finding gives reason to assume long-term stability of the unsupervised system. ances.. Comparing Table 5. the open-set implementation approaches the closed-set realization.1) for λ = 4. 6. 50[ utter. 90 89 88 WA [%] 87 86 85 84 BL ML 4 8 12 16 20 EV ST λ (c) Evaluation band III = [100. 250] utterances. The additional decrease to 87. the achieved WA is smaller compared to closed-set experiments but there is still a clear improvement with respect to the baseline and short-term adaptation.6 it can be stated that this effect is not limited to λ = 4.7 % can be measured. 6.

The main aspect is to consider speaker profiles with respect to their training and to provide prior knowledge about their expected performance. 5. Applying long-term speaker tracking can help to prevent a decrease of the speech recognizer’s performance. . only a small number of codebooks can be considered.64 % was observed without long-term tracking. Long-term speaker tracking is essential to implement a speaker specific speech recognizer operated in an unsupervised way. No training or explicit authentication of a new speaker is required.6. On the other hand missed speaker changes would also negatively affect speech and speaker modeling leading to significantly higher recognition error rates.7 Summary 143 6. The performance of non-target speaker profiles is included to calculate a probability for each enrolled speaker to be responsible for the current utterance. Speaker adaptation can be delayed until a more confident guess of the speaker identity is obtained. Only weakly or moderately speaker profiles would be effectively used for speech recognition if unknown speakers were not reliably detected. By using a simple HMM on an utterance level. Since this technique can be easily extended to open-set speaker tracking. the decoding techniques known from speech recognition can be applied to determine the most probable speaker alignment. An optimum of 98. A method has been developed to compute posterior probabilities reflecting not only the value of the log-likelihood but also prior knowledge about the evolution of speaker profiles. Without long-term speaker tracking a notable number of codebooks is expected to be initialized even though an enrolled user is speaking as the corresponding ROC curves in Sect. A significantly higher speaker identification rate compared to the first realization of the target system was achieved for closed sets.7 Summary The unified approach for speaker identification and speech recognition introduced in the preceding chapter can be extended by an additional component to appropriately track speakers on a series of utterances.4.2 suggest.59 % correctly identified speakers was obtained for λ = 4 whereas an identification rate of 94. new speaker profiles can be initialized without any additional intervention of new users. When memory is limited. A steady stabilization of the complete system was observed in the long run.

.

New speaker profiles are initialized when unknown speakers are detected. SCT. speaker identification and speech recognition were discussed. A standard technique for noise reduction was included. Instead. A compact representation with only one statistical model was targeted and an extension of a common speech recognizer was preferred. Step by step more sophisticated techniques and statistical models were introduced to capture speech and speaker characteristics. An implementation in an embedded system was intended and therefore computational complexity and memory consumption were essential design parameters. is characterized by varying background noises and channel characteristics degrading automatic speech recognition and speaker identification. c Springer-Verlag Berlin Heidelberg 2011 springerlink. Each utterance contains additional information about a particular speaker.7 Summary and Conclusion The goal of this book was to develop an unsupervised speech controlled system which automatically adapts to several recurring users. Then several basic techniques for speaker change detection. The discussion started with speech production represented by a simple yet effective source-filter model. It defines the process of speech generation and provides a first overview of the complexity of speech and speaker variability and their dependencies. Herbig. They represent essential components of the intended solution. Enrolled speakers have to be reliably identified to allow optimal speech decoding and continuous adjustment of the corresponding statistical models. The system can be incrementally personalized. Speaker adaptation moderates mismatches between training and test by adjusting the statistical models to unseen situations. especially for in-car applications. The former one processes a continuous stream of audio data and extracts a feature vector representation for further pattern recognition. 145–148. F. Gerl. pp. The acoustic environment. Modeling accuracy T. Automated feature extraction was motivated by human hearing. each user can directly use the speech controlled system without any constraints concerning vocabulary or the need to identify himself. and W. They are allowed to operate the system without the requirement to attend a training. Minker: Self-Learning Speaker Identification.com .

All these implementations cover several aspects of this book. is the moderate complexity. an appropriate strategy for speaker adaptation was discussed. Furthermore.20 %. One important property. Multiple recognition steps are not required to generate a transcription and to estimate the speaker identity. A completely unsupervised system was developed: First. it was obviously more complicated to reliably . It is suitable for both short-term and long-term adaptation since the effective number of parameters are dynamically adjusted. Even on limited data a fast and robust retrieval of speaker related information is achieved. For comparison. especially for embedded devices. The solutions differ in model complexity and the mutual benefit between speech and speaker characteristics. Several strategies were considered to handle speaker variabilities and to establish more complex systems which enable an unsupervised speaker modeling and tracking. the corresponding upper limit was 88. Speaker specific codebooks can be used for speech recognition and speaker identification leading to a compact representation of speech and speaker characteristics.64 % identification rate could be observed.90 % WA when the speaker is known. a reference system was implemented which integrates a standard technique for speaker identification.146 7 Summary and Conclusion depends on the number of parameters which can be reliably estimated. Several strategies were discussed suitable for fast and long-term adaptation depending on the amount of speaker specific training data. All components comprising speaker identification. The speaker independent baseline only achieved 85. Feature vector enhancement was introduced as an alternative approach to deal with speaker characteristics and environmental variations on a feature level.23 % WA. This was the starting point for the fusion of the basic techniques of speaker identification and speech recognition into one statistical model. The experiments carried out showed the gain of this approach compared to strictly speaker independent implementations and speaker adaptation based on a few utterances without speaker identification. Then some dedicated realizations of complete systems for unsupervised speaker identification or speech recognition were presented as found in the literature. An optimum of 94. With respect to the implementation of short-term adaptation a relative error rate reduction of 20 % was achieved. speech recognition and speaker adaptation were integrated into a speech controlled system. Then a first realization of the target system was presented. for example. In addition. The speech recognizer is enabled to initialize and continuously adapt speaker specific models. In the experiments speech recognition accuracy could be increased by this technique to 88. All implementations obtained lower identification rates but similar speech recognition results in a realistic closedset scenario. The error rate of the speech recognizer could be reduced by 25 % compared to the speaker independent baseline.

Speaker tracking across several utterances was described by HMMs based on speaker specific codebooks considered on an utterance level. For speech recognition approximately 17 % and 12 % relative error rate reduction compared to the baseline and short-term adaptation were measured. both the speaker identification and speech recognition can benefit even on an utterance level. Furthermore.33 % correctly identified speakers and 88. additional devices such as a beamformer can be easily integrated to support speaker change detection. the identification rate was raised again by speaker tracking so that 97. it is hard to imagine that unknown speakers can be reliably detected without long-term speaker tracking. The identification rate on an utterance level decreased from 94. the former experiment was considered in an open-set scenario. However. In an additional experiment the number of enrolled speakers was doubled. posterior probabilities can be used to reflect not only the match between model and observed data but also the training situation of each speaker specific model. Under favorable conditions an optimum of 97. The experiments for supervised adaptation and the first implementation of integrated speaker identification and speech recognition were repeated. Since the WA approached the results of the closed-set implementation in the long run. However. Speaker identification was therefore extended by long-term speaker tracking. The forward-backward algorithm allows a more robust speaker identification since a series of utterances is evaluated. In addition. New codebooks can be initialized without imposing new users to identify themselves. open-set scenarios can be handled. speech recognition accuracy could be improved for some realizations. Remarkable identification rates up to 98. Both weakly and extensively trained models are treated appropriately due to prior knowledge about the evolution of the system. No structural modifications of the speech controlled system are required.64 % to 89. It can be concluded that a compact representation of speech and speaker characteristics was achieved which allows simultaneous speech recognition and speaker identification. Furthermore. Nearly the same speech recognition rate was observed. Instead of likelihoods. For both systems unknown speakers appeared to be an unsolved problem because of the unacceptably high error rates of missed and falsely detected out-of-set speakers. No time-consuming enrollment is necessary.7 Summary and Conclusion 147 detect unknown speakers compared to the unified approach of the target system. Potential for further improvements was seen in an enhanced feature extraction. even higher gains might be expected for more extensive tests. If an optimal feature normalization can be guaranteed for the current speaker. an additional module for speaker .59 % could be obtained in the closedset scenario. Obviously.89 % of the enrolled speakers were correctly identified.56 %. For closedsets a benefit for speech recognition could only be observed when long-term speaker tracking is applied to larger speaker sets.61 % WA for speech recognition were obtained. Finally.

. Only calculations of low complexity have to be performed in real-time since speaker tracking and adaptation are computed after utterance is finished. Therefore applications in embedded systems become feasible.148 7 Summary and Conclusion identification is not required. The approach presented here has potential for further extensions and may be applied in a wide range of scenarios as sketched in the outlook.

pp. 1999]. MLLR adaptation which is widely used in speaker adaptation. Especially for applications in varying environments.com .g. A more convenient human-computer communication may be realized by an unsupervised endpoint detection. F. Standard VAD techniques based on energy and fundamental frequency are expected to have limited capabilities in adverse environments. Usability The discussion was focused on a speech controlled system triggered by a push-to-talk button to simplify speech segmentation. e. e. For example. 2008]. training and test conditions have to be balanced. For example. Gerl.. Herbig. re-estimating the eigenvoices may be applied to improve on the eigenspace given by the PCA [Nguyen et al. Minker: Self-Learning Speaker Identification. different adaptation schemes may be investigated with respect to joint speaker identification and speech recognition as possible future work. Furthermore. This task was accomplished by EV adaptation. in automobiles. Adaptation One important aspect of the speech controlled system presented in this book is to capture speaker characteristics fast and accurately for robust speaker identification. [2007. SCT. 149–152. The outlook is intended to emphasize the potential for many applications and shall motivate for further research.8 Outlook At the end of this book an outlook for prospective applications and extensions of the techniques presented here is discussed. T. [2010] may be a promising candidate. may be integrated into the system. Ferr`s et al. Gales and Woodland [1996]. To obtain optimal performance of EV adaptation. and W. Leggetter a and Woodland [1995b].g. the subspace approach presented by Zhu et al. c Springer-Verlag Berlin Heidelberg 2011 springerlink.

. Ljolje and Goffin. In the literature [Ittycheriah and Mammone. some phoneme groups are well suited for speaker identification whereas others do not significantly contribute to speaker discrimination as outlined in Sect. Significant improvements are expected when the spoken text and the speaker identity are known for a couple of utterances. More sophisticated solutions comprising Viterbi time alignment of all speaker models similar to the decoder-guided audio segmentation are also possible. 2010. Since the likelihood values are buffered until the speech recognition result is available. A decision logic similar to the implementation of speaker specific speech recognition may allow a simple decision whether speech is present. for instance. Tsai et al. Since an estimate for the state sequence is accessible. Oonishi et al. 2003]. 2. Wrigley et al. this problem may be solved by an additional specialized codebook of the speech recognizer to represent different background scenarios during speech pauses. a refined likelihood becomes feasible. The restriction of equal weights can be avoided so that speaker identification can apply state dependent GMMs for each frame which may result in a better statistical representation.1 and Sect. GMMs of moderate complexity can be used to model speech and non-speech events [Herbig et al. 2008.3. In combination with a phoneme alignment emphasis can be placed on vowels and nasals. the phoneme alignment can be used not only to support speaker adaptation but also speaker identification in a similar way. A more natural and efficient control of automated human-computer interfaces should be targeted. Barge-in functionality can be incorporated into speech segmentation to allow users to interrupt speech prompts. Furthermore.150 8 Outlook Model-based speech segmentation may be implemented by similar approaches as speaker identification.. 2010. Together this would result in a closer relationship between speech and speaker related information.. 3.. 2007] similar techniques exist which might be integrated. speech decoding can be rejected. Alternatively. 2005]. Enhancing Speaker Identification and Speech Recognition Another method to increase the WA and identification rate is to employ a short supervised enrollment. Development of the Unified Modeling of Speech and Speaker-Related Characteristics The solution presented in this book is developed on a unified modeling of speech and speaker characteristics.. For instance. Lee et al. 1999.. 2004. . The influence of fricatives and plosives on speaker identification can be reduced. Crosstalk detection may be an additional challenge [Song et al. When a speech pause is detected. The speech recognizer’s codebooks are considered as common GMMs with uniform weights.

The focus should be set here on the variety of background conditions. 2010c] or by an enrolling phase [Herbig et al. a Personal Digital Assistant (PDA). The latter approach seems to be preferable since only one speaker specific codebook is needed and different training levels of domain specific codebooks are circumvented. Mobile Applications Even though only in-car applications were evaluated in this book. The speech recognizer is always run in an open-set mode to protect these profiles against unknown speakers. e. Even though no essential improvement is expected for the WA.. one important aspect should be the problem of environmental variations. If a new speaker operates the device for the first time and decides not to attend an enrollment. Especially for mobile applications. The identification of the current background scenario might help to apply more sophisticated compensation and speaker adaptation techniques. Each user can attend a supervised enrollment of variable duration to initialize a robust speaker profile. an increase in the speaker identification rate might be achieved when both identification techniques are combined. A simple realization would be to use several codebooks for each speaker and environment. the concepts of the speaker adaptation scheme and long-term speaker tracking presented here might be applied to the reference speaker identification. Thus. Alternatively.g. artifacts are still present and are learned by speaker adaptation.e]. 2011]. mobile applications can be considered as well. 2010d. Since eigenvoices achieved good results for speaker adaptation. this can be undesirable. in-car. In the optimal case only speaker characteristics are incorporated into the statistical modeling.. Interaction The use cases so far have involved identifying persons only by external means [Herbig et al. Effects from the acoustic background should be suppressed. combining feature vector enhancement and adaptation may be advantageous when speaker characteristics and environmental influences can be separated. compensation on a feature vector level implemented by the eigen-environment approach could be a promising candidate. All scenarios have to be covered appropriately. However. The front-end of the system presented in this book reduces background noises and compensates for channel characteristics by noise reduction and cepstral mean subtraction. e.8 Outlook 151 The approach presented in this book also allows hybrid solutions. In addition. . a new profile can be temporarily initialized and applied subsequently. If a self-learning speaker identification is used for speech controlled mobile devices.g. feature vector normalization or enhancement may be combined with speaker adaptation. or such as here where no external information was used [Herbig et al. public places and home.. office.

An acoustic preprocessing engine. e. Whereas beginners can be offered a comprehensive introduction or help concerning the handling of the device. Visual information may also be employed [Maragos et al. 2008]. may use beamforming on a microphone array which can track the direction of the sound signal. devices which are usually registered to a particular user. For example. e.152 8 Outlook However. Feedback of the speaker identity allows special habits and the experience of the user about the speech controlled device to be taken into consideration. In addition. advanced users can be pointed how to use the device more efficiently. During an ongoing dialog a user change is not expected.g. a wide range of human-computer communication may be of interest for realistic applications: A dialog engine may give information about the temporal or semantic coherence of a user interaction. in an infotainment or in-car communication system [Schmidt and Haulick. . In summary. This soft information could be integrated to improve the robustness of speaker tracking. driver and co-driver may enter voice commands in turn.g. a flexible framework has been developed which shows a high potential for future applications and research. car keys. can support speaker identification.. 2006]. mobile phones or PDAs.

A statistical model with parameter set ΘML characterized by ΘML = arg max {p(x1:T |Θ)} (A.A Appendix A. training and adaptation algorithms require knowledge or at least an estimate concerning the corresponding Gaussian density. In this section the meaning of the latent variable is explained for GMM training which was introduced in Sect.1) Θ has to be determined. For convenience. An iterative procedure is required due to the latent variable. By repetition of these two steps.1 Expectation Maximization Algorithm The Expectation Maximization (EM) algorithm [Dempster et al. In contrast. a sequence of feature vectors is produced. GMMs can be explained by a combination of two random processes as given by (2. Thus. Therefore.35). GMM training or adaptation encounters the inverse problem since only the measured feature vectors x1:T are available. .. 2.4. the first process is used to select the Gaussian density with index k according to the prior probability p(k|Θ). Then the two steps of the EM algorithm are described. Θ). the index k denotes the latent variable of GMM modeling. 1977] is a fundamental approach for mathematical optimization problems comprising a hidden or latent variable. Feature vectors x are subsequently called incomplete data since only the observations of the second random process are accessible.2. An optimal statistical representation of a training data set x1:T is targeted. Θ) given by the first random generator is hidden for an observer. complete data (x. The information which Gaussian density generated an observed feature vector is missing. k) comprise feature vector and latent variable. The assignment of the feature vector xt to the corresponding Gaussian density p(xt |k. Latent variables can be intuitively explained when a GMM is considered as a random generator. A second random generator produces a Gaussian distributed random vector xt characterized by the probability density function p(xt |k.

4) An iterative procedure is required since the assignment of the feature vectors xt to a particular Gaussian density may be different for an updated ˆ ¯ parameter set Θ compared to the preceding parameter set Θ.3) t=1 k=1 ˆ where the new parameter set Θ is characterized by d ¯ QML (Θ. Θ) = Ek1:T {log (p(x1:T . [1977]. The problem of parameter optimization is solved in two steps: • The E-step computes an estimate of the missing index k with the help ¯ of the posterior probability p(k|xt . Several iterations of the EM algorithm are performed until an optimum of the likelihood p(x1:T |Θ) is obtained. Θ) = T N (A. It can be calculated by ¯ QML(Θ. The risk of over-fitting arises since insignificant properties of the observed data are learned instead of general statistical patterns.154 A Appendix Therefore. Θ) based on initial parameters or the ¯ parameter set of the last iteration Θ. k|Θ) instead of incomplete data p(xt |Θ) is employed in the EM algorithm. Θ}. To simplify the following considerations. For limited training data or in an adaptation scenario this might be inadequate.3) is extended by prior knowledge p(Θ) about the parameter set Θ. Θ) + log (p(Θ)) . Θ) = QML (Θ. . The corresponding auxiliary function is then given by ¯ ¯ QMAP (Θ. This posterior probability represents the responsibility of a Gaussian density for the production of the observation xt and controls the effect on the training or adaptation of a particular Gaussian density.5) For further reading on the EM algorithm. • The M-step maximizes the likelihood function by calculating a new set of ˆ model parameters Θ based on complete data given by the previous E-step. local and global maxima cannot be distinguished. ˆ Θ=Θ (A. k|Θ)) (A.2) ¯ p(k|xt . k1:T |Θ)) |x1:T . the iid assumption is used. GMM and HMM training the reader is referred to Bishop [2007]. the likelihood of complete data p(xt . Θ) · log (p(xt . (A. Dempster et al. E-step and M-step can be summarized for a sequence of training or adaptation data x1:T in a compact mathematical description which is known as the auxiliary function ¯ ¯ QML (Θ. The problem of over-fitting can be decreased when equation (A. Θ) dΘ = 0. However. Rabiner and Juang [1993].

Θ0 ) · log (p(xt . . . . . Θ0 ) = QML (Θi . Since only mean vectors are adapted. A prior Gaussian distribution is assumed N log (p(μ1 . . the speaker index i is omitted. For reasons of simplicity. . Σpr }) k k (A. . . The initial parameter set 0 0 Θ0 = w1 . Σ0 . Speech decoding provides an optimal state sequence s and determines the state dependent weights wk . μ0 .3) and (A. . μi . . .6) = t=1 k=1 p(k|xt . For the following equations it is assumed that each Gaussian distribution can be treated independently from the remaining distributions. . A two-stage procedure is applied similar to the segmental MAP algorithm introduced in Sect. Σ0 1 N 1 N (A.6. For convenience. . It is assumed that a set of speaker specific training data x1:T is accessible. the following notation is used for the speaker specific codebooks: 0 0 Θi = w1 . Σ0 .2 Bayesian Adaptation 155 A.7) is extended by a term comprising prior knowledge as given in (A.2. wN . . . μN )) = k=1 log (N {μk |μpr . .10) . μ0 . . the prior distribution of the speaker specific mean vector μk can be factorized or equivalently the logarithm is given by a sum of logarithms. Θ0 ) + log (p(Θi )) T N (A. In the following derivation. Thus. wN . Then the codebook of the current speaker i is adapted. Only one iteration of the EM algorithm is calculated. . The goal is an improved statistical representation of speech characteristics and speaker specific pronunciation for enhanced speech recognition. the auxiliary function of the EM algorithm QMAP (Θi . 1 N 1 N (A. .8) is given by the standard codebook.2 Bayesian Adaptation In this section codebook adaptation is exemplified by adapting the mean vectors in the Bayesian framework. k|Θi )) + log (p(Θi )) (A. .A. . . .5). Σ0 . .9) Subsequently. . 2. . . μk and Σk denote the speaker specific mean vectors to be optimized and the covariance matrices of the standard codebook. the subsequent notation neglects the optimal state sequence to obtain a compact representation. μi . .

. Only the derivative of the squared Mahalanobis distance T ¯ dMahal = (μk − μpr ) · (Σpr )−1 · (μk − μpr ) k k k k (A. The derivative d d log (p(Θ)) = log (p(μ1 .16) according to (2.11) comprising normalization and the determinant of the covariance matrix. Θ0 ) · t=1 (A. Subsequently.156 A Appendix =− 1 − 2 1 2 N k=1 (μk − μpr ) · (Σpr )−1 · (μk − μpr ) k k k T N N k=1 (A.15) (A.18) . . μN )) = − · d dμk 2 dμk k = −(Σpr )−1 · (μk − μpr ) k k (A. for example.7) and the derivative d QMAP (Θ. . the derivative d QML (Θ. Θ0 ) · t=1 d log (N {xt |μk . . Since the weights are given by the standard codebook. μN )) dμk dμk (A. Θ0 ) = dμk T p(k|xt .11) is inserted into (A.113). k|Θ)) dμk (A.12) d + log (p(Θ)) dμk is calculated for a particular Gaussian density k. Θ0 ) = dμk T p(k|xt . . . Θ0 ) · t=1 d log (p(xt .14) is retained in d 1 d ¯Mahal log (p(μ1 . .13) removes the constant terms in (A. Σk }) dμk d Mahal d dμk k (A. k QMAP is maximized by taking the derivative with respect to the mean vector μk and by calculating the corresponding roots.11) log (2π) d log (|Σpr |) − k 2 k=1 where the covariance matrix Σpr represents the uncertainty of the adaptation. k μpr may be given by the mean vectors of the standard codebook. equation (A. .17) =− 1 2 T p(k|xt .

A. (A.. k (A.2 Bayesian Adaptation 157 only depends on the squared Mahalanobis distance dMahal = (xt − μk ) · Σ−1 · (xt − μk ) k k T (A. Θ0 ). t=1 (A.20) and (A. Θ0 ) · Σ−1 · (xt − μk ) .20) By using (A. Θ0 ) = dμk − T p(k|xt . Θ0 ) · xt − nk · μk t=1 (A. Θ0 ) = dμk T t=1 p(k|xt .12) a compact notation of the optimization problem can be obtained: d QMAP (Θ.22) (Σpr )−1 k · (μk − μpr ) k nk = t=1 p(k|xt . Θ0 ) · Σ−1 · (xt − μk ) k · (μk − μpr ) . 2001].21) The mean vector μk and the covariance matrix Σ−1 of the standard codebook k are time invariant and can be placed in front of the sum d QMAP (Θ.25) .19) and the posterior probability p(k|xt . Θ0 ).23) The normalized sum over all observed feature vectors weighted by the posterior probability is defined in (2.16) in (A. Θ0 ) = Σ−1 · k dμk − where T T p(k|xt . Θ0 ) dμk is solved by nk · Σ−1 · μML − μopt = (Σpr )−1 · μopt − μpr k k k k k k which is known as Bayesian adaptation [Duda et al. k t=1 (Σpr )−1 k (A.26) =0 μk =μopt k 1 nk T p(k|xt . Θ0 ) · xt .43) as the ML estimate μML = k The optimization problem d QMAP (Θ. The derivative of QML is given by d QML (Θ. (A.24) (A.

30) 2 2 σ ˆ z is the cut-off of the normalized random variable and σ denotes the standard ˆ deviation of the applied estimator. the Word Error Rate (WER) defined by WS + WI + WD WER = 100 · % (A. 2004].28) W can be employed [Bisani and Ney. iu ) u=1 δK (iu Nu (A. In addition. the Gaussian approximation is used to compute the confidence interval σ · tα ˆ σ · tα ˆ ˆ ˆ ζ− √ 2 <ζ≤ζ+ √ 2 Nu Nu 1 (A.3 Evaluation Measures Speech recognition is evaluated in this book by the so-called Word Accuracy (WA) which is widely accepted in the speech recognition literature [Boros et al. The recognized word sequence is compared with a reference string to determine the word accuracy WA = 100 · 1 − WS + WI + WD W %. Therefore.31) is equally employed in this book to indicate the reliability of the speech recognition rate. (A. (A. 1996]. Subsequently. . 2004]. Speaker identification is evaluated by the identification rate ζ which is approximated by the ratio ˆ ζ= Nu MAP . equation (A. 1996].29) of correctly assigned utterances and the total number of utterances.. WI and WD denote the number of substituted.27) W represents the number of words in the reference string and WS . In this book. Alternatively.158 A Appendix A. only those utterances are investigated which lead to a speech recognition result. The uncertainty is caused by the limited number of utterances incorporated into the estimation of the identification rate. the interval is determined in which the true identification rate ζ has to be expected with a predetermined probability 1 − α given an experimental ˆ mean estimate ζ based on Nu utterances: √ ˆ (ζ − ζ) · Nu p(z1− α < ≤ z α ) = 1 − α.. a confidence interval is given for identification rates1 to reflect the statistical uncertainty.31) Even though the WA is a rate and not a probability [Bisani and Ney. inserted and deleted words in the recognized string [Boros et al.

respectively. Those tests are usually performed on identical data sets to avoid uncertainties caused by independent data. For example. For details on confidence measures and ROC curves it is referred to Kerekes [2008]. e.g. Alternatively. In this context α = 0.2 % for independent datasets and about ±0.A. 2004. The true positives (which are complimentary to the miss rate) are plotted versus the false alarm rate independently from the prior probability of a speaker change. Bortz. the accuracy of a binary classifier can be graphically represented for varying thresholds by the Receiver Operation Characteristics (ROC) curve2 . The standard deviˆ ˆ ation σ is given by σ 2 = ζ · (1 − ζ). 2005].14).1 is about ±0. . p(H0 |iu−1 = iu ) missed speaker change. 1997]. In this book only the typical errors of independent datasets are given for reasons of simplicity.. in Table 4. t α is characterized by the Student ˆ ˆ 2 distribution.96 are used. In this case paired difference tests are appropriate to determine whether a significant difference between two experiments can be assumed for a particular data set [Bisani and Ney. the typical error of the results shown in Table 4. Frequently.03 % for this specific evaluation. The problem of speaker change detection is characterized by a binary decision problem. The following two error scenarios determine the performance of the binary classifier: p(H1 |iu−1 = iu ) false alarm.1. H0 and H1 denote the hypotheses of no speaker change and a speaker change. In the figures 2 of this book confidence intervals are given by a gray shading.3 Evaluation Measures 159 of a Binomial distribution as found by Kerekes [2008]. several realizations of an automated speech recognizer are evaluated and compared for certain values of a tuning parameter. Macskassy and Provost [2004]. 2 Detection Error Trade-off (DET) curves can be equivalently used to graphically represent both error types of a binary classifier [Martin et al. The optimal classifier is given in (2.05 and t α = 1.

.

and Signal Processing.: Pattern Recognition and Machine Learning. 649–651 (2004) Angkititrakul. J. Speech. 2. and Language Processing 15(2). ICASSP 2003. In: IEEE International Conference on Acoustics. pp.: Second-order statistical measures for text-independent speaker identification. Speech. I. 409–412 (2004) Bishop.: Towards understanding spontaneous speech: word accuracy vs. Springer..: Feature and score normalization for speaker verification of cellular data.. Springer. Signals and Communication Technology.. vol. Mathan.. 177– 192 (1995) Bimbot. Springer. Gorz. Mathan.. Ney. and Signal Processing.. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp.. Oxford University Press. ICASSP 2001.. SIGIR 2003. F. IEEE Signal Processing Letters 11(8)..: Statistik: F¨r Human.. H. H. Makino. Heidelberg (2005) Bennett. Gauvain. W. Magrin-Chagnolleau.: Neural Networks for Pattern Recognition. Springeru Lehrbuch. concept accuracy.M.: Using asymmetric distributions to improve text classifier probability estimates. 169–172 (1993) Bisani.: Anisotropic MAP defined by eigenvoices for large vocabulary continuous speech recognition. IEEE Transactions on Audio. Bourlard.. C. 1st edn. 6th edn. F. New York (2007) Boros. Speech.. M. pp. I. 111–118 (2003) Bimbot. C. 1009–1012 (1996) Bortz. J. Chen.: Robust speaker change detection. 1. McCowan. L. In: IEEE International Conference on Acoustics.H. pp.: Discriminative in-set/out-of-set speaker recognition. H. Speech Communication 17(1-2). Niemann.. Hanrieder. S. pp.L. Eckert. Oxford (1996) Bishop. H. 1. vol. (eds.): Speech Enhancement.-L.M. vol.: Bootstrap estimates for confidence intervals in ASR performance evaluation.: Text-free speaker recognition using an arithmetic-harmonic sphericity measure. concept accuracy vs. G. In: EUROSPEECH 1993. ICSLP 1996. J. L. pp..References Ajmera. 2. Speech.und Sozialwissenschaftler.N. vol. 49–52 (2003) Benesty. P. Gallwitz. P.. M. J. ICASSP 2004. 353–356 (2001) . 1st edn. Hansen. C. In: International Conference on Spoken Language Processing. Heidelberg (2005) (in German) Botterweck. G. In: IEEE International Conference on Acoustics. and Signal Processing. 498–508 (2007) Barras. J. J. F.

. Saz. Fast Algorithms and Integer Approximations. Academic Press.) Multimodal Processing and Interaction: Audio. Yuk. P.: Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. Multimedia Systems and Applications. Text. V. Musiol.. Lleida. Speech and Language Processing 15(3). Verlag Harri Deutsch. Gopalakrishnan.R. 1st edn. European Patent EP0586996 (1994) Cohen. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. G. A. Gros.: Taschenbuch der Mathematik. New York (2008) . 12–15 (2002) Davis. Q.. S..: Noise estimation by minima controlled recursive. 127–132 (1998) Cheng. Gros. IEEE Transactions on Speech and Audio Processing 11(5). Wang. B.. 357–366 (1980) Delakis. F. 345–349 (1994) Che.. pp. I. F. Rosas. pp. Villalba. A. I.N. P. C. Yip. J.: Cepstral vector normalization based on stereo data for robust speech recognition. J. vol. Saz. J. ICASSP 1996.S. IEEE Transactions on Speech and Audio Processing 2(2). 91–109. Gravier. G. Proceedings of the IEEE 85(9).: Speaker. L.based continuous speech recognizer. In: IEEE International Conference on Acoustics...-M. Speech. P. Springer. S. A. H.. O. Lleida..-S.A. Miguel. IEEE Transactions on Acoustics. Semendjajew. IEEE Transactions on Audio.: An HMM approach to text-prompted speaker verification. Rao.a tutorial. Speech. Ortega.-S. K. 803–806 (1993) Class.: Stochastic models for multimodal video analysis. P.. P. In: EUROSPEECH 1993. Frankfurt am Main (2000) (in German) ´ Buera. D. Muehlig. A. O. E.... A.B. Regel-Brietzmann. 945–948 (2003) Class. vol. U. P. F. Kaltenmeier. London (2006) Bronstein... A.: Speaker verification and identification using phoneme dependent multienvironment models based linear normalization in adverse and dynamic acoustic environments. Kaltenmeier.: Optimization of an HMM . IEEE Signal Processing Letters 9(1). K. O...P. In: EUROSPEECH 2003.. Miguel. 1437–1462 (1997) Capp´.: A sequential metric-based audio segmentation method via the Bayesian information criterion. 33. P. US Patent Application 2003/0187645 A1 (2003) Class.. A...: Speech recognition method with adaptation of the speech characteristics. Lin. Kaltenmeier.: Automatic detection of change in speaker in speaker adaptive speech recognition systems.: Discrete Cosine and Sine Transforms: General Properties.. A. E.. pp. Haiber. Potamianos..162 References Britanak. Mermelstein. S. In: Summer School for Advanced Studies on Biometrics for Secure Authentication: Multimodality and System Integration (2005) Campbell. pp. 2. 673–676 (1996) Chen. (eds. In: Maragos. environment and channel change detection and clustering via the Bayesian information criterion.. pp.. P.: Speaker recognition . and Signal Processing. H..S.D. Video..: Elimination of the musical noise phenomenon with the Ephraim and e Malah noise suppressor. Regel-Brietzmann. Berdugo. L. Ortega. M.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences... 1098–1113 (2007) Buera. and Signal Processing 28(4). 466–475 (2003) Cohen. I.

Mason. Teubner.A. S. 1139–1143 (1995) Droppo. P. R¨hl. 139–144 (2010) Faltlhauser. pp. pp. Malah. ICASSP 1994.: Mean and variance adaptation within the MLLR framework.. Acero. and Signal Processing.. Speech. T. 21–24 (2008) Fink. Ariyaeeinia. pp.O. pp.: Maximum mutual information SPLICE transform for seen and unseen conditions. In: EUROSPEECH 1995.: Open-set speaker identification using adapted Gaussian mixture models. A. D. Laird.: Introduction to finite element methods (2004) Ferr`s.F.References 163 Dempster.. 651–654 (1995) Gauvain. Journal of the Royal Statistical Society Series B 39(1).: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. In: EUROSPEECH 2001.M. Y. J. pp.: Experiments with speaker verification over the telephone.. Ruske. C.. In: IEEE Workshop on Spoken Language Technology.: Constrained MLLR for speaker a recognition. 1. pp. C. vol. Odyssey 2008. H. A.. Gauvain. A. M.: Evaluation of the SPLICE algorithm on the Aurora2 database. M.: Speaker adaptation for telephone based speech dialogue u systems.J. SLT 2010. vol.. B. G.. 249–264 (1996) Garau. Miyabe.: Selected topics from 40 years of research in speech and speaker recognition. 291–298 (1994) . Stork.. pp.-L. M. 989–992 (2005) Droppo. C. C. In: IEEE International Conference on Acoustics.. New York (2001) Eatock. Speech.F. In: EUROSPEECH 1995.. Acero. S. Gauvain. Leung.: Improving speaker recognition using phonetically structured Gaussian mixture models. 265–268 (2005) Gauvain. Proceedings of the IEEE 61(3). and Signal Processing ASSP-32(6). 1997–2000 (2005) Furui. Renals. ICASSP 2007. Lee. A.. Sagayama. 133–136 (1994) Ephraim. In: EUROSPEECH 2001. 1109–1121 (1984) Espi. In: INTERSPEECH 2005... G. Computer Speech and Language 10. J. 2nd edn.: Pattern Classification.-L.P.. Sivakumaran. J. L. G.D... S. N. Speech. In: The Speaker and Language Recognition Workshop.C.. J.. M.-L. pp. Prouts. Lamel. Malegaonkar.: A quantitative assessment of the relative speaker discriminating properties of phonemes. S. In: INTERSPEECH 2005. T.G.: Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. G. Hart. 4.A.C.-H.S. Rubin. J. Nishimoto. pp.B.: Applying vocal tract length normalization to meeting recordings.E. Leitf¨den der Informatik. 1–38 (1977) Dobler..-L.. Leung.: Mustererkennung mit Markov-Modellen: Theorie-Praxis-Anwendungsgebiete. A. 1–8 (2009) Gales.. J. G. J. Deng. pp. P. pp. J. Woodland. Wiley Interscience.. R.. B.: Maximum likelihood from incomplete data via the EM algorithm.P.. 751–754 (2001) Fant. In: INTERSPEECH 2005. S. and Signal Processing. C.. R.-W. IEEE Transactions on Acoustics. D.. J. Stuttgart (2003) (in German) a Forney. C. L. Hain. 53–56 (2007) Ferr`s.: Analysis on speech characteristics for robust voice activity detection.: MLLR techniques for speaker a recognition. In: INTERSPEECH 2009. The Hague (1960) Felippa.. IEEE Transactions on Speech and Audio Processing 2(2). P. D. Ono.C. G. N. In: IEEE International Conference on Acoustics. 268–278 (1973) Fortuna. pp. Barras.: Acoustic Theory of Speech Production. Barras. Mouton de Gruyter..: The Viterbi algorithm. 217–220 (2001) Duda..

pp. Mariani.) IWSDS 2010.. pp.. S. 25–35. In: Lee.W. In: Proceedings of the Broadcast News Transcription and Understanding Workshop.: Evaluation of two approaches for speaker specific speech recognition. T...C.. ASRU 1999 (1999) Gish. Woodland. Lee.. P.: Detection of unknown speakers in an unsupervised speech controlled system. Nosratighods. M. Shallom. In: The XI European Signal Processing Conference. F. a In: IEEE International Conference on Acoustics. vol. vol. H. pp. C.: GMM-UBM based open-set online speaker diarization. IEEE Signal Processing Magazine 11(4). Hansen..: Joint speaker identification and speech recognition for speech controlled applications in an automotive environment. Speech.: Investigations on inter-speaker variability in the feature space.. Y. Heidelberg (2008) Harrag. In: 34.. N. T. pp. Bistritz..: LDA combination of pitch and MFCC features in speaker recognition. Nakamura. pp. and Signal Processing. Springer. T. W. G. Bacchiani.: Modellbasierte Sprachsegmentierung. Tuerk. 237–240 (2005) Hautam¨ki. EUSIPCO 2002. ITRE 2006. Ma. ICASSP 1999.: Combined speech and speaker recognition with speaker-adapted connectionist models. pp.. Gerl. In: INTERSPEECH 2010. G.. pp.. pp. O.. Nakamura. I. R. DAGA 2008.G. Wallhoff. B.: Statistische Signale: Grundlagen und Anwendungen. Mohamadi. J. In: INTERSPEECH 2010.: Session variability contrasts in the MARP corpus. Rigoll. In: Lee. A. K.A.: Segment generation and clustering in the HTK broadcast news transcription system. T. Kinnunen. F.: Confidence scores for acoustic model adaptation. Springer.) IWSDS 2010. 3rd edn. Gerl. pp.: Speaker verification using phoneme-adapted Gaussian mixture models. V. F. pp. J. 74–77 (2009) Herbig. Young. J. pp. E. 1473–1476 (2010) Herbig. Heidelberg (2010b) .. Schmidt. Minker.. Cohen. 3. G.F. W. LNCS.G. F. Heidelberg (2010a) Herbig. G. Morgan. J. Speech. Gerl.. 6392. M. K. In: INTERSPEECH 2010. Johnson. In: IEEE Workshop on Automatic Speech Recognition and Understanding. (eds.. H. a Heidelberg (2001) (in German) H¨nsler.. W. J.L. G. a Wiley-IEEE Press. In: International Conference on Information Technology: Research and Education. LNCS. 397–400 (1999) Hain.. Li.H. Jahrestagung f¨r Akustik. Minker. vol.. F. T. In: IEEE Indicon Conference. E.J. D. (eds. pp. M.. In: IEEE International Conference on Acoustics. 133–137 (1998) H¨nsler.. T. Ellis. (eds. Springer.. 4289–4292 (2008) Gutman.. W. NAG-DAGA 2009. A. S. Schmidt. Hoboken (2004) H¨nsler. 18–32 (1994) Godin.. 1. vol...): Speech and Audio Processing in Adverse Environa ments. S. Signals and Communication Technology. 85–88 (2002) H¨b-Umbach. and Signal Processing..: The Lombard effect’s influence on automatic speaker verification systems and methods for its compensation. T. D.. E...: Apa proaching human listener accuracy with modern speaker verification. A. Gerl. D. In: International Conference on Acoustics. Minker. 2330–2333 (2010) Genoud. R. Minker. 36–47.E.. pp. Serignat.: Acoustic Echo and Noise Control: A Practical Approach. Schmidt. 233–237 (2006) Gollan.. ICASSP 2008. 6392. Gaupp. 298–301 (2010) Goldenberg.: Text-independent speaker identification. S..164 References Geiger. 259–260 (2008) (in German) u Herbig. Springer. Mariani..

: Robust Speech Recognition in Embedded Systems and PC Applications. D. 1953–1956 (2004) Kimball. Marasek.speech databases for consumer devices: Database specification and validation. Speech.-H. F. Gerl. W. F. W.: Using VTLN for broadcast news transcription. Springer. K. H.-T. Minker. 2666–2669 (2010e) Herbig. Hain. pp... F. 327–330 (1999) Johnson.T. Gales. Kiessling. Diehl. E. In: EUROSPEECH 1999. B. Minker. SLT 2010.: The influence of acoustics on speech production: A noise-induced stress phenomenon known as the Lombard reflex. 13–22 (1996) Junqua. pp.References 165 Herbig. O. F. K. pp. Kim..C. pp. S. In: IEEE International Conference on Acoustics.. B. pp. Waterman..Automatic segmentation and clustering for determining speaker turns. Dordrecht (2000) Kammeyer. D. pp. F. In: IEEE Workshop on Evolving and Adaptive Intelligent Systems. 2325–2328 (2002) .: Digitale Signalverarbeitung. Heidelberg (2002) Jon. M. 967–970 (1997) Kinnunen. Teubner. ICASSP 2001.-C. J.: Unified techniques for vector quantization and hidden Markov modeling using semi-continuous models. Springer Series in Statistics. N.: Speaker verification with limited enrollment data. T. G.D. Gerl. T. ICSLP 2002.. W. pp.. 251–255 (2008) Kim. Minker. J. 5. 1.J.: Who spoke when? .: EMAP-based speaker adaptation with robust correlation estimation. Gerl. pp. H. van den Heuvel.. T. pp. T. P. Mammone. Grosskopf.. Chien. S.F.-C.J. IEEE Geoscience and Remote Sensing Letters 5(2). Kluwer Academic Publishers. 321–324 (2001) Junqua. Wang. 639–642 (1989) Iskra.. ICASSP 1989.: Evolution of an adaptive unsupervised speech controlled system. pp. Kroschel. 4th edn. In: International Conference on Spoken Language Processing.K.. pp.Y.: Fast adaptation of speech and speaker characteristics for enhanced speech recognition in adverse intelligent environments. vol. In: The 4th International Symposium on Chinese Spoken Language Processing.. LREC 2002.: Receiver operating characteristic curve confidence intervals and regions. 2211–2214 (1999) Jolliffe. IE 2010.. Jack.. Minker. R.: Speaker tracking in an unsupervised speech controlled system. Kim.: A new eigenvoice approach to speaker adaptation.... and Signal Processing. 2nd edn. Speech. J. 163–169 (2011) Huang. Gish. Schmidt. T. A. 329–333 (2002) Ittycheriah. pp. vol. In: IEEE International Conference on Acoustics. In: Proceedings of the Third International Conference on Language Resources and Evaluation..S.. A. 206–210 (2010d) Herbig. vol.: Detecting user speech in barge-in over prompts using speaker identification methods. Woodland. H. In: IEEE Workshop on Spoken Language Technology.: SPEECON . In: EUROSPEECH 1999. Stuttgart (1998) (in German) Kerekes. I. D.. M... and Signal Processing. 1. In: The 6th International Conference on Intelligent Environments. ISCSLP 2004. C. M. EAIS 2011.E.: Principal Component Analysis.A.. J.: Designing a speaker-discriminative adaptive filter bank for speaker recognition. W. Umesh.. Gerl.D. T. pp. 109–112 (2004) Huang.-M.... K. Speech Communication 20(12). In: INTERSPEECH 2004. J.: Simultaneous speech recognition and speaker identification. X. In: EUROSPEECH 1997. 100–105 (2010c) Herbig. In: INTERSPEECH 2010.

pp. pp.A..J.. S. In: IEEE International Conference on Acoustics. ICSLP 1998... 173–176 (2004) Leggetter.. Nguyen. M. K.. ROCAI 2004. Field. ISMIR 2000 (2000) Lu.lic.: Speaker change detection and tracking in real-time news broadcasting analysis. ICASSP 1999. N. L. 110–115 (1995a) Leggetter.166 References Kinnunen. In: 1st International Symposium on Music Information Retrieval. pp. 602–610 (2002) Macskassy. Teubner. R. Computer Speech and Language 9. In: International Conference on Spoken Language Processing. In: 1st Workshop on ROC Analysis in Artificial Intelligence. P.. 61–70 (2004) . Palm.: Spectral Features for Automatic Text-Independent Speaker Recognition. K.C. He. R.. Perronnin.-C. 2. K. Department of Computer Science.. Speech. 33–36 (2001) Kwon. Woodland.: Discriminative training of multi-state barge-in models. Y.: A comparison of human and machine in speaker recognition. IEEE Transactions on Speech and Audio Processing 8(6). Fincke. pp. In: EUROSPEECH 1997. S.. S. 3rd edn. pp. 749–752 (1999) Kuhn. Narayanan. Junqua. R. Saruwatari...: Mel frequency cepstral coefficients for music modeling. Goffin.. 695–707 (2000) Kuhn.: Rapid speaker adaptation in eigenvoice space.: Einf¨hrung in die Statistik. pp. P. J. pp. 67–94 (1996) Kuhn. R.. J.. Junqua. Niedzielski. Nguyen. IEEE Transactions on Communications 28(1)..: Confidence bands for ROC curves: Methods and an empirical study.-C. R.S. Contolini.-C. N.. Zhang.: Time is money: Why very rapid adaptation matters.. T.. 2327–2330 (1997) Ljolje.J. H. Nguyen. In: INTERSPEECH 2004... Finland (2003) Krim. B.: Eigenvoices for speaker adaptation.. H. In: Proceedings of the ARPA Spoken Language Technology Workshop. G. 2537–2540 (2002) Lee.. 1771–1774 (1998) Kuhn. A. C. F. P. R. Ph. In: Adaptation 2001. Provost. Goldwasser.J.M. In: International Conference on Spoken Language Processing. S. M. S. P. u Stuttgart (2000) (in German) Linde. V. L. Junqua. In: IEEE Workshop on Automatic Speech Recognition and Understanding. University of Joensuu.. H.: Speaker change detection using a new weighted distance measure.. S.. Viberg.J. L. pp. S. Shikano. Contolini. 5. Nakamura.-C. Niedzielski. thesis..: Unsupervised speaker indexing using generic models. ASRU 2007.. IEEE Signal Processing Magazine 13(4).. Woodland. P. In: Proceedings of the Tenth ACM International Conference on Multimedia.. 1004–1013 (2005) Kwon. J. vol. 84–95 (1980) Liu. C.: An algorithm for vector quantizer design. and Signal Processing. ICSLP 2002.. Junqua. Boman.. N.: Flexible speaker adaptation using maximum likelihood linear regression. Fincke. A.. Gray.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models.: Fast speaker adaptation using a priori knowledge. Field. J. A. pp. Wegmann. Buzo. Narayanan. J... B. R. H. K. 171–185 (1995b) Lehn.: Two decades of array signal processing research: the parametric approach. Niedzielski. pp. vol. J. M.C. Nisimura.. G. MULTIMEDIA 2002.: Noise robust real world spoken dialogue system using GMM based rejection of unintended inputs. F. 353–358 (2007) Logan. IEEE Transactions on Speech and Audio Processing 13(5).

. A. T. S. Brand. 3. In: IEEE International Conference on Acoustics. Ariyaeeinia. K. pp. G.. Ordowski. P.J. CCST 2005. A. and Signal Processing. Kamm. ICASSP 2001. Doddington. R. Neto.und Matrizenrechnung.und Integralo rechnung.. Takahashi. 325–328 (2005) Nakagawa..: Maximum likelihood eigenspace and MLLR for speech recognition in noisy environments. In: 39th Annual International Carnahan Conference on Security Technology. Furui.: H¨here Mathematik 1: Differential. M.. R. 1. S. classification and clustering in a broadcast news task. G.. Speech.. 391–394 (1993) Meinedo. vol. (eds. pp. vol. Springer. Video. pp.: Efficient speaker change detection using adapted Gaussian mixture models. In: EUROSPEECH 1999. D. W. X.. Koreman. K. IEEE Signal Processing Magazine 13(5). Speech. M. vol. 1058–1065 (2006) Nguyen. pp.: Text-independent / text-prompted speaker recognition by combining speaker-specific GMM with speaker adapted syllable-based HMM. vol. In: International Conference on Spoken Language Processing. A. T. J. Speech. and Language Processing 15(6). 413–416 (2001) Morris. C. Junqua. J. P. K. Sivakumaran. D. Nocera.: Speech Recognition in Noisy Environments.. Department of Electrical and Computer Engineering.: Audio segmentation. 2. PhD thesis. O. Linares. Speech.S.C.J.. 2.. P.): Multimodal Processing and Interaction: Audio. Vachenauer... P..: A posteriori and a priori transformations for speaker adaptation in large vocabulary speech recognition systems. J. T. 95–100 (2006) Moreno. SRIV 2006. IEEE Transactions on Audio. Zhang. Wu..-F. Ramachandran. Zhang.: MLP trained to separate problem speakers provides improved features for speaker identification.. pp. 1895–1898 (1997) Matrouf. In: IEEE International Conference on Acoustics. A. In: IEEE International Conference on Acoustics. Multimedia Systems and Applications. A. and Signal Processing.. Bellot. 1245–1248 (2001) Matsui. T. pp. Heidelberg (2003) (in German) Meyer.. In: EUROSPEECH 1997.: Frame level likelihood normalization for textindependent speaker identification using Gaussian mixture models. P. IEICE Transactions on Information and Systems E89-D(3). S. H. pp. Kollmeier.. Gros.: Concatenated phoneme models for text-variable speaker recognition. P. ICASSP 2003. ICSLP 1996. Carnegie Mellon University. Wellekens. S.. 1764–1767 (1996) Martin. Bonastre. 1st edn. A. New York (2008) Markov.. pp. ICASSP 1993.: Robust speaker recognition. Pennsylvania (1996) Mori.. P. B. Pittsburgh. Wesker. In: Speech Recognition and Intrinsic Variation. Mertins. 58–71 (1996) Maragos.. Przybocki.: Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition.P.. Springer. Potamianos.. Springer-Lehrbuch. M. 6th edn.. Text.References 167 Malegaonkar. J. Nakagawa. Vektor.. 1859–1869 (2007) Mammone.: A human-machine comparison in speech recognition based on a logatome corpus. Nakagawa. 2519–2522 (1999) . In: EUROSPEECH 2001. and Signal Processing. B.-C. vol. pp..: The DET curve in assessment of detection task performance. 5–8 (2003) Meyberg. 33..M.

K.S. J. pp. T. 257–286 (1989) Ram´ ırez.F. T. B.: Speaker indexing and adaptation using speaker clustering based on statistical model selection. B. Quillen.: The matrix cookbook (2008) Quatieri. R. D. Kroschel. Quatieri... Prentice-Hall. D.: Robust text-independent speaker identification using Gaussian mixture speaker models.: Discrete-Time Speech Signal Processing: Principles and Practice. Campbell. ISSPA 1999. Sridharan. R.: Speech Communications: Human and Machine. A. In: Proceedings of the Fifth International Symposium on Signal Processing and Its Applications. 72–83 (1995) . pp. IEEE Transactions on Speech and Audio Processing 3(1). In: EUROSPEECH 1997.: Fundamentals of Speech Recognition.W.: Large population speaker identification using clean and telephone speech. Kawahara.: VAD-measure-embedded decoder with online model adaptation.A. Kawahara.. 353–356 (2004) Nishida. 1..F. M. 3122–3125 (2010) Oppenheim. vol. J. M. Englewood Cliffs (1993) Rabiner.. 2nd edn. R.. Gleason. IEEE Press. Englewood Cliffs (1975) O’Shaughnessy. Jones. D.: ASR dependent techniques for speaker identification. In: International Conference on Spoken Language Processing.: Beyond cepstra: Exploiting high-level information in speaker recognition. 583–592 (2005) Olsen. G´rriz. pp.J. Juang. Prentice-Hall. I-Tech Education and Publishing.-H. D. J.: Enhancing automatic speaker identification using phoneme clustering and frame based parameter and frame size selection. D. S. B.B..168 References Nishida. vol. 223–229 (2003) Reynolds. Vienna (2007) Reynolds.) Robust Speech Recognition and Understanding. S. Hazen. Schafer.. S.. In: Proceedings of the Workshop on Multimodal User Authentication. 1337– 1340 (2002) Pelecanos.A.. Dunn. Prentice-Hall. T. K. Campbell. T. 1–22.. T. 647–650 (1995) Reynolds. IEEE Transactions on Speech and Audio Processing 13(4). pp. In: INTERSPEECH 2010. In: EUROSPEECH 1995. Carlson. and Signal Processing. Iwano.: Speaker model selection based on the Bayesian information criterion applied to unsupervised speaker indexing..Ø.. Quatieri. L. T.. D. New York (2000) Park.. pp. Pedersen.. T.: Voice activity detection. D. Upper Saddle River (2002) Rabiner. Digital Signal Processing 10(1-3). M. J. Dunn.C.: Speaker identification and verification using Gaussian mixture speaker models.V. M.B.. Proceedings of the IEEE 77(2).. T..M. Speech Communication 17(1-2).C. J.. In: Grimm. pp. Segura. ICSLP 2002.: Digital Signal Processing.: Speaker verification using adapted Gaussian mixture models. Sturim. Fundamentals o and speech recognition system robustness.. Speech. Torres-Carrasquillo. 1375–1378 (1997) Oonishi.R. 2. P. 19–41 (2000) Reynolds. IEEE Signal Processing Letters 2(3). Rose. pp.: Speaker verification based on phonetic decision making. B. 91–108 (1995b) Reynolds. (eds.A.. A.. D. ICASSP 2004.A. In: IEEE International Conference on Acoustics. 633–636 (1999) Petersen. J. Slomka... C.: A tutorial on hidden Markov models and selected applications in speech recognition. 46–48 (1995a) Reynolds.: Text-dependent speaker verification using decoupled and integrated speaker and speech recognizers.. D.A. pp.A. K. Furui. L.

A..M.D.-H.: Entropy-based feature analysis for speech o recognition. J.G.. 2315–2318 (1997) Rozzi.. 2993–2996 (2003) Vary. W. 570–573 (2007) Schmidt. 981–984 (2005) Song. Stuttgart (1998) (in German) Veen. H. W. pp. Teubner.: Signal processing for in-car communication systems. J.-C. Rubio.. 2778–2781 (2010) Stern. Stern. 1307–1326 (2006) Schukat-Talamazzini.-H. In: International Conference on Signal Processing.-S. pp.: Automatic segmentation. Kim. 679–682 (1999) Tsai. J. pp. C. P. L. H¨ge. In: INTERSPEECH 2007. 3. P.M..: Automatic singer identification of popular music recordings via estimation and modeling of solo vocal signal. K.: Joint speaker segmentation. vol. pp.. R.-M.. IEEE Signal Processing Letters 11(5). B. R. 865–868 (1991) Schmalenstroeer. localization and a identification for streaming audio. M.C. H.. M. Signal processing 86(6).: Speaker adaptation using semicontinuous hidden Markov models.: On using Gaussian mixture model for double-talk detection in acoustic echo suppression... pp. pp. O. 2. Stern.-H. K.: Improved speaker segmentation and segments clustering using the Bayesian information criterion.: Digitale Sprachsignalverarbeitung. Lee. 97–99 (1997) Song. P. J. R. Park. B. Y. In: International Conference on Spoken Language Processing. 2. H. G. Rodgers....: Dynamic speaker adaptation for feature-based isolated word recognition.. In: IEEE International Conference on Acoustics..: Beamforming: A versatile approach to spatial filtering. B.: Speaker adaptation in continuous speech recognition via estimation of correlated mean vectors.-I. E. vol. Haulick.: Cepstral domain segmental nonlinear feature transformations for robust speech recognition. E.G. E.-H. In: INTERSPEECH 2005. R. vol.: On the use of acoustic segmentation in speaker identification.J.: Eigen-environment based noise compensation method for robust speech recognition. Speech and Signal Processing 35(6). pp.: Cepstral statistics within phonetic subgroups.M. Hess. pp.J.A. Nguyen. 2959–2962 (2009) Siegler. 517–520 (2004) Setiawan.A.. In: EUROSPEECH 1997. pp. pp. U. R. C. Vieweg. ICSLP 1996.. J. Fingscheidt. In: INTERSPEECH 2010. W. ICSLP 2000. Garc´ n ıa-Mateo.. classification and clustering of broadcast news audio. Junqua. Braunschweig (1995) (in German) ´ Segura. Kang. Ram´ ırez.S. ICASSP 1991.. In: INTERSPEECH 2009.G.S. ICSP 1993. 541–544 (1992) Rodr´ ıguez-Li˜ares. IEEE Transactions on Acoustics.. Schukat-Talamazzini. Lasry. Buckley. T. pp. In: EUROSPEECH 2003..... Ben´ ıtez. R.. J. In: Proceedings of the DARPA Speech Recognition Workshop.. H.References 169 Rieck. Niemann. 4–24 (1988) .V. In: 11th IAPR International Conference on Pattern Recognition. 2139–2142 (1996) Thompson.: Automatische Spracherkennung. 737–740 (1993) Thyes. de la Torre. Gopinath. J.. S. J.. Raj.: Long term on-line speaker adaptation for large vocabulary dictation. Heute. vol. In: International Conference on Spoken Language Processing. and Signal Processing.. 242–245 (2000) Tritschler. A. A.M... Wang. H¨b-Umbach.J. In: EUROSPEECH 1999. D. H. IEEE ASSP Magazine 5(2). pp. Chang. 751–763 (1987) Thelen. U. pp. Speech. S. T.: Speaker identification and verification using eigenvoices. 2. Jain. Mason. Kuhn.

. pp. Zhang.H. Meignier. and Signal Processing. K. Cambridge (2006) Zavaliagkos. X.. Cambridge University Press. Speech. Speech. G..-L. V. G. and Signal Processing. 714–717 (2000) Zhu. V.: Adaptation algorithms for large scale HMM recognizers.. L.. S. In: International Conference on Spoken Language Processing.-C.. SLT 2010. V. pp. H. Gales. pp.. Speech and Signal Processing. In: IEEE International Conference on Acoustics. P. Radov´. Kershaw. Moore. Varma. Chen. J. Lee.-P.J.. In: Proceedings of the SPIE Conference on Automatic Systems for the Inspection and Identification of Humans.: Significance of anchor speaker segments for constructing extractive audio summaries of broadcast news.170 References Wendemuth. Renals. D. Ollason.. 2277..: Combining speaker identification and BIC for speaker diarization.. Prahallad.. 149–157 (1994) Wrigley... P. H. 1465–1468 (2010) Zhu.H. ICASSP 2008..: UBM-based real-time speaker segmentation for broadcasting news.... Leung. 3073–3076 (2005) . Barras. In: IEEE International Conference on Acoustics. ICASSP 2003.. 2. Oldenbourg (2004) (in German) Wilcox. L.-J. X. E. pp. Kenny. K.L. K. P. Evermann. Valtchev. McDonough. vol. In: IEEE International Conference of Acoustics. 961–964 (2000) Zhou. IEEE Transactions on Speech and Audio Processing 13(1).. F. In: EUROSPEECH 1995. R.. vol. S. Liu. Povey. S.: On-line incremental speaker adaptation with automatic speaker change detection. pp. Ohtsuki. J. In: INTERSPEECH 2005.: Grundlagen der stochastischen Sprachverarbeitung..: Adaptive score normalization for progressive model adaptation in text independent speaker verification... G. B. Hain.: Modified DISTBIC algorithm for speaker change deteca a tion. J. Li. Oxford (2004) Zhang. Gauvain. 13–18 (2010) Yin. Rose. Woodland. In: INTERSPEECH 2005.. R. pp. A.: Audio indexing using speaker identification. Z.. S. 4857–4860 (2008) Young... J. pp. D.-C.A. Makhoul. G. D. pp. 2441–2444 (2005) Zochov´. D.: Oxford Users’ Guide to Mathematics..4. Hansen. Schwartz. C. Lu. V. pp. C. 84–91 (2005) Wu. D. Furui.-A. In: IEEE Workshop on Spoken Language Technology.. 193–196. S.. S.. Chen. Yella. 3. M.: Speech and crosstalk detection in multi-channel audio.. T. ICSLP 2000.. T.N. Odell.: The HTK Book Version 3. Oxford University Press.: Unsupervised audio stream segmentation and clustering via the Bayesian information criterion. Kimber.: MAP estimation of subspace transform for speaker recognition.. In: INTERSPEECH 2010. Ma.. vol. S. B. K. 2.. pp. vol. J. Wan. 1131–1135 (1995) Zeidler. ICASSP 2000. Brown.

138 BIC. 32. 62 divisive segmentation. 56. 150 DCT. 41. 35. 150 Bayes’ theorem. 34. 60 model-based. 25 hidden Markov model. 90. see discrete cosine transform delta features. see random variable in-set speaker. 26 eigen-environment. 3. 27. 129 Kronecker delta. 22 Bayesian Adaptation. 158 measure. 131 backward algorithm. 35 confidence interval. 151 eigen-environment. 102. 127 fundamental frequency. see linear discriminant analysis likelihood. 59 agglomerative clustering. 153 auxiliary function. 33. 151 stereo-based piecewise linear compensation for environments. see hidden Markov model cepstral mean normalization. 25. 10 feature vector normalization and enhancement. 28 crosstalk. 15 delta-delta features. 54 formant. 18. 79. 22 ratio. 155 Bayesian information criterion. 34. 60 metric-based. 6 Gaussian distribution. 21 LDA. 31 continuous. 14. 15 discrete cosine transform. 53. 128 covariance matrix.Index audio segmentation. 127. 154 feature extraction. 45 expectation maximization algorithm. 40. 17 CDHMM. 127 forward algorithm. 34. 33. 126 semi-continuous. 6 forward-backward algorithm. 27. 61 barge-in. 132 cepstral mean subtraction. 40. 20. 18. 87 iid. 70. 18 Gaussian mixture model. 12. 18 diagonal. 93. see Bayesian information criterion binary hypothesis test. 56 eigenvoice. 17 linear discriminant analysis. 60 decoder-guided. 15 . see cepstral mean normalization codebook.

31. 29. 117. 31 language model. 36 grammar. see speaker identification speaker verification. 24 text-independent. 30 decoder. 22 prior. 41 Index maximum likelihood linear regression. 21 closed-set. 44. 104. 91. 6 voiced. 117 Lombard effect. 36 garbage words. 19 independently and identically distributed. 5 unvoiced. 28 mean vector. 84 soft quantization. 129 PCA. 158 word error rate. see maximum likelihood multilayer perceptron. 34. 125 open-set. 149 segmental maximum a posteriori algorithm. 24. 10. 88. 12. 36 speech segmentation. 6 voice activity detection. 7. 29. 5 speaker adaptation. 35 statistical model discriminative. 23. see hidden Markov model simultaneous speaker identification and speech recognition. 7 MAP. 65. 6 speech recognition. 41. 51 maximum a posteriori algorithm. 150 pitch. 24. 19 evolution. 9 non-target speaker. 6 principal component analysis. 18 stationary. 39 speaker change detection. 23 parametric. 149 extended maximum a posteriori. 70. 12 receiver operator characteristics. 150 SPEECON. 29. 117 universal background model. 128 text-dependent. 24. 110. 41. 125 noise reduction. 24 speaker identification rate. 149. see principal component analysis phone. 158 speaker tracking. 93. 22 joint. 48 probability conditional. 23 sufficient statistics. 23 speech production. 22 posterior. 36 lexicon. 24. 46 target speaker. 35 vocal tract. 159 SCHMM. 95 Viterbi algorithm.172 log-likelihood. see maximum a posteriori Markov model. 133 supervector. see speech segmentation word accuracy. 16 speaker identification. 45. 76 SPLICE. 79. 18 mel frequency cepstral coefficients. 74. 22 random variable ergodic. 117 ratio. 28. 158 . 7 phoneme. 125 non-parametric. 31. 125 generative. 36. 29. 103. 23. 23. see mel frequency cepstral coefficients ML. 31 maximum a posteriori. see feature vector normalization and enhancement standard codebook. 10 MFCC. 88 source-filter model. 93. 129 out-of-set speaker. 24 text-prompted. 37 eigenvoices. 17. 22 maximum likelihood. 34.