Robust Speaker Recognition Over Varying Channels-R

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/228774387
Robust speaker recognition over varying channels–report from JHU workshop

2008
Article
CITATIONS READS
10 362
19 authors, including:
Niko Brummer Douglas A. Reynolds

AGNITIO Massachusetts Institute of Technology
53 PUBLICATIONS 2,545 CITATIONS 119 PUBLICATIONS 16,787 CITATIONS
SEE PROFILE SEE PROFILE
Patrick Kenny Jason Pelecanos

Centre de recherche informatique de Montréal IBM
143 PUBLICATIONS 10,251 CITATIONS 55 PUBLICATIONS 1,132 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Speech Recognition View project
All content following this page was uploaded by Nicolas Scheffer on 01 June 2014.
The user has requested enhancement of the downloaded file.

Robust Speaker Recognition Over
Varying Channels
Report from JHU workshop 2008
Lukáš Burget 1 , Niko Brümmer 2 , Douglas Reynolds 3 , Patrick Kenny 4 , Jason Pelecanos 6 ,
Robbie Vogt 7 , Fabio Castaldo 5 , Najim Dehak 4 , Reda Dehak 12 , Ondřej Glembek 1 ,
Zahi N. Karam 3 , John Noecker Jr. 9 , Elly (Hye Young) Na 10 , Ciprian Constantin Costin 11 ,
Valiantsina Hubeika 1 , Sachin Kajarekar 8 , Nicolas Scheffer 8 ,
and Jan “Honza” Černocký (editor) 1
(1) Brno University of Technology, Czech Republic,

(2) Agnitio, Spain,
(3) MIT Lincoln Labs, USA,
(4) Centre de Recherche en Informatique de Montreal, Canada
(5) Polytechnic University of Turin, Italy,
(6) IBM, USA,
(7) Queensland University of Technology, Australia,
(8) SRI INternational, USA,
(9) Duquesne University, USA,
(10) George Mason University, USA,
(11) Alexandru Ioan Cuza University, Rumania,
(12) EPITA, France.
Summary of planned work
Nowadays, speaker recognition is relatively mature with the basic scheme, where speaker model is
trained using target speaker speech and speech from large number of non-target speakers. However,
the speech from non-target speakers is typically used only for finding general speech distribution (e.g.
UBM). It is not used to find the ”directions” important for discriminating between speakers. This
scheme is reliable when the training and test data come from the same channel. All current speaker
recognition systems are however prone to errors when the channel changes (for example from IP
telephone to mobile). In speaker recognition, the ”channel” variability can include also to linguistic
content of the message, emotions, etc. - all these factors should not be considered by a speaker
recognition system. Several techniques, such as feature mapping, eigen-channel adaptation and NAP
(nuisance attribute projection) have been devised in the past years to overcome the channel variability.
These techniques make use of the large amount of data from many speakers to find and ignore directions
with high with-in speaker variability. However, these techniques still do not utilize the data to directly
search for directions important for discriminating between speakers.
In an attempt to overcome the above mentioned problem, the research will be concentrate on
utilizing the large amount of training data currently available to research community to derive the
information, that can help discriminate among speakers and discard the information that can not. We
propose direct identification of directions in model parameter space that are the most important for
discrimination between speakers. According to our experience from speech and language recognition,
the use of discriminative training should significantly improve the performance of acoustic SID system.
We also expect that discriminative training will make the explicit modeling of channel variability
needless.
The research will be based on an excellent baseline - the STBU system for NIST 2006 SRE
evaluations (NIST rules prohibit us to disclose the exact position of the system in the evaluations).
The data to be used during the workshop will include NIST SRE data (telephone) but we will not
overhear the requests from the security/defense community and evaluate the investigated techniques
also on other data sources (meetings, web-radio, etc) as well as on cross-channel conditions.
The expected outcomes of the proposed research are:
1. significant increasing of the accuracy of current SID systems
2. decreasing the dependency on communication channel, content of the message and other factors
negatively affecting SID performance.
3. speaker identification and verification from very short speech segments.
2
Team members
Team Leader
Lukas Burget burget@fit.vutbr.cz Brno University of Technology
Senior Researchers
Niko Brummer niko.brummer@gmail.com Agnitio
Patrick Kenny pkenny@crim.ca Centre de Recherche en Informatique
de Montreal
Jason Pelecanos jwpeleca@us.ibm.com IBM
Douglas Reynolds dar@sst.ll.mit.edu MIT Lincoln Labs
Robbie Vogt r.vogt@qut.edu.au Queensland University of Technology
Graduate Students
Fabio Castaldo fabio.castaldo@polito.it Polytechnic University of Turin
Najim Dehak Najim.Dehak@crim.ca Ecole de Technologie Superieure
Reda Dehak reda@dehak.org EPITA
Ondrej Glembek glembek@fit.vutbr.cz Brno University of Technology
Zahi Karam zahi@mit.edu Massachusettes Institute of Technology
Undergraduate Students
John Noecker Jr. jnoecker@gmail.com Duquesne University
Elly (Hye Young) Na hna@gmu.edu George Mason University
Ciprian Constantin Costin cip123a@gmail.com The Alexandru Ioan Cuza University
Valiantsina Hubeika xhubei00@stud.fit.vutbr.cz Brno University of Technology
Affiliates
Sachin Kajarekar sachin@speech.sri.com SRI International
Nicolas Scheffer scheffer.nicolas@gmail.com SRI International
Acknowledgements
This research was conducted under the auspices of the 2008 Johns Hopkins University Summer Work-
shop, and partially supported by NSF Grant No IIS-0705708 and by a gift from Google, Inc.
BUT researchers were partly supported by European project AMIDA (IST-033812), by Grant Agency
of Czech Republic under project No. 102/05/0278 and by Czech Ministry of Education under project
No. MSM0021630528. Lukáš Burget was supported by Grant Agency of Czech Republic under
project No. GP102/06/383. The hardware used in this work was partially provided by CESNET
under projects Nos. 162/2005 and 201/2006.
Thanks to Tomáš Kašpárek (BUT) who provided the JHU team an excellent computer support and
allowed for efficient use of the BUT computing cluster during the wokrshop.
3
Contents
1 Introduction 7
1.1 Role of NIST evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Sub-groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Diarization using JFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Factor Analysis Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 SVM–JFA and fast scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Discriminative System Optimization . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Overview of JFA 10
2.1 Supervector model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Generative ML training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 JFA operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Gender dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Factor analysis based approaches to speaker diarization 13

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Diarization Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Agglomerative Clustering System (Baseline) . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Variational Bayes System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Streaming Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.4 Hybrid System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Measures of Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Factor analysis conditionning 20

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 A Phonetic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Factor Analysis Combination Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.2 Systems and protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.3 Combination Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Within Session Variability Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.1 Joint Factor Analysis with Short Utterances . . . . . . . . . . . . . . . . . . . . 29
4.4.2 Extending JFA to Model Within-Session Variability . . . . . . . . . . . . . . . 31
4.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4
4.4.5 Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Multigrained Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.2 Multi-Grained Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Support vector machines and joint factor analysis for speaker verification 43
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Joint Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 SVM-JFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.1 GMM Supervector space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.2 Speaker factors space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3.3 Speaker and Common factors space . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.1 Test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.2 Acoustic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.3 Factor analysis training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.4 SVM impostors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.5 Within Class Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5.1 SVM-JFA: GMM supervector space . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5.2 SVM-JFA: speaker factors space . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5.3 SVM-JFA: speaker and common factors space . . . . . . . . . . . . . . . . . . . 48
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6 Handling variability with support vector machines 50

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Handling Nuisance Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.4 Should All Nuisance be Treated Equally . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5 Using Inter-speaker Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.6 Incorporating All Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.7 Probabilistic Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.8 Artificially Extending the Target Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.9 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7 Comparison of Scoring Methods used in Speaker Recognition with Joint Factor

Analysis 58
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2 Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.2.1 Frame by Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2.2 Integrating over Channel Distribution . . . . . . . . . . . . . . . . . . . . . . . 59
7.2.3 Channel Point Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.4 UBM Channel Point Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.5 Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.3.1 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5
7.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.3.3 JFA Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.3.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.3.5 Hardware and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.4.1 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8 Discriminative optimization of speaker recognition systems 65

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.2 Motivation for discriminative training . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.2.1 Envisioned advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.3 Challenges of discriminative training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.4 Solutions for discriminative training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.4.1 Discriminative Training Objective Function . . . . . . . . . . . . . . . . . . . . 68
8.4.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.4.3 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.4.4 Computation of derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.5.1 Small scale experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.5.2 Large Scale Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.5.3 Experiment with ML trained eigenchannels . . . . . . . . . . . . . . . . . . . . 75
8.5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9 Summary and conclusions 77
6
Chapter 1
Introduction
The largest challenge to practical use of speaker detection systems is channel/session variability, where
“variability” refers to changes in channel effects between training and successive detection attempts.
Channel/session variability encompasses several factors:
• microphones – Carbon-button, electret, hands-free, array, etc.
• acoustic environment – Office, car, airport, etc.
• transmission channel – Landline, cellular, VoIP, etc.
• differences in speaker voice – Aging, mood, spoken language, etc.

Recent experiments of several sites on NIST 2008 data have shown that the performance of a speaker
verification system can be improved from ∼3% EER when using different microphone in training and
test to ∼1% EER for the same microphone in training and test.
The main vehicle to fight the unwanted variability in this workshop was Joint Factor Analysis
(JFA), which dominated NIST 2008 SRE evaluations.
1.1 Role of NIST evaluations

The role of NIST SRE evaluations1 is greater than just providing the data and metrics for the workshop
– all the team members participated in recent 2008 NIST evaluations, so JHU workshop can be seen
as a great opportunity to:
• do common post-evaluation analysis of our systems
• combine and improve techniques developed by individual sites

Thanks to NIST evaluations we have:
• identified some of the current problems that we worked on
• well defined setup and evaluation framework
• baseline systems that were trying to extend and improve during the workshop
1.2 Sub-groups
The work in the workshop was split into 4 work-groups:
1
Annual NIST evaluations of speaker verification technology (since 1995) using a common paradigm for comparing
technologies, see http://nist.gov/speech/tests/sre/
7
1.2.1 Diarization using JFA
Problem Statement
• Diarization is an important upstream process for real-world multi-speaker speech
• At one level diarization depends on accurate speaker discrimination for change detection and
clustering
• JFA and Bayesian methods have the promise of providing improvementsto speaker diarization
Goals
• Apply diarization systems to summed telephone speech and interview microphone speech
– Baseline segmentation-agglomerative clustering

– Streaming system using speaker factors features
– New variational-bayes approach using eigenvoices
• Measure performance in terms of DER and effect on speaker detection
For more details, see Chapter 3.
1.2.2 Factor Analysis Conditioning

Problem Statement
• A single FA model is sub-optimal across different conditions
• Eg.: different durations, phonetic content and recording scenario
Goals
Explore two approaches:
• Build FA models specific to each condition and robustly combine multiple models
• Extend the FA model to explicitly model the condition as another source of variability
1.2.3 SVM–JFA and fast scoring

Problem Statement
• The Support Vector Machine is a discriminative recognizer which has proved to be useful for
SRE
• Parameters of generative GMM speaker models are used as features for linear SVM ( ..sequence
kernels)
• We know Joint Factor Analysis provides higher quality GMMs, but using these as is in SVMs
has not been so successful.
8
Goals
• Analysis of the problem
• Redefinition of SVM kernels based on JFA
• Application of JFA vectors to recently proposed and closely related bilinear scoring techniques
which do not use SVMs
For more details, see Chapters 5, 6 and 7.
1.2.4 Discriminative System Optimization

Problem Statement
• Discriminative training has proved very useful in speech and language recognition, but has not
been investigated in depth for speaker recognition
• In both speech and language recognition, the classes (phones, languages) are modeled with
generative models, which can be trained with copious quantities of data
• But in speaker recognition, our speaker GMMs have at best a few minutes of training typically
of only one recording of the speaker
Goals
• Reformulate the speaker recognition problem as binary discrimination between pairs of record-
ings which can be (i) of the same speaker, or (ii) of two different speakers
• We now have lots of training data for these two classes and we can afford to train complex
discriminative recognizers
9
Chapter 2
Overview of JFA
Joint factor analysis (JFA) is a two-level generative model of how different speakers produce speech and
how their (remotely) observed speech may differ on different occasions (or sessions). The hidden deep
level is the joint factor analysis part that models the generation of speaker-and-session-dependent
GMMs. The output level is the GMM generated by the hidden level, which in turn generates the
sequence of feature vectors of a given session.
The GMM part needs no further introduction. As is customary in speaker recognition, all of the
GMMs differ only in the mean vectors of the components [Reynolds et al., 2000]. The component
weights and the variances are the same for all speakers and sessions. The session-dependent GMM
component means are modeled as:
Mki = mk + Uk xi + Vk ys(i) + Dk z ks(i) (2.1)
Here the indices are: k for the GMM component; i for the session; and s(i) for the speaker in session
i. The system hyperparameters are:
mk , speaker-and-session-independent mean vector;
Uk , rectangular channel-factor loading matrix ;
Vk , rectangular speaker-factor loading matrix ;
Dk , diagonal speaker-residual scaling matrix ;
The hidden speaker and session variables are:
xi , session-dependent vector of channel-factors;
ys , speaker-dependent vector of speaker-factors;
z ks , speaker-and-component-dependent vector of speaker-residuals.
Standard normal distributions are used as a prior for all of these hidden variables.
2.1 Supervector model

We can summarize our JFA model by stacking component-dependent hyperparameters into larger
matrices:
     
V1 U1 D1 0 · · ·
V =  V2  , U =  U2  , D =  0 D2 · · ·  (2.2)
     
.. .. .. .. . .
. . . . .
We refer to V as the eigenvoice matrix ; to U as the eigenchannel matrix ; and D as the residual scaling
matrix. By also stacking component-dependent vectors into larger vectors, which we shall refer to as
10
supervectors:
     
M1i m1 z 1s
Mi =  M2i  m =  m2  zs =  z 2s  , (2.3)
     
.. .. ..
. . .
the JFA model can be expressed succinctly in supervector form as
Mi = m + Uxi + Vys(i) + Dzs(i) (2.4)
2.2 Generative ML training

In this section we give a rough summary of how the hyperparameters of a JFA system may be trained.
The steps are as follows:
1. Train the universal background model (UBM) on a large selection of development data, possibly
on all of it. The UBM is a GMM trained by maximum likelihood (ML), via appropriate initial-
ization followed by multiple iterations of the EM algorithm. The UBM essentially provides the
following functionality:
• Its component means are a good choice to use for the speaker-and-session-independent
supervector m; and its variances and weights are a good choice to use for all speaker-and-
session-dependent GMM variances and weights.
• It parametrizes a computationally efficient approximation to all GMM log-likelihoods,
used during training and operation of the JFA system. Specifically, all GMM log-
likelihoods are approximated by the EM-algorithm auxiliary function [Minka, 1998], of-
ten denoted ‘Q-funtion’ in the literature. Informally, given some GMM, we approximate
log p(data|GMM) ≈ Q(UBM, GMM, data). All further processing makes use of this ap-
proximation.
2. Train the eigenvoice matrix V with an EM algorithm designed to optimize a maximum likelihood
criterion over a database of as many speakers as possible. Pool multiple sessions per speaker, to
attenuate intersession variation.
3. Given V as obtained above and with D temporarily set to zero, train the eigenchannel matrix
U with a similar EM algorithm, over a database that has multiple sessions per speaker. This
data should be rich in channel variation. The Mixer Databases are very good for this purpose.
4. Finally (and optionally), train D, with a similar EM-algorithm, on some held-out data.
2.3 JFA operation

When used operationally, the steps performed by a JFA system to score a given trial, composed of
one train and one test segment, can be described as follows:
1. Use the JFA model (2.4) and the train segment to make a MAP1 point-estimate of the target
speaker model. That is, the hidden variables x, y, and (optionally) z are jointly estimated.
Then x is discarded and M is denoted the target speaker model. Note that this model now has
a unspecified parameter x, because its value in a test segment will be different from its value in
the train segment. This uncertainty is modeled by the standard normal prior over x.
1
MAP denotes maximum a-posteriori. The likelihood used here is the Q-function approximation and the prior is the
standard normal distributions over the hidden variables.
11
2. Compute an approximation to the log-likelihood of the target speaker model, given
the test segment data, log p(test segment|M). Good approximations to use here in-
clude [Glembek et al., 2009]:
• The Q-function approximation, where the unknown nuisance variable x is integrated out,
see [Kenny et al., 2007b], equation 19.
• A linear simplification to the Q-function, where a MAP point-estimate of x is used. For
computational efficiency x is estimated relative to the UBM, i.e. with y = 0 and z = 0.
3. Compute the same approximation to the UBM log-likelihood, i.e. with y = 0 and z = 0. The raw
score (or raw log-likelihood-ratio) is now the difference between the target model log-likelihood
and the UBM log-likelihood.
4. Normalize the raw score by applying the following in order: (i) divide by the number of test
frames, (ii) z-norm, (iii) t-norm.
2.4 Gender dependency

JFA systems benefit from gender-dependent components:
• Some, like the CRIM system at SRE’08, are trained from the UBM onwards on gender-dependent
data. This gives independent male and female systems, which can be used respectively for all-
male or all-female trials.
• Others, like the BUT system at SRE’08, are trained on mixed data, but then use gender-
dependent ZT-norm cohorts.
12
Chapter 3
Factor analysis based approaches to

speaker diarization
This chapter reports on work examining new approaches to speaker diarization. Four different sys-
tems were developed and experiments were conducted using summed-channel telephone data from the
2008 NIST SRE. The systems are a baseline agglomerative clustering system, a new Variational Bayes
system using eigenvoice speaker models, a streaming system using a mix of low dimensional speaker
factors and classic segmentation and clustering, and a new hybrid system combining the baseline sys-
tem with a new cosine-distance speaker factor clustering. Results are presented using the Diarization
Error Rate as well as by the EER when using diarization outputs for a speaker detection task. The
best configurations of the diarization system produced DERs of 3.5-4.6% and we demonstrate a weak
correlation of EER and DER,
3.1 Introduction
Audio diarization is the process of annotating an input audio channel with information that attributes
(possibly overlapping) temporal regions of signal energy to their specific sources. These sources can
include particular speakers, music, background noise sources and other signal source/channel char-
acteristics. Diarization systems are typically used as a pre-processing stage for other downstream
applications, such as providing speaker and non-speech annotations to text transcripts or for adapta-
tion of speech recognition systems. In this work we are interested in improving diarization to aid in
speaker recognition tasks where the training and/or the test data consists of speech from more than
one speaker. In particular we focus on two speaker telephone conversations and multi-microphone
recorded interviews as used in the latest NIST Speaker Recognition Evaluation (SRE)1 .
This chapter reports on work carried out at the 2008 JHU Summer Workshop examining new
approaches to speaker diarization. Four different systems were developed and experiments were con-
ducted using data from the 2008 NIST SRE. Results are presented using a direct measure of diarization
error (Diarization Error Rate) as well as the effect of using diarization outputs for a speaker detection
task (Equal Error Rate). Finally we conclude showing the relation of DER to EER and summarize
the effective components common to all systems.
3.2 Diarization Systems

Four systems were developed for the 2008 JHU Summer Workshop. The systems range from a baseline
agglomerative clustering system, to a new system based on variation Bayes theory, to a streaming audio
1
See http://www.nist.gov/speech/tests/sre/2008/ for more details.
13
clustering system and a new hybrid system using elements of the baseline system and newly developed
speaker factor distances.
3.2.1 Agglomerative Clustering System (Baseline)

The baseline system represents the framework of most widely used diarization sys-
tems [Tranter and Reynolds, 2006]. It consists of three main stages.
In the first stage speaker change points are detected using a Bayesian Information Criterion (BIC)
based distance between abutting windows of feature vectors. The features for the baseline system
consist of 13 cepstral coefficient (including c0) with no channel normalization. This technique searches
for change points within a window using a penalized likelihood ratio test of whether the data in the
window is better modeled by a single distribution (no change point) or two different distributions
(change point). If a change is found, the window is reset to the change point and the search restarted.
If no change point is found, the window is increased and the search is redone. Full covariance Gaussians
are used as distribution models.
The purpose of the second stage is to associate or cluster segments from the same speaker together.
The clustering ideally produces one cluster for each speaker in the audio with all segments from a
given speaker in a single cluster. Hierarchical, agglomerative clustering with a BIC based stopping
criterion is used consisting of the following steps:
0. Initialize leaf clusters of tree with speech segments.

1. Compute pair-wise distances between each cluster.
2. Merge closest clusters.
3. Update distances of remaining clusters to new cluster.
4. Iterate steps 1-3 until stopping criterion is met.
The clusters are represented by a single full covariance Gaussian. Since we have prior knowledge
of two speakers present in the audio, we stop when we reach two clusters.
The last stage is iterative re-segmentation with GMM Viterbi decoding to refine change points and
clustering decisions. Additionally, a form of Baum-Welch re-training of speaker GMMs using segment
posterior-weighted statistics can be used before a final Viterbi segmentation. This step was inspired
by the Variation Bayes approach and is also referred to as ”soft-clustering.”
3.2.2 Variational Bayes System

This is one of the new systems developed during the workshop and is based on the Variational Bayes
method of speaker diarization described by Valente [Valente, 2005]. Work on this system was moti-
vated by the desire to build on the success of factor analysis methods in speaker recognition and to
capitalize on some of the advantages a Bayesian approach may bring to the diarization problem (e.g.,
EM-like convergence guarantees, avoiding premature hard decisions, automatic regularization).
To build on the factor analysis work, we begin by using an eigenvoice model to represent the
speakers. The assumption in eigenvoice modeling is that supervectors2 have the form
s = m + V y.
Here s is a randomly chosen speaker dependent supervector; m is a speaker independent supervector

(i.e., UBM); V is a rectangular matrix of low rank whose columns are referred to as eigenvoices;
the vector y has a standard normal distribution; and the entries of y are the speaker factors. From
the point of view of Bayesian statistics, this is a highly informative prior distribution as it imposes
2
The term supervector is used to refer to the concatenation of the mean vectors in a Gaussian mixture model.
14
severe constraints on speaker supervectors. Although supervectors typically have tens of thousands of
dimensions, this representation constrains all supervectors to lie in an affine subspace of the supervector
space whose dimension is typically at most a few hundred. The subspace in question is the affine
subspace containing m which is spanned by the columns of V .
In the Variational Bayes diarization algorithm, we start with audio file in which we assume there
are just two speakers and a partition of the file into short segments, each containing the speech of just
one of the speakers. This partitioning need not be very accurate. A uniform partition into one second
intervals can be used to begin with; this assumption can be relaxed in a second pass.
We define two types of posterior distribution which we refer to as speaker posteriors and segment
posteriors. For each of the two speakers, the speaker posterior is a Gaussian distribution on the vector
of speaker factors which models the location of the speaker in the speaker factor space. The mean
of this distribution can be thought of as a point estimate of the speaker factors and the covariance
matrix as a measure of the uncertainty in the point estimate. For each segment, there are two segment
posteriors q1 and q2 ; q1 is the posterior probability of the event that the speaker in the segment is
speaker 1 and similarly for speaker 2.
The Variational Bayes algorithm consists in estimating these two types of posterior distribution
alternately as explained in detail in [Kenny, 2008]. At convergence, it is normally the case that q1 and
q2 takes values of 0 or 1 for each segment but q1 and q2 are initialized randomly so that the Variational
Bayes algorithm can be thought of as performing a type of soft speaker clustering, as distinct from
the hard decision making in the agglomerative clustering phase of the baseline system.
The Variational Bayes algorithm can be summarized as follows:
Begin:
• Partition the file into 1 second segments and extract Baum Welch statistics from each
segment
• Initialize the segment posteriors randomly
• No initialization is needed for the speaker posteriors
On each iteration of Variational Bayes:
• For each speaker s:

– Synthesize Baum Welch statistics for the speaker by weighting the Baum Welch statis-
tics for each segment by the corresponding segment posterior qs
– Use the synthetic Baum Welch statistics to update the speaker posterior
• For each segment:
– Update the segment posteriors for each speaker
End:
• Baum Welch estimation of speaker GMM’s together with iterative Viterbi re-segmentation
(as in the baseline system)
In the Variational Bayes system, 39 dimensional feature vectors derived from HLDA transforms of
13 cepstra (including c0) plus single, double and triple deltas are used. The cepstra were processed with
short-term (300 frame) Gaussianization. For the re-segmentation, 13 un-normalized cepstra (c0-c12)
were used. The eigenvoice analysis used a 512 mixture GMMs and 200 speaker factors.
15
3.2.3 Streaming Systems
In this section we describe another way to combine speaker diarization and join factor analysis. Speaker
diarization using factor analysis was first introduced in [Castaldo et al., 2008] using a stream-based
approach. This technique performs an on-line diarization where a conversation is seen as a stream of
fixed duration time slices. The system operates in a causal fashion by producing segmentation and
clustering for a given slice without requiring the following slices. Speakers detected in the current slice
are compared with previously detected speakers to determine if a new speaker has been detected or
previous models should be updated.
Given an audio slice, a stream of cepstral coefficients and their first derivatives are extracted.
With a small sliding window (about one second) a new stream of speaker factors (as described in the
previous section) is computed and used to perform the slice segmentation. The dimension of speaker
factor space is quite small (about twenty) with respect to the number used for speaker recognition
(about three hundred) due to the short estimation window.
In this new space, a clustering of the speaker factors stream is done obtaining a single multivariate
Gaussian for each speaker. A BIC criterion is used to determine how many speaker there are in the
slice. A Hidden Markov Model (HMM) using the Gaussian for each state associated to a speaker is
built and through the Viterbi algorithm a slice segmentation is obtained.
In addition to the segmentation, a Gaussian Mixture Model (GMM) in the acoustic space is created
for each speaker found in the audio slice. These models are used in the last step, slice clustering, where
we determine if a speaker in the current audio slice was present in previous slices, or is a new one.
Using an approximation to the Kullback-Leibler divergence, we find the closest speaker model built in
previous slices to each speaker model in the current slide. If the divergence is below a threshold the
previous model is adapted using the model created in the current slice, otherwise the current model
is added to the set of speaker models found in the audio.
The final segmentation and speakers found from the on-line processing can further be refined using
Viterbi re-segmentation over the entire file.
3.2.4 Hybrid System

The last system was motivated by other work at the 2008 JHU Workshop [Dehak et al., 2009] that
showed good speaker recognition performance could be obtained using a simple cosine-distance between
speaker factor vectors. The idea is to use the baseline agglomerative clustering system to build a tree
up to a certain level where the nodes contain sufficient data, then to extract speaker factors for these
nodes and continue the clustering process using the cosine distance.
The critical factor in determining when to stop the initial clustering is the amount of speech in
each node since we wished to work with 200 speaker factors. Two approaches were used for stopping
the initial clustering: level cutting and tree searching.
Level cutting (upper pannel of Fig 3.1) consists of merely running the clustering until a preset
number of nodes exist - typically 5-15. The tree searching (lower pannel of Fig 3.1) consists of
building the entire tree, then searching the tree from the top-down to select the set of nodes that had
at least a preset amount of speech.
As with the other systems, a final Viterbi re-segmentation is applied to refine the diarization.
3.3 Experiment Design

3.3.1 Data
Summed channel telephone data from the 2008 SRE was used for diarization experiments with the
above systems. This data was selected since we could derive reference diarization, needed for measuring
DER, by using time marks from ASR transcripts produced on each channel separately. In addition
16
Figure 3.1: Level cutting and Tree clustering
the data corresponded to one of the speaker detection tasks in the 2008 SRE, so we could measure the
effect of diarization on EER. The test set consists of 2215 files of approximately five minutes duration
each (≈ 200 hours total). To avoid confounding effects of mismatched speech/non-speech detection
on the error measures, all diarization systems used a common set of reference speech activity marks
for processing.
3.3.2 Measures of Performance

As mentioned earlier, we used two measures of performance for the diarization systems. The diarization
error rate (DER) is the more direct measure which aligns a reference diarization output with a system
diarization output and computes a time weighted combination of miss, false alarm and speaker error
3 . Since all systems use reference speech activity marks, miss and false alarm, which only are affected
by speech activity detection, are not used. Speaker error measures the percent of time a system
incorrectly associates speech from different speakers as being from a single speaker. In these results
we report the average and standard deviation DER computed over the test set to show both the
average as well as the variation in performance for a given system.
To measure the effect of diarization on a speaker detection task, we used the diarization output
in the recognition phase of one of the summed-channel telephone tasks from the 2008 SRE. In the
3conv-summed task, the speaker models are trained with three single channel conversations and tested
with a summed channel conversation. The diarization output is used to split the test conversation
into two speech files (presumably each from a single speaker) which are scored separately and the
maximum score of the two is the final detection score. A state-of-the-art Joint Factor Analysis (JFA)
3
DER scoring code available at www.nist.gov/speech/tests/rt/2006-spring/code/md-eval-v21.pl
17
speaker detection system developed by Loquendo [Vair et al., 2007] is used for all diarization systems.
Results are reported in terms of the equal error rate (EER).
3.4 Results
In Table 3.1 we present DER results for some key configurations of the diarization systems. Overall we
see that the final Viterbi re-segmentation significantly helps all diarization systems. For the baseline
system, it was further seen that the soft-clustering, inspired by the Variational Bayes system, reduces
the DER by almost 50%. The Variational Bayes system achieves similarly low DER when a second
pass is added that relaxes the first pass assumption of fixed one second segmentation. The streaming
system had the best performance out of the box, with some further gains with the non-causal Viterbi
re-segmentation. Disappointingly, the hybrid system did not achieve performance better then the
original baseline. This may be due to the first stage baseline clustering biasing the clusters too much
or the inability to reliably extract 200 speaker factors from the small amounts of speech in the selected
clusters.
Table 3.1: Mean and standard deviation of diarization error rates (DER) on the NIST 2008 summed
channel telephone data for various configurations of diarization systems.
mean DER (%) σ (%)
Baseline + Viterbi 6.8 12.3
Baseline + soft-cluster + Viterbi 3.5 8.0
Var. Bayes 9.1 11.9
Var. Bayes + Viterbi 4.5 8.5
Var. Bayes + Viterbi + 2-pass 3.8 7.6
Stream 5.8 11.1
Stream + Viterbi 4.6 8.8
Hybrid + Viterbi (level cut) 14.6 17.1
Hybrid + Viterbi (tree search) 6.8 13.6
Lastly, in Figure 3.2 we show EER for the 3conv-summed task for different configurations of
the above diarization systems. The end point DER values of 0% and 35% represent using reference
diarization and no diarization, respectively. We see that there is some correlation of EER to DER, but
this is relatively weak. It appears that systems with a DER < 10% produce EERs within about 1%
of the “perfect” diarization. To sweep out more point with higher DER, we ran the baseline system
with no Viterbi re-segmentation (DER=20%). While the EER did increase to 10.5% it was still better
than the no-diarization result of EER=14.1%.
3.5 Conclusions
In this chapter we have reported on a study of several diarization system developed during the 2008
JHU Summer Workshop. While each of the systems had a different approach to speaker diarization,
we found that ideas and techniques proved out in one system were also able to be successfully ap-
plied to other systems. The Viterbi re-segmentation used in the baseline system, was a very useful
stage for the other systems. Also the idea of soft-clustering from the Variation Bayes approach was
incorporated into the agglomerative clustering baseline system to reduce the DER by almost 50%.
The best configurations of the diarization systems produced DERs of 3.5-4.6% on summed-channel
conversational telephone speech. We further examined the impact of using different diarization system
with varying DERs on a speaker recognition task. While there was some weak correlation of EER
18
Figure 3.2: EER vs DER for several diarization systems.
to DER, it was not as direct as one would like in order to optimize diarization systems using DER
independent of the recognition systems using the their output. In future work we plan on applying
these diarization systems to the interview recordings in the 2008 SRE. This new domain will present
several new challenges, including variable acoustics due to microphone type and placement as well as
different speech styles and dynamics between face to face interviewer and interviewee.
19
Chapter 4
Factor analysis conditionning
4.1 Introduction
Factor Analysis (FA) modelling [Kenny, 2006] is a popular and an effective mechanism for capturing
variabilities in speaker recognition. However it is recognized that a single FA model is sub-optimal
across different conditions. For example, modelling utterances of different durations, phonetic content
and recording configurations.
In this chapter we begin to address these conditions by exploring two approaches; that is by,
(1) building FA models specific to each condition and robustly combining multiple models and (2)
extending the FA model to explicitly model the condition as another source of variability. These
approaches guide the study in four areas:
• A Phonetic Analysis
• Factor Analysis Combination Strategies
• Within Session Variability Modelling
• Multigrained Factor Analysis
The work stemming from these themes exploits the use of phonetic information in both enrollment
and verification. Figure 4.1 presents the issue of phonetic variability across sessions. In the traditional
factor analysis system, the phonetic variability component is largely ignored and is modelled indirectly
as part of a larger within session variability process – whether or not the phonetic instances were
observed in all utterances.
Section 4.2 provides an introductory study on the performance of phonetic events with the FA
type system. Section 4.3 discusses the use of different FA configurations (such as stacking and con-
catenation) and their effect on performance. The following section (Section 4.4) then investigates the
issue of factor analysis for varied utterance durations, and finally, Section 4.5 examines one of the
granularity assumptions of the implemented FA model.
4.2 A Phonetic Analysis

This section performs an examination of the relative performance of phonetically labelled events and
their improvement attributed to cross fusion of these categories in a factor analysis setup. This work
validates the need for a conditioned analysis of the underlying processes being modelled.
To demonstrate the importance of conditioning a system for the audio context the results of an
artificial experiment are presented. The results in Table 4.1 demonstrate the importance of balanced
phonetic content in both enrollment and verification. The results are presented for the NIST 2006
20
Train Data Test Data
phoneme I
phoneme I
‘w’
‘w’
phoneme II phoneme II
‘ah1’ phoneme III ‘ow’ phoneme III
‘n’ ‘d’
feature space feature space
phoneme I phoneme II phoneme III
Figure 4.1: A drawing indicating the breakdown of speech into phonetic categories in enrollment and
test.
Table 4.1: Performance of systems when trained and tested on broad phonetic categories.
Vowel (Test) Consonant (Test)
Enroll EER (%) Min DCF EER (%) Min DCF
Vowel 4.50 0.0208 12.47 0.0537
Consonant 10.72 0.0521 7.03 0.0336
Speaker Recognition Evaluation [National Institute of Standards and Technology, 2008] using a stan-
dard factor analysis system trained on the broad phonetic groups as classified by the BUT Hungarian
Phonetic Recognizer. This result, albeit an extreme example, demonstrates the challenge of mis-
matched phonetic content. For example, if only consonants are used to enroll and verify a speaker,
the EER is approximately 7% while if only vowels are used in verification, then the EER increases
to more than 12%. Phonetic mismatch is pronounced for short duration utterances and utterances
recorded with a different speech style.
Not only are there performance differences attributed to speech content across enrollment and
verification but there is also performance differences for different phones as shown in Table 4.21 .
A follow up plot (Figure 4.2) is provided using the data from Table 4.2 to present the performance
of broad phonetic categories versus their relative duration in the utterance. Interestingly, the vowels
tend to be the best performing, but they also comprise more of the speech in an utterance.
A final experiment examines the performance of fusing the systems from two different phonetic
events (optimally combined by linear fusion). The question this experiment attempts to address is
whether the linear score fusion of two vastly different phonetic categories is more beneficial than the
fusion of two similar phonetic classes. Figure 4.3 plots the performance of the score fusion of two
phone classes versus the total duration of the combined phonetic classes. Intuition would suggest that
phonetic diversity should help, but it was not observed to a significant degree in this experiment.
1
Note that the output from the Hungarian recognizer does not correspond to English phones and may be considered
more as an audio tokeniser instead.
21
Table 4.2: Performance of systems when trained and tested on broad phonetic categories.
DET 1 DET 3
Phoneme Type % of speech EER (%) DCF EER (%) DCF
E vowel 18.93 12.16 0.0567 8.62 0.0419
O vowel 10.71 14.57 0.0645 12.30 0.0558
i vowel 6.85 16.73 0.0749 15.49 0.0696
A: vowel 5.89 23.31 0.0876 21.79 0.0852
n nonvowel 5.44 19.08 0.0779 17.23 0.073
e: vowel 4.73 25.31 0.0917 22.92 0.0866
k stop 4.49 25.56 0.0926 22.26 0.0868
z sibilant 4.25 29.73 0.098 28.22 0.0971
o vowel 3.01 25.53 0.0924 25.24 0.0926
t stop 2.76 27.04 0.0956 24.92 0.0936
s sibilant 2.74 30.73 0.0965 27.63 0.0908
f sibilant 2.41 34.43 0.0998 31.42 0.0984
j nonvowel 2.38 25.00 0.0918 22.41 0.0862
v sibilant 2.35 33.66 0.1 30.78 0.0992
m nonvowel 2.29 21.18 0.0835 18.63 0.0782
S sibilant 2.21 31.97 0.0959 31.74 0.0981
l nonvowel 1.99 30.05 0.0974 29.91 0.0955
40
vowel nonvowel sibilant stop
30
Performance (EER) %
20
10
0
0 5 10 15 20
% of Speech
Figure 4.2: A plot of the phonetic performance of individual phones identified according to broad
phonetic category.
4.3 Factor Analysis Combination Strategies

4.3.1 Introduction
Modeling variability in the model space is a major focus of the speaker recognition community. This
work has shown to be particularly useful for channel compensation of speaker models. One of the most
developed frameworks tackling this problem is Joint Factor Analysis (JFA), introduced by Patrick
Kenny in [Kenny et al., 2005a]. This framework aims at factoring out two components for an utter-
ance: the speaker and the nuisance component (usually called channel or session variability). The
latter is commonly removed for training a speaker model.
This work2 aims to take advantage of developments in JFA in the context of a phonetically con-
ditioned system. Previous work with phonetic systems has shown the ability to extract additional
performance through phonetic conditioning [Kajarekar, 2008, Castaldo et al., 2007], although this ad-
vantage was not observed for a full factor analysis model.
The particular focus of this work is to investigate strategies for combining each of the phone-
conditioned JFA systems. Our hypothesis is that score level combination is suboptimal and does
not fully realize the potential advantages of a conditioned JFA system. Options for model-level
2
This section of work ”Factor Analysis Combination Strategies” is copied from our accepted ICASSP 2009 submission.
22
vowel with others
vowel with vowel
Figure 4.3: A plot of the performance of fusing two phonetic events from within or across broad phone
categories.
combination are presented and compared.

We term the model combination strategies as supervector concatenation and subspace stacking,
both illustrated in Figure 4.4. The motivation behind the supervector concatenation approach is to
simultaneously present all the phone-conditioned statistics to the JFA model so that correlations and
relationships between the phonetic conditions, as well as the differences, can be observed and modeled.
This approach results in an increase in the dimension of the speaker model mean by a factor of the
number of phonetic classes with no increase in the latent variable dimension.
Alternatively, the subspace stacking approach combines subspace transforms from each phonetic
context resulting in an increased dimension of the speaker, channel or both latent variables. It is
hypothesized that this approach provides the flexibility for the observed data to select the most
relevant subspace dimensions and has previously proven useful in the auxiliary microphone conditions
of recent NIST SREs [Kenny et al., 2008c].
While the focus of this work is on phone-conditioned JFA systems, the implications may reach
beyond this scope. We expect that investigating several possibilities using phonetic-events will lead to
a better understanding of the JFA model and a methodology that can be applied to increase robustness
to other kinds of conditions such as language, gender and microphone types.
Figure 4.4: Stacked vs. concatenated eigenvectors for 2 phonetic classes. The former enrich the model
by projecting statistics on both classes, thus increasing the rank. The latter produces a more robust
latent variable by tying the classes together, thus increasing the model size.
4.3.2 Systems and protocol

We describe the JFA framework, as well as the system and the phonetic decoder used for the experi-
ments, before presenting the experimental protocol.
23
Joint Factor Analysis
Let us define the notations that will be used throughout this discussion. The JFA framework uses the
distribution of an underlying GMM, the universal background model (UBM) of mean m0 and diagonal
covariance Σ0 . Let the number of Gaussians of this model be N and the feature dimension in each
Gaussian be F . A supervector is a vector of the concatenation of the means of a GMM: its dimension
is N F . The speaker component of the JFA model is a factor analysis model on the speaker GMM
supervector. It is composed of a set of eigenvoices and a diagonal model. Precisely, the supervector
ms of a speaker s is governed by,
ms = m0 + V y + Dz (4.1)
where V is a tall matrix of dimension N F × RS , and is related to the eigenvoices (or speaker loadings),
which span a subspace of low-rank RS . D is the diagonal matrix of the factor analysis model of
dimension N F × N F . Two latent variables y and z entirely describe the speaker and are subjected
to the prior N (0, 1). The nuisance (or channel/session) supervector distribution also lies in a low-
dimensional subspace of rank RC . The supervector for an utterance h with speaker s is
mh = ms + U x (4.2)
The matrix U , known as the eigenchannels (or channel loadings), has a dimension of N F × RC .
The loadings U , V , D are estimated from a sufficiently large dataset while the latent variables x, y,
z are estimated for each utterance.
Baseline System Description

The speaker recognition system from Brno University of Technology (BUT) is used for the experiments.
The baseline system employs a 512-Gaussian UBM. The features are warped Mel-frequency cepstral
coefficients (MFCCs) composed of 19 cepstrum features and one energy feature. First and second
order derivatives are appended for a total dimension of 60. The rank of the speaker space is 120
while the channel space rank is 60. A lower number of Gaussian as well as lower subspace ranks were
selected to accommodate for the multiple phone classes.
To train the matrices, several iterations of the expectation maximization (EM) algorithm of the
factor analysis framework are used. An alternative minimum divergence estimation (MDE) is used
at the second iteration to scale the latent variables to a N (0, 1) distribution. To train a speaker
model, the posteriors of x, y, z are computed using a single iteration (via the Gauss-Seidel method as
in [Vogt et al., 2005]).
The verification score for each trial was a scalar product between the speaker model mean offset
and the channel compensated first order Baum-Welch statistics centered around the UBM. This scalar
product was found to be simple yet very effective [Brümmer, 2008] and was subsequently adopted by
the JHU fast scoring group (Chapter 7).
The speaker verification system is gender-independent with a gender-dependent score normalization
(ZT norm).
Phonetic Decoder
The phonetic decoder used for these experiments is an open-loop Hungarian phone decoder from
BUT, Brno [Matejka et al., 2006]. The Hungarian language possesses a large phone set and enables
the modeling of more nuances than an English set. This has been particularly useful in language
identification tasks. For this work, we chose to cluster the phonemes into broader phonetic events.
We used two different clusterings obtained in a supervised way by expertise:
• 2-class set: vowels (V), consonants (C)

• 4-class set: vowels (V), sibilants (Si), stops (St), non-vowels (NV).
24
To build a phonetically conditioned system, for example a vowel system, we first extract the feature
vectors from an utterance corresponding to the occurrences of vowels in the phone transcription to
obtain phone-conditioned Baum-Welch statistics for the utterance. These statistics are used in exactly
the same fashion as described above to build a full JFA model with phone-conditioned speaker and
channel subspace matrices. The speaker and channel loadings will be subscripted by the notation
adopted for each event in Table 4.3 (for instance, VV will be the speaker loading for the vowel set).
Experimental Protocol
All experiments were performed based on the all trials condition from the NIST-SRE-2006 dataset.
The data set consists of 3616 target trials and 47452 non-target trials. Results are given in terms of
equal error rate (EER) and minimum detection cost function (mDCF) given by NIST.
The factor analysis model uses the following data sets for training:
• The UBM is trained on Switchboard and Mixer data. For simplicity we fixed the UBM for all
phonetic events.
• The eigenvoices and eigenchannels are trained in a gender-independent fashion on the NIST SRE
04 data set, consisting of 304 speakers and 4353 sessions. The diagonal model is trained on 359
utterances coming from 57 speakers from SRE 04 and 05.
• The score normalization data (Z- and Tnorm) was drawn from SRE 04 and 05 with around 300
utterances for each gender.
4.3.3 Combination Strategies

In this section, we evaluate the performance of the score-level combination strategy for the phonetic-
system. We will then investigate techniques in the model space that will robustly estimate the speaker
by taking into account all phonetic classes.
Baseline and Score-level Fusion Results

Score-level combination is a frequently used technique for gaining robustness on different conditions.
For a phonetic GMM system, the usual strategy is to have as many systems as the number of phonetic
events. The combination of information is done at the score level by fusing the scores. In this
experiment, an optimistic system combination is used, as the logistic regression is trained and tested
on the same data. The FoCal toolkit [Brümmer and du Preez, 2006] is used for this process.
Table 4.3 presents the results for the baseline system, as well as for each broad phonetic event of
our set. There is a clear advantage of the system using vowels alone, but it also represents 60% of the
entire data used. The score-level fusion on the 2-class is better than for the 4-class set. However, while
using the same amount of data, the 2-class fusion performance is worse than the baseline system. In
the following paragraphs, we show how to improve the subsystem combination.
Concatenation
The first model space approach investigated consists of concatenating parameters of the speaker from
different phone sets. The following experiments investigate at which level this concatenation should
occur. Let us consider the 2-class phone-set {V, C} for this approach. The resulting model supervector
length will thus increase to 2N F . The main advantage of this method is that a single system is used
for the entire phone set.
• Eigenvector concatenation
25
Table 4.3: Results for the baseline system, as well as for each phonetic group are included. The results
of fusions across phonetic groupings are also shown. Results show that score-level combinations for
the two phonetic sets are similar, but fail to outperform the baseline. [SRE 06, all trials, DCF×10,
EER(%)].
System % Data EER (%) mDCF

Vowels (V) 60 6.17 0.296
Consonants (C) 40 7.91 0.391
Consonant subsets...
Non-Vowels (NV) 15 10.7 0.502
Sibilants (Si) 15 14.14 0.647
Stops (St) 10 15.27 0.685
V+ C 100 5.20 0.262
V + NV + Si + S. 100 5.42 0.272
Baseline 100 5.12 0.241
We first concatenate the eigenvectors from different phonetic events during training and testing of the
speaker models. Under this model, the system will estimate a single set of latent variables x, y, z per
utterance, each of them being independent of the class.
ms = m0 m0 + VV VC y + UV UC x + DG 00DG z (4.3)
Here, the ranks of the subspaces are the same as in the baseline system and the DG matrix is a copy
of the D matrix from the baseline system.
The results in Table 4.4 (first three rows) show a significant degradation of the model concatena-
tion style combination. It seems that if the subspaces are trained separately, the projection on the
resulting concatenated subspace does not reflect the classes appropriately. This leads to the need to
retrain subspaces explicitly to be tied together. It is important to note that the concatenation of
the channel eigenvectors decreases the performance much more compared to the speaker eigenvectors.
This supports the hypothesis that eigenvoices should be the main focus when using a phonetic GMM
system.
• Baum-Welch statistics concatenation
For this experiment, the speaker and channel subspaces are retrained using the concatenated first-
and zero-order statistics from each phonetic event. The results in Table 4.4 show that this approach
performs close to the score-level combination, but fails to outperform it. However, the subspaces are
effectively tied so that a robust estimate of the latent variable can be produced. Consequently, a gain
is observed compared to the systems taken separately.
• Tied factor analysis
Tied factor analysis has been used successfully in other fields such as face recogni-
tion [Prince and Elder, 2006]. For this approach, the model is the same as in Equation 4.3, but
the eigenvectors for each phonetic event are trained so that the latent variables are tied between the
phonetic events. This approach should be successful for a phonetic system, as the amount of data
for each event can vary, especially for very short conditions. We applied the following algorithm until
convergence:
• Estimate the latent variables for the concatenated Baum-Welch statistics (like in 4.3.3).
26
• Estimate the matrices separately, on their respective statistics, by maximizing the likelihood of
the data with respect to the latent variables of the previous step.
Table 4.4 shows that retraining the subspaces by concatenating the statistics from each phone set
or by using tied factor analysis leads to similar performance. It seems the EM algorithm used for the
factor analysis model tends to tie the different phonetic events naturally.
Table 4.4: Eigenvector concatenation on the 2-class set. The speaker and the channel subspace used
are shown along with the concatenation type. Results show that the subspaces have to be retrained
to obtain decent performance, using the standard EM or a Tied Factor Analysis approach. [SRE 06,
all trials, DCF×10, EER(%)]
System Speaker Channel EER (%) mDCF

Baseline VG UG 5.12 0.241
Eig. Concat. VV , VC UV , UC 13.4 0.573
Eig. Concat. VG UV , UC 11.3 0.531
Eig. Concat. VV , VC UG 7.02 0.378
BW Concat. VV , VC UV , UC 5.45 0.266
Tied FA {VV , VC }T ied UV , UC 5.32 0.268
Stacking
Another approach in the model space consists in stacking the eigenvectors of the subspaces together. In
this approach, the dimension of the model remains constant while the rank of the subspaces increases.
This leads to running one system per event before combining them at the score-level.
• Eigenvector stacking
The advantage of this method is its robustness to different stacking configurations. Indeed, the
latent variable estimation is enriched with the information of other events while keeping a good
estimate for the current event. Let us consider two matrices from the 2-class phone set VV and
VC , and their respective latent variables yv , yc . This approach captures cross-correlation between
phonetic events when estimating the latent components. Stacking the eigenvectors for different events
is equivalent to performing a sum in the supervector space. For the 2-class set, the system is expressed
as:
mh = m0 + VV VC yV yC + UV UC xV xC + DG z (4.4)
The DG matrix is the one from the baseline system. The ranks of the resulting stacked matrices are
240 and 120, for the speaker and the channel respectively.
• Stacking in the speaker space and channel space
Stacking the channel eigenvectors was already demonstrated to be successful for a different set of
microphones [Kenny et al., 2008c]. Stacking the speaker eigenvectors should be suitable for a phonetic
GMM system for two reasons. Firstly, speaker modeling should profit from correlations between
phonetic events. Secondly, using subspaces from all phonetic events when evaluating a single phonetic
event should increase robustness to errors of the phonetic decoder.
Similarly to the concatenation experiments, results in Table 4.5 tend to show that the relevant
information is contained in the speaker space as stacking in the channel space degrades the results.
This means that a global channel matrix can be estimated and successfully applied to all events.
Therefore, we only present this configuration for the 4-class set. Stacking the speaker eigenvectors is
a strategy that outperforms the score-level combination and gives the results similar to the baseline
non-phonetic system. There is no observed improvement by using the 4-class set over the 2-class one.
27
Table 4.5: System combination using stacked eigenvectors for the speaker space, channel space or
both. The matrices selected in each configuration are specified. Results tend to show that the relevant
information is contained in the speaker space, as stacking the speaker loadings gives better results
than the score-level fusion. [SRE 06, all trials, DCF×10, EER(%)]
System Speaker Channel EER mDCF

Baseline VG UG 5.12 0.241
Unstacked VV , VC UV , UC 5.20 0.262
Stacked VG UV , UC 5.34 0.260
Stacked VV , VC UG 5.09 0.247
Stacked VV , VC UV , UC 5.28 0.251
Stacked VV , VSt , VSi , VN V UG 5.03 0.250
• Stacked eigenvoices for the baseline system
In section 4.3.3, we showed that stacking the matrices for each phonetic event was a successful
approach for a phonetic-based system. One disadvantage of this method, compared to the method of
Section 4.3.3, is the need to run one system for each event.
The phonetic subspaces can, however, be used to generate large factor loading matrices. In the
protocol, around 300 speakers are used to train the eigenvoice matrix. This is also the maximum
number of eigenvoices that can be estimated. For the 4-class phone set, the system has a rank of 480
for the speaker space. This number of eigenvectors cannot be estimated from our data set. However, it
is interesting to use this large eigenvoice matrix for the baseline non-phonetic system (channel matrices
are not used here following the results in Table 4.5). Under this scenario, the standard (non-phonetic)
statistics will be presented to the system while the stacked matrices coming from different phonetic
events are used as eigenvoices. The channel matrix used is the one from the baseline system.
Table 4.6: Performance of the stacked eigenvoices generated from different phonetic events on a non-
phonetic system. Stacked eigenvoices from the 4-class set outperform the baseline. [SRE 06, all trials,
DCF×10, EER(%)]
System Speaker EER (%) mDCF

Baseline VG 5.12 0.241
Stacked VV , VC 5.14 0.243
Stacked VV , VN V , VSt , VSi 4.76 0.234
Results in Table 4.6 show that stacking eigenvoices derived from different phonetic events can be
useful for improving performance over the standard baseline system. It may also be that using more
classes may better the performance of the stacked system. Indeed, using the stacked eigenvoices from
the 4-class set outperforms the baseline non-phonetic system and the 2-class system.
4.3.4 Conclusion
This work3 aims to take advantage of the recent developments in Joint Factor Analysis in the context of
a phonetically conditioned GMM speaker verification system. We focused on strategies for combining
the phone-conditioned systems. Our first approach was to perform JFA per class and combine the
systems at the score-level. Our hypothesis is that this approach does not use the data efficiently as the
3
The work, by authors at SRI International, was funded through a development contract with Sandia National
Laboratories (#DE-AC04-94AL85000). The views herein are those of the authors and do not necessarily represent the
views of the funding agencies.
28
performance is worse than the baseline. We later employed strategies in the model space that more
robustly estimate the latent variables by taking into account all phonetic events. In section 4.3.3, we
showed that the concatenation of eigenvectors could lead to decent performance provided that the
subspaces are explicitly retrained on the concatenated statistics. In section 4.3.3, we showed that
both factor concatenation and score-level fusion could be outperformed by stacking eigenvectors from
different phonetic events. For the phonetic system, stacking the eigenvoices leads to the greatest
improvement. We also proposed to use this large set of eigenvoices on the baseline system and showed
that it could result in a slight improvement over the traditional baseline system.
While the focus of this work is on phone-conditioned JFA systems, the implications may lead to a
better understanding of the JFA model and a methodology that can be applied to increase robustness
to other kinds of conditions such as language, gender and microphones. Future work will focus on
understanding the differences and overlaps between the global and per-class estimates, in the channel
and the speaker space, and methods to extract more information for a more robust estimate of speaker
models.
4.4 Within Session Variability Modelling

Recent observations have shown that the current Joint Factor Analysis (JFA) model does not provide
the expected improvements in performance for short utterance lengths that it does for the core NIST
SRE condition using full conversation sides. It is hypothesized in this work that this poor perfor-
mance is the result of deficiencies in the current JFA model particularly with respect to modelling the
unwanted variability present within the session.
Based on these observations, an extended JFA model is introduced in this work to specifically
address the characteristics of verification with short utterances by incorporating explicit modelling of
within-session variability, such as the phonetic information encoded in an utterance.
The following section investigates the effect of verification with short utterances on the standard
JFA approach through the results of recent studies, highlighting the role of session variability and its
dependency on utterance length. Section 4.4.2 then proposes the extended factor analysis model that
incorporates within-session variability modelling to combat the deficiencies of the standard model.
Implementation details and experiments on NIST SRE 2006 data are then presented with a brief
discussion in Sections 4.4.3 and 4.4.4, respectively. Finally a summary and possible future directions
are presented in Section 4.4.5.
4.4.1 Joint Factor Analysis with Short Utterances

Previous work has highlighted some deficiencies with current Joint Factor Analysis models for
shorter utterance lengths. As demonstrated in [Vogt et al., 2008b], as utterance lengths for train-
ing and testing reduce the effectiveness of JFA is also reduced. Table 4.7, with results reproduced
from [Vogt et al., 2008b], shows that, while JFA provides a quite significant performance improvement
for full conversation sides, this improvement certainly diminishes when utterance lengths of 20 seconds
or less are used for training and testing. Furthermore, the inclusion of channel factors in the JFA
model at these short utterance lengths had a significant negative impact on performance, while the
inclusion of speaker factors was still generally beneficial.
Further investigation of JFA with short utterances was pursued in [Vogt et al., 2008a]. In this
investigation, it was found that training the session variability subspace matrix U with utterances of
matched length to the evaluation conditions provides significant improvements, as shown in Table 4.8.
Matching the session variability training data resulted in performance gains for the full JFA model,
incorporating session factors, even with utterance lengths as short as 10 and 20 seconds (Table 4.9).
It is also clear from Table 4.8 that specifically the channel subspace must be trained with matched
29
System 1 conv 60 sec 20 sec 10 sec
Baseline .0442 .0456 .0608 .0752
Speaker only .0422 .0434 .0571 .0727
Session only .0305 .0373 .0702 .0857
Speaker & Session .0295 .0350 .0671 .0880
Table 4.7: DCF on the female subset of the 2005 NIST SRE common evaluation condition for systems
with and without channel compensation. From [Vogt et al., 2008b].
Subspace Training
System V U EER Min. DCF
Full-length 1 conv 1 conv 13.47% 0.0544
Matched 20 sec 20 sec 12.04% 0.0498
Matched Session 1 conv 20 sec 11.70% 0.0493
Table 4.8: EER and minimum DCF on a modified 20 second train/test condition for the female subset
of the 2005 NIST SRE. Results are presented for systems using subspaces trained on different length
segments. From [Vogt et al., 2008a].
conditions rather than the speaker subspace; additionally matching the training for the speaker sub-
space to the evaluation conditions results in degraded performance compared to matching the channel
subspace training alone.
From these results it was concluded that the inter-session variability captured in the subspace of
U is actually dependent on the length of the utterances used to train the subspace. More specifically,
shorter utterances show an increase in overall session variability as shown by the measured trace of
the session subspaces for differing lengths in Table 4.10.
Deficiencies of the JFA Model

The observed behavior of the joint factor analysis with short utterances does not fit well with the
assumptions made by the model.
It has been assumed to this point that the session factors/session subspace capture environmental
effects such as channel, handset and background noise which we take to be constant for the length of a
session. This was the initial intent of including the session variables [Kenny and Dumouchel, 2004]. In
fact, the terms “channel” and “session” have often been considered effectively synonymous, although
this is not technically accurate. The characteristics of these environmental effects should be consistent
regardless of utterance length even if estimating session factors with shorter utterances should lead to
less accurate results.
The improved performance attained with matched session subspaces demonstrates that the
matched subspaces are substantially different at different utterance lengths. These results indicate
that the characteristics of the differences between sessions are also different as the utterance length
change.
System 80 sec 40 sec 20 sec 10 sec

Baseline 0.0442 0.0501 0.0617 0.0753
FA Full-length 0.0238 0.0346 0.0544 0.0797
FA Matched Session 0.0234 0.0337 0.0493 0.0708
Table 4.9: Minimum DCF on the female subset of the 2005 NIST SRE common evaluation for reduced
utterance length conditions. From [Vogt et al., 2008a].
30
Utt. Length 1 conv 80 sec 40 sec 20 sec 10 sec
tr(U U ∗ ) 105.7 116.9 148.8 213.0 329.8
Table 4.10: Trace of the session subspace covariance with U trained with different length utterances.
From [Vogt et al., 2008a].
This is problematic firstly because we are thus required to train specialized session subspaces for
the range of utterance lengths we are interested in using to extract optimal performance from the JFA
model, but, more importantly, it implies that our assumptions about the nature of the inter-session
variability are flawed.
As noted above, the improved performance of the matched session subspace system indicates that
the characteristics are different, but what exactly is this difference? Looking at the increasing session
variability with shorter utterances, it seems that the consistent, stationary environmental factors may
well still be present as utterances become shorter, but an additional source of variability is becoming
more apparent with reducing utterance length.
One hypothesis for this extra captured variability is the variability introduced by the speech con-
tent, that is the phonetic information encoded in the speech. For the text-independent speaker recog-
nition task, phonetic variation is unwanted variability in general (although it is possible to produce
better, more accurate speaker models that are conditioned on phonetic context; this is the approach
taken for the conditioned Factor Analysis work).
The effect of phonetic content of an utterance on a speaker model will be more pronounced as
training utterances become shorter. Over the typical NIST conversation lengths, there is likely to be
a reasonable coverage of the phonetic space and the effects of phonetic variability will largely average
out. For utterances of only a few seconds in length, however, there will be very poor coverage of the
phonetic space, and differences in the particular observed phones will cause large differences in the
produced speaker model estimate.
4.4.2 Extending JFA to Model Within-Session Variability

This work extends the current JFA model. The goals of extending the model are two-fold:
Firstly, to produce better performance from the JFA model in the specific case of short utterances,
through using a model that better fits the observed reality.
And secondly, to construct a JFA model that can be used for a wide range of utterance lengths
without adjustment, that is, to avoid the issues of retraining the session subspace for different length
training and testing utterances. This is particularly relevant if an evaluation or application has mixed
utterance lengths or the utterance length is not known a priori. We want to effectively enhance the
JFA model by making it independent of utterance length.
With these goals in mind, the approach taken in this work is to extend the JFA model by separating
the sources of session variability—a collective term for all information not useful for identifying a
speaker—into distinct sources of inter -session variability and within-session variability.
In this extended model, inter-session variability is modelled as an offset U I x to the GMM mean
supervector, as in the standard JFA model, where U I is equivalent to U , except that U I x is intended
to strictly represent only constant environmental effects such as handset and channel. A goal therefore
is to train U I in such a way as to capture only stationary environmental effects that are independent
of utterance length.
Additionally, within-session variability is modelled over a shorter time span than the inter-session
variability to capture and remove transient effects within an utterance. Following the hypothesis
above, these transient effects are expected to be dominated by phonetic variability, although this
restriction is not enforced. The within-session variability is modelled by splitting a utterance into a
series of short segments and estimating an additional GMM mean offset U W wn for each short segment
31
n.
Including within-session variability, the complete model for a short segment n of an utterance is
sn = m + V y + dz + U I x + U W wn .
While y, z and x are all held constant for the entire utterance, there will be an independent wn for
each short segment n.
Choice of Short Segmentation

An important consideration for this extended JFA model including within-session variability modelling
is the choice of method for segmenting an utterance into short segments. Ideally, the segments should
be short enough to capture the relevant sources of the within-session variability but must also be long
enough to adequately estimate wn for each segment. Computational load is also a consideration, as
increasing the number of segments also increases the number of within-session factors, wn , which must
be estimated for an utterance.
Following the hypothesis that phonetic variability is the dominant source of within-session vari-
ability, this work explored modelling within-session variability for short segments that are aligned with
open-loop phone recogniser (OLPR) transcripts. Using this alignment, there is a one-to-one mapping
from each phone instance in the OLPR output transcript to a short segment.
The OLPR transcripts derived from BUT Hungarian phone recognition sys-
tem [Schwarz et al., 2004]. This phone recogniser has previously been shown to be effective for
a number of applications including speech activity detection and language recognition.
A brief analysis of the OLPR transcripts for Mixer data used in recent SRE’s revealed that this
phone recogniser produced phone labels at approximately 10 phone events per second of active speech.
This rate roughly translates into 1,000 phone events for a full NIST conversation side or 100 events
for a 10-second segment.
It is important to note that there is no conditioning or separate modelling based on the actual
phone labels produced by the recogniser. The phone labels are simply used to chop an utterance into
short segments based on the start and end time of each phone event. The actual phone labels are
completely disregarded.
There are other potential methods of segmenting an utterance which may deserve pursuing. One
potential option is to simply segment the active speech of an utterance at regular intervals, giving
a sequence of segments of the same length. A reasonable range of segment lengths might be 0.1–1
seconds, which corresponds to approximately 10–100 speech frames. A regular segmentation scheme
such as this has the advantages of not requiring any external dependencies such as an OLPR as well
as a consistent length of segment which may be helpful in estimating wn for each segment.
Another possibility is aligning segments with syllables instead of phones, allowing the segments to
be centred around high-energy syllable nuclei. This may provide better quality estimates of wn due
to cepstral representations of high-energy voiced speech generally being less effected by environmental
effects. A syllable-aligned segmentation would obviously also require syllable transcripts or some other
method of recognising syllable events such as [Dehak et al., 2007].
4.4.3 Implementation
Several systems were developed for comparison in this work:
1. A baseline JFA system, using the standard JFA model.
2. A standard JFA system with U specialised for differing utterance lengths.
3. A system implementing the extended JFA model incorporating the within-session variability
modelling.
32
Details of these systems are presented in the following sections.
Baseline JFA System

The baseline system for this evaluation implemented the standard JFA model introduced by Kenny,
et al. [Kenny and Dumouchel, 2004] for speaker modelling. This implementation was based on
the “small” BUT system comprising 512-component, gender-independent GMM models with 39-
dimensional MFCC-based features. Details of the features and UBM training data are given
in [Burget et al., 2007].
For simplicity and efficiency, this system implemented dot product scoring for the verification
trials as described for the SUN/SDV submission for SRE’08 [Strasheim and Brümmer, 2008]. Channel
compensation was also applied to the statistics for both training and testing utterances in the manner
described in [Strasheim and Brümmer, 2008]. ZT-norm was also applied using BUT’s “standard”
lists.
The JFA parameters for the baseline system were trained on a small subset of the BUT
lists for FA training. Specifically, U and V were trained on the SRE04 utterances in the list
fa_train_eigenchannels.scp while d was trained on fa_train_d.scp as usual. With the reduced
number of utterances and unique speakers for training the subspaces, the subspace dimensions were
limited to 100 speaker factors and 50 session factors.
While this baseline system does not provide world-beating performance due to the limited FA
training data and smaller models, it is expected to be representative of a larger state-of-the-art sys-
tem. The choice to use a configuration with reduced FA training data was made in order to provide
representative performance and a maximum throughput of experiments given the limited time-frame.
Matched U to Utterance Length

This system is identical to the baseline system in most aspects. The only difference is in the data used
to train the session subspace transform U . Additional matrices U were trained using the same utter-
ances as the baseline system, except the utterances were truncated in length to match the anticipated
utterance lengths to be used in the experiments. Matrices for 20-second and 10-second conditions
were produced for this system. V and d were unchanged.
Extended JFA System Incorporating Within-Session Modelling

Training the subspaces is expected to be at least as important for the extended JFA model as it is
with the standard JFA model. Many potential options exist for the order in which the subspaces are
estimated and whether joint or separate optimization is better, as well as the question of how to split
the utterances into short segments. Due to time constraints and the computational cost involved in
this estimation process, few of these options were examined.
The approach taken in these experiments was the simplest and involved the least changes from
the baseline system. Firstly, the parameters of the standard JFA model (U I , V and d) were trained
exactly as in the baseline system. These were therefore identical to the baseline system. Following this,
the additional within-session subspace, U W , was trained on a subset of approximately 100 utterances
(2 from each speaker in the fa_train_d.scp list). This training process is analogous to the training
of U I , except that U W is trained to capture the dominant directions of differences between the short
segments of the training utterances.
The transcripts from the Hungarian OLPR provided the segment alignment for the utterances
used in estimating U W . It is therefore expected that the within-session variability captured through
this procedure will be dominated by phonetic information,4 however, it is also reasonable to expect
4
As an interesting aside, it may be possible to retrieve the phone label for each segment based on the corresponding
estimate of w. This has not been investigated to date, but may provide insight into the validity of the assumption that
33
Approx Magnitude of JFA Subspaces
2
10
Speaker
Inter−Session
Phonetic
1
10
Eigenvalues
0
10
−1
10
−2
10
0 1 2
10 10 10
Figure 4.5: Leading eigenvalues of the speaker, inter-session and within-session variability. The within-
session subspace was trained on segments aligned to open-loop phone recogniser transcripts.
variation in actual realisations of phones will also be present.

The leading eigenvalues of the speaker, inter-session and within-session subspaces resulting from
this training procedure are plotted in Fig. 4.5. The within-session variability is evidently very high
and substantially greater than both the speaker and inter-session variability. Based on an approximate
average length for the short segments of 0.1 seconds, the effective within-session variability over the
utterance for a variety of utterance lengths is depicted in Fig. 4.6, again showing the speaker and
inter-session variability for comparison. As the utterance length increases to a full conversation side of
approximately 100 seconds, the effect of within-session variability to the utterance as a whole will be
effectively negligible, as expected, due to the averaging effect and sufficient coverage of the phonetic
space.
During both speaker model training and testing, the effect of within-session variability was removed
in an analogous fashion to the inter-session variability of the baseline system: For each short segment
n of an utterance, the within-session factors wn were estimated and the sufficient statistics compen-
sated as in [Strasheim and Brümmer, 2008]. The compensated statistics for all of the segments in an
utterance were then summed to give utterance statistics with within-session effects removed. These
within-session-compensated statistics were then used in the same way as the usual utterance statistics
in the baseline system.
4.4.4 Experiments
A system implementing the extended JFA model with within-session variability was evaluated and
compared against the standard and matched-U JFA systems on the NIST SRE 2006 core, 1conv4w-
1conv4w condition. To investigate the performance of the systems with reduced utterance lengths, this
same 1conv4w-1conv4w condition was again utilised however both the training and testing utterances
were truncated to produce shorter utterances. 20-second and 10-second conditions for both training
the within-session variability is dominated by phonetic information.
34
Approx Magnitude of JFA Subspaces
Speaker
Inter−Session
Phonetic−−1 sec
Phonetic−−10 sec
1
10 Phonetic−−Conv side
0
10
Eigenvalues
−1
10
−2
10
0 1 2
10 10 10
Figure 4.6: Approximate effective within-session variability for a range of utterance lengths compared
to speaker and inter-session variability.
JFA Model Dims 1 conv 20 sec 10 sec

U +V +d 50 3.10% 12.79% 20.21%
U +V +d 60 3.03% 13.01% 20.31%
U Matched + V + d 50 3.10% 12.20% 19.71%
UI + UW + V + d 50I + 10W 2.97% 11.98% 19.67%
Table 4.11: Comparison of EER performance for the standard JFA model, matched-length session JFA
model and the extended JFA model incorporating within-session variability modelling on the SRE 06
common evaluation condition.
and testing were added in this way. From the results in [Vogt et al., 2008b] and [Vogt et al., 2008a],
this 10 to 20 seconds range appears to be the range at which the effectiveness of the standard JFA
model is diminishing.
Tables 4.11 and 4.12 present EER and minimum DCF results, respectively, comparing the variants
of the JFA model for the full conversation side and 20- and 10-second truncated training and testing
conditions. All results are English-language trials only. The first and second rows in each table use
the standard JFA model with 50 and 60 session factors, respectively. The third row show the results
with U matched to the length of utterance used for training and testing. The last row of each table
includes within-session variability modelling with 50 inter-session factors and 10 within-session factors.
As reported in [Vogt et al., 2008a], matching U to the evaluation conditions provides an advantage
over the standard JFA model. The matched system provided better performance in all short conditions
over the baseline although the improvement for the 10-sec condition are quite modest.
Incorporating within-session variability modelling largely produced similar results to the matched-
U approach, improving on the standard JFA system for all shortened utterances. Additionally, at the
EER operating point this approach gave the best performance at each utterance length, although only
by a small margin. Results were less clear-cut when measured by minimum DCF.
From these results it can be seen that the introduction of within-session factors at least achieved
35
JFA Model Dims 1 conv 20 sec 10 sec
U +V +d 50 .0159 .0561 .0819
U +V +d 60 .0156 .0562 .0820
U M atched + V + d 50 .0159 .0531 .0814
UI + UW + V + d 50I + 10W .0170 .0541 .0807
Table 4.12: Comparison of minimum DCF performance for the standard JFA model, matched-length
session JFA model and the extended JFA model incorporating within-session variability modelling on
the SRE 06 common evaluation condition.
JFA Model Dims 20 sec 10 sec

U +V +d 50 6.12% 9.59%
U Matched + V + d 50 6.39% 10.13%
U Stacked + V + d 100 5.91% 9.54%
UI + UW + V + d 50I + 10W 5.85% 9.59%
Table 4.13: Comparison of EER performance for the standard JFA model, matched-length session JFA
model, a stacked session model and the extended JFA model incorporating within-session variability
modelling on the SRE 06 common evaluation condition with whole conversation side training and
truncated utterances for testing.
one of the stated goals of producing a system that could be effective over a wide range of utterance
lengths. While the matched system used a distinct U matrix for each utterance length tested, the
parameters of the within-session modelling system were consistent across all trials. Thus, the within-
session modelling approach provides a practical advantage over the standard JFA model through its
flexibility.
The second goal of improving performance through more accurately modelling the unwanted vari-
ability has not been convincingly achieved with these results. Several factors may contribute to this
outcome. Firstly, the choice of segmentation may not be optimal, but more importantly, the approach
to estimating the subspaces of the extended model used for these experiments was not at all tailored
to the extended model. It is expected that the extended model should at the very minimum require
adjustment to the values of d as less information will be explained as “residual” variability with the
inclusion of within-session factors. The effects of including within-session modelling on the speaker
and inter-session subspaces must also be investigated. Future investigation of segmentation choice
and proper integration of within-session modelling in the subspace estimation process may lead to
significant improvements in performance of this extended model.
Following on from the above results where utterances of the same length were used for both training
and testing. An added complication is introduced when the training and testing utterance lengths
differ. In this case, the optimal matrix U is different for training and testing. Tables 4.13 and 4.14
present results evaluated with a whole conversation for training and 20 or 10 second testing utterances.
Again in these tables the first row is the baseline approach using standard JFA model. The results
in the second represent a system with U matched to the utterance length for both training and testing.
In this case, due to the full conversation for training and truncated utterances for testing, U differs
from training to testing. Interestingly, while the matched-U approach worked quite well with the
same utterance lengths for both training and testing, it causes a degradation in performance in all
measures compared to the baseline system. Mismatch between the U for training and testing is the
most likely cause of this performance degradation.
To overcome the issue of differing U between training and testing while matching the session
subspace to the utterance length, a stacking approach was investigated. Under this approach, a larger
session subspace was constructed by concatenating the two session matrices matched to both the
36
JFA Model Dims 20 sec 10 sec
U +V +d 50 .0293 .0433
U Matched + V + d 50 .0305 .0441
U Stacked + V + d 100 .0275 .0421
UI + UW + V + d 50I + 10W .0290 .0414
Table 4.14: Comparison of minimum DCF performance for the standard JFA model, matched-length
session JFA model, a stacked session model and the extended JFA model incorporating within-session
variability modelling on the SRE 06 common evaluation condition with whole conversation side training
and truncated utterances for testing.
training and testing conditions. That is, for a 1conv training, 10 second test condition, the U used for
training and testing consists of concatenated matrices matched to the 1 conv and 10 second utterance
lengths. This approach has been successfully employed previously for mixed telephone and distant
microphone conditions in recent SRE.
The third row of Tables 4.13 and 4.14 demonstrate that this stacking approach provides an im-
provement in all cases over the baseline system, regaining the advantage of the matched approach
observed previously, although, again these gains are modest.
Finally, the last row in Tables 4.13 and 4.14 present the performance of incorporating within-session
modelling. As with the stacking approach, the extended model provides improved performance over
the baseline system in all cases, except for the 10-sec EER where the two are equivalent. The extended
approach is also competitive with the stacked approach as they each provide the best performance
depending on the condition and performance measure.
The results for these experiments again highlight the ability for the extended JFA model to provide
competitive performance across a wide range of operating conditions without having to adjust model
parameters. This flexibility is a major advantage of this approach, especially for situations in which
it is not possible to know the training and testing utterance lengths prior to evaluation or, as in this
case, the utterance lengths are not consistent for training and testing.
4.4.5 Summary and Future Directions

This work motivated and presented an extension to the joint factor analysis model to include modelling
of unwanted within-session variability. This extension was particularly motivated by observations of
relatively poor and ineffective performance of the standard JFA model for short utterance lengths.
The inclusion of within-session variability modelling was particularly intended to compensate for the
effects of poor and uneven phonetic coverage for short utterances by modelling and removing the
effects of phonetic variation over short segments of each utterance.
The goals of the extended model were; (a) to produce better performance from the JFA model in
the specific case of short utterances by using a model that better fits observed behaviour, and (b) to
produce a flexible JFA model that would be equally effective over a wide range of utterance lengths
without adjusting model parameters such as retraining session subspaces.
Experimental results demonstrate the flexibility of the extended JFA model by providing com-
petitive results over a wide range of utterance lengths and operating conditions without need for
adjusting any of the model parameters. While modest performance improvements were also observed
in a number of conditions over current state-of-the-art, further work is necessary to demonstrate that
significant performance improvements are achievable through this extended model.
Future work on this approach is expected to focus on two areas, the optimal method of segmenting
utterances and better integration of within-session variability in estimating the parameters of the JFA
model. Possible candidates for segmentation method include aligning with syllables or syllable-like
units to provide slightly longer segments and potentially better estimates of within-session factors (wn )
37
and fixed length segments in the range of 0.1–1 seconds of active speech to provide more consistency in
the estimates of wn and less dependency on speech recognition tools. Methods of incorporating within-
session variability in estimating the speaker and inter-session subspaces will also be examined. Previous
work has shown that slight variations in the subspace estimation procedure can make significant
performance differences for the standard JFA model; it is likely that this effect is exacerbated for the
extended model.
4.5 Multigrained Factor Analysis

4.5.1 Introduction
Recent efforts in speaker recognition research have focussed on reducing the effects of session vari-
ability. The more recent papers in this research area attempt to compensate a high-dimensional
utterance representation. These methods include Factor Analysis (in both the model [Kenny, 2006,
Vogt et al., 2005] and feature [Vair et al., 2006] domains), Within-Class Covariance Normalisation
(WCCN) [Hatch et al., 2006] and Nuisance Attribute Projection (NAP) [Solomonoff et al., 2005].
These techniques are noted to be very effective within a Gaussian Mixture Model (GMM) frame-
work. Each of these methods make particular assumptions; such as assuming that signal distortions
do not span across multiple mixture components. We attempt to address this constraint by examining
the granularity of the distorted feature space.
We investigate this problem from the perspective of two similar modelling approaches; they are,
Factor analysis [Kenny, 2006] and NAP [Campbell et al., 2006a]. Factor analysis decomposes the
statistics of an utterance into components relating to the speaker and the channel/audio-environment.
The NAP approach models and eliminates directions of variability in the model space that are con-
sidered harmful to classification performance. The assumption for both the GMM-kernel based NAP
approach [Campbell et al., 2006a] and the factor analysis variant is that the session variability is con-
fined to distortions within the span of its corresponding mixture component. More specifically, for
a feature based interpretation, the most that a feature vector will be compensated by is an offset
spanned by the operating space of the corresponding Gaussian mixture component. In addition, fea-
ture distortions that exist at a much smaller scale than the span of the mixture component will not
be sufficiently described.
In this work, we propose the use of a multi-grained approach whereby the compensated statistics
generated by a low-complexity GMM-NAP structure are used in a higher-complexity compensation
system. Section 4.5.2 presents the proposed multi-grain approach that compensates for distortions of
differing structural detail. Section 4.5.3 follows with a presentation of the results and then leads into
the conclusions.
4.5.2 Multi-Grained Approach

We propose a multi-grained approach as a simple means to mitigate assumptions of granularity. To
motivate the approach given later in this section, we include an example indicating a need for the
approach. Figure 4.7 shows two Gaussian mixture models with a different number of mixture com-
ponents. The model to the left is a GMM with a lower complexity than the GMM on the right. In
the figure, the data is represented by square dots while the distortions (hypothetically applied) are
represented by arrows. When feature compensation is performed, the amount by which a single fea-
ture vector may be compensated is generally governed by the reach of the most significant Gaussian
mixture component. A Gaussian’s reach is used here to describe the space of features vectors that will
significantly affect or be influenced by that mixture component. Note that this description is specific
to the formulation of the model.
Through a thought experiment, it is apparent that a feature vector that will be compensated by
38
Figure 4.7: A hypothetical scenario of two different complexity models used to account for feature
distortions.
the GMM on the left could potentially be modified to a greater extent than by the GMM on the right.
In addition, the GMM on the left may be able to compensate for significantly large channel effects,
while the GMM on the right would tend to compensate for more fine-grained distortions.
A multi-grained model may be able to compensate for session effects that cause large regional
variability and yet also handle small localized distortions. The work regarding the multi-grained
analysis is inspired by Chaudhari et al [Chaudhari et al., 2000].
We apply a multi-grained framework to both the NAP and the factor analysis approaches to
demonstrate their utility. Please refer to [Campbell et al., 2006a] for the NAP specific details of the
method and to [Kenny, 2006] for the factor analysis specifics. In brief, Figure 4.8 shows the general
process. A low complexity NAP-GMM setup is used to transform the raw features to produce Level-1
(L1) features. These features are further transformed by a more complex Level-2 (L2) NAP-GMM
structure. Note that the features generated by the L2-model may be passed into an SVM or other
alternate classifier.
Although this diagram indicates the case for feature based compensation of the NAP statistics
using models of differing granularity, it was also applied such that the sufficient statistics could be
changed using a factor analysis type model (see Figure 4.9).
Multi-grained NAP feature compensation
With the previous method, NAP was applied in two stages, first for enhancing the features using a
low complexity model, and second, a higher complexity model to perform an additional NAP trans-
formation on the sufficient statistics that were then scored.
For the NAP case, a feature vector, x, may be compensated to give, x̂, according to the following
equation ([Vair et al., 2006]).
XN
x̂ = x − P r(i|x)S i (4.5)
i=1
where
wi g(x|µi , Σi )
P r(i|x) = PN (4.6)
j=1 wj g(x|µj , Σj )
The conditional probability of mixture component i given observation x is given by P r(i|x). The
function g(x|µi , Σi ) is the multi-variate probability density function of feature vector x for a given
mean (µi ) and diagonal covariance (Σi ) of mixture component i.
39
The parameter vector, S i , represents the session nuisance contribution for mixture component i
and may be calculated as follows:
1 1
S i = √ Σi2 Vi V T φ (4.7)
wi
Note that Vi is the sub-matrix of V referring to the nuisance contribution of mixture component
i. Specifically, Vi refers to row [(i − 1) × d + 1] through to row [i × d] of matrix V . In addition, the
L2 NAP-GMM is directly scored rather than being used for generating additional features.
Figure 4.8: The procedure for performing a two-step Nuisance Attribute Projection.
Multi-grained FA statistics compensation

With the statistics compensation approach, factor analysis is applied to all broad-phone classes to
determine the session subspace contribution. Following this, another factor analysis model is created
for each broad-phone class whereby the previous session subspace contribution is already removed5 .
Compensate Compensate
Raw Find FA Find FA
For Session For Session
Features Statistics Statistics Use of FA
Statistics Statistics
Compensated Statistics
L1 STP
VOW
L2 In Kernel or Model
ALL Global Phone-Class
SPEECH Model NON Specific
SIB
Model
Figure 4.9: The procedure for performing a two-step factor analysis compensation, firstly using the
statistics over the entire utterance followed by the phone-group specific compensation.
4.5.3 Results
The speaker recognition system is based on a GMM based kernel structure [Reynolds et al., 2000,
Campbell et al., 2006a]. All output scores have ZT-Norm [Reynolds et al., 2000,
Auckenthaler et al., 2000] (enrollment and test utterance score normalization) applied.
We first evaluate the multigrained NAP feature compensation approach. This consists of a 256
mixture component NAP system that is used to transform the cepstral-based features for use by a 1024
mixture component secondary NAP system. The secondary NAP system also incorporates the scoring
(dot-product) component. The results of the multigrained NAP feature compensation approach are
presented in Table 4.15 for the NIST 2008 SRE. Conditions 7 and 8 represent the telephony audio
5
The equations are omitted at this time.
40
Table 4.15: The NIST 2008 results with and without the multi-grained analysis.
Condition Condition
Task
7 8
Base System with
0.179 0.182
NAP
Base System with
Multigrained NAP 0.175 0.166
Broad Phone
0.212 0.209
System with NAP
Broad Phone
System with 0.206 0.190
Multigrained NAP
Table 4.16: The NIST 2006 results with and without the multi-grained analysis compared for broad
phonetic groupings.
Phone DET 3 - Base DET 3 - ZTNorm
Type Min DCF EER Min DCF EER
Baseline NonVowel 0.0888 24.04% 0.0413 9.05%
Sibilant 0.0988 30.28% 0.0584 13.05%
Stop 0.0993 33.33% 0.0631 13.81%
Vowel 0.0604 11.26% 0.0201 3.97%
Hierarchical NonVowel 0.0852 23.24% 0.042 9.53%
Sibilant 0.0994 28.93% 0.0585 14.20%
Stop 0.0991 33.27% 0.0655 14.63%
Vowel 0.0482 10.29% 0.0206 3.91%
Baseline Consonant 0.0839 20.48% 0.0323 6.28%
Vowel 0.0604 11.26% 0.0201 3.97%
Hierarchical Consonant 0.0777 18.26% 0.0312 6.45%
Vowel 0.0482 10.29% 0.0206 3.91%
evaluation for all English trials and all native English trials respectively. This table includes results for
two types of systems; that is a standard GMM system and a broad phone system. The standard GMM
system is of the same configuration as mentioned earlier in the paragraph. The broad phone system
consists of using the compensated features generated by the 256 mixture component system and then
having broad-phone models (which are also NAP compensated) trained on these new features. The
scores from the broad phone models are combined in late fusion. The results show a small improvement
in the performance of the multigrained NAP systems over the standard NAP baselines.
In another experiment (see Table 4.16), using the NIST 2006 SRE, the multigrained FA system
was evaluated. This consisted of compensating the sufficient statistics using a general GMM and then
further compensating the statistics using a broad-phone specific system. The multigrained approach
demonstrated consistent improvements for the ‘DET 3-Base’ result. Once ZT-Norm was applied (‘DET
3-ZTNorm’), the observed benefits were lost. Note also that the fusion of multiple phone systems did
not demonstrate an improvement. Effectively compensating the sufficient statistics of the FA model
in a multigrained manner seems to be a challenging task at this time.
4.5.4 Conclusions
This section presented a multi-grained approach to address certain limitations in current compensation
models. Results indicate some gains with the potential for the method being applied to other session
modelling approaches.
4.6 Summary
This chapter presented some of the efforts performed at the JHU workshop on the topic of Fac-
tor Analysis Conditioning. The work covered four main areas; a phonetic analysis, Factor Analysis
41
Combination Strategies, Within Session Variability Modelling and Multigrained Factor Analysis.
Results demonstrate that a conditioned FA model can provide improved performance and score
level combination may not always be the best method. Including Within-Session factors in an FA
model can reduce the sensitivity to utterance duration and phonetic content variability. Stacking
factors across conditions or data subsets can provide additional robustness. Hierarchical modelling for
NAP/Factor Analysis also shows promise. These approaches also have applicability to other condition
types such as different languages and microphones types.
42
Chapter 5
Support vector machines and joint

factor analysis for speaker verification
This chapter presents several techniques to combine between Support vector machines (SVM) and
Joint Factor Analysis (JFA) model for speaker verification. In this combination, the SVMs are ap-
plied in different sources of information produced by the JFA. These informations are the Gaussian
Mixture Model supervectors and speakers and Common factors. We found that the use of JFA factors
gave the best results especially when within class covariance normalization method is applied in the
speaker factors space in order to compensate for the channel effect. The new combination results are
comparable to other classical JFA scoring techniques.
5.1 Introduction
During the last three years, the Joint Factor Analysis (JFA) [Kenny et al., 2008d] approach became the
state of the art in the speaker verification field. This modeling was proposed in order to deal with the
speaker and channel variability in the Gaussian Mixture Models (GMM) [Douglas A. Reynolds, 2000]
framework.
At the same time the application of the Support Vector Machine (SVM) in GMM supervector
space [Campbell et al., 2006a] got interesting results, especially when the nuisance attribute projection
(NAP) was applied to deal with the channel affect. In this approach, the kernel used is based on a
linear approximation of the Kullback Leibler (KL) distance between two GMMs. The speaker GMM
means supervectors were obtained by adapting the Universal Background Model (UBM) supervector
to speaker frames using Maximum A Posterior (MAP) adaptation [Douglas A. Reynolds, 2000].
In this paper, we propose to combine the SVM with JFA. We tried two types of combination; the
first one uses the GMM supervector obtained with JFA as input to the SVM using the classical linear
KL kernel between two supervectors. The second, rather than using the GMM supervectors as features
for the SVM, directly uses the information given by the speaker and common factors components (see
section 5.2) defined by the JFA model.
The outline of the paper is as follows. Section 5.2 describes the factor analysis model. In section
5.3, we present the JFA-SVM approach and we describe all the kernels used to implement it. The
comparison between different results is presented in section 5.5. Section 5.6 concludes the paper.
5.2 Joint Factor analysis

Joint factor analysis is a model used to treat the problem of speaker and session variability in GMM’s.
In this model, each speaker is represented by the means, covariance, and weights of a mixture of
C multivariate diagonal-covariance Gaussian densities defined in some continuous feature space of
43
dimension F . The GMM for a target speaker is obtained by adapting the UBM means parameters
(UBM). In joint factor analysis [Kenny et al., 2008d, Kenny et al., 2007b, Kenny et al., 2007a], the
basic assumption is that a speaker and channel dependent supervector M can be decomposed into a
sum of two supervectors: a speaker supervector s and a channel supervector c
M =s+c (5.1)
where s and c are normally distributed.

In [Kenny et al., 2008d], Kenny et al. described how the speaker dependent supervector and
channel dependent supervector can be represented in low dimensional spaces. The first term in the
right hand side of (5.1) is modeled by assuming that if s is the speaker supervector for a randomly
chosen speaker then
s = m + dz + V y (5.2)
Where m is the speaker and channel independent supervector (UBM), d is a diagonal matrix, V is a
rectangular matrix of low rank and y and z are independent random vectors having standard normal
distributions. In other words, s is assumed to be normally distributed with mean m and covariance
matrix V V t + d2 . The components of y and z are respectively the speaker and common factors.
The channel-dependent supervector c, which represents the channel effect in an utterance, is as-
sumed to be distributed according to
c = ux (5.3)
Where u is a rectangular matrix of low rank, x is distributed with standard normal distribution.
This is equivalent to saying that c is normally distributed with zero mean and covariance uut . The
components of x are the channel factors in factor analysis modeling.
5.3 SVM-JFA
The SVM is a classifier used to find a separator between two classes. The main idea of this classifier
is to project the input vectors into a high dimension space called feature space in order to find linear
separation. This projection is carried out using a mapping function. In practice, SVMs use kernel
functions to perform the scalar product computation in the feature space. These functions allow us
to compute directly the scalar product in the feature space without defining the mapping function.
In this section, we will present several ways to carry out the combination between the
SVM and JFA. The first approach is similar to the classical SVM-GMM [Campbell et al., 2006a,
Campbell et al., 2006b] when the speaker GMM supervectors are used as input to SVM. The second
set of method that we tested, consist of designing new kernel using the speaker factors or speaker and
common factors depending on the configuration of the JFA model.
5.3.1 GMM Supervector space

In order to apply SVM with JFA using speaker supervector as input, we used the classical linear
Kullback- Leibler kernel. This kernel applied in GMM supervector space is based on Kullback-Leibler
divergence between two GMMs [Campbell et al., 2006a]. This distance corresponds to Euclidean dis-
tance between scaled GMM supervectors s and s0 .
C
X t
De2 0
wi si − s0i Σ−1 si − s0i

s, s = i (5.4)
i=1
where wi and Σi are the ith UBM mixture weights and diagonal covariance matrix, si corresponds to
the mean of Gaussian i of the speaker GMM. The derived linear kernel is defined as the corresponding
44
inner product of the preceding distance
C t
0
X √ −1 √ −1
wi Σi 2 s0i

Klin s, s = wi Σi 2 si (5.5)
i=1
This kernel was proposed by Campbell et al. [Campbell et al., 2006a].
5.3.2 Speaker factors space

In this part of the paper, we discuss the use of speaker factors as parameters input to SVM. The
speaker factors coefficients correspond to coordinate of the speaker in the speaker space defined by the
eigenvoices matrix. The advantage of using speaker factors is that these vectors are of low dimensions
(typical dimension = 300), making the decision process faster. We tested these vectors with three
classical kernels which are linear, Gaussian and cosine kernels. These kernels are respectively given
by the following equations:
k(y1 , y2 ) = hy1 , y2 i (5.6)

1 2
k(y1 , y2 ) = exp − 2 ky1 − y2 k (5.7)
2.σ
hy1 , y2 i
k(y1 , y2 ) = (5.8)
ky1 k ky2 k
The motivation of using the linear kernel is that the speaker factors vectors are normally distributed
with zero mean and identity variance matrix. In order to obtain the speaker factors for this system,
we used the JFA configuration which has the speaker and channel factors only. There are no common
factors (z) (see equation 5.2).
Within Class Covariance Normalization

In this new approach, we proposed to make another channel compensation step in the speaker
factors space. The first step is carried out by estimating the channel factors in GMM super-
vectors space. To achieve this compensation, two choices are possible, the first one is NAP
[Campbell et al., 2006a] algorithm and the second is the Within Class Covariance Normalization al-
gorithm (WCCN) [Hatch et al., 2006]. We decided to apply WCCN algorithm rather than NAP
because NAP algorithm realizes channel compensation by removing the nuisance directions; however
the speaker factors are vectors of low dimension so removing additional directions could be harmful.
The WCCN algorithm uses the Within Class Covariance (WCC) matrix to normalize the kernel
functions in order to compensate for the channel factor without removing any directions in the space.
WCC matrix is obtained by the following formula:
S ns
1X 1 X
W = (yis − ys ) (yis − ys )t (5.9)
S ns
s=1 i=1
where ys = n1s ni=1

P s s
yi is mean of speaker factors vectors of each speaker, S is the number of speaker
and ns is number of utterances for each speaker s.
The WCCN algorithm was applied to the linear and cosine kernels. The new versions of the two
previous kernels are given by the following equations:
k(y1 , y2 ) = y1t W −1 y2 (5.10)

y1t W −1 y2
k(y1 , y2 ) = (5.11)
y1 W y1 y2t W −1 y2
t −1
45
5.3.3 Speaker and Common factors space
In the case when we have the speaker and common factors, we proposed and compared two techniques
to combine these two sources of information. The first approach is to apply SVM in each space (speaker
factors space and common factors space). Thereafter we make linear scores fusion of these two SVMs
scores. The fusion weights are obtained by using a logistic regression [Brümmer et al., 2007]. The
second approach is to define a new kernel which is the linear combination of two kernels. The first
kernel is applied in the speaker factors space the second kernel is applied in the common factors space.
The kernels combination weights are fixed in order to maximize the margin between target speaker and
impostors utterances. This technique was already applied in speaker verification [Dehak et al., 2008].
5.4 Experimental setup

5.4.1 Test set
The results of our experiments are reported in the core condition of the NIST 2006 speaker recognition
evaluation (SRE) dataset1 . In the case of score fusion system, we trained the score fusion weights on
NIST 2006 SRE dataset and we tested the systems on the telephone data of the core condition of the
NIST 2008 SRE.
5.4.2 Acoustic features

In our experiments, we used cepstral features, extracted using a 25ms Hamming window. 19 mel
frequency cepstral coefficients together with log energy are calculated every 10ms. This 20-dimensional
feature vector was subjected to feature warping [Pelecanos and Sridharan, 2001] using a 3s sliding
window. Delta and double delta coefficients were then calculated using a 5 frames window giving a
60-dimensional feature vectors. These feature vectors were modeled using GMM and factor analysis
was used to treat the problem of speaker and session variability.
5.4.3 Factor analysis training

We used gender independent Universal Background Models which contains 2048 Gaussians. This
UBM was trained using LDC releases of Switchboard II, Phases 2 and 3; switchboard Cellular, Parts
1 and 2 and NIST 2004-2005 SRE. The (gender independent) factor analysis models were trained on
the same quantities of data as the UBM.
The decision scores obtained with the factor analysis were normalized using zt-norm normalization.
We used 148 male and 221 female t-norm models and we used 159 male and 201 female z-norm
utterances.
We used two factor analysis configurations. The first JFA is composed by 300 speaker factors
and 100 channel factors only and in the second configuration is the full configuration; we added the
diagonal matrix (d) in order to have speaker and common factors.
5.4.4 SVM impostors

We used 1875 gender independent impostors to train the SVM model. These impostors are taken
from LDC releases of Switchboard II, Phases 2 and 3; switchboard Cellular, Parts 1 and 2 and NIST
2004-2005 SRE.
1
http://www.nist.gov/speech/tests/spk/index.htm
46
Table 5.1: Comparison results between SVM-JFA in GMM supervectors space and JFA frame by
frame scoring. The results are given on EER in the core condition of the NIST 2006 SRE.
System English All trials

JFA: s = m + V y 1.95% 3.01%
JFA: s = m + V y + dz 1.80% 2.96%
SVM-JFA: s = m + V y 4.24% 4.98%
SVM-JFA: s = m + V y + dz 4.23% 4.92%
Table 5.2: Comparison results between SVM-JFA in speaker factor space and GMM supervectors
space. The Results are given on EER in the core condition of the NIST 2006 SRE.
English All trials

No-norm T-norm No-norm T-norm
KL-kernel: - 4.24% - 4.98%
GMM super-
vectors
Linear kernel 3.47% 2.93% 4.64% 4.04%
Gaussian kernel 3.03% 2.98% 4.59% 4.46%
Cosine kernel 3.08% 2.92% 4.18% 4.15%
5.4.5 Within Class Covariance

The gender independent within class covariance matrix is trained in the same dataset as the JFA
training.
5.5 Results
5.5.1 SVM-JFA: GMM supervector space
We start with the results obtained by the combination SVM-JFA when the GMM supervectors are
used as input to the SVM. We used GMM supervector obtained using both JFA configurations (with
or without Common factors). The results are given in Table 5.1. These results are compared to the
frame by frame JFA scoring techniques.
The results show that the performances of the application of the SVM in the GMM supervector
space are significantly worse than that obtained by the conventional frame by frame JFA scoring.
These results can be explained by the fact that the linear KL kernel is not appropriate for GMM
supervectors obtained by the JFA model because the assumption of independence of GMM Gaussians
in the case of MAP adaptation is not true for adaptation based on eigenvoices. The results show also
that the addition of common factors didn’t improve the results in the case of SVM-JFA compared to
the JFA scoring.
5.5.2 SVM-JFA: speaker factors space

We present in this section the results obtained with the linear, Gaussian and cosine kernels applied in
speaker factors space. We compare these new results with the last one using the SVM- JFA applied
in GMM supervectors. The Table 5.2 gives these results.
47
Table 5.3: Comparison results between SVM-JFA in speaker factor space (with and without WCCN)
with two JFA scoring techniques. The results are given on EER in the core condition of the NIST
2006 SRE, English trials.
Without WCCN With WCCN

t-norm zt-norm t-norm zt-norm
Linear kernel 2.93% - 2.44% -
Cosine kernel 2.92% - 2.43% -
JFA frame by 2.81% 1.95% - -
frame scoring
JFA integrate 4.12% 2.70% - -
over channel
factors
There are three remarks in Table 5.2. The first one is that the application of the SVM in speaker
factors space gave better results than applied SVM in GMM supervectors space. The second is that
there is well linear separation between the speakers if we compare the results between cosine and
Gaussian kernel. The last remark is that t-norm didn’t give a large improvement in the case of the
cosine and Gaussian kernels, however it helps in the case of the linear kernel.
Within Class Covariance Normalization
We will now discuss the performance achieved with or without the WCCN technique in the case of
linear and cosine kernels. Table 5.3 compares the results obtained with and without WCCN to the
results of two JFA scoring techniques. The first method consists on integrating over channel factors
proposed in [Kenny et al., 2008d]. The second one is to make frame by frame JFA scoring.
The results given in Table 5.3 show that with the WCCN, we achieved 17 % relative improvements
in both kernels. We can see also that the performances obtained with WCCN are very comparable to
the JFA scoring. We got better results than integrating over channel factors and closer to JFA frame
by frame scoring. An advantage of this new SVM-JFA scoring is that it is faster than the two other
techniques.
5.5.3 SVM-JFA: speaker and common factors space
We present a comparison between results obtained with score fusion and kernel combination applied
in the speaker and common factors. In both fusion techniques, we applied cosine kernel in speaker and
common factors space. We used also WCCN in order to normalize the speaker factors cosine kernel.
The results are given in Table 5.4.
By looking at these results, we can conclude, that both fusion methods gave equivalent results.
However, the use of the kernel combination is more appropriate because we dont need development
data to set the kernel weights. The results reported in the Table 5.4 by score fusion on NIST 2006
SRE are not realistic because we trained and tested the score fusion weights on the same dataset.
We note also that the common factors components give complementary information to speaker
factors components and the combination between them improve the performances. If we compare the
results obtained by the kernel combination method and the other scoring methods, we find the same
conclusion as using only SVM in the speaker factors space (see section 5.5.2).
48
Table 5.4: Comparison results between score fusion and kernels combination for SVM-JFA system.
NIST 2006 SRE NIST 2008 SRE

English All trials English All trials
Cosine kernel on y 2.34% 3.59% 3.86% 6.55%
Cosine kernel on z 6.26% 8.68% 10.34% 13.45%
Linear score fusion 2.11% 3.62% 3.23% 6.86%
Kernel combination 2.08% 3.62% 3.20% 6.60%
JFA frame by 1.80% 2.96% - -
frame Scoring
JFA integrate 2.65% 3.82% - -
over channel factors
5.6 Conclusion
We tested in this paper several combinations between discriminative model which is Support vector
machine and generative model which is Joint Factor analysis for speaker verification. We found that
using linear or cosine kernel defined in speaker and Common factors which are the components of
the JFA gave better results than using linear Kullback Leibler kernel applied in GMM supervectors
obtained also with JFA model. We proved that using within class covariance normalization in speaker
space in order to compensate for the channel effect gave the best performances.
The results obtained with SVM-JFA using the speaker factors were comparable to the results
obtained with classical JFA scoring. However the advantage of using the SVM in speaker factors
space (usally dimension 300) makes the scoring faster than others classical techniques.
49
Chapter 6
Handling variability with support

vector machines
In speaker verification we encounter two types of variation: inter-speaker, and intra-speaker. The
former is desired and a good recognizer should exploit it, while the latter is a nuisance and a good
recognizer should suppress it. In this chapter, we will propose variability compensated SVM (VCSVM),
a new framework for handling both of these types of variation in the SVM speaker recognition setup.
6.1 Introduction
Speaker verification using SVMs has proven to be a powerful method, specifically using the GSV Kernel
[Campbell et al., 2006b] with nuisance attribute projection (NAP) [Solomonoff et al., 2005]. Also, the
recent popularity and success of factor analysis [Kenny et al., 2008c] has led to Najim’s promising
attempts to use speaker factors directly as SVM features. Both using NAP projection and speaker
factors with SVMs are methods of handling variability in speaker verification: NAP by removing
undesirable nuisance variability, while using the speaker factors does so by forcing the discrimination
to be performed based on inter-speaker variability. These successes have led us to propose VCSVM,
a new method to handle both inter and intra-speaker variation, that attempts to do so directly in the
SVM optimization. This is done by adding a penalty to the minimization that biases the normal to the
hyperplane to be orthogonal to the nuisance subspace or alternatively orthogonal to the complement of
the subspace containing the intra-speaker variation. This bias will attempt to ensure that inter-speaker
variability is used in the recognition while intra-speaker variability is ignored.
6.2 Motivation
Evidence of the importance of handling variability can be found in the discrepancy in verification
performance between one, three and eight conversation enrollment tasks for the same SVM system;
with performance improving as the number of enrollment utterances increases. One explanation for
this is that when only one target conversation is available to enroll a speaker then the orientation of
the separating hyperplane is set by the impostor utterances. As more target enrollment utterances
are provided the orientation of the separating hyperplane can change drastically, as sketched in Figure
6.1.
The additional information that the extra enrollment utterances provide is intra-speaker variability,
due to channel effects and other nuisance variables. If we could estimate the principal components of
intra-speaker variability for a given speaker then we could force the SVM to not use that variability
in choosing a separating hyperplane as is shown in Figure 6.2 where the main nuisance direction was
removed. However since it is not generally possible to estimate intra-speaker variability for a specific
50
1c target w
3c targets
+ + + 1c
+ + +
+ +
3c
− − 8c
− − −
− −
− − − − −
Figure 6.1: Different separating hyperplanes obtained with 1, 3, and 8 conversation enrollment.
speaker we could substitute a global estimate obtained from a large number of speakers, this is exactly
what is done in NAP.
Capturing Intra−speaker Standard SVM
variability seperating hyperplane
w
+
U
− − SVM seperating hyperplane

− − − with intra−speaker
− − variability compensation
− − − − −
Figure 6.2: Effect of removing the nuisance direction from the SVM optimization.
6.3 Handling Nuisance Variability

NAP handles nuisance variability by estimating a small subspace where the nuisance lives and removing
it completely from the SVM features, i.e. not allowing any information from the nuisance subspace to
affect the SVM decision. We approach this from another angle that, instead of removing the subspace
completely, biases the normal to the separating hyperplane to be orthogonal to the nuisance subspace.
Assume that the nuisance subspace is spanned by a set of N orthonormal eigenvectors {u1 , u2 , . . . , uN },
and let U be the matrix whose columns are the u0 s. Let the vector normal to the separating hyperplane
be w and ideally if the nuisance was restricted to the subspace
U then
one would require the orthogonal
T 2
projection of w in the nuisance subspace to be zero, i.e. UU w 2 = 0. This requirement can be

introduced directly into the primal formulation of the SVM optimization:
2
minimize J(w, ) = ||w||22 /2 + ξ UUT w2 /2 + C m
P
i=1 i (6.1)
subject to yi (wT xi + b) ≥ 1 − i i = 0, . . . , m
i ≥ 0 i = 0, . . . , m,
51
where the x0i s are the utterance specific SVM features (supervectors) and yi0 s are the corresponding
labels. Note that the only difference between (6.1) and the standard SVM formulation is the addition
2
of the ξ UUT w2 term, where ξ is a tunable (on some held out set) parameter that regulates the
amount of bias desired. If ξ = ∞ then this formulation becomes similar to NAP compensation, and if
ξ = 0 then we obtain the standard SVM formulation; Figure 6.3 sketches the separating hyperplane
obtained for different values of ξ .
Capturing Intra−speaker Standard SVM
variability seperating hyperplane
w
ξ=0
+
U 0< ξ <
8
ξ=
8
− − SVM seperating hyperplane
− − − with intra−speaker
− − variability compensation
− − − − −
Figure 6.3: How the separating hyperplane varies with different values of ξ.
We can rewrite the additional term in (6.1) as follows:
UUT w2 = (UUT w)T (UUT w) = wT UUT UUT w = wT UUT w,

2
(6.2)
where the final equality follows from the eigenvectors being orthonormal (UT U = I). Note that
2
UUT is a positive semi-definite matrix since ∀x xT UUT x = UUT x2 which is the 2-norm of the
projection of x in the U subspace and hence xT UUT x ≥ 0 ∀x.
We now follow the recipe presented in [Ferrer et al., 2007] to convert this reformulation into a standard
SVM with the bias absorbed into the kernel. We begin by rewriting J(w, ) in (6.1) as:
m
X
J(w, ) = wT (I + ξUUT )w/2 + C i , (6.3)
i=1
and since (I + ξUUT ) is a positive definite symmetric matrix we obtain the following
m
X
T T
J(w, ) = w B Bw + C i , (6.4)
i=1
where B can be chosen to be real and symmetric and is invertible. A change of variable w̃ = Bw and
x̃ = B−T x allows us to rewrite the optimization in (6.1) as
minimize J(w, ) = ||w̃||22 /2 + C m

P
i=1 i (6.5)
subject to yi (w̃T x̃i + b) ≥ 1 − i i = 0, . . . , m
i ≥ 0 i = 0, . . . , m,
which is then the standard SVM formulation with the following kernel:
K(xi , xj ) = xTi B−1 B−T xj = xTi (I + ξUUT )−1 xj . (6.6)
Examining the kernel presented in (6.6) we realize that (I+ξUUT ) can be very large, e.g. 19456x19456
if Gaussian supervectors are used with a 512 mixture UBM and 38 dimensional features. The size of
52
this matrix is of concern since the kernel requires its inverse. To circumvent this concern we use the
Matrix Inversion Lemma [Brookes, 2006] which states the following, for appropriately sized matrices
A, U, V, and I:
(A−1 + UVH )−1 = A − (AU(I + VH AU)−1 )VH A. (6.7)

√ √
Setting A = I, U = ξU and V = ξU in the above equation and recalling that UT U = I we get,
(I + ξUUT )−1 = I − ξU(I + ξUT U)−1 ξUT

p p
= I − ξU[(1 + ξ)I]−1 UT
ξ
= I− UUT . (6.8)
1+ξ
The kernel can therefore be rewritten as:

ξ
K(xi , xj ) = xTi (I − UUT )xj . (6.9)
1+ξ
Examining (6.9) we notice that when ξ = 0 we recover the standard linear kernel, and more importantly
when ξ = ∞ we recover exactly the kernel suggested in [Solomonoff et al., 2005] for performing NAP
channel compensation.
An advantage of this VCSVM formulation over NAP is that it does not make a hard decision to
completely remove dimensions from the SVM features but instead leaves that decision to the SVM
optimization, this is of interest especially after Najim’s experiment using only the channel-factors
in an SVM and obtaining a surprising good EER. This method effectively allows for using a larger
estimate of the nuisance subspace and then letting the SVM optimization decide how much energy to
discard from that subspace. VCSVM also provides a way to perform nuisance compensation in SVM
spaces of smaller dimensions, e.g. those obtained when speaker factors are used as SVM features;
in such situations using NAP may not be favorable as it could permanently be removing important
information.
6.4 Should All Nuisance be Treated Equally

As the choice of nuisance subspace gets larger one might desire to handle directions within that sub-
space unequally, for example we might want to avoid using larger nuisance directions in discrimination
more than we would smaller ones. This can be done based on the eigenvalues corresponding to the
different nuisance directions. Therefore, the eigenvectors spanning the U matrix are still orthogonal
but not orthonormal:
UT U = Λ, (6.10)
where Λ is a diagonal matrix whose elements are the eigenvalues corresponding to the columns of U.
We can now follow a formulation similar to that of the previous section, the difference will show up
in kernel as the matrix inversion lemma (6.7) is applied:
K(xi , xj ) = xTi (I + ξUUT )−1 xj (6.11)

= xTi (I − ξU(I + ξUT U)−1 ξUT )xj
p p
= xTi (I − ξU[I + ξΛ]−1 UT )xj . (6.12)
An extreme example of this where the whole SVM space is considered to contain nuisance informa-
tion (i.e. UUT is full rank) results in a formulation very similar to that of WCCN normalization
53
[Hatch et al., 2006]. WCCN proposes using inverse of the intra-speaker covariance matrix (i.e. full
rank UUT ) as a kernel:
K(xi , xj ) = xTi (UUT )−1 xj . (6.13)
However, in practice UUT is ill-conditioned due to the noisy estimate and directions of very small
nuisance variability, therefore smoothing is added to the intra-speaker covariance matrix to make
inversion possible, and therefore the WCCN suggested kernel becomes:
K(xi , xj ) = xTi ((1 − α)I + αUUT )−1 xj 0 ≤ α < 1. (6.14)
Comparing (6.14) with (6.11) we see that they are similar. We should however mention that when
UUT spans the full SVM space the ξ (in our implementation) and α (in the WCCN implementation)
no longer set the amount of bias desired, instead they insure that the kernel does not over-amplify
directions with small amounts of nuisance variability.
6.5 Using Inter-speaker Variability

Najim’s successful use of speaker factors to perform SVM speaker verification highlights that the in-
formation of the supervectors present in the subspace contains most of the important inter-speaker
information needed for SVM verification. However, this does not contain all the information, since
Najim’s experiments that use the “z” vector (which contains the remaining inter-speaker variability
information) performed better than just using the speaker factors. Based on this we propose a VCSVM
method similar to the one presented in the previous section to bias the SVM towards mostly using
the data present in the inter-speaker variability space.
Assume that the inter-speaker subspace is spanned by a set of M orthonormal eigenvectors (eigen-
voices) {v1 , v2 , . . . , vM }, and let V be the matrix whose columns are the v0 s. Let the vector nor-
mal to the separating hyperplane be w and ideally if V captured all inter-speaker variability then
we
would Twant2w to live in the V subspace and therefore be orthogonal to its complement, i.e.
(I − VV )w = 0. Similar to the previous section this requirement can be introduced directly into
2
the primal formulation of the SVM optimization:
2
minimize J(w, ) = ||w||22 /2 + γ (I − VVT )w2 /2 + C m
P
i=1 i (6.15)
subject to yi (wT x i + b) ≥ 1 − i i = 0, . . . , m
i ≥ 0 i = 0, . . . , m,
where γ is a tunable (on some held out set) parameter that enforces the amount of bias desired. If
γ = ∞ then this formulation becomes similar to just using the speaker factors, and if γ = 0 then we
obtain the standard SVM formulation. Note that since I − VVT is a projection into the complement
of V then we can replace it by QQT , where Q which is a matrix whose columns are the orthonormal
eigenvectors that span the complement. With this substitution we obtain a formulation that is almost
equivalent to that in (6.1), hence following the recipe in the previous section we see again can push
the bias into the kernel of a standard SVM formulation. The kernel in this case is:
γ
K(xi , xj ) = xTi (I − QQT )xj . (6.16)
1+γ
substituting back Q = I − VVT we can rewrite (6.16) as:

γ
K(xi , xj ) = xTi (I − (I − VVT ))xj . (6.17)
1+γ
54
Note that we do not have to explicitly compute the orthonormal basis Q, which can be rather large.
When γ = ∞ the kernel becomes an inner-product between the orthogonal projection of xi into the
inter-speaker subspace and recalling that VVT = VVT VVT we can rewrite the kernel as:
K(xi , xj ) = xTi VVT xj = xTi VVT VVT xj = (VVT xi )T VVT xj . (6.18)
But, VVT xi = Vyi where yi are the speaker factors of xi , therefore:
K(xi , xj ) = yiT VT Vyj = yiT yj , (6.19)
since V is an orthonormal subspace. This kernel suggests that when one chooses to perform classi-
fication using only the inter-speaker subspace the resultant kernel is just an inner-product between
the minivectors (speaker factors). Note however, that when estimating √ V one should use UBM-
normalized supervectors, similarly xi should be UBM-normalized, i.e. xi = λΣ−1/2 (mi mU BM where
mi is the supervector of relevance MAP adapted means, and mU BM is the supervector of UBM means.
6.6 Incorporating All Variability

A natural followup is to combine the previous sections into a single VCSVM formulation that attempts
to handle all of the variation. In this section we will choose to treat all nuisance directions equally,
however this can easily be extended to the setup in Section 6.4. This has to be done carefully since
there is an overlap between the U subspace and the complement to the V subspace, specifically U
lies in the complement of V. With this in mind the resultant VCSVM formulation is as follows:
2 2
minimize ||w||22 /2 + ξ UUT w2 + γ (I − VVT − UUT )w2 /2 + C m
P
i=1 i
subject to yi (wT xi + b) ≥ 1 − i i = 0, . . . , m (6.20)
i ≥ 0 i = 0, . . . , m,
where when ξ = γ we obtain the inter-speaker result of Section 6.5 and if γ = 0 we obtain the
intra-speaker result of Section 6.3. Recasting it as a standard SVM formulation yields the following
kernel:
ξ γ
K(xi , xj ) = xTi (I − UUT − (I − VVT − UUT ))xj . (6.21)
1+ξ 1+γ
6.7 Probabilistic Interpretation

In [Ferrer et al., 2007], the author makes a connection between the suggested kernel and the proba-
bilistic interpretation of SVMs proposed in [Sollich, 1999]. In this section we will choose to treat all
nuisance directions equally, however this can easily be extended to the setup in Section 6.4. The SVM
problem can be thought of as one of maximization the likelihood of w given the training data ({xi , yi }
pairs) by writing it as follows
m
X
maximize l(w|{xi , yi }) = −wT w/2 − C h(yi (wT xi + b)), (6.22)
i=1
where h() is the hinge loss. In this formulation the SVM can be though of as just computing the MAP
estimate of w given the training data: where the wT w term is essentially a Gaussian (N (0, I)) prior
and the second term is the log-likelihood of the training data given w. This Gaussian prior on w in
the standard SVM does not bias the angle of w in any direction since the components of w in the
55
prior are independent. When we introduce the bias to handle the variability this only affects the first
term in (6.22) and therefore changes the prior on w in the MAP estimation interpretation (we will
focus on nuisance variability):
m
X
T T
maximize l(w|{xi , yi }) = −w (I + ξUU )w/2 − C h(yi (wT xi + b)). (6.23)
i=1
The prior on the MAP estimate of w is still a Gaussian N (0, (I + ξUUT )−1 ) but with its principal
components orthogonal to the nuisance subspace, with the variance along the principle components
set by ξ. Hence, the prior is biasing w to be orthogonal to the nuisance subspace. A similar connection
can be made for the full setup proposed in Section 6.6.
6.8 Artificially Extending the Target Data

The probabilistic interpretation seems to suggest a method for incorporating a bias in the orientation
of the hyperplane by artificially extending the target data. In this section we will choose to treat
all nuisance directions equally, however this can easily be extended to the setup in Section 6.4. This
can be done by creating artificial target points sampled from a Gaussian centered at the true target
point and a covariance whose principle components are those of the nuisance subspace (or those of the
complement to the inter-speaker subspace). Therefore in general we can add target points sampled
from the Gaussian N (xt , (I + ξUUT + γ(I − VVT − UUT ))), where xt is the true target point.
6.9 Experimental Results

We have chosen to demonstrate VCSVM by handling nuisance variability when performing recognition
using only the speaker factors. We use the speaker verification system which uses the speaker factors
as features of the SVM. Each utterance is represented using a vector of 300 speaker factors from the
joint factor analysis system used in the workshop. The speaker factor vectors are normalized to have
unit L2 -norm and used as features in a SVM speaker verification system. Figure 6.4 shows how the
equal error rate (EER) changes as a function of ξ on our development set, the English trials of the
NIST SRE-Eval 06 core task, for 50 and 100 dimensional nuisance subspaces when equal and non-
equal weighting of the nuisance dimensions are used. The figure shows that non-equal weighting of
3.6
VCSVM EQUAL CORANK 50
3.4 VCSVM EQUAL CORANK 100
VCSVM ξ=∞ (NAP) CORANK 50
3.2 VCSVM ξ=∞ (NAP) CORANK 100
VCSVM NON−EQUAL CORANK 50
EER
3 VCSVM NON−EQUAL CORANK 100
2.8
2.6
0 5 10 15 20 25
ξ
Figure 6.4: Results on English trials of the NIST SRE-Eval 06 core task with speaker factor SVM
system: EER vs ξ for equal and non-equal weighting of nuisance subspace, and various subspace sizes.
the nuisance directions yields more favorable results than equal weighting. It also shows that VCSVM
allows for nuisance compensation in such a small space, while NAP performs poorly since it completely
removes the estimated nuisance dimensions which are a large percentage of the total dimensionality.
56
Based on the development results we choose ξ = 3 and a corank of 50 to present the results on all
trials of the Eval 08 core task in Figure 6.5.
40
NO COMPENSATION
VCSVM ξ=3 NON−EQUAL CORANK 50
Miss probability (in %) 20
10
2
1 2 5 10 20 40
False Alarm probability (in %)
Figure 6.5: Detection error plots on all trials of NIST Eval 08 core task with speaker factor SVM
system.
6.10 Conclusion
This paper presents variability compensated SVM (VCSVM), a method for handling both good and
bad variability directly in the SVM optimization. This is accomplished by introducing into the mini-
mization a regularized penalty, which biases the classifier to avoid nuisance directions and use direc-
tions of inter-speaker variability. With regard to nuisance compensation, an advantage of our proposed
method is that it does not make a hard decision on removing nuisance directions, rather it decides
according to performance on a held out set. Another benefit is that it allows for unequal weighting of
the estimated nuisance directions, e.g. according to their associated eigenvalues. This flexibility allows
for improved performance over NAP, increased robustness with regards to the size of the estimated
nuisance subspace, and successful nuisance compensation in small SVM spaces. Future work will focus
on using this method for handling inter-speaker variability and all variability simultaneously.
57
Chapter 7
Comparison of Scoring Methods used

in Speaker Recognition with Joint
Factor Analysis
The aim of this chapter is to compare different log-likelihood scoring methods, that different sites used
in the latest state-of-the-art Joint Factor Analysis (JFA) Speaker Recognition systems. The algorithms
use various assumptions and have been derived from various approximations of the objective functions
of JFA. We compare the techniques in terms of speed and performance. We show, that approximations
of the true log-likelihood ratio (LLR) may lead to significant speedup without any loss in performance.
7.1 Introduction
Joint Factor Analysis (JFA) has become the state-of-the-art technique in the problem of speaker
recognition1 . It has been proposed to model the speaker and session variabilities in the parameter
space of the Gaussian Mixture Model (GMM) [Douglas A. Reynolds, 2000]. The variabilities are
determined by subspaces in the parameter space, commonly called the hyper-parameters.
Many sites used JFA in the latest NIST evaluations, however they report their results using different
scoring methods ([Kenny et al., 2007b], [Vair et al., 2007], [Brümmer et al., 2007]). The aim of this
paper is to compare these techniques in terms of speed and performance.
7.2 Theoretical background

The theory on JFA was given in Chapter 1.
The likelihood of test utterance X is then computed by integrating over the posterior distribution
of y and z, and the prior distribution of x [Kenny and Dumouchel, 2004]. In [Kenny et al., 2005b],
it was later shown, that using mere MAP point estimates of y and z is sufficient. Still, integration
over the prior distribution of x was performed. We will further show, that using the MAP point
estimate of x gives comparable results. Scoring is understood as computing the log-likelihood ratio
(LLR) between the target speaker model s and the UBM, for the test utterance X .
There are many ways in which JFA can be trained and which different sites have experimented
with. Not only the training algorithms differ, but also the results were reported using different scoring
strategies.
1
In the meaning of speaker verification
58
7.2.1 Frame by Frame
Frame-by-Frame is based on a full GMM log-likelihood evaluation. The log-likelihood of utterance X
and model s is computed as an average frame log-likelihood 2 . It is practically infeasible to integrate
out the channel, therefore MAP point estimate of x is used. The formula is as follows
T
X C
X
log P (X |s) = log wc N (ot ; µc , Σc ) , (7.1)
t=1 c=1
where ot is the feature vector at frame t, T is the length (in frames) for utterance X , C is number of
Gaussians in the GMM, and wc , Σc , and µc the c th Gaussian weight, mean, and covariance matrix,
respectively.
7.2.2 Integrating over Channel Distribution

This approach is based on evaluating an objective function as given by Equation (13)
in [Kenny et al., 2007b]:
Z
P (X |s) = P (X |s, x)N (x; 0, I)dx (7.2)
As was said in the previous paragraph, it would be difficult to evaluate this formula in the frame-by-
frame strategy. However, (7.1) can be approximated by using fixed alignment of frames to Gaussians,
i.e., assume that each frame is generated by a single (best scoring) Gaussian. In this case, the likelihood
can be evaluated in terms of the sufficient statistics. If the statistics are collected in the Baum-Welch
way, the approximation is equal to the GMM EM auxiliary function, which is a lower bound to (7.2).
The closed form (logarithmic) solution is then given as:
C
X 1
log P̃ (X |s) = Nc log
c=1
(2π)F/2 |Σc |1/2
1 1
− tr(Σ−1 Ss ) − log|L|
2 2
1 −1/2 ∗ −1
+ kL U Σ Fs k 2 (7.3)
2
where for the first term, C is the number of Gaussians, Nc is the data count for Gaussian c, F is the
feature vector size, Σc is covariance matrix for Gaussian c. These numbers will be equal both for UBM
and the target model, thus the whole term will cancel out in the computation of the log-likelihood
ratio.
For the second term of (7.3), Σ is the block-diagonal matrix of separate covariance matrices for
each Gaussian, Ss is the second order moment of X around speaker s given as
Ss = S − 2diag(Fs∗ ) + diag(Nss∗ ), (7.4)
where S is the CF × CF block-diagonal matrix whose diagonal blocks are uncentered second order
cumulants Sc . This term is independent of speaker, thus will cancel out in the LLR computation
(note that this was the only place where second order statistics appeared, therefore are not needed for
scoring). F is a CF × 1 vector, obtained by concatenating the first order statistics. N is a CF × CF
diagonal matrix, whose diagonal blocks are Nc IF , i.e., the occupation counts for each Gaussian (IF is
F × F identity matrix).
2
All scores are normalized by frame length of the tested utterance, therefore the log-likelihood is average.
59
The L in the third term of (7.3) is given as
L = I + U∗ Σ−1 NU, (7.5)
where I is a CF × CF identity matrix, U is the eigenchannel matrix, and the rest is as in the
second term. The whole term, however, does not depend on speaker and will cancel out in the LLR
computation.
In the fourth term of (7.3), let L1/2 be a lower triangular matrix, such that
L = L1/2 L1/2∗ (7.6)
i.e., L−1/2is the inverse of the Cholesky decomposition of L.
As was said, terms one and three in (7.3), and second order statistics S in (7.4) will cancel out.
Then the formula for the score is given as
Qint (X |s) = tr(Σ−1 diag(Fs∗ ))
1
+ tr(Σ−1 diag(Nss∗ ))
2
1
+ kL−1/2 U∗ Σ−1 Fs k2 (7.7)
2
7.2.3 Channel Point Estimate

This function is similar to the previous case, except for the fact, that the channel factor x is known.
This way, there is no need for integrating over the whole distribution of x, and only its point estimate
is taken for LLR computation. The formula is directly adopted from [Kenny, 2005] (Theorem 1),
C
X 1
log P̃ (X |s, x) = Nc log
c=1
(2π)F/2 |Σc |1/2
1
− tr(Σ−1 S)
2
1
+M∗ Σ−1 F + M∗ NΣ−1 M, (7.8)
2
where M is given Eq. 2.3. In this formula, the first and second terms cancel out in LLR computation,
leading to scoring function
Qx (X |s, x) = M∗ Σ−1 F
1
+ M∗ NΣ−1 M, (7.9)
2
hence
LLRx (X |s) = Qx (X |s, xs ) − Qx (X |UBM, xUBM ), (7.10)
where xUBM is a channel factor estimated using UBM, and xs is a channel factor estimated using
speaker s.
7.2.4 UBM Channel Point Estimate

In [Vair et al., 2007], the authors assumed, that the shift of the model caused by the channel is identical
both to the target model and the UBM3 . Therefore, the x factor for utterance X is estimated using
the UBM and then used for scoring. Formally written:
LLRLPT (X |s) = Qx (X |s, xUBM )
−Qx (X |UBM, xUBM ) (7.11)
3
The authors identified themselves under abbreviation LPT, therefore we will refer to this approach as to LPT
assumption
60
Note, that when computing the LLR, the Ux in the linear term of (7.8) will cancel out, leaving the
compensation to the quadratic term of (7.8).
7.2.5 Linear Scoring

Let us keep the LPT assumption and let mc be the channel compensated UBM:
mc = m + c. (7.12)
Furthermore, let us assume, that we move the origin of supervector space to mc .
M̄ = M − mc (7.13)
F̄ = F − Nmc . (7.14)
Eq. (7.9) can now be rewritten to
Qxmod (X |M̄, x) = M̄∗ Σ−1 F̄

1
+ M̄∗ NΣ−1 M̄. (7.15)
2
When approximating (7.15) by the first order Taylor series (as a function of M̄), only the linear term
is kept, leading to
Qlin (X |M̄, x) = M̄∗ Σ−1 F̄ (7.16)
Realizing, that the channel compensated UBM is now a vector of zeros, and substituting (7.16)
to (7.11), the formula for computing the LLR simplifies to
LLRlin (X |s, x) = (Vy + Dz)∗ Σ−1 (F − Nm − Nc). (7.17)
LLR r
ea
lin
linear score
LLRfbf
full score
0
quadratic score
qua
drat
ic
UBM target GMM mean space

model
Figure 7.1: An illustration of the scoring behavior for frame-by-frame, LPT, and linear scoring.
Given the fact, that the P̃ -function is a lower bound approximation of the real frame-by-frame
likelihood function, there are cases, when the LPT original function fails. Fig. 7.1 shows that the
linear function can sometimes be a better approximation of the full LLR.
61
7.3 Experimental setup
7.3.1 Test Set
The results of our experiments are reported on the Det1 and Det3 conditions of the NIST 2006 speaker
recognition evaluation (SRE) dataset4 .
The real-time factor was measured on a special test set, where 49 speakers were tested against
50 utterances. The speaker models were taken from the t-norm cohort, while the test utterances
were chosen from the original z-norm cohort, each having approximately 4 minutes, totally giving 105
minutes.
7.3.2 Feature Extraction

In our experiments, we used cepstral features, extracted using a 25 ms Hamming window. 19 mel
frequency cepstral coefficients together with log energy are calculated every 10 ms. This 20-dimensional
feature vector was subjected to feature warping [J. Pelecanos, 2006] using a 3 s sliding window. Delta
and double delta coefficients were then calculated using a 5 frames window giving a 60-dimensional
feature vectors. These feature vectors were modeled using GMM and factor analysis was used to treat
the problem of speaker and session variability.
Segmentation was based on the BUT Hungarian phoneme recognizer [Schwarz et al., 2006] and
relative average energy thresholding. Also short segments were pruned out, after which the speech
segments were merged together.
7.3.3 JFA Training

We used gender independent Universal Background Models, which contain 2048 Gaussians. This UBM
was trained using LDC releases of Switchboard II, Phases 2 and 3; switchboard Cellular, Parts 1 and
2 and NIST 2004-2005 SRE. The (gender independent) factor analysis models were trained on the
same quantities of data as the UBM.
Our JFA is composed by 300 speaker factors, 100 channel factors, and diagonal matrix D. While
U was trained on the NIST data olny, D and V were trained on two disjoint sets comprising NIST
and Switchboard data.
7.3.4 Normalization
All scores, as presented in the previous sections, were normalized by the number of frames in the test
utterance. In case of normalizing the scores (zt-norm), we worked in the gender dependent fashion.
We used 220 female, and 148 male speakers for t-norm, and 200 female, 159 male speakers for z-norm.
These segments were a subset of the JFA training data set.
7.3.5 Hardware and Software

The frame-by-frame scoring was implemented in C++ code, which calls ATLAS functions for math
operations. Matlab was used for the rest of the computations. Even though C++ produces more
optimized code, the most CPU demanding computations are performed via the tuned math libraries
that both Matlab and C++ use. This fact is important for measuring the real-time factor. The
machine on which the real-time factor (RTF) was measured was a Dual-Core AMD Opteron 2220
with cache size 1024 KB. For the rest of the experiments, computing cluster was used.
4
http://www.nist.gov/speech/tests/spk/index.htm
62
7.4 Results
Table 7.1 shows the results without any score normalization. The reason for the loss of performance in
the case of LPT scoring could possibly be due to bad approximation of the likelihood function around
UBM, ,i.e., the inability to adapt the model to the test utterance (in the U space only). Fig. 7.1 shows
this case.
Table 7.1: Comparison of different scoring techniques in terms of EER and DCF. No score normal-
ization was performed here.
Det1 Det3
EER DCF EER DCF
Frame-by-Frame 4.70 2.24 3.62 1.76
Integration 5.36 2.46 4.17 1.95
Point estimate 5.25 2.46 4.17 1.96
Point estimate LPT 16.70 6.84 15.05 6.52
Linear 5.53 2.97 3.94 2.35
Table 7.2 shows the results after application of zt-norming. While the frame-by-frame scoring
outperformed all the fast scorings in the un-normalized case, normalization is essential for the other
methods.
Table 7.2: Comparison of different scoring techniques in terms of EER and DCF. zt-norm was used
as score normalization.
Det1 Det3
EER DCF EER DCF
Frame-by-Frame 2.96 1.50 1.80 0.91
Integration 2.90 1.48 1.78 0.91
Point estimate 2.90 1.47 1.83 0.89
Point estimate LPT 3.98 2.01 2.70 1.36
Linear 2.99 1.48 1.73 0.95
7.4.1 Speed
The aim of this experiment was to show the approximate real time factor of each of the systems. The
time measured included reading necessary data connected with the test utterance (features, statistics),
estimating the channel shifts, and computing the likelihood ratio. Any other time, such as reading of
hyper-parameters, models, etc. was not comprised in the result. Each measuring was repeated 5 times
and averaged. Table 7.3 shows the real time of each algorithm. Surprisingly, the integration LLR is
Table 7.3: Real time factor for different systems
Time [s] RTF

Frame-by-Frame 1010 1.60e−1
Integration 50 7.93e−3
Point estimate 160 2.54e−2
Point estimate LPT 36 5.71e−3
Linear 13 2.07e−3
63
faster then the point estimate. This is due to implementation, where the channel compensation term
in the integration formula is computed once per an utterance, while in the point estimate case, each
model needs to be compensated for each trial utterance.
7.5 Conclusions
We have showed a comparison of different scoring techniques that different sites have recently used
in their evaluations. While, in most cases, the performance does not change dramatically, the speed
of evaluation is the major difference. The fastest scoring method is the Linear scoring. It can be
implemented by a simple dot product, allowing for fast scoring of huge problems (e.g., z-, t- norming).
64
Chapter 8
Discriminative optimization of speaker

recognition systems
8.1 Introduction
In this chapter we present the work we did to explore some discriminative training techniques of a
speaker recognition system, as an alternative to the generative training that constitutes the current
state-of-the-art.
For the state-of-the-art — generatively trained joint factor analysis (JFA) — we refer to the
introductory chapter 2. In this chapter, we introduce discriminative training, followed by some general
discussion. Finally, we present the specific experiments that we did.
8.2 Motivation for discriminative training

In machine-learning in general, and in speech recognition in particular, discriminative training has
proved advantageous for many supervised learning problems. However, in speaker recognition, the use
of discriminative training has been very limited. In particular the following forms of discriminative
training have been used in speaker recognition:
Discriminative training of speaker models, as illustrated in figure 8.1, has been around for more
than a decade, and indeed SVM speaker modeling has been a constant feature at the NIST SRE
evaluations since 2003. However, such discriminative training addresses only half of the speaker
recognition problem. The full problem is to (i) distinguish one speaker from others, but also to
(ii) recognize the same speaker under different circumstances. Traditional SVM speaker modeling
addresses only (i), by discriminatively training every speaker model to discriminate one example
of the speaker from a pool of background speakers. It does not however address problem (ii)
in a discriminative way. Instead, state-of-the-art SVM implementations have had to resort to
solutions such as nuisance attribute projection (NAP), which are external to the discriminative
training—see e.g. [Brümmer et al., 2007] and references therein.
Discriminative fusion of speaker recognition systems, using neural networks, SVM or logistic
regression [Brümmer et al., 2007] has been very successful in several of the latest NIST SRE
evaluations. This technique does use large amounts of training data in a single discriminative
optimization, but the number of parameters that are trained is very small, typically around 10
parameters of fewer. Fusion can improve upon existing systems, but the existing systems already
have to be good and they have to be complementary. Fusion by itself cannot create good systems
and it cannot make different systems complementary.
65
The focus of this work is therefore to explore techniques which are more powerful than both of the
above-mentioned methodologies. Specifically, the focus of this work is to start with some variant of the
JFA system (figure 8.2) and to discriminatively optimize some of the hyperparameters of the system
as shown in figure8.3. Parameters that could be optimized include:
• The parameters of feature vector transformations, analogous to techniques such as HLDA and
RASTA.
• The parameters of the voice activity detector.
• The parameters of the UBM.
• The JFA parameters, (V, U, D).
• Parameters of a score normalizer, which could perhaps replace ZT-norm.
In this work, we concentrate only on optimization of the JFA parameters.
enrollment speech test speech
feature extraction feature extraction
discriminative estimate model

optimization
match
score
Figure 8.1: Traditional use of discriminative training in speaker recognition.

M m Vy D z Ux
Generative Modeling
via Joint Factor Analysis
(ML optimization)
system hyper-
parameters
estimate model
match
score
M m Vy D z Ux
Figure 8.2: State-of-the-art generatively trained JFA system.
8.2.1 Envisioned advantages

Discriminative training is known to be able to compensate for unrealistic generative modeling assump-
tions. It is clear that there are problems with the current generative JFA model. These problems
66
System
discriminative
optimization
system
hyper-parameters
estimate model
match
score
M m Vy D z Ux
Figure 8.3: Proposed discriminative training paradigm for speaker recognition.
are indicated by the ad-hoc normalizations (see step 4 in section 2.3) that have to be used to make
generative JFA work, although these normalizations are not derived from the generative models. A
favourable outcome to discriminative training of the JFA hyperparameters could therefore be expected
to give better accuracy than ML estimates of those hyperparameters.
Discriminative training has been shown to be able to optimize smaller, simpler, faster systems to
rival the accuracy of larger generatively trained systems. In this work, we concentrated on this aspect,
with a few encouraging results.
8.3 Challenges of discriminative training

Discriminative training of complex systems is a difficult problem. In large LVCSR systems it took
years for discriminative methods to catch up with generative ones. The challenges include not only
complexity and scale of implementation, but also overtraining.
The complexity includes the challenges of designing and implementing all of the following:
• The form of the system to be optimized and which parameters to optimize.
• The discriminative objective function to use.
• The choice and implementation of a suitable numerical optimization algorithm. Such algo-
rithms must be chosen to match both the nature and the scale of the optimization prob-
lem [Nocedal and Wright, 2006].
• The computation of partial derivatives of the objective function, with respect to all of the param-
eters. For complex systems, this can be a very difficult problem, especially if the optimization
algorithm needs second order derivatives. Moreover, computation of derivatives can also be very
demanding of CPU and memory.
The scale of this optimization problem presents a challenge, because:
• There are a very large number of parameters to be optimized. The number of hyperparameters in
a JFA system numbers in the tens of millions. Note that partial derivatives have to be computed
for every parameter under optimization.
• The training data over which the optimization is performed is large, involving some 1500 hours
of speech, or some 250 million training examples.
Finally, discriminative optimization of a large parameter set is vulnerable to overtraining. It can be a
challenging problem to find suitable means to combat this problem.
67
8.4 Solutions for discriminative training
In this section, we list possible solutions for several of the challenges mentioned above. Some solutions
are listed here just for reference; some were tried and abandoned for various reasons; and some were
pursued to the stage of obtaining experimental results.
8.4.1 Discriminative Training Objective Function

The first thing we need to answer is what classes we want to discriminate. Here we mention two
possibilities: Our experiments were done with two-class discrimination, but for completeness we also
briefly discuss the multi-class case.
Two-class discriminative training

The basic challenge in speaker recognition is binary discrimination between target trials and non-
target trials. The data in each trial is assumed to consist of two speech segments1 : a train and a test
segment. A target trial is where the train and test segments are from the same speaker. A non-target
trial is where the train and test segments are from different speakers.
We chose to use an objective function with which the authors have had previous success, both
in fusion of multiple speaker recognition systems [Brümmer et al., 2007] and in training of language
recognition systems [Matejka et al., 2008]. This objective function goes by many names, such as
Maximum Mutual Information (MMI), Minimum Cross-Entropy, Logistic Regression, etc. The ver-
sion we use is specifically adapted for speaker recognition in order to compensate for the fact that
the application priors we are optimizing for may be very different from the proportions of the
classes in the training data. Our version of the MMI criterion therefore normalizes these propor-
tions [Brümmer and du Preez, 2006].
To apply the objective function to a speaker recognizer, we assume that the recognizer outputs
a score, λt , for every trial t and that this score is calibrated such that it may act as the following
log-likelihood-ratio:
p(λt |target)
λt = log (8.1)
p(λt |non-target)
Note that by acting as log-likelihood-ratio, we do not imply that λt is actually calculated as the ratio
between two likelihoods. Rather, we mean that when it is employed to compute the class posterior for
a trial, that this posterior is of good quality, as judged by the discriminative optimization criterion.
Given some target prior π, the posterior is by Bayes’ rule:
π
P (target|λt , π) = (1 + exp(−λt − logit π))−1 , where logit π = log (8.2)
1−π
Given the log-likelihood-ratio scores for each trial in a supervised training database, the objective
function, which we call Cllr , may be expressed as:
!
1 1 X −λt 1 X λt
Cllr = log(1 + e ) + log(1 + e ) (8.3)
2 log 2 |T | |N |
t∈T t∈N
where T denotes the subset of all target trials in the training data and N the subset of all non-target
trials and where | · | denotes the number of trials in each subset. We used Cllr in all of our experiments,
to be described below in section 8.5. These experiments differed in the way that the scores λt were
computed from the input data.
1
Of course variants exist, e.g. where there are multiple train segments. Here we concentrate on the simplest case of
a single train segment.
68
Multi-class discriminative training
For completeness we mention a somewhat different way to approach discriminative training of a speaker
recognition system. To apply this, we would need to set up the recognizer as a multi-class speaker
identifier which outputs a likelihood/posterior for every one of a closed set of speakers, rather than
as a two-class speaker verifier/detector. However, the plan here is somewhat more complex than
traditional discriminative training of generative models:
In traditional multi-class discriminative training of generative models, the class-dependent

models are adapted. But in this case, we will not see those same speakers again—we
want to use the system for new speakers. So we will train only speaker-independent
hyperparameters.
The plan works as follows:
1. Choose K speakers, so that there are two or more speech segments for each speaker.
2. Hold out one segment per speaker as the training segment. Denote the training segment for
speaker k as Tk . Denote the set of test segments for this speaker as Sk .
3. The model for speaker k is Mk and is a function of the training segment and of θ, the set of
hyperparameters to be optimized, so that Mk = train(Tk , θ).
4. The log-likelihood score for speaker k, given test segment s` is λk` = score(Mk , s` , θ).
5. The objective function is the empirical cross-entropy averaged over all test segments:
 
K K
1 X 1 X X
O=− λk` − log exp(λj` ) (8.4)
K kSk k
k=1 `∈Sk j=1
6. Optimize (minimize) the objective function w.r.t. θ.
8.4.2 Regularization
As mentioned above, discriminative optimization of a large parameter set is vulnerable to overtraining.
Overtraining can be controlled in different ways. Examples are [MacKay, 2003, Bishop, 2007]:
Early stopping, where the iterative optimization process is stopped as soon as the performance on
a held-out evaluation database stops improving.
Regularization penalties, where the objective function is modified by the addition of a suitable
weighted penalty function. (This is standard practice with SVM training.) The weight of the
penalty has to be tuned by using a held-out evaluation set. For non-standard optimizations such
as this work, the design of the penalty function can be challenging.
Bayesian methods, (see above references), where priors are assigned for the parameters under op-
timization. There are several challenges associated with such Bayesian methods.
In this work, we used the simplest strategy of early stopping.
69
8.4.3 Optimization algorithms
There are very many ways to implement algorithms to numerically optimize a given objective function.
We studied some general-purpose optimization literature, most notably the textbook by Nocedal and
Wright [Nocedal and Wright, 2006], but also some other sources, which will be mentioned below.
In addition, we worked with the extended Baum-Welch algorithm, which is specifically designed for
discriminative MMI training of generative models in speech recognition.
The software that we used for our optimization experiments was all done in MATLAB. We tried
some experiments with the MATLAB optimization toolbox, but found problems in scaling these
algorithms to our problem. Instead, we hand-coded our own optimization algorithms in MATLAB.
General-purpose optimization algorithms

Optimization algorithms may be classified according to their usage of derivatives:
• The most primitive methods use only objective function evaluations and do not rely on the
availability of explicitly calculated derivatives of the objective. Examples of this class are:
The Nelder-Mead simplex method. We did not further consider this method, because it
becomes intractable for large-scale optimization.
SPSA, or simultaneous perturbation stochastic approximation. In this methodology, numerical
derivative approximations are done in suitably randomized directions in search space. This
method is tractable for large optimizations, because only one derivative per iteration is
approximated. It is reported to have convergence rates similar to algorithms that do use
explicit gradient information. We did not try this method, but it may be an interesting
avenue for further investigation. See www.jhuapl.edu/SPSA for further reading.
• More sophisticated methods use explicit first-order derivatives in the form of the gradient of
the objective function. For a large number of variables, the gradient cannot be numerically
approximated with finite difference methods, because this would involve prohibitive numbers
of objective function evaluations, namely one or two evaluations per variable. To use these
methods, one therefore needs explicit (and efficient) analytical implementations of the gradient.
The subject of how to implement gradients will be discussed below in section 8.4.4. In this class
of first-order methods, we investigated the following:
Non-linear conjugate gradient methods form the canonical class of general purpose un-
constrained first-order optimization algorithms.
RPROP uses only the sign of the the gradient, and a set of heuristics to adaptively ad-
just the search step size for every variable. For some objective functions, RPROP
has comparable (or superior) performance to conjugate gradient, but it is considerably
easier to implement, because it does not use sophisticated line-search techniques. See
http://en.wikipedia.org/wiki/Rprop. We used RPROP as our optimization algorithm
in the experiment reported in section 8.5.1 below.
Stochastic gradient descent optimization algorithms can be applied to supervised machine
learning objective functions which are sums over very many terms, where each term repre-
sents one training example. This is indeed the case here. The difference between stochastic
gradient descent and more traditional methods is that stochastic gradient adjusts the pa-
rameters being optimized after every example, rather than once after all examples have been
processed. If the examples are repetitive this can give substantial savings in computation.
We tried some preliminary experiments with this methodology, but found no advantage
for our problem. One of the issues that we could not resolve to make this method work,
70
was that it breaks the vector pipelining that allows MATLAB code to run efficiently. For
further reading see http://leon.bottou.org/research/stochastic.
• The most sophisticated methods use not only gradients, but also some second-order derivative
information, in the form of Hessian-vector products. The second-order information can lead to
fast convergence in convex regions of the objective function, which are typically found close to
minima. We mentioned above that gradients cannot be approximated numerically for large-scale
optimization. However, given analytical solutions for the gradient, the Hessian-vector products
can be accurately and efficiently approximated. A promising method in this class, which we
believe merits further investigation is the Trust-Region-Newton-Conjugate-Gradient method.
See [Nocedal and Wright, 2006] and [Lin et al., 2008].
8.4.4 Computation of derivatives

Computation of derivatives for optimization problems is in general a challenging problem. Firstly,
the derivatives of even moderately complex objective functions can become analytically very complex.
To do the derivations by hand and to correctly implement that in software is a non-trivial problem.
Secondly, derivatives for large-scale optimization imply large linear algebra operations, which can
challenge CPU and memory resources. We discuss some alternatives:
Finite difference methods, as mentioned above, are not tractable for large-scale gradient compu-
tation. However, they do form a very valuable tool for verifying the correctness of gradients
computed by other means and we used these methods for this purpose.
Automatic/algorithmic differentiation is a class of automatic methods, where code that imple-

ments the objective function can be automatically transformed to also compute derivatives.
Algorithmic differentiation has a forward mode and a reverse mode. The simpler forward mode
is not suitable for large gradients. The reverse mode is considerably complicated by having to
store a backtrack path and associated data structures, which can become very large. We did not
use such tools, for reasons of availability and scalability.
Semi-automatic: We did some experiments with semi-automatic ways to achieve more or less the
same effect as reverse-mode automatic differentiation (also similar to back-propagation in neural-
network training), by using differentiation chain-rules for compositions of functions. With this
methodology, the programmer hand-codes the derivatives for simple building blocks and connects
the blocks such that the chain rules are implemented. This is a promising methodology, but we
had problems with the scaling of memory and CPU resources. Computing gradients of nested
multivariate functions involves large matrix multiplications.
Hand-coded derivatives: In the end, we found it best to globally hand-optimize these matrix com-
putations to obtain tractable solutions to computing the required gradients. In doing this we
found two tools invaluable. The one, as mentioned above was testing via finite differencing. The
other was matrix calculus to find derivatives of expressions involving vectors and matrices, see
http://research.microsoft.com/en-us/um/people/minka/papers/matrix.
8.5 Experiments
In this section, we present details and results for two sets of experiments that were pursued to successful
conclusion.
71
8.5.1 Small scale experiment
In this experiment, we drastically simplified the scale of the problem, by simplifying the JFA system,
prior to discriminative training. This simplification was found by other members of the team in their
quest to find JFA-based transformations to a feature space amenable to SVM training (see 5). The
transform works as follows:
Transform
The transform is a mapping of every input segment to a low-dimensional vector. It is effected by
performing model training, as explained in step 1 of section 2.3, on every input segment. This
differs from the normal JFA recipe where only train segments are treated thus. Then, only the 300-
dimensional y-vector is retained.
In the work reported here, this transform was done by the female variant of the CRIM JFA
system[Kenny et al., 2008b].
Raw Score
Let y1 represent the train segment and y2 represent the test segment. Let W be a 300-by-300 positive
semi-definite matrix, so that y1 Wy2 is a dot-product. Then the score for the trial is the normalized
dot product:
y0 Wy2
s(y1 , y2 ) = p 0 1 (8.5)
y1 Wy1 y20 Wy2
We did not use ZT-norming, because it showed no improvement over the raw score.
Generative baseline
The parameter W of this score can be informally ‘trained’ in a generative way2 by estimating the
within-class covariance matrix of the y-vectors obtained from a development database with multiple
sessions (segments) per speaker. We use the W obtained thus as our baseline for comparison and also
as initialization for the discriminative optimization.
Calibration
As mentioned above, to evaluate the discriminative objective function (8.3), we need the score to
act as log-likelihood-ratio. We force the score to behave thus with an affine calibration transform,
the parameters of which are trained discriminatively along with the other parameters. Denoting the
log-likelihood-ratio of a trial t by λt , the affine transform of a raw score st is:
λt = αst + β (8.6)
where the scale parameter α and the offset parameter β are real scalars. The whole discriminative
training problem is therefore to jointly optimize (α, β, W).
Parametrization
The 300-by-300 parameter matrix W is constrained to be positive semi-definite. In order to avoid
the extra complications of constrained optimization, we reparametrize in terms of an unconstrained
2
A formal generative model, with multivariate Gaussians for between- and within-class variation, yields a somewhat
more complex score function. We tried using such score functions, but the simple normalized dot-product score performed
best.
72
300-by-300 matrix3 X, so that W = X0 X.
The full parameter set for optimization is now (α, β, X). It should be noted that although this
parametrization avoids constraints, it forms a non-convex optimization problem, with multiple equiv-
alent optima, because for a given W, there are many solutions for W = X0 X.
Optimization
As mentioned above optimization involves two interrelated challenges: (i) choosing a numerical op-
timization algorithm and (ii) computing the derivatives required by that algorithm—all subject to
limited resources of time, CPU and memory. Our first successful solution used the above-mentioned,
first-order RPROP method, with hand-optimized gradient calculation. Undoubtedly there are better
solutions, some of which we will investigate in future.
Results
We performed our experiments by discriminative training on all the female JFA development data,
including Switchboard and SRE’04 and SRE’05. By using all pairs of speech sgments, this gave a
training database of almost 200 million detection trials, involving more than 1000 speakers. We tested
on the English, female subset of the 1conv4w-1conv4w task of NIST SRE’06. The above-mentioned
generative baseline gave EER = 2.61%. Discriminative retraining, with early stopping after a few
iterations gave an 11% relative improvement of EER = 2.33%.
Comment
Although we got an improvement, it is clear that this method is very vulnerable to overtraining of the
90000 parameters. If the optimization was iterated further, the training EER dropped to a fraction
of a percent, but this did not translate to good performance on independent test data. In future
we would like to use more sophisticated regularization, coupled with more sophisticated optimization
algorithms, in order to be able to find well-defined optima of the objective function rather than the
somewhat ill-defined early stopping.
8.5.2 Large Scale Experiments

In this set of experiment, we attempted to discriminatively train the whole eigenvoice matrix V.
Again, the task is to minimize the discriminative objective function Cllr given by equation (8.3),
where λt is again the calibrated log-likelihood-ratio score as defined in (8.6). This time, however,
the raw score st is the log-likelihood-ratio obtained from the full JFA model. Since the simple linear
scoring described in section 8.2 provided superior performance, it was the natural choice for these
experiments. Therefore, the log-likelihood-ratio defined by (7.17) is used as the raw score st . Another
advantage of using the linear scoring are the simple derivatives of the objective function w.r.t. V that
are easy to derive and that can be very efficiently calculated. In all experiments described in this
section, speaker factors for all training and enrollment utterances were precomputed using ML trained
JFA system and stayed fixed.
The full parameter set for optimization is (α, β, V). The optimization is an iterative process,
where, in each iteration, we first fully optimize parameters α and β using linear logistic regression4 ,
which is followed by one iteration of modified gradient descent update of eigenvoices V. Differentiating
(8.3) w.r.t. a model parameter θ, we obtain
3
One could also more generally let X be k-by-300, for some k 6= 300. If we choose k < 300, then the rank of W
would be constrained to be k, but we found no experimental benefit in applying this constraint. Conversely, we could
overparametrize by letting k > 300, but we did not investigate this experimentally.
4
FoCal toolkit (http://niko.brummer.googlepages.com/focal) by Niko Brummer was used for this purpose
73
!
∂Cllr 1 1 X 1 X ∂λt
= (1 − P (target|λt , 0.5)) − P (target|λt , 0.5) , (8.7)
∂θ 2 log 2 |T | |N | ∂θ
t∈T t∈N
where P (target|λt , 0.5) is given by (8.2) and by combining (8.6) and (7.17) and differentiating
w.r.t. V, we obtain
∂λt
= αΣ−1 F̄y∗ (8.8)
∂V
To optimize our objective function, we need to define set of training trials. In these experiments,
each possible pair of two segments from our training set formed a valid trial, where one segment is
considered to be the enrollment and the other the test segment. This allows us to define J × J matrix
P, where J is number of the segments in the training set and each element of the matrix corresponds
to one trial where row index defines test segment and column index defines the enrollment segment.
Let the elements of the matrix P corresponding to trial t be 1−P (target|λ
|T |
t ,0.5)
if the trial is a target
trial and − P (target|λ
|N |
t ,0.5)
if the trial is a non-target trial. Combining (8.7), (8.8), making use of matrix
P and taking the gradient of the objective function w.r.t. V, we obtain
1
∇Cllr (V) = αΣ−1 F̃PY∗ (8.9)
2 log 2
where columns of matrix F̃ are vectors of first order sufficient statistics, F̄, extracted form all segments
in training set (representing test segments) and where columns of matrix Y are vectors of speaker
factors extracted form all segments in training set (representing enrollment segments).
Now, the gradient could be used to optimize the objective function using standard gradient de-
scent optimization. However, the widely adopted technique for MMI training GMMs of HMMs is
Extended Baum-Welch [Schluter et al., 2001] re-estimation, which has been shown to provide much
faster convergence compared to gradient descent. In our case, the Extended Baum-Welch can not be
adopted in straightforward way thanks to our more complicated model and simplified linear scoring.
In [Schluter et al., 2001] section 2.2.2., relation between Extended Baum-Welch and gradient descent
update of parameters was pointed out showing that Extended Baum-Welch update of GMM mean
vectors can be seen as gradient descent update with a specific learning rate used to update each
parameter. Inspired by this relation, we propose to use similar learning rate specific to each row of
matrix V. Specifically, we multiply the gradient ∇Cllr (V) by diagonal matrix
L = η diag(ÑP1)Σ, (8.10)
where columns of matrix Ñ are vectors of zero order sufficient statistics, N, extracted form all segments
in training set, η is parameter independent learning rate, 1 is column vector of ones and diag is an
operator that converts vector into diagonal matrix. Finally, the matrix V is iteratively updated using
the following formula:
Vnew = Vold + L∇Cllr (V) (8.11)
Experiment with no explicit channel compensation

First system, which we used for experimenting with discriminative training of eigenvoices V was pure
eigenvoice system (i.e. no U and no D matrices were considered in this case). A system based on
relatively small UBM with only 512 components and 39 dimensional features was selected for this
experiment. System was first trained using maximum likelihood training, then eigenvoice matrix V of
size 19968 × 300 was retrained discriminatively using the procedure described in the previous section.
We use ML trained V as the staring point for the discriminative training. The results are presented in
74
EER[%] No norm ZT-norm
Generative V 15.44 11.42
Generative V and U 6.99 4.07
Discriminative V 7.19 5.06
Discriminative V with channel compensated y 6.80 4.81
Table 8.1: Results of the 1st large scale experiment, on SRE 2006 all trials (det1).
EER[%] No norm ZT-norm

Generative V and U 6.99 4.07
Discriminative V, Generative U 6.00 3.87
Table 8.2: Results of the 2nd large scale experiment, on SRE 2006 all trials (det1).
Table 8.1. Comparing the first and the third line in the table, we can see that discriminative training
provides substantial improvement in the performance. The improvement hold also when applying
zt-normalization, though the normalization was not considered during the discriminative training.
As mentioned in the previous section, speaker factors y are always computed using the original
ML trained model. In this case, it is the pure eigenvoice system with speaker factors y estimated
without considering channel variability. The last line in the table shows results obtained with the
same discriminatively trained pure eigenvoice system used for testing, where, however, factors y were
obtained from ML trained system modeling the channel variability. The improved result suggest that
good estimation of y not affected by channel variability may be important. Possibly, in the future, this
could be also achieved by means of discriminative training, without explicitly modeling the channel
variability.
The second line in the table shows performance of ML trained system making use of eigenchannels
for both estimating speaker factors y and testing. We can see that discriminative training provides
comparable improvements to the intersession variability modeling. However, improvement over the
ML trained system is observed only in the first column of the third row, which corresponds to result
without zt-normalization. When zt-norm is used, performance of the generative system is superior
to the discriminatively trained one. Note, that zt-norm was not considered during discriminative
training. Not having zt-norm incorporated in the discriminative training may force the training to
concentrate on problem that can be easily solved by the normalization, which can lead to suboptimal
result.
8.5.3 Experiment with ML trained eigenchannels

In the following experiment, we have used system, were the intersession variability is modeled using
eigenchannel matrix U (200 eigenchannels). Otherwise, the system is the same as in the previous set
of experiments. Again, the ML trained parameters are used as the starting point for the discriminative
training of matrix V. Although, we make use of U in this experiment, this matrix is not retrained
discriminatively. As can be seen in Table 8.2, retraining the matrix V improves the results. Much
higher improvement is, however, obtained without zt-norm. The probable reason is the same as
explained in the previous paragraph.
8.5.4 Conclusion
Discriminative training for speaker identification a large and difficult problem, but it has the potential
of worthwhile gains with the possibility of more accurate, but faster and smaller systems. We have
managed to show some proof of concept, but so far without significantly improving on the state-of-
75
the-art. Remaining problems are practical and theoretical, including complexity of optimization and
principled methods for combating over-training.
Many possible extensions of our large scale experiments are possible. Beside training eigenvoices
V, hyperparameters U and D can be also trained discriminatively. In all of our current experiments,
we worked with sufficient statistics collected by UBM model. This means that assignment of frames to
Gaussians is fixed given by UBM, which was, however, trained using maximum likelihood criterion. It
is quite possible that such allocation of Gaussians is suboptimal for the task of discriminating between
speaker. It would be worthwhile to experiment with discriminative training that has the freedom
to change such frame assignment. We have also pointed out the problem with zt-norm not being
incorporated in the discriminative training. This could be achieved by making λt in (8.3) to be the
zt-normalized score. However, this make integration of our objective function much more complicated.
76
Chapter 9
Summary and conclusions
In this workshop, several approaches to robust speaker recognition, sharing the same theoretical
background — Joint Factor Analysis (JFA) — were investigated.
In diarization (Chapter 3), we have examined application of JFA and Bayesian methods to
diarization. Our approach produced 3-4on challenging interview speech
In Factor Analysis Conditioning (Chapter 4), we have explored ways to use JFA to account for
non-session variability (phone) and showed robustness using within-session, stacking and hierarchical
modeling.
We have also advanced SVM-JFA approaches by developing techniques to use JFA elements
in SVM classifiers (Chapters 5 and 6). The results are comparable to full JFA system but with fast
scoring (Chapter 7) and no score normalization. We have concluded that SVM approaches provide
better performance using all JFA factors.
Finally, discriminative system optimization was investigated (Chapter 8). It focused on means
to discriminatively optimize the whole speaker recognition system and successfully demonstrated proof-
of-concept experiments.
To conclude, we have found JHU 2008 an extremely productive and enjoyable workshop, and our
aim is to have collaboration in problem areas going forward. Cross-site, joint efforts will certainly
provide big gains in future speaker recognition evaluations and experiments.
77
Bibliography
[Auckenthaler et al., 2000] Auckenthaler, R., Carey, M., and Lloyd-Thomas, H. (2000). Score normal-
ization for text-independent speaker verification systems. Digital Signal Processing, 10(1/2/3):42–
54.
[Bishop, 2007] Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Springer.
[Brookes, 2006] Brookes, M. (2006). The matrix reference manual.

http://www.ee.ic.ac.uk/hp/staff/www/matrix/intro.html.
[Brümmer, 2008] Brümmer, N. (2008). SUN SDV system description for the NIST SRE 2008 evalua-
tion.
[Brümmer et al., 2007] Brümmer, N., Burget, L., Černocký, J., Glembek, O., Grézl, F., Karafiát, M.,
van, D. L., Matějka, P., Schwarz, P., and Strasheim, A. (2007). Fusion of heterogeneous speaker
recognition systems in the stbu submission for the NIST speaker recognition evaluation 2006. IEEE
Transactions on Audio, Speech, and Language Processing, 15(7):2072–2084.
[Brümmer and du Preez, 2006] Brümmer, N. and du Preez, J. (2006). Application-independent eval-
uation of speaker detection. Computer Speech & Language, 20(2-3):230–275.
[Burget et al., 2007] Burget, L., Matejka, P., Glembek, O., Schwarz, P., and Cernocky, J. (2007).
Analysis of feature extraction and channel compensation in GMM speaker recognition system.
IEEE Transactions on Audio, Speech, and Language Processing, 15(7):1979–1986.
[Campbell et al., 2006a] Campbell, W., Sturim, D., Reynolds, D., and Solomonoff, A. (2006a). Svm
Based Speaker Verification using a GMM SuperVector Kernel and NAP Variability Compensation.
In IEEE-ICASSP, Toulouse.
[Campbell et al., 2006b] Campbell, W. M., Sturim, D. E., and Reynolds, D. (2006b). Support Vec-
tor Machines Using GMM Supervectors for Speaker Verification. IEEE Signal Processing Letters,
13(5):308–311.
[Castaldo et al., 2007] Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., and Vair, C. (2007). Com-
pensation of nuisance factors for speaker and language recognition. IEEE Transactions on Audio,
Speech and Language Processing, 15(7):1969–1978.
[Castaldo et al., 2008] Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., and Vair, C. (2008).
Stream-based speaker segmentation using speaker factors and eigenvoices. In Proc. ICASSP, Las
Vegas, Nevada.
[Chaudhari et al., 2000] Chaudhari, U., Navratil, J., and Maes, S. (2000). Transformation enhanced
multi-grained modeling for text independent speaker recognition. ICSLP, 2:298–301.
78
[Dehak et al., 2009] Dehak, N., Kenny, P., Dehak, R., Glembek, O., Dumouchel, P., Burget, L.,
Hubeika, V., and Castaldo, F. (2009). Support vector machines and joint factor analysis for speaker
verification. In Proc. ICASSP, Taipei, Taiwan.
[Dehak et al., 2007] Dehak, N., Kenny, P., and Dumouchel, P. (2007). Modeling prosodic features with
joint factor analysis for speaker verification. IEEE Transactions on Audio, Speech and Language
Processing, 15:2095–2103.
[Dehak et al., 2008] Dehak, R., Dehak, N., Kenny, P., and Dumouchel, P. (2008). Kernel Combination
for SVM Speaker Verification. In Odyssey Speaker and Language Recognition Workshop 2008,
Stellenbosch, South Africa.
[Douglas A. Reynolds, 2000] Douglas A. Reynolds, Thomas F. Quatieri, R. B. D. (2000). Speaker
verification using adapted gaussian mixture models. Digital Signal Processing, pages 19–41.
[Ferrer et al., 2007] Ferrer, L., Sonmez, K., and Shriberg, E. (2007). A smoothing kernel for spatially
related features and its application to speaker verification. In Proceedings of Interspeech.
[Glembek et al., 2009] Glembek, O., Burget, L., Dehak, N., Br ummer, N., and Kenny, P. (2009).
Comparison of scoring methods used in speaker recognition with joint factor analysis. In Proc.
ICASSP, Taipei.
[Hatch et al., 2006] Hatch, A. O., Kajarekar, S., and Stolcke, A. (2006). Within-class covariance
normalization for svm-based speaker recognition. In Proceedings of Interspeech.
[J. Pelecanos, 2006] J. Pelecanos, S. S. (2006). Feature warping for robust speaker verification. In
Proceedings of Odyssey 2006: The Speaker and Language Recognition Workshop, pages 213–218.
[Kajarekar, 2008] Kajarekar, S. (2008). Phone-based cepstral polynomial SVM system for speaker
recognition. Proceedings of Interspeech 2008.
[Kenny, 2005] Kenny, P. (2005). Joint factor analysis of speaker and session variability : Theory and
algorithms - technical report CRIM-06/08-13. Montreal, CRIM, 2005.
[Kenny, 2006] Kenny, P. (2006). Joint factor analysis of speaker and session variability: Theory and
algorithms (draft version). IEEE Speech, Acoustics and Language Processing.
[Kenny, 2008] Kenny, P. (2008). Bayesian analysis of speaker diarization with eigenvoice priors.
[Kenny et al., 2005a] Kenny, P., Boulianne, G., and Dumouchel, P. (2005a). Eigenvoice modeling with
sparse training data. Speech and Audio Processing, IEEE Transactions on, 13(3):345–354.
[Kenny et al., 2005b] Kenny, P., Boulianne, G., Ouellet, P., and Dumouchel, P. (2005b). Factor
analysis simplified. In Proc. of the International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), pages 637– 640, Toulouse, France.
[Kenny et al., 2007a] Kenny, P., Boulianne, G., Ouellet, P., and Dumouchel, P. (2007a). Speaker and
Session Variability in GMM-Based Speaker Verification. IEEE Trans. Audio Speech and Language
Processing.
[Kenny et al., 2007b] Kenny, P., Boulianne, G., Oullet, P., and Dumouchel, P. (2007b). Joint factor
analysis versus eigenchannes in speaker recognition. IEEE Transactions on Audio, Speech, and
Language Processing, 15(7):2072–2084.
[Kenny et al., 2008a] Kenny, P., Dehak, N., Dehak, R., Gupta, V., and Dumouchel, P. (2008a). The
role of speaker factors in the nist extended data task. In Odyssey: The Speaker and Language
Recognition Workshop.
79
[Kenny et al., 2008b] Kenny, P., Dehak, N., Ouellet, P., Gupta, V., and Dumouchel, P. (2008b).
Development of the primary CRIM system for the nist 2008 speaker recognition evaluation. In
Proc. Iinterspeech, Brisbane.
[Kenny and Dumouchel, 2004] Kenny, P. and Dumouchel, P. (2004). Experiments in speaker verifi-
cation using factor analysis likelihood ratios. In Odyssey: The Speaker and Language Recognition
Workshop, pages 219–226.
[Kenny et al., 2008c] Kenny, P., Ouellet, P., Dehak, N., Gupta, V., and Dumouchel, P. (2008c). A
Study of Inter-Speaker Variability in Speaker Verification. IEEE Trans. Audio, Speech and Language
Processing, 16(5):980–988.
[Kenny et al., 2008d] Kenny, P., Ouellet, P., Dehak, N., Gupta, V., and Dumouchel, P. (2008d). A
Study of Inter-Speaker Variability in Speaker Verification. IEEE Transactions on Audio, Speech
and Language Processing.
[Lin et al., 2008] Lin, C.-J., Weng, R. C., and Keerthi, S. S. (2008). Trust region newton method for
logistic regression. J. Mach. Learn. Res., 9:627–650.
[MacKay, 2003] MacKay, D. (2003). Information theory, inference and learning algorithms. Cam-
bridge University Press, New York, NY.
[Matejka et al., 2008] Matejka, P., Burget, L., Glembek, O., Schwarz, P., Hubeika, V., Fapso, M.,
Mikolov, T., Plchot, O., and Cernocky, J. (2008). BUT language recognition system for nist 2007
evaluations. In Proc. Interspeech.
[Matejka et al., 2006] Matejka, P., Burget, L., Schwarz, P., and Cernocky, J. (2006). Brno University
of Technology System for NIST 2005 Language Recognition Evaluation. Speaker and Language
Recognition Workshop, 2006. IEEE Odyssey 2006, pages 1–7.
[Minka, 1998] Minka, T. (1998). Expectation-maximization as lower bound maximization. Technical
report, Microsoft.
[National Institute of Standards and Technology, 2008] National Institute of Standards and Technol-
ogy (2008). NIST speech group website. http://www.nist.gov/speech.
[Nocedal and Wright, 2006] Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer.
[Pelecanos and Sridharan, 2001] Pelecanos, J. and Sridharan, S. (2001). Feature Warping for Robust
Speaker Verification. In Speaker Odyssey, pages 213–218, Crete, Greece.
[Prince and Elder, 2006] Prince, S. and Elder, J. (2006). Tied factor analysis for face recognition
across large pose changes. Proceedings of the British Machine Vision Conference, 3:889–898.
[Reynolds et al., 2000] Reynolds, D., Quatieri, T., and Dunn, R. (2000). Speaker verification using
adapted Gaussian mixture models. Digital Signal Processing, 10(1/2/3):19–41.
[Schluter et al., 2001] Schluter, R., Macherey, W., Muller, B., and Ney, H. (2001). Comparison of
discriminative training criteria and optimization methods for speech recognition. Speech Commu-
nication, 34:287–310.
[Schwarz et al., 2004] Schwarz, P., Matějka, P., and Černocký, J. (2004). Towards lower error rates
in phoneme recognition. In International Conference on Text, Speech and Dialogue, pages 465–472.
[Schwarz et al., 2006] Schwarz, P., Matějka, P., and Černocký, J. (2006). Hierarchical structures of
neural networks for phoneme recognition. In Proc. of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), pages 325–328, Toulouse, France.
80
[Sollich, 1999] Sollich, P. (1999). Probabilistic interpretation and bayesian methods for support vector
machines. In Proceedings of ICANN.
[Solomonoff et al., 2005] Solomonoff, A., Campbell, W. M., and Boardman, I. (2005). Advances in
channel compensation for SVM speaker recognition. In Proceedings of ICASSP.
[Strasheim and Brümmer, 2008] Strasheim, A. and Brümmer, N. (2008). SUNSDV system descrip-
tion: NIST SRE 2008. In NIST Speaker Recognition Evaluation Workshop Booklet.
[Tranter and Reynolds, 2006] Tranter, S. and Reynolds, D. (2006). An overview of automatic speaker
diarization systems. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1557–
1565.
[Vair et al., 2006] Vair, C., Colibro, D., Castaldo, F., Dalmasso, E., and Laface, P. (2006). Channel
factors compensation in model and feature domain for speaker recognition. IEEE Odyssey 2006
Speaker and Language Recognition Workshop.
[Vair et al., 2007] Vair, C., Colibro, D., Castaldo, F., Dalmasso, E., and Laface, P. (2007). Loquendo
- Politecnico di Torino’s 2006 NIST speaker recognition evaluation system. In Proceedings of Inter-
speech 2007, pages 1238–1241.
[Valente, 2005] Valente, F. (2005). Variational Bayesian methods for audio indexing. PhD thesis,
Eurecom.
[Vogt et al., 2005] Vogt, R., Baker, B., and Sridharan, S. (2005). Modelling session variability in
text-independent speaker verification. Interspeech, pages 3117–3120.
[Vogt et al., 2008a] Vogt, R., Baker, B., and Sridharan, S. (2008a). Factor analysis subspace estima-
tion for speaker verification with short utterances. In Interspeech, pages 853–856.
[Vogt et al., 2008b] Vogt, R., Lustri, C., and Sridharan, S. (2008b). Factor analysis modelling for
speaker verification with short utterances. In Odyssey: The Speaker and Language Recognition
Workshop.
81
View publication stats

Robust Speaker Recognition Over Varying Channels-R

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robust Speaker Recognition Over Varying Channels-R

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Robust speaker recognition over varying channels–report from JHU workshop

Niko Brummer Douglas A. Reynolds

SEE PROFILE SEE PROFILE

Patrick Kenny Jason Pelecanos

SEE PROFILE SEE PROFILE

Speech Recognition View project

The user has requested enhancement of the downloaded file.

Report from JHU workshop 2008

(1) Brno University of Technology, Czech Republic,

1. significant increasing of the accuracy of current SID systems

3. speaker identification and verification from very short speech segments.

3 Factor analysis based approaches to speaker diarization 13

4 Factor analysis conditionning 20

6 Handling variability with support vector machines 50

7 Comparison of Scoring Methods used in Speaker Recognition with Joint Factor

8 Discriminative optimization of speaker recognition systems 65

9 Summary and conclusions 77

• acoustic environment – Office, car, airport, etc.

• transmission channel – Landline, cellular, VoIP, etc.

• differences in speaker voice – Aging, mood, spoken language, etc.

1.1 Role of NIST evaluations

• combine and improve techniques developed by individual sites

• well defined setup and evaluation framework

• Diarization is an important upstream process for real-world multi-speaker speech

– Baseline segmentation-agglomerative clustering

• Measure performance in terms of DER and effect on speaker detection

For more details, see Chapter 3.

1.2.2 Factor Analysis Conditioning

• A single FA model is sub-optimal across different conditions

• Eg.: different durations, phonetic content and recording scenario

Explore two approaches:

For more details, see Chapter 4.

1.2.3 SVM–JFA and fast scoring

• Redefinition of SVM kernels based on JFA

For more details, see Chapters 5, 6 and 7.

1.2.4 Discriminative System Optimization

For more details, see Chapter 8.

Mki = mk + Uk xi + Vk ys(i) + Dk z ks(i) (2.1)

2.1 Supervector model

the JFA model can be expressed succinctly in supervector form as

Mi = m + Uxi + Vys(i) + Dzs(i) (2.4)

2.2 Generative ML training

2.3 JFA operation

2.4 Gender dependency

Factor analysis based approaches to

3.2 Diarization Systems

3.2.1 Agglomerative Clustering System (Baseline)

0. Initialize leaf clusters of tree with speech segments.

3.2.2 Variational Bayes System

Here s is a randomly chosen speaker dependent supervector; m is a speaker independent supervector

On each iteration of Variational Bayes:

• For each speaker s:

3.2.4 Hybrid System

3.3 Experiment Design

3.3.2 Measures of Performance

Factor analysis conditionning

• Factor Analysis Combination Strategies

• Within Session Variability Modelling

• Multigrained Factor Analysis

4.2 A Phonetic Analysis

feature space feature space

phoneme I phoneme II phoneme III

UUT w2 = (UUT w)T (UUT w) = wT UUT UUT w = wT UUT w,

minimize J(w, ) = ||w̃||22 /2 + C m