Professional Documents
Culture Documents
Robust Speaker Recognition Over Varying Channels-R
Robust Speaker Recognition Over Varying Channels-R
net/publication/228774387
Article
CITATIONS READS
10 362
19 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Nicolas Scheffer on 01 June 2014.
Lukáš Burget 1 , Niko Brümmer 2 , Douglas Reynolds 3 , Patrick Kenny 4 , Jason Pelecanos 6 ,
Robbie Vogt 7 , Fabio Castaldo 5 , Najim Dehak 4 , Reda Dehak 12 , Ondřej Glembek 1 ,
Zahi N. Karam 3 , John Noecker Jr. 9 , Elly (Hye Young) Na 10 , Ciprian Constantin Costin 11 ,
Valiantsina Hubeika 1 , Sachin Kajarekar 8 , Nicolas Scheffer 8 ,
and Jan “Honza” Černocký (editor) 1
2. decreasing the dependency on communication channel, content of the message and other factors
negatively affecting SID performance.
2
Team members
Team Leader
Lukas Burget burget@fit.vutbr.cz Brno University of Technology
Senior Researchers
Niko Brummer niko.brummer@gmail.com Agnitio
Patrick Kenny pkenny@crim.ca Centre de Recherche en Informatique
de Montreal
Jason Pelecanos jwpeleca@us.ibm.com IBM
Douglas Reynolds dar@sst.ll.mit.edu MIT Lincoln Labs
Robbie Vogt r.vogt@qut.edu.au Queensland University of Technology
Graduate Students
Fabio Castaldo fabio.castaldo@polito.it Polytechnic University of Turin
Najim Dehak Najim.Dehak@crim.ca Ecole de Technologie Superieure
Reda Dehak reda@dehak.org EPITA
Ondrej Glembek glembek@fit.vutbr.cz Brno University of Technology
Zahi Karam zahi@mit.edu Massachusettes Institute of Technology
Undergraduate Students
John Noecker Jr. jnoecker@gmail.com Duquesne University
Elly (Hye Young) Na hna@gmu.edu George Mason University
Ciprian Constantin Costin cip123a@gmail.com The Alexandru Ioan Cuza University
Valiantsina Hubeika xhubei00@stud.fit.vutbr.cz Brno University of Technology
Affiliates
Sachin Kajarekar sachin@speech.sri.com SRI International
Nicolas Scheffer scheffer.nicolas@gmail.com SRI International
Acknowledgements
This research was conducted under the auspices of the 2008 Johns Hopkins University Summer Work-
shop, and partially supported by NSF Grant No IIS-0705708 and by a gift from Google, Inc.
BUT researchers were partly supported by European project AMIDA (IST-033812), by Grant Agency
of Czech Republic under project No. 102/05/0278 and by Czech Ministry of Education under project
No. MSM0021630528. Lukáš Burget was supported by Grant Agency of Czech Republic under
project No. GP102/06/383. The hardware used in this work was partially provided by CESNET
under projects Nos. 162/2005 and 201/2006.
Thanks to Tomáš Kašpárek (BUT) who provided the JHU team an excellent computer support and
allowed for efficient use of the BUT computing cluster during the wokrshop.
3
Contents
1 Introduction 7
1.1 Role of NIST evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Sub-groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Diarization using JFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Factor Analysis Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 SVM–JFA and fast scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Discriminative System Optimization . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Overview of JFA 10
2.1 Supervector model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Generative ML training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 JFA operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Gender dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4
4.4.5 Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Multigrained Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.2 Multi-Grained Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Support vector machines and joint factor analysis for speaker verification 43
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Joint Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 SVM-JFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.1 GMM Supervector space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.2 Speaker factors space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3.3 Speaker and Common factors space . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.1 Test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.2 Acoustic features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.3 Factor analysis training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.4 SVM impostors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4.5 Within Class Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5.1 SVM-JFA: GMM supervector space . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5.2 SVM-JFA: speaker factors space . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5.3 SVM-JFA: speaker and common factors space . . . . . . . . . . . . . . . . . . . 48
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5
7.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.3.3 JFA Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.3.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.3.5 Hardware and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.4.1 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6
Chapter 1
Introduction
The largest challenge to practical use of speaker detection systems is channel/session variability, where
“variability” refers to changes in channel effects between training and successive detection attempts.
Channel/session variability encompasses several factors:
• microphones – Carbon-button, electret, hands-free, array, etc.
• baseline systems that were trying to extend and improve during the workshop
1.2 Sub-groups
The work in the workshop was split into 4 work-groups:
1
Annual NIST evaluations of speaker verification technology (since 1995) using a common paradigm for comparing
technologies, see http://nist.gov/speech/tests/sre/
7
1.2.1 Diarization using JFA
Problem Statement
• At one level diarization depends on accurate speaker discrimination for change detection and
clustering
• JFA and Bayesian methods have the promise of providing improvementsto speaker diarization
Goals
• Apply diarization systems to summed telephone speech and interview microphone speech
Goals
• Build FA models specific to each condition and robustly combine multiple models
• Extend the FA model to explicitly model the condition as another source of variability
• The Support Vector Machine is a discriminative recognizer which has proved to be useful for
SRE
• Parameters of generative GMM speaker models are used as features for linear SVM ( ..sequence
kernels)
• We know Joint Factor Analysis provides higher quality GMMs, but using these as is in SVMs
has not been so successful.
8
Goals
• Analysis of the problem
• Application of JFA vectors to recently proposed and closely related bilinear scoring techniques
which do not use SVMs
• In both speech and language recognition, the classes (phones, languages) are modeled with
generative models, which can be trained with copious quantities of data
• But in speaker recognition, our speaker GMMs have at best a few minutes of training typically
of only one recording of the speaker
Goals
• Reformulate the speaker recognition problem as binary discrimination between pairs of record-
ings which can be (i) of the same speaker, or (ii) of two different speakers
• We now have lots of training data for these two classes and we can afford to train complex
discriminative recognizers
9
Chapter 2
Overview of JFA
Joint factor analysis (JFA) is a two-level generative model of how different speakers produce speech and
how their (remotely) observed speech may differ on different occasions (or sessions). The hidden deep
level is the joint factor analysis part that models the generation of speaker-and-session-dependent
GMMs. The output level is the GMM generated by the hidden level, which in turn generates the
sequence of feature vectors of a given session.
The GMM part needs no further introduction. As is customary in speaker recognition, all of the
GMMs differ only in the mean vectors of the components [Reynolds et al., 2000]. The component
weights and the variances are the same for all speakers and sessions. The session-dependent GMM
component means are modeled as:
Here the indices are: k for the GMM component; i for the session; and s(i) for the speaker in session
i. The system hyperparameters are:
mk , speaker-and-session-independent mean vector;
Uk , rectangular channel-factor loading matrix ;
Vk , rectangular speaker-factor loading matrix ;
Dk , diagonal speaker-residual scaling matrix ;
The hidden speaker and session variables are:
xi , session-dependent vector of channel-factors;
ys , speaker-dependent vector of speaker-factors;
z ks , speaker-and-component-dependent vector of speaker-residuals.
Standard normal distributions are used as a prior for all of these hidden variables.
We refer to V as the eigenvoice matrix ; to U as the eigenchannel matrix ; and D as the residual scaling
matrix. By also stacking component-dependent vectors into larger vectors, which we shall refer to as
10
supervectors:
M1i m1 z 1s
Mi = M2i m = m2 zs = z 2s , (2.3)
.. .. ..
. . .
• Its component means are a good choice to use for the speaker-and-session-independent
supervector m; and its variances and weights are a good choice to use for all speaker-and-
session-dependent GMM variances and weights.
• It parametrizes a computationally efficient approximation to all GMM log-likelihoods,
used during training and operation of the JFA system. Specifically, all GMM log-
likelihoods are approximated by the EM-algorithm auxiliary function [Minka, 1998], of-
ten denoted ‘Q-funtion’ in the literature. Informally, given some GMM, we approximate
log p(data|GMM) ≈ Q(UBM, GMM, data). All further processing makes use of this ap-
proximation.
2. Train the eigenvoice matrix V with an EM algorithm designed to optimize a maximum likelihood
criterion over a database of as many speakers as possible. Pool multiple sessions per speaker, to
attenuate intersession variation.
3. Given V as obtained above and with D temporarily set to zero, train the eigenchannel matrix
U with a similar EM algorithm, over a database that has multiple sessions per speaker. This
data should be rich in channel variation. The Mixer Databases are very good for this purpose.
4. Finally (and optionally), train D, with a similar EM-algorithm, on some held-out data.
11
2. Compute an approximation to the log-likelihood of the target speaker model, given
the test segment data, log p(test segment|M). Good approximations to use here in-
clude [Glembek et al., 2009]:
• The Q-function approximation, where the unknown nuisance variable x is integrated out,
see [Kenny et al., 2007b], equation 19.
• A linear simplification to the Q-function, where a MAP point-estimate of x is used. For
computational efficiency x is estimated relative to the UBM, i.e. with y = 0 and z = 0.
3. Compute the same approximation to the UBM log-likelihood, i.e. with y = 0 and z = 0. The raw
score (or raw log-likelihood-ratio) is now the difference between the target model log-likelihood
and the UBM log-likelihood.
4. Normalize the raw score by applying the following in order: (i) divide by the number of test
frames, (ii) z-norm, (iii) t-norm.
• Some, like the CRIM system at SRE’08, are trained from the UBM onwards on gender-dependent
data. This gives independent male and female systems, which can be used respectively for all-
male or all-female trials.
• Others, like the BUT system at SRE’08, are trained on mixed data, but then use gender-
dependent ZT-norm cohorts.
12
Chapter 3
This chapter reports on work examining new approaches to speaker diarization. Four different sys-
tems were developed and experiments were conducted using summed-channel telephone data from the
2008 NIST SRE. The systems are a baseline agglomerative clustering system, a new Variational Bayes
system using eigenvoice speaker models, a streaming system using a mix of low dimensional speaker
factors and classic segmentation and clustering, and a new hybrid system combining the baseline sys-
tem with a new cosine-distance speaker factor clustering. Results are presented using the Diarization
Error Rate as well as by the EER when using diarization outputs for a speaker detection task. The
best configurations of the diarization system produced DERs of 3.5-4.6% and we demonstrate a weak
correlation of EER and DER,
3.1 Introduction
Audio diarization is the process of annotating an input audio channel with information that attributes
(possibly overlapping) temporal regions of signal energy to their specific sources. These sources can
include particular speakers, music, background noise sources and other signal source/channel char-
acteristics. Diarization systems are typically used as a pre-processing stage for other downstream
applications, such as providing speaker and non-speech annotations to text transcripts or for adapta-
tion of speech recognition systems. In this work we are interested in improving diarization to aid in
speaker recognition tasks where the training and/or the test data consists of speech from more than
one speaker. In particular we focus on two speaker telephone conversations and multi-microphone
recorded interviews as used in the latest NIST Speaker Recognition Evaluation (SRE)1 .
This chapter reports on work carried out at the 2008 JHU Summer Workshop examining new
approaches to speaker diarization. Four different systems were developed and experiments were con-
ducted using data from the 2008 NIST SRE. Results are presented using a direct measure of diarization
error (Diarization Error Rate) as well as the effect of using diarization outputs for a speaker detection
task (Equal Error Rate). Finally we conclude showing the relation of DER to EER and summarize
the effective components common to all systems.
13
clustering system and a new hybrid system using elements of the baseline system and newly developed
speaker factor distances.
The clusters are represented by a single full covariance Gaussian. Since we have prior knowledge
of two speakers present in the audio, we stop when we reach two clusters.
The last stage is iterative re-segmentation with GMM Viterbi decoding to refine change points and
clustering decisions. Additionally, a form of Baum-Welch re-training of speaker GMMs using segment
posterior-weighted statistics can be used before a final Viterbi segmentation. This step was inspired
by the Variation Bayes approach and is also referred to as ”soft-clustering.”
s = m + V y.
14
severe constraints on speaker supervectors. Although supervectors typically have tens of thousands of
dimensions, this representation constrains all supervectors to lie in an affine subspace of the supervector
space whose dimension is typically at most a few hundred. The subspace in question is the affine
subspace containing m which is spanned by the columns of V .
In the Variational Bayes diarization algorithm, we start with audio file in which we assume there
are just two speakers and a partition of the file into short segments, each containing the speech of just
one of the speakers. This partitioning need not be very accurate. A uniform partition into one second
intervals can be used to begin with; this assumption can be relaxed in a second pass.
We define two types of posterior distribution which we refer to as speaker posteriors and segment
posteriors. For each of the two speakers, the speaker posterior is a Gaussian distribution on the vector
of speaker factors which models the location of the speaker in the speaker factor space. The mean
of this distribution can be thought of as a point estimate of the speaker factors and the covariance
matrix as a measure of the uncertainty in the point estimate. For each segment, there are two segment
posteriors q1 and q2 ; q1 is the posterior probability of the event that the speaker in the segment is
speaker 1 and similarly for speaker 2.
The Variational Bayes algorithm consists in estimating these two types of posterior distribution
alternately as explained in detail in [Kenny, 2008]. At convergence, it is normally the case that q1 and
q2 takes values of 0 or 1 for each segment but q1 and q2 are initialized randomly so that the Variational
Bayes algorithm can be thought of as performing a type of soft speaker clustering, as distinct from
the hard decision making in the agglomerative clustering phase of the baseline system.
The Variational Bayes algorithm can be summarized as follows:
Begin:
• Partition the file into 1 second segments and extract Baum Welch statistics from each
segment
• Initialize the segment posteriors randomly
• No initialization is needed for the speaker posteriors
End:
• Baum Welch estimation of speaker GMM’s together with iterative Viterbi re-segmentation
(as in the baseline system)
In the Variational Bayes system, 39 dimensional feature vectors derived from HLDA transforms of
13 cepstra (including c0) plus single, double and triple deltas are used. The cepstra were processed with
short-term (300 frame) Gaussianization. For the re-segmentation, 13 un-normalized cepstra (c0-c12)
were used. The eigenvoice analysis used a 512 mixture GMMs and 200 speaker factors.
15
3.2.3 Streaming Systems
In this section we describe another way to combine speaker diarization and join factor analysis. Speaker
diarization using factor analysis was first introduced in [Castaldo et al., 2008] using a stream-based
approach. This technique performs an on-line diarization where a conversation is seen as a stream of
fixed duration time slices. The system operates in a causal fashion by producing segmentation and
clustering for a given slice without requiring the following slices. Speakers detected in the current slice
are compared with previously detected speakers to determine if a new speaker has been detected or
previous models should be updated.
Given an audio slice, a stream of cepstral coefficients and their first derivatives are extracted.
With a small sliding window (about one second) a new stream of speaker factors (as described in the
previous section) is computed and used to perform the slice segmentation. The dimension of speaker
factor space is quite small (about twenty) with respect to the number used for speaker recognition
(about three hundred) due to the short estimation window.
In this new space, a clustering of the speaker factors stream is done obtaining a single multivariate
Gaussian for each speaker. A BIC criterion is used to determine how many speaker there are in the
slice. A Hidden Markov Model (HMM) using the Gaussian for each state associated to a speaker is
built and through the Viterbi algorithm a slice segmentation is obtained.
In addition to the segmentation, a Gaussian Mixture Model (GMM) in the acoustic space is created
for each speaker found in the audio slice. These models are used in the last step, slice clustering, where
we determine if a speaker in the current audio slice was present in previous slices, or is a new one.
Using an approximation to the Kullback-Leibler divergence, we find the closest speaker model built in
previous slices to each speaker model in the current slide. If the divergence is below a threshold the
previous model is adapted using the model created in the current slice, otherwise the current model
is added to the set of speaker models found in the audio.
The final segmentation and speakers found from the on-line processing can further be refined using
Viterbi re-segmentation over the entire file.
16
Figure 3.1: Level cutting and Tree clustering
the data corresponded to one of the speaker detection tasks in the 2008 SRE, so we could measure the
effect of diarization on EER. The test set consists of 2215 files of approximately five minutes duration
each (≈ 200 hours total). To avoid confounding effects of mismatched speech/non-speech detection
on the error measures, all diarization systems used a common set of reference speech activity marks
for processing.
by speech activity detection, are not used. Speaker error measures the percent of time a system
incorrectly associates speech from different speakers as being from a single speaker. In these results
we report the average and standard deviation DER computed over the test set to show both the
average as well as the variation in performance for a given system.
To measure the effect of diarization on a speaker detection task, we used the diarization output
in the recognition phase of one of the summed-channel telephone tasks from the 2008 SRE. In the
3conv-summed task, the speaker models are trained with three single channel conversations and tested
with a summed channel conversation. The diarization output is used to split the test conversation
into two speech files (presumably each from a single speaker) which are scored separately and the
maximum score of the two is the final detection score. A state-of-the-art Joint Factor Analysis (JFA)
3
DER scoring code available at www.nist.gov/speech/tests/rt/2006-spring/code/md-eval-v21.pl
17
speaker detection system developed by Loquendo [Vair et al., 2007] is used for all diarization systems.
Results are reported in terms of the equal error rate (EER).
3.4 Results
In Table 3.1 we present DER results for some key configurations of the diarization systems. Overall we
see that the final Viterbi re-segmentation significantly helps all diarization systems. For the baseline
system, it was further seen that the soft-clustering, inspired by the Variational Bayes system, reduces
the DER by almost 50%. The Variational Bayes system achieves similarly low DER when a second
pass is added that relaxes the first pass assumption of fixed one second segmentation. The streaming
system had the best performance out of the box, with some further gains with the non-causal Viterbi
re-segmentation. Disappointingly, the hybrid system did not achieve performance better then the
original baseline. This may be due to the first stage baseline clustering biasing the clusters too much
or the inability to reliably extract 200 speaker factors from the small amounts of speech in the selected
clusters.
Table 3.1: Mean and standard deviation of diarization error rates (DER) on the NIST 2008 summed
channel telephone data for various configurations of diarization systems.
mean DER (%) σ (%)
Baseline + Viterbi 6.8 12.3
Baseline + soft-cluster + Viterbi 3.5 8.0
Var. Bayes 9.1 11.9
Var. Bayes + Viterbi 4.5 8.5
Var. Bayes + Viterbi + 2-pass 3.8 7.6
Stream 5.8 11.1
Stream + Viterbi 4.6 8.8
Hybrid + Viterbi (level cut) 14.6 17.1
Hybrid + Viterbi (tree search) 6.8 13.6
Lastly, in Figure 3.2 we show EER for the 3conv-summed task for different configurations of
the above diarization systems. The end point DER values of 0% and 35% represent using reference
diarization and no diarization, respectively. We see that there is some correlation of EER to DER, but
this is relatively weak. It appears that systems with a DER < 10% produce EERs within about 1%
of the “perfect” diarization. To sweep out more point with higher DER, we ran the baseline system
with no Viterbi re-segmentation (DER=20%). While the EER did increase to 10.5% it was still better
than the no-diarization result of EER=14.1%.
3.5 Conclusions
In this chapter we have reported on a study of several diarization system developed during the 2008
JHU Summer Workshop. While each of the systems had a different approach to speaker diarization,
we found that ideas and techniques proved out in one system were also able to be successfully ap-
plied to other systems. The Viterbi re-segmentation used in the baseline system, was a very useful
stage for the other systems. Also the idea of soft-clustering from the Variation Bayes approach was
incorporated into the agglomerative clustering baseline system to reduce the DER by almost 50%.
The best configurations of the diarization systems produced DERs of 3.5-4.6% on summed-channel
conversational telephone speech. We further examined the impact of using different diarization system
with varying DERs on a speaker recognition task. While there was some weak correlation of EER
18
Figure 3.2: EER vs DER for several diarization systems.
to DER, it was not as direct as one would like in order to optimize diarization systems using DER
independent of the recognition systems using the their output. In future work we plan on applying
these diarization systems to the interview recordings in the 2008 SRE. This new domain will present
several new challenges, including variable acoustics due to microphone type and placement as well as
different speech styles and dynamics between face to face interviewer and interviewee.
19
Chapter 4
4.1 Introduction
Factor Analysis (FA) modelling [Kenny, 2006] is a popular and an effective mechanism for capturing
variabilities in speaker recognition. However it is recognized that a single FA model is sub-optimal
across different conditions. For example, modelling utterances of different durations, phonetic content
and recording configurations.
In this chapter we begin to address these conditions by exploring two approaches; that is by,
(1) building FA models specific to each condition and robustly combining multiple models and (2)
extending the FA model to explicitly model the condition as another source of variability. These
approaches guide the study in four areas:
• A Phonetic Analysis
The work stemming from these themes exploits the use of phonetic information in both enrollment
and verification. Figure 4.1 presents the issue of phonetic variability across sessions. In the traditional
factor analysis system, the phonetic variability component is largely ignored and is modelled indirectly
as part of a larger within session variability process – whether or not the phonetic instances were
observed in all utterances.
Section 4.2 provides an introductory study on the performance of phonetic events with the FA
type system. Section 4.3 discusses the use of different FA configurations (such as stacking and con-
catenation) and their effect on performance. The following section (Section 4.4) then investigates the
issue of factor analysis for varied utterance durations, and finally, Section 4.5 examines one of the
granularity assumptions of the implemented FA model.
20
Train Data Test Data
phoneme I
phoneme I
‘w’
‘w’
phoneme II phoneme II
‘ah1’ phoneme III ‘ow’ phoneme III
‘n’ ‘d’
Figure 4.1: A drawing indicating the breakdown of speech into phonetic categories in enrollment and
test.
Table 4.1: Performance of systems when trained and tested on broad phonetic categories.
Vowel (Test) Consonant (Test)
Enroll EER (%) Min DCF EER (%) Min DCF
Vowel 4.50 0.0208 12.47 0.0537
Consonant 10.72 0.0521 7.03 0.0336
Speaker Recognition Evaluation [National Institute of Standards and Technology, 2008] using a stan-
dard factor analysis system trained on the broad phonetic groups as classified by the BUT Hungarian
Phonetic Recognizer. This result, albeit an extreme example, demonstrates the challenge of mis-
matched phonetic content. For example, if only consonants are used to enroll and verify a speaker,
the EER is approximately 7% while if only vowels are used in verification, then the EER increases
to more than 12%. Phonetic mismatch is pronounced for short duration utterances and utterances
recorded with a different speech style.
Not only are there performance differences attributed to speech content across enrollment and
verification but there is also performance differences for different phones as shown in Table 4.21 .
A follow up plot (Figure 4.2) is provided using the data from Table 4.2 to present the performance
of broad phonetic categories versus their relative duration in the utterance. Interestingly, the vowels
tend to be the best performing, but they also comprise more of the speech in an utterance.
A final experiment examines the performance of fusing the systems from two different phonetic
events (optimally combined by linear fusion). The question this experiment attempts to address is
whether the linear score fusion of two vastly different phonetic categories is more beneficial than the
fusion of two similar phonetic classes. Figure 4.3 plots the performance of the score fusion of two
phone classes versus the total duration of the combined phonetic classes. Intuition would suggest that
phonetic diversity should help, but it was not observed to a significant degree in this experiment.
1
Note that the output from the Hungarian recognizer does not correspond to English phones and may be considered
more as an audio tokeniser instead.
21
Table 4.2: Performance of systems when trained and tested on broad phonetic categories.
DET 1 DET 3
Phoneme Type % of speech EER (%) DCF EER (%) DCF
E vowel 18.93 12.16 0.0567 8.62 0.0419
O vowel 10.71 14.57 0.0645 12.30 0.0558
i vowel 6.85 16.73 0.0749 15.49 0.0696
A: vowel 5.89 23.31 0.0876 21.79 0.0852
n nonvowel 5.44 19.08 0.0779 17.23 0.073
e: vowel 4.73 25.31 0.0917 22.92 0.0866
k stop 4.49 25.56 0.0926 22.26 0.0868
z sibilant 4.25 29.73 0.098 28.22 0.0971
o vowel 3.01 25.53 0.0924 25.24 0.0926
t stop 2.76 27.04 0.0956 24.92 0.0936
s sibilant 2.74 30.73 0.0965 27.63 0.0908
f sibilant 2.41 34.43 0.0998 31.42 0.0984
j nonvowel 2.38 25.00 0.0918 22.41 0.0862
v sibilant 2.35 33.66 0.1 30.78 0.0992
m nonvowel 2.29 21.18 0.0835 18.63 0.0782
S sibilant 2.21 31.97 0.0959 31.74 0.0981
l nonvowel 1.99 30.05 0.0974 29.91 0.0955
40
30
Performance (EER) %
20
10
0
0 5 10 15 20
% of Speech
Figure 4.2: A plot of the phonetic performance of individual phones identified according to broad
phonetic category.
22
vowel with others
vowel with vowel
Figure 4.3: A plot of the performance of fusing two phonetic events from within or across broad phone
categories.
Figure 4.4: Stacked vs. concatenated eigenvectors for 2 phonetic classes. The former enrich the model
by projecting statistics on both classes, thus increasing the rank. The latter produces a more robust
latent variable by tying the classes together, thus increasing the model size.
23
Joint Factor Analysis
Let us define the notations that will be used throughout this discussion. The JFA framework uses the
distribution of an underlying GMM, the universal background model (UBM) of mean m0 and diagonal
covariance Σ0 . Let the number of Gaussians of this model be N and the feature dimension in each
Gaussian be F . A supervector is a vector of the concatenation of the means of a GMM: its dimension
is N F . The speaker component of the JFA model is a factor analysis model on the speaker GMM
supervector. It is composed of a set of eigenvoices and a diagonal model. Precisely, the supervector
ms of a speaker s is governed by,
ms = m0 + V y + Dz (4.1)
where V is a tall matrix of dimension N F × RS , and is related to the eigenvoices (or speaker loadings),
which span a subspace of low-rank RS . D is the diagonal matrix of the factor analysis model of
dimension N F × N F . Two latent variables y and z entirely describe the speaker and are subjected
to the prior N (0, 1). The nuisance (or channel/session) supervector distribution also lies in a low-
dimensional subspace of rank RC . The supervector for an utterance h with speaker s is
mh = ms + U x (4.2)
The matrix U , known as the eigenchannels (or channel loadings), has a dimension of N F × RC .
The loadings U , V , D are estimated from a sufficiently large dataset while the latent variables x, y,
z are estimated for each utterance.
Phonetic Decoder
The phonetic decoder used for these experiments is an open-loop Hungarian phone decoder from
BUT, Brno [Matejka et al., 2006]. The Hungarian language possesses a large phone set and enables
the modeling of more nuances than an English set. This has been particularly useful in language
identification tasks. For this work, we chose to cluster the phonemes into broader phonetic events.
We used two different clusterings obtained in a supervised way by expertise:
24
To build a phonetically conditioned system, for example a vowel system, we first extract the feature
vectors from an utterance corresponding to the occurrences of vowels in the phone transcription to
obtain phone-conditioned Baum-Welch statistics for the utterance. These statistics are used in exactly
the same fashion as described above to build a full JFA model with phone-conditioned speaker and
channel subspace matrices. The speaker and channel loadings will be subscripted by the notation
adopted for each event in Table 4.3 (for instance, VV will be the speaker loading for the vowel set).
Experimental Protocol
All experiments were performed based on the all trials condition from the NIST-SRE-2006 dataset.
The data set consists of 3616 target trials and 47452 non-target trials. Results are given in terms of
equal error rate (EER) and minimum detection cost function (mDCF) given by NIST.
The factor analysis model uses the following data sets for training:
• The UBM is trained on Switchboard and Mixer data. For simplicity we fixed the UBM for all
phonetic events.
• The eigenvoices and eigenchannels are trained in a gender-independent fashion on the NIST SRE
04 data set, consisting of 304 speakers and 4353 sessions. The diagonal model is trained on 359
utterances coming from 57 speakers from SRE 04 and 05.
• The score normalization data (Z- and Tnorm) was drawn from SRE 04 and 05 with around 300
utterances for each gender.
Concatenation
The first model space approach investigated consists of concatenating parameters of the speaker from
different phone sets. The following experiments investigate at which level this concatenation should
occur. Let us consider the 2-class phone-set {V, C} for this approach. The resulting model supervector
length will thus increase to 2N F . The main advantage of this method is that a single system is used
for the entire phone set.
• Eigenvector concatenation
25
Table 4.3: Results for the baseline system, as well as for each phonetic group are included. The results
of fusions across phonetic groupings are also shown. Results show that score-level combinations for
the two phonetic sets are similar, but fail to outperform the baseline. [SRE 06, all trials, DCF×10,
EER(%)].
We first concatenate the eigenvectors from different phonetic events during training and testing of the
speaker models. Under this model, the system will estimate a single set of latent variables x, y, z per
utterance, each of them being independent of the class.
ms = m0 m0 + VV VC y + UV UC x + DG 00DG z (4.3)
Here, the ranks of the subspaces are the same as in the baseline system and the DG matrix is a copy
of the D matrix from the baseline system.
The results in Table 4.4 (first three rows) show a significant degradation of the model concatena-
tion style combination. It seems that if the subspaces are trained separately, the projection on the
resulting concatenated subspace does not reflect the classes appropriately. This leads to the need to
retrain subspaces explicitly to be tied together. It is important to note that the concatenation of
the channel eigenvectors decreases the performance much more compared to the speaker eigenvectors.
This supports the hypothesis that eigenvoices should be the main focus when using a phonetic GMM
system.
For this experiment, the speaker and channel subspaces are retrained using the concatenated first-
and zero-order statistics from each phonetic event. The results in Table 4.4 show that this approach
performs close to the score-level combination, but fails to outperform it. However, the subspaces are
effectively tied so that a robust estimate of the latent variable can be produced. Consequently, a gain
is observed compared to the systems taken separately.
Tied factor analysis has been used successfully in other fields such as face recogni-
tion [Prince and Elder, 2006]. For this approach, the model is the same as in Equation 4.3, but
the eigenvectors for each phonetic event are trained so that the latent variables are tied between the
phonetic events. This approach should be successful for a phonetic system, as the amount of data
for each event can vary, especially for very short conditions. We applied the following algorithm until
convergence:
• Estimate the latent variables for the concatenated Baum-Welch statistics (like in 4.3.3).
26
• Estimate the matrices separately, on their respective statistics, by maximizing the likelihood of
the data with respect to the latent variables of the previous step.
Table 4.4 shows that retraining the subspaces by concatenating the statistics from each phone set
or by using tied factor analysis leads to similar performance. It seems the EM algorithm used for the
factor analysis model tends to tie the different phonetic events naturally.
Table 4.4: Eigenvector concatenation on the 2-class set. The speaker and the channel subspace used
are shown along with the concatenation type. Results show that the subspaces have to be retrained
to obtain decent performance, using the standard EM or a Tied Factor Analysis approach. [SRE 06,
all trials, DCF×10, EER(%)]
Stacking
Another approach in the model space consists in stacking the eigenvectors of the subspaces together. In
this approach, the dimension of the model remains constant while the rank of the subspaces increases.
This leads to running one system per event before combining them at the score-level.
• Eigenvector stacking
The advantage of this method is its robustness to different stacking configurations. Indeed, the
latent variable estimation is enriched with the information of other events while keeping a good
estimate for the current event. Let us consider two matrices from the 2-class phone set VV and
VC , and their respective latent variables yv , yc . This approach captures cross-correlation between
phonetic events when estimating the latent components. Stacking the eigenvectors for different events
is equivalent to performing a sum in the supervector space. For the 2-class set, the system is expressed
as:
mh = m0 + VV VC yV yC + UV UC xV xC + DG z (4.4)
The DG matrix is the one from the baseline system. The ranks of the resulting stacked matrices are
240 and 120, for the speaker and the channel respectively.
• Stacking in the speaker space and channel space
Stacking the channel eigenvectors was already demonstrated to be successful for a different set of
microphones [Kenny et al., 2008c]. Stacking the speaker eigenvectors should be suitable for a phonetic
GMM system for two reasons. Firstly, speaker modeling should profit from correlations between
phonetic events. Secondly, using subspaces from all phonetic events when evaluating a single phonetic
event should increase robustness to errors of the phonetic decoder.
Similarly to the concatenation experiments, results in Table 4.5 tend to show that the relevant
information is contained in the speaker space as stacking in the channel space degrades the results.
This means that a global channel matrix can be estimated and successfully applied to all events.
Therefore, we only present this configuration for the 4-class set. Stacking the speaker eigenvectors is
a strategy that outperforms the score-level combination and gives the results similar to the baseline
non-phonetic system. There is no observed improvement by using the 4-class set over the 2-class one.
27
Table 4.5: System combination using stacked eigenvectors for the speaker space, channel space or
both. The matrices selected in each configuration are specified. Results tend to show that the relevant
information is contained in the speaker space, as stacking the speaker loadings gives better results
than the score-level fusion. [SRE 06, all trials, DCF×10, EER(%)]
In section 4.3.3, we showed that stacking the matrices for each phonetic event was a successful
approach for a phonetic-based system. One disadvantage of this method, compared to the method of
Section 4.3.3, is the need to run one system for each event.
The phonetic subspaces can, however, be used to generate large factor loading matrices. In the
protocol, around 300 speakers are used to train the eigenvoice matrix. This is also the maximum
number of eigenvoices that can be estimated. For the 4-class phone set, the system has a rank of 480
for the speaker space. This number of eigenvectors cannot be estimated from our data set. However, it
is interesting to use this large eigenvoice matrix for the baseline non-phonetic system (channel matrices
are not used here following the results in Table 4.5). Under this scenario, the standard (non-phonetic)
statistics will be presented to the system while the stacked matrices coming from different phonetic
events are used as eigenvoices. The channel matrix used is the one from the baseline system.
Table 4.6: Performance of the stacked eigenvoices generated from different phonetic events on a non-
phonetic system. Stacked eigenvoices from the 4-class set outperform the baseline. [SRE 06, all trials,
DCF×10, EER(%)]
Results in Table 4.6 show that stacking eigenvoices derived from different phonetic events can be
useful for improving performance over the standard baseline system. It may also be that using more
classes may better the performance of the stacked system. Indeed, using the stacked eigenvoices from
the 4-class set outperforms the baseline non-phonetic system and the 2-class system.
4.3.4 Conclusion
This work3 aims to take advantage of the recent developments in Joint Factor Analysis in the context of
a phonetically conditioned GMM speaker verification system. We focused on strategies for combining
the phone-conditioned systems. Our first approach was to perform JFA per class and combine the
systems at the score-level. Our hypothesis is that this approach does not use the data efficiently as the
3
The work, by authors at SRI International, was funded through a development contract with Sandia National
Laboratories (#DE-AC04-94AL85000). The views herein are those of the authors and do not necessarily represent the
views of the funding agencies.
28
performance is worse than the baseline. We later employed strategies in the model space that more
robustly estimate the latent variables by taking into account all phonetic events. In section 4.3.3, we
showed that the concatenation of eigenvectors could lead to decent performance provided that the
subspaces are explicitly retrained on the concatenated statistics. In section 4.3.3, we showed that
both factor concatenation and score-level fusion could be outperformed by stacking eigenvectors from
different phonetic events. For the phonetic system, stacking the eigenvoices leads to the greatest
improvement. We also proposed to use this large set of eigenvoices on the baseline system and showed
that it could result in a slight improvement over the traditional baseline system.
While the focus of this work is on phone-conditioned JFA systems, the implications may lead to a
better understanding of the JFA model and a methodology that can be applied to increase robustness
to other kinds of conditions such as language, gender and microphones. Future work will focus on
understanding the differences and overlaps between the global and per-class estimates, in the channel
and the speaker space, and methods to extract more information for a more robust estimate of speaker
models.
29
System 1 conv 60 sec 20 sec 10 sec
Baseline .0442 .0456 .0608 .0752
Speaker only .0422 .0434 .0571 .0727
Session only .0305 .0373 .0702 .0857
Speaker & Session .0295 .0350 .0671 .0880
Table 4.7: DCF on the female subset of the 2005 NIST SRE common evaluation condition for systems
with and without channel compensation. From [Vogt et al., 2008b].
Subspace Training
System V U EER Min. DCF
Full-length 1 conv 1 conv 13.47% 0.0544
Matched 20 sec 20 sec 12.04% 0.0498
Matched Session 1 conv 20 sec 11.70% 0.0493
Table 4.8: EER and minimum DCF on a modified 20 second train/test condition for the female subset
of the 2005 NIST SRE. Results are presented for systems using subspaces trained on different length
segments. From [Vogt et al., 2008a].
conditions rather than the speaker subspace; additionally matching the training for the speaker sub-
space to the evaluation conditions results in degraded performance compared to matching the channel
subspace training alone.
From these results it was concluded that the inter-session variability captured in the subspace of
U is actually dependent on the length of the utterances used to train the subspace. More specifically,
shorter utterances show an increase in overall session variability as shown by the measured trace of
the session subspaces for differing lengths in Table 4.10.
Table 4.9: Minimum DCF on the female subset of the 2005 NIST SRE common evaluation for reduced
utterance length conditions. From [Vogt et al., 2008a].
30
Utt. Length 1 conv 80 sec 40 sec 20 sec 10 sec
tr(U U ∗ ) 105.7 116.9 148.8 213.0 329.8
Table 4.10: Trace of the session subspace covariance with U trained with different length utterances.
From [Vogt et al., 2008a].
This is problematic firstly because we are thus required to train specialized session subspaces for
the range of utterance lengths we are interested in using to extract optimal performance from the JFA
model, but, more importantly, it implies that our assumptions about the nature of the inter-session
variability are flawed.
As noted above, the improved performance of the matched session subspace system indicates that
the characteristics are different, but what exactly is this difference? Looking at the increasing session
variability with shorter utterances, it seems that the consistent, stationary environmental factors may
well still be present as utterances become shorter, but an additional source of variability is becoming
more apparent with reducing utterance length.
One hypothesis for this extra captured variability is the variability introduced by the speech con-
tent, that is the phonetic information encoded in the speech. For the text-independent speaker recog-
nition task, phonetic variation is unwanted variability in general (although it is possible to produce
better, more accurate speaker models that are conditioned on phonetic context; this is the approach
taken for the conditioned Factor Analysis work).
The effect of phonetic content of an utterance on a speaker model will be more pronounced as
training utterances become shorter. Over the typical NIST conversation lengths, there is likely to be
a reasonable coverage of the phonetic space and the effects of phonetic variability will largely average
out. For utterances of only a few seconds in length, however, there will be very poor coverage of the
phonetic space, and differences in the particular observed phones will cause large differences in the
produced speaker model estimate.
31
n.
Including within-session variability, the complete model for a short segment n of an utterance is
sn = m + V y + dz + U I x + U W wn .
While y, z and x are all held constant for the entire utterance, there will be an independent wn for
each short segment n.
4.4.3 Implementation
Several systems were developed for comparison in this work:
1. A baseline JFA system, using the standard JFA model.
3. A system implementing the extended JFA model incorporating the within-session variability
modelling.
32
Details of these systems are presented in the following sections.
33
Approx Magnitude of JFA Subspaces
2
10
Speaker
Inter−Session
Phonetic
1
10
Eigenvalues
0
10
−1
10
−2
10
0 1 2
10 10 10
Figure 4.5: Leading eigenvalues of the speaker, inter-session and within-session variability. The within-
session subspace was trained on segments aligned to open-loop phone recogniser transcripts.
4.4.4 Experiments
A system implementing the extended JFA model with within-session variability was evaluated and
compared against the standard and matched-U JFA systems on the NIST SRE 2006 core, 1conv4w-
1conv4w condition. To investigate the performance of the systems with reduced utterance lengths, this
same 1conv4w-1conv4w condition was again utilised however both the training and testing utterances
were truncated to produce shorter utterances. 20-second and 10-second conditions for both training
the within-session variability is dominated by phonetic information.
34
Approx Magnitude of JFA Subspaces
Speaker
Inter−Session
Phonetic−−1 sec
Phonetic−−10 sec
1
10 Phonetic−−Conv side
0
10
Eigenvalues
−1
10
−2
10
0 1 2
10 10 10
Figure 4.6: Approximate effective within-session variability for a range of utterance lengths compared
to speaker and inter-session variability.
Table 4.11: Comparison of EER performance for the standard JFA model, matched-length session JFA
model and the extended JFA model incorporating within-session variability modelling on the SRE 06
common evaluation condition.
and testing were added in this way. From the results in [Vogt et al., 2008b] and [Vogt et al., 2008a],
this 10 to 20 seconds range appears to be the range at which the effectiveness of the standard JFA
model is diminishing.
Tables 4.11 and 4.12 present EER and minimum DCF results, respectively, comparing the variants
of the JFA model for the full conversation side and 20- and 10-second truncated training and testing
conditions. All results are English-language trials only. The first and second rows in each table use
the standard JFA model with 50 and 60 session factors, respectively. The third row show the results
with U matched to the length of utterance used for training and testing. The last row of each table
includes within-session variability modelling with 50 inter-session factors and 10 within-session factors.
As reported in [Vogt et al., 2008a], matching U to the evaluation conditions provides an advantage
over the standard JFA model. The matched system provided better performance in all short conditions
over the baseline although the improvement for the 10-sec condition are quite modest.
Incorporating within-session variability modelling largely produced similar results to the matched-
U approach, improving on the standard JFA system for all shortened utterances. Additionally, at the
EER operating point this approach gave the best performance at each utterance length, although only
by a small margin. Results were less clear-cut when measured by minimum DCF.
From these results it can be seen that the introduction of within-session factors at least achieved
35
JFA Model Dims 1 conv 20 sec 10 sec
U +V +d 50 .0159 .0561 .0819
U +V +d 60 .0156 .0562 .0820
U M atched + V + d 50 .0159 .0531 .0814
UI + UW + V + d 50I + 10W .0170 .0541 .0807
Table 4.12: Comparison of minimum DCF performance for the standard JFA model, matched-length
session JFA model and the extended JFA model incorporating within-session variability modelling on
the SRE 06 common evaluation condition.
Table 4.13: Comparison of EER performance for the standard JFA model, matched-length session JFA
model, a stacked session model and the extended JFA model incorporating within-session variability
modelling on the SRE 06 common evaluation condition with whole conversation side training and
truncated utterances for testing.
one of the stated goals of producing a system that could be effective over a wide range of utterance
lengths. While the matched system used a distinct U matrix for each utterance length tested, the
parameters of the within-session modelling system were consistent across all trials. Thus, the within-
session modelling approach provides a practical advantage over the standard JFA model through its
flexibility.
The second goal of improving performance through more accurately modelling the unwanted vari-
ability has not been convincingly achieved with these results. Several factors may contribute to this
outcome. Firstly, the choice of segmentation may not be optimal, but more importantly, the approach
to estimating the subspaces of the extended model used for these experiments was not at all tailored
to the extended model. It is expected that the extended model should at the very minimum require
adjustment to the values of d as less information will be explained as “residual” variability with the
inclusion of within-session factors. The effects of including within-session modelling on the speaker
and inter-session subspaces must also be investigated. Future investigation of segmentation choice
and proper integration of within-session modelling in the subspace estimation process may lead to
significant improvements in performance of this extended model.
Following on from the above results where utterances of the same length were used for both training
and testing. An added complication is introduced when the training and testing utterance lengths
differ. In this case, the optimal matrix U is different for training and testing. Tables 4.13 and 4.14
present results evaluated with a whole conversation for training and 20 or 10 second testing utterances.
Again in these tables the first row is the baseline approach using standard JFA model. The results
in the second represent a system with U matched to the utterance length for both training and testing.
In this case, due to the full conversation for training and truncated utterances for testing, U differs
from training to testing. Interestingly, while the matched-U approach worked quite well with the
same utterance lengths for both training and testing, it causes a degradation in performance in all
measures compared to the baseline system. Mismatch between the U for training and testing is the
most likely cause of this performance degradation.
To overcome the issue of differing U between training and testing while matching the session
subspace to the utterance length, a stacking approach was investigated. Under this approach, a larger
session subspace was constructed by concatenating the two session matrices matched to both the
36
JFA Model Dims 20 sec 10 sec
U +V +d 50 .0293 .0433
U Matched + V + d 50 .0305 .0441
U Stacked + V + d 100 .0275 .0421
UI + UW + V + d 50I + 10W .0290 .0414
Table 4.14: Comparison of minimum DCF performance for the standard JFA model, matched-length
session JFA model, a stacked session model and the extended JFA model incorporating within-session
variability modelling on the SRE 06 common evaluation condition with whole conversation side training
and truncated utterances for testing.
training and testing conditions. That is, for a 1conv training, 10 second test condition, the U used for
training and testing consists of concatenated matrices matched to the 1 conv and 10 second utterance
lengths. This approach has been successfully employed previously for mixed telephone and distant
microphone conditions in recent SRE.
The third row of Tables 4.13 and 4.14 demonstrate that this stacking approach provides an im-
provement in all cases over the baseline system, regaining the advantage of the matched approach
observed previously, although, again these gains are modest.
Finally, the last row in Tables 4.13 and 4.14 present the performance of incorporating within-session
modelling. As with the stacking approach, the extended model provides improved performance over
the baseline system in all cases, except for the 10-sec EER where the two are equivalent. The extended
approach is also competitive with the stacked approach as they each provide the best performance
depending on the condition and performance measure.
The results for these experiments again highlight the ability for the extended JFA model to provide
competitive performance across a wide range of operating conditions without having to adjust model
parameters. This flexibility is a major advantage of this approach, especially for situations in which
it is not possible to know the training and testing utterance lengths prior to evaluation or, as in this
case, the utterance lengths are not consistent for training and testing.
37
and fixed length segments in the range of 0.1–1 seconds of active speech to provide more consistency in
the estimates of wn and less dependency on speech recognition tools. Methods of incorporating within-
session variability in estimating the speaker and inter-session subspaces will also be examined. Previous
work has shown that slight variations in the subspace estimation procedure can make significant
performance differences for the standard JFA model; it is likely that this effect is exacerbated for the
extended model.
38
Figure 4.7: A hypothetical scenario of two different complexity models used to account for feature
distortions.
the GMM on the left could potentially be modified to a greater extent than by the GMM on the right.
In addition, the GMM on the left may be able to compensate for significantly large channel effects,
while the GMM on the right would tend to compensate for more fine-grained distortions.
A multi-grained model may be able to compensate for session effects that cause large regional
variability and yet also handle small localized distortions. The work regarding the multi-grained
analysis is inspired by Chaudhari et al [Chaudhari et al., 2000].
We apply a multi-grained framework to both the NAP and the factor analysis approaches to
demonstrate their utility. Please refer to [Campbell et al., 2006a] for the NAP specific details of the
method and to [Kenny, 2006] for the factor analysis specifics. In brief, Figure 4.8 shows the general
process. A low complexity NAP-GMM setup is used to transform the raw features to produce Level-1
(L1) features. These features are further transformed by a more complex Level-2 (L2) NAP-GMM
structure. Note that the features generated by the L2-model may be passed into an SVM or other
alternate classifier.
Although this diagram indicates the case for feature based compensation of the NAP statistics
using models of differing granularity, it was also applied such that the sufficient statistics could be
changed using a factor analysis type model (see Figure 4.9).
With the previous method, NAP was applied in two stages, first for enhancing the features using a
low complexity model, and second, a higher complexity model to perform an additional NAP trans-
formation on the sufficient statistics that were then scored.
For the NAP case, a feature vector, x, may be compensated to give, x̂, according to the following
equation ([Vair et al., 2006]).
XN
x̂ = x − P r(i|x)S i (4.5)
i=1
where
wi g(x|µi , Σi )
P r(i|x) = PN (4.6)
j=1 wj g(x|µj , Σj )
The conditional probability of mixture component i given observation x is given by P r(i|x). The
function g(x|µi , Σi ) is the multi-variate probability density function of feature vector x for a given
mean (µi ) and diagonal covariance (Σi ) of mixture component i.
39
The parameter vector, S i , represents the session nuisance contribution for mixture component i
and may be calculated as follows:
1 1
S i = √ Σi2 Vi V T φ (4.7)
wi
Note that Vi is the sub-matrix of V referring to the nuisance contribution of mixture component
i. Specifically, Vi refers to row [(i − 1) × d + 1] through to row [i × d] of matrix V . In addition, the
L2 NAP-GMM is directly scored rather than being used for generating additional features.
Figure 4.8: The procedure for performing a two-step Nuisance Attribute Projection.
Compensate Compensate
Raw Find FA Find FA
For Session For Session
Features Statistics Statistics Use of FA
Statistics Statistics
Compensated Statistics
L1 STP
VOW
L2 In Kernel or Model
ALL Global Phone-Class
SPEECH Model NON Specific
SIB
Model
Figure 4.9: The procedure for performing a two-step factor analysis compensation, firstly using the
statistics over the entire utterance followed by the phone-group specific compensation.
4.5.3 Results
The speaker recognition system is based on a GMM based kernel structure [Reynolds et al., 2000,
Campbell et al., 2006a]. All output scores have ZT-Norm [Reynolds et al., 2000,
Auckenthaler et al., 2000] (enrollment and test utterance score normalization) applied.
We first evaluate the multigrained NAP feature compensation approach. This consists of a 256
mixture component NAP system that is used to transform the cepstral-based features for use by a 1024
mixture component secondary NAP system. The secondary NAP system also incorporates the scoring
(dot-product) component. The results of the multigrained NAP feature compensation approach are
presented in Table 4.15 for the NIST 2008 SRE. Conditions 7 and 8 represent the telephony audio
5
The equations are omitted at this time.
40
Table 4.15: The NIST 2008 results with and without the multi-grained analysis.
Condition Condition
Task
7 8
Base System with
0.179 0.182
NAP
Base System with
Multigrained NAP 0.175 0.166
Broad Phone
0.212 0.209
System with NAP
Broad Phone
System with 0.206 0.190
Multigrained NAP
Table 4.16: The NIST 2006 results with and without the multi-grained analysis compared for broad
phonetic groupings.
Phone DET 3 - Base DET 3 - ZTNorm
Type Min DCF EER Min DCF EER
Baseline NonVowel 0.0888 24.04% 0.0413 9.05%
Sibilant 0.0988 30.28% 0.0584 13.05%
Stop 0.0993 33.33% 0.0631 13.81%
Vowel 0.0604 11.26% 0.0201 3.97%
Hierarchical NonVowel 0.0852 23.24% 0.042 9.53%
Sibilant 0.0994 28.93% 0.0585 14.20%
Stop 0.0991 33.27% 0.0655 14.63%
Vowel 0.0482 10.29% 0.0206 3.91%
Baseline Consonant 0.0839 20.48% 0.0323 6.28%
Vowel 0.0604 11.26% 0.0201 3.97%
Hierarchical Consonant 0.0777 18.26% 0.0312 6.45%
Vowel 0.0482 10.29% 0.0206 3.91%
evaluation for all English trials and all native English trials respectively. This table includes results for
two types of systems; that is a standard GMM system and a broad phone system. The standard GMM
system is of the same configuration as mentioned earlier in the paragraph. The broad phone system
consists of using the compensated features generated by the 256 mixture component system and then
having broad-phone models (which are also NAP compensated) trained on these new features. The
scores from the broad phone models are combined in late fusion. The results show a small improvement
in the performance of the multigrained NAP systems over the standard NAP baselines.
In another experiment (see Table 4.16), using the NIST 2006 SRE, the multigrained FA system
was evaluated. This consisted of compensating the sufficient statistics using a general GMM and then
further compensating the statistics using a broad-phone specific system. The multigrained approach
demonstrated consistent improvements for the ‘DET 3-Base’ result. Once ZT-Norm was applied (‘DET
3-ZTNorm’), the observed benefits were lost. Note also that the fusion of multiple phone systems did
not demonstrate an improvement. Effectively compensating the sufficient statistics of the FA model
in a multigrained manner seems to be a challenging task at this time.
4.5.4 Conclusions
This section presented a multi-grained approach to address certain limitations in current compensation
models. Results indicate some gains with the potential for the method being applied to other session
modelling approaches.
4.6 Summary
This chapter presented some of the efforts performed at the JHU workshop on the topic of Fac-
tor Analysis Conditioning. The work covered four main areas; a phonetic analysis, Factor Analysis
41
Combination Strategies, Within Session Variability Modelling and Multigrained Factor Analysis.
Results demonstrate that a conditioned FA model can provide improved performance and score
level combination may not always be the best method. Including Within-Session factors in an FA
model can reduce the sensitivity to utterance duration and phonetic content variability. Stacking
factors across conditions or data subsets can provide additional robustness. Hierarchical modelling for
NAP/Factor Analysis also shows promise. These approaches also have applicability to other condition
types such as different languages and microphones types.
42
Chapter 5
This chapter presents several techniques to combine between Support vector machines (SVM) and
Joint Factor Analysis (JFA) model for speaker verification. In this combination, the SVMs are ap-
plied in different sources of information produced by the JFA. These informations are the Gaussian
Mixture Model supervectors and speakers and Common factors. We found that the use of JFA factors
gave the best results especially when within class covariance normalization method is applied in the
speaker factors space in order to compensate for the channel effect. The new combination results are
comparable to other classical JFA scoring techniques.
5.1 Introduction
During the last three years, the Joint Factor Analysis (JFA) [Kenny et al., 2008d] approach became the
state of the art in the speaker verification field. This modeling was proposed in order to deal with the
speaker and channel variability in the Gaussian Mixture Models (GMM) [Douglas A. Reynolds, 2000]
framework.
At the same time the application of the Support Vector Machine (SVM) in GMM supervector
space [Campbell et al., 2006a] got interesting results, especially when the nuisance attribute projection
(NAP) was applied to deal with the channel affect. In this approach, the kernel used is based on a
linear approximation of the Kullback Leibler (KL) distance between two GMMs. The speaker GMM
means supervectors were obtained by adapting the Universal Background Model (UBM) supervector
to speaker frames using Maximum A Posterior (MAP) adaptation [Douglas A. Reynolds, 2000].
In this paper, we propose to combine the SVM with JFA. We tried two types of combination; the
first one uses the GMM supervector obtained with JFA as input to the SVM using the classical linear
KL kernel between two supervectors. The second, rather than using the GMM supervectors as features
for the SVM, directly uses the information given by the speaker and common factors components (see
section 5.2) defined by the JFA model.
The outline of the paper is as follows. Section 5.2 describes the factor analysis model. In section
5.3, we present the JFA-SVM approach and we describe all the kernels used to implement it. The
comparison between different results is presented in section 5.5. Section 5.6 concludes the paper.
43
dimension F . The GMM for a target speaker is obtained by adapting the UBM means parameters
(UBM). In joint factor analysis [Kenny et al., 2008d, Kenny et al., 2007b, Kenny et al., 2007a], the
basic assumption is that a speaker and channel dependent supervector M can be decomposed into a
sum of two supervectors: a speaker supervector s and a channel supervector c
M =s+c (5.1)
5.3 SVM-JFA
The SVM is a classifier used to find a separator between two classes. The main idea of this classifier
is to project the input vectors into a high dimension space called feature space in order to find linear
separation. This projection is carried out using a mapping function. In practice, SVMs use kernel
functions to perform the scalar product computation in the feature space. These functions allow us
to compute directly the scalar product in the feature space without defining the mapping function.
In this section, we will present several ways to carry out the combination between the
SVM and JFA. The first approach is similar to the classical SVM-GMM [Campbell et al., 2006a,
Campbell et al., 2006b] when the speaker GMM supervectors are used as input to SVM. The second
set of method that we tested, consist of designing new kernel using the speaker factors or speaker and
common factors depending on the configuration of the JFA model.
where wi and Σi are the ith UBM mixture weights and diagonal covariance matrix, si corresponds to
the mean of Gaussian i of the speaker GMM. The derived linear kernel is defined as the corresponding
44
inner product of the preceding distance
C t
0
X √ −1 √ −1
wi Σi 2 s0i
Klin s, s = wi Σi 2 si (5.5)
i=1
45
5.3.3 Speaker and Common factors space
In the case when we have the speaker and common factors, we proposed and compared two techniques
to combine these two sources of information. The first approach is to apply SVM in each space (speaker
factors space and common factors space). Thereafter we make linear scores fusion of these two SVMs
scores. The fusion weights are obtained by using a logistic regression [Brümmer et al., 2007]. The
second approach is to define a new kernel which is the linear combination of two kernels. The first
kernel is applied in the speaker factors space the second kernel is applied in the common factors space.
The kernels combination weights are fixed in order to maximize the margin between target speaker and
impostors utterances. This technique was already applied in speaker verification [Dehak et al., 2008].
46
Table 5.1: Comparison results between SVM-JFA in GMM supervectors space and JFA frame by
frame scoring. The results are given on EER in the core condition of the NIST 2006 SRE.
Table 5.2: Comparison results between SVM-JFA in speaker factor space and GMM supervectors
space. The Results are given on EER in the core condition of the NIST 2006 SRE.
5.5 Results
5.5.1 SVM-JFA: GMM supervector space
We start with the results obtained by the combination SVM-JFA when the GMM supervectors are
used as input to the SVM. We used GMM supervector obtained using both JFA configurations (with
or without Common factors). The results are given in Table 5.1. These results are compared to the
frame by frame JFA scoring techniques.
The results show that the performances of the application of the SVM in the GMM supervector
space are significantly worse than that obtained by the conventional frame by frame JFA scoring.
These results can be explained by the fact that the linear KL kernel is not appropriate for GMM
supervectors obtained by the JFA model because the assumption of independence of GMM Gaussians
in the case of MAP adaptation is not true for adaptation based on eigenvoices. The results show also
that the addition of common factors didn’t improve the results in the case of SVM-JFA compared to
the JFA scoring.
47
Table 5.3: Comparison results between SVM-JFA in speaker factor space (with and without WCCN)
with two JFA scoring techniques. The results are given on EER in the core condition of the NIST
2006 SRE, English trials.
There are three remarks in Table 5.2. The first one is that the application of the SVM in speaker
factors space gave better results than applied SVM in GMM supervectors space. The second is that
there is well linear separation between the speakers if we compare the results between cosine and
Gaussian kernel. The last remark is that t-norm didn’t give a large improvement in the case of the
cosine and Gaussian kernels, however it helps in the case of the linear kernel.
We will now discuss the performance achieved with or without the WCCN technique in the case of
linear and cosine kernels. Table 5.3 compares the results obtained with and without WCCN to the
results of two JFA scoring techniques. The first method consists on integrating over channel factors
proposed in [Kenny et al., 2008d]. The second one is to make frame by frame JFA scoring.
The results given in Table 5.3 show that with the WCCN, we achieved 17 % relative improvements
in both kernels. We can see also that the performances obtained with WCCN are very comparable to
the JFA scoring. We got better results than integrating over channel factors and closer to JFA frame
by frame scoring. An advantage of this new SVM-JFA scoring is that it is faster than the two other
techniques.
We present a comparison between results obtained with score fusion and kernel combination applied
in the speaker and common factors. In both fusion techniques, we applied cosine kernel in speaker and
common factors space. We used also WCCN in order to normalize the speaker factors cosine kernel.
The results are given in Table 5.4.
By looking at these results, we can conclude, that both fusion methods gave equivalent results.
However, the use of the kernel combination is more appropriate because we dont need development
data to set the kernel weights. The results reported in the Table 5.4 by score fusion on NIST 2006
SRE are not realistic because we trained and tested the score fusion weights on the same dataset.
We note also that the common factors components give complementary information to speaker
factors components and the combination between them improve the performances. If we compare the
results obtained by the kernel combination method and the other scoring methods, we find the same
conclusion as using only SVM in the speaker factors space (see section 5.5.2).
48
Table 5.4: Comparison results between score fusion and kernels combination for SVM-JFA system.
5.6 Conclusion
We tested in this paper several combinations between discriminative model which is Support vector
machine and generative model which is Joint Factor analysis for speaker verification. We found that
using linear or cosine kernel defined in speaker and Common factors which are the components of
the JFA gave better results than using linear Kullback Leibler kernel applied in GMM supervectors
obtained also with JFA model. We proved that using within class covariance normalization in speaker
space in order to compensate for the channel effect gave the best performances.
The results obtained with SVM-JFA using the speaker factors were comparable to the results
obtained with classical JFA scoring. However the advantage of using the SVM in speaker factors
space (usally dimension 300) makes the scoring faster than others classical techniques.
49
Chapter 6
In speaker verification we encounter two types of variation: inter-speaker, and intra-speaker. The
former is desired and a good recognizer should exploit it, while the latter is a nuisance and a good
recognizer should suppress it. In this chapter, we will propose variability compensated SVM (VCSVM),
a new framework for handling both of these types of variation in the SVM speaker recognition setup.
6.1 Introduction
Speaker verification using SVMs has proven to be a powerful method, specifically using the GSV Kernel
[Campbell et al., 2006b] with nuisance attribute projection (NAP) [Solomonoff et al., 2005]. Also, the
recent popularity and success of factor analysis [Kenny et al., 2008c] has led to Najim’s promising
attempts to use speaker factors directly as SVM features. Both using NAP projection and speaker
factors with SVMs are methods of handling variability in speaker verification: NAP by removing
undesirable nuisance variability, while using the speaker factors does so by forcing the discrimination
to be performed based on inter-speaker variability. These successes have led us to propose VCSVM,
a new method to handle both inter and intra-speaker variation, that attempts to do so directly in the
SVM optimization. This is done by adding a penalty to the minimization that biases the normal to the
hyperplane to be orthogonal to the nuisance subspace or alternatively orthogonal to the complement of
the subspace containing the intra-speaker variation. This bias will attempt to ensure that inter-speaker
variability is used in the recognition while intra-speaker variability is ignored.
6.2 Motivation
Evidence of the importance of handling variability can be found in the discrepancy in verification
performance between one, three and eight conversation enrollment tasks for the same SVM system;
with performance improving as the number of enrollment utterances increases. One explanation for
this is that when only one target conversation is available to enroll a speaker then the orientation of
the separating hyperplane is set by the impostor utterances. As more target enrollment utterances
are provided the orientation of the separating hyperplane can change drastically, as sketched in Figure
6.1.
The additional information that the extra enrollment utterances provide is intra-speaker variability,
due to channel effects and other nuisance variables. If we could estimate the principal components of
intra-speaker variability for a given speaker then we could force the SVM to not use that variability
in choosing a separating hyperplane as is shown in Figure 6.2 where the main nuisance direction was
removed. However since it is not generally possible to estimate intra-speaker variability for a specific
50
1c target w
3c targets
+ + + 1c
+ + +
+ +
3c
− − 8c
− − −
− −
− − − − −
Figure 6.1: Different separating hyperplanes obtained with 1, 3, and 8 conversation enrollment.
speaker we could substitute a global estimate obtained from a large number of speakers, this is exactly
what is done in NAP.
Capturing Intra−speaker Standard SVM
variability seperating hyperplane
w
+
U
Figure 6.2: Effect of removing the nuisance direction from the SVM optimization.
51
where the x0i s are the utterance specific SVM features (supervectors) and yi0 s are the corresponding
labels. Note that the only difference between (6.1) and the standard SVM formulation is the addition
2
of the ξ UUT w2 term, where ξ is a tunable (on some held out set) parameter that regulates the
amount of bias desired. If ξ = ∞ then this formulation becomes similar to NAP compensation, and if
ξ = 0 then we obtain the standard SVM formulation; Figure 6.3 sketches the separating hyperplane
obtained for different values of ξ .
Capturing Intra−speaker Standard SVM
variability seperating hyperplane
w
ξ=0
+
U 0< ξ <
8
ξ=
8
− − SVM seperating hyperplane
− − − with intra−speaker
− − variability compensation
− − − − −
Figure 6.3: How the separating hyperplane varies with different values of ξ.
where the final equality follows from the eigenvectors being orthonormal (UT U = I). Note that
2
UUT is a positive semi-definite matrix since ∀x xT UUT x = UUT x2 which is the 2-norm of the
projection of x in the U subspace and hence xT UUT x ≥ 0 ∀x.
We now follow the recipe presented in [Ferrer et al., 2007] to convert this reformulation into a standard
SVM with the bias absorbed into the kernel. We begin by rewriting J(w, ) in (6.1) as:
m
X
J(w, ) = wT (I + ξUUT )w/2 + C i , (6.3)
i=1
and since (I + ξUUT ) is a positive definite symmetric matrix we obtain the following
m
X
T T
J(w, ) = w B Bw + C i , (6.4)
i=1
where B can be chosen to be real and symmetric and is invertible. A change of variable w̃ = Bw and
x̃ = B−T x allows us to rewrite the optimization in (6.1) as
which is then the standard SVM formulation with the following kernel:
Examining the kernel presented in (6.6) we realize that (I+ξUUT ) can be very large, e.g. 19456x19456
if Gaussian supervectors are used with a 512 mixture UBM and 38 dimensional features. The size of
52
this matrix is of concern since the kernel requires its inverse. To circumvent this concern we use the
Matrix Inversion Lemma [Brookes, 2006] which states the following, for appropriately sized matrices
A, U, V, and I:
= I − ξU[(1 + ξ)I]−1 UT
ξ
= I− UUT . (6.8)
1+ξ
Examining (6.9) we notice that when ξ = 0 we recover the standard linear kernel, and more importantly
when ξ = ∞ we recover exactly the kernel suggested in [Solomonoff et al., 2005] for performing NAP
channel compensation.
An advantage of this VCSVM formulation over NAP is that it does not make a hard decision to
completely remove dimensions from the SVM features but instead leaves that decision to the SVM
optimization, this is of interest especially after Najim’s experiment using only the channel-factors
in an SVM and obtaining a surprising good EER. This method effectively allows for using a larger
estimate of the nuisance subspace and then letting the SVM optimization decide how much energy to
discard from that subspace. VCSVM also provides a way to perform nuisance compensation in SVM
spaces of smaller dimensions, e.g. those obtained when speaker factors are used as SVM features;
in such situations using NAP may not be favorable as it could permanently be removing important
information.
UT U = Λ, (6.10)
where Λ is a diagonal matrix whose elements are the eigenvalues corresponding to the columns of U.
We can now follow a formulation similar to that of the previous section, the difference will show up
in kernel as the matrix inversion lemma (6.7) is applied:
An extreme example of this where the whole SVM space is considered to contain nuisance informa-
tion (i.e. UUT is full rank) results in a formulation very similar to that of WCCN normalization
53
[Hatch et al., 2006]. WCCN proposes using inverse of the intra-speaker covariance matrix (i.e. full
rank UUT ) as a kernel:
However, in practice UUT is ill-conditioned due to the noisy estimate and directions of very small
nuisance variability, therefore smoothing is added to the intra-speaker covariance matrix to make
inversion possible, and therefore the WCCN suggested kernel becomes:
Comparing (6.14) with (6.11) we see that they are similar. We should however mention that when
UUT spans the full SVM space the ξ (in our implementation) and α (in the WCCN implementation)
no longer set the amount of bias desired, instead they insure that the kernel does not over-amplify
directions with small amounts of nuisance variability.
where γ is a tunable (on some held out set) parameter that enforces the amount of bias desired. If
γ = ∞ then this formulation becomes similar to just using the speaker factors, and if γ = 0 then we
obtain the standard SVM formulation. Note that since I − VVT is a projection into the complement
of V then we can replace it by QQT , where Q which is a matrix whose columns are the orthonormal
eigenvectors that span the complement. With this substitution we obtain a formulation that is almost
equivalent to that in (6.1), hence following the recipe in the previous section we see again can push
the bias into the kernel of a standard SVM formulation. The kernel in this case is:
γ
K(xi , xj ) = xTi (I − QQT )xj . (6.16)
1+γ
54
Note that we do not have to explicitly compute the orthonormal basis Q, which can be rather large.
When γ = ∞ the kernel becomes an inner-product between the orthogonal projection of xi into the
inter-speaker subspace and recalling that VVT = VVT VVT we can rewrite the kernel as:
since V is an orthonormal subspace. This kernel suggests that when one chooses to perform classi-
fication using only the inter-speaker subspace the resultant kernel is just an inner-product between
the minivectors (speaker factors). Note however, that when estimating √ V one should use UBM-
normalized supervectors, similarly xi should be UBM-normalized, i.e. xi = λΣ−1/2 (mi mU BM where
mi is the supervector of relevance MAP adapted means, and mU BM is the supervector of UBM means.
where when ξ = γ we obtain the inter-speaker result of Section 6.5 and if γ = 0 we obtain the
intra-speaker result of Section 6.3. Recasting it as a standard SVM formulation yields the following
kernel:
ξ γ
K(xi , xj ) = xTi (I − UUT − (I − VVT − UUT ))xj . (6.21)
1+ξ 1+γ
where h() is the hinge loss. In this formulation the SVM can be though of as just computing the MAP
estimate of w given the training data: where the wT w term is essentially a Gaussian (N (0, I)) prior
and the second term is the log-likelihood of the training data given w. This Gaussian prior on w in
the standard SVM does not bias the angle of w in any direction since the components of w in the
55
prior are independent. When we introduce the bias to handle the variability this only affects the first
term in (6.22) and therefore changes the prior on w in the MAP estimation interpretation (we will
focus on nuisance variability):
m
X
T T
maximize l(w|{xi , yi }) = −w (I + ξUU )w/2 − C h(yi (wT xi + b)). (6.23)
i=1
The prior on the MAP estimate of w is still a Gaussian N (0, (I + ξUUT )−1 ) but with its principal
components orthogonal to the nuisance subspace, with the variance along the principle components
set by ξ. Hence, the prior is biasing w to be orthogonal to the nuisance subspace. A similar connection
can be made for the full setup proposed in Section 6.6.
2.8
2.6
0 5 10 15 20 25
ξ
Figure 6.4: Results on English trials of the NIST SRE-Eval 06 core task with speaker factor SVM
system: EER vs ξ for equal and non-equal weighting of nuisance subspace, and various subspace sizes.
the nuisance directions yields more favorable results than equal weighting. It also shows that VCSVM
allows for nuisance compensation in such a small space, while NAP performs poorly since it completely
removes the estimated nuisance dimensions which are a large percentage of the total dimensionality.
56
Based on the development results we choose ξ = 3 and a corank of 50 to present the results on all
trials of the Eval 08 core task in Figure 6.5.
40
NO COMPENSATION
VCSVM ξ=3 NON−EQUAL CORANK 50
10
2
1 2 5 10 20 40
False Alarm probability (in %)
Figure 6.5: Detection error plots on all trials of NIST Eval 08 core task with speaker factor SVM
system.
6.10 Conclusion
This paper presents variability compensated SVM (VCSVM), a method for handling both good and
bad variability directly in the SVM optimization. This is accomplished by introducing into the mini-
mization a regularized penalty, which biases the classifier to avoid nuisance directions and use direc-
tions of inter-speaker variability. With regard to nuisance compensation, an advantage of our proposed
method is that it does not make a hard decision on removing nuisance directions, rather it decides
according to performance on a held out set. Another benefit is that it allows for unequal weighting of
the estimated nuisance directions, e.g. according to their associated eigenvalues. This flexibility allows
for improved performance over NAP, increased robustness with regards to the size of the estimated
nuisance subspace, and successful nuisance compensation in small SVM spaces. Future work will focus
on using this method for handling inter-speaker variability and all variability simultaneously.
57
Chapter 7
The aim of this chapter is to compare different log-likelihood scoring methods, that different sites used
in the latest state-of-the-art Joint Factor Analysis (JFA) Speaker Recognition systems. The algorithms
use various assumptions and have been derived from various approximations of the objective functions
of JFA. We compare the techniques in terms of speed and performance. We show, that approximations
of the true log-likelihood ratio (LLR) may lead to significant speedup without any loss in performance.
7.1 Introduction
Joint Factor Analysis (JFA) has become the state-of-the-art technique in the problem of speaker
recognition1 . It has been proposed to model the speaker and session variabilities in the parameter
space of the Gaussian Mixture Model (GMM) [Douglas A. Reynolds, 2000]. The variabilities are
determined by subspaces in the parameter space, commonly called the hyper-parameters.
Many sites used JFA in the latest NIST evaluations, however they report their results using different
scoring methods ([Kenny et al., 2007b], [Vair et al., 2007], [Brümmer et al., 2007]). The aim of this
paper is to compare these techniques in terms of speed and performance.
58
7.2.1 Frame by Frame
Frame-by-Frame is based on a full GMM log-likelihood evaluation. The log-likelihood of utterance X
and model s is computed as an average frame log-likelihood 2 . It is practically infeasible to integrate
out the channel, therefore MAP point estimate of x is used. The formula is as follows
T
X C
X
log P (X |s) = log wc N (ot ; µc , Σc ) , (7.1)
t=1 c=1
where ot is the feature vector at frame t, T is the length (in frames) for utterance X , C is number of
Gaussians in the GMM, and wc , Σc , and µc the c th Gaussian weight, mean, and covariance matrix,
respectively.
As was said in the previous paragraph, it would be difficult to evaluate this formula in the frame-by-
frame strategy. However, (7.1) can be approximated by using fixed alignment of frames to Gaussians,
i.e., assume that each frame is generated by a single (best scoring) Gaussian. In this case, the likelihood
can be evaluated in terms of the sufficient statistics. If the statistics are collected in the Baum-Welch
way, the approximation is equal to the GMM EM auxiliary function, which is a lower bound to (7.2).
The closed form (logarithmic) solution is then given as:
C
X 1
log P̃ (X |s) = Nc log
c=1
(2π)F/2 |Σc |1/2
1 1
− tr(Σ−1 Ss ) − log|L|
2 2
1 −1/2 ∗ −1
+ kL U Σ Fs k 2 (7.3)
2
where for the first term, C is the number of Gaussians, Nc is the data count for Gaussian c, F is the
feature vector size, Σc is covariance matrix for Gaussian c. These numbers will be equal both for UBM
and the target model, thus the whole term will cancel out in the computation of the log-likelihood
ratio.
For the second term of (7.3), Σ is the block-diagonal matrix of separate covariance matrices for
each Gaussian, Ss is the second order moment of X around speaker s given as
where S is the CF × CF block-diagonal matrix whose diagonal blocks are uncentered second order
cumulants Sc . This term is independent of speaker, thus will cancel out in the LLR computation
(note that this was the only place where second order statistics appeared, therefore are not needed for
scoring). F is a CF × 1 vector, obtained by concatenating the first order statistics. N is a CF × CF
diagonal matrix, whose diagonal blocks are Nc IF , i.e., the occupation counts for each Gaussian (IF is
F × F identity matrix).
2
All scores are normalized by frame length of the tested utterance, therefore the log-likelihood is average.
59
The L in the third term of (7.3) is given as
L = I + U∗ Σ−1 NU, (7.5)
where I is a CF × CF identity matrix, U is the eigenchannel matrix, and the rest is as in the
second term. The whole term, however, does not depend on speaker and will cancel out in the LLR
computation.
In the fourth term of (7.3), let L1/2 be a lower triangular matrix, such that
L = L1/2 L1/2∗ (7.6)
i.e., L−1/2is the inverse of the Cholesky decomposition of L.
As was said, terms one and three in (7.3), and second order statistics S in (7.4) will cancel out.
Then the formula for the score is given as
Qint (X |s) = tr(Σ−1 diag(Fs∗ ))
1
+ tr(Σ−1 diag(Nss∗ ))
2
1
+ kL−1/2 U∗ Σ−1 Fs k2 (7.7)
2
60
Note, that when computing the LLR, the Ux in the linear term of (7.8) will cancel out, leaving the
compensation to the quadratic term of (7.8).
mc = m + c. (7.12)
M̄ = M − mc (7.13)
F̄ = F − Nmc . (7.14)
When approximating (7.15) by the first order Taylor series (as a function of M̄), only the linear term
is kept, leading to
Realizing, that the channel compensated UBM is now a vector of zeros, and substituting (7.16)
to (7.11), the formula for computing the LLR simplifies to
LLR r
ea
lin
linear score
LLRfbf
full score
0
quadratic score
qua
drat
ic
Figure 7.1: An illustration of the scoring behavior for frame-by-frame, LPT, and linear scoring.
Given the fact, that the P̃ -function is a lower bound approximation of the real frame-by-frame
likelihood function, there are cases, when the LPT original function fails. Fig. 7.1 shows that the
linear function can sometimes be a better approximation of the full LLR.
61
7.3 Experimental setup
7.3.1 Test Set
The results of our experiments are reported on the Det1 and Det3 conditions of the NIST 2006 speaker
recognition evaluation (SRE) dataset4 .
The real-time factor was measured on a special test set, where 49 speakers were tested against
50 utterances. The speaker models were taken from the t-norm cohort, while the test utterances
were chosen from the original z-norm cohort, each having approximately 4 minutes, totally giving 105
minutes.
7.3.4 Normalization
All scores, as presented in the previous sections, were normalized by the number of frames in the test
utterance. In case of normalizing the scores (zt-norm), we worked in the gender dependent fashion.
We used 220 female, and 148 male speakers for t-norm, and 200 female, 159 male speakers for z-norm.
These segments were a subset of the JFA training data set.
62
7.4 Results
Table 7.1 shows the results without any score normalization. The reason for the loss of performance in
the case of LPT scoring could possibly be due to bad approximation of the likelihood function around
UBM, ,i.e., the inability to adapt the model to the test utterance (in the U space only). Fig. 7.1 shows
this case.
Table 7.1: Comparison of different scoring techniques in terms of EER and DCF. No score normal-
ization was performed here.
Det1 Det3
EER DCF EER DCF
Frame-by-Frame 4.70 2.24 3.62 1.76
Integration 5.36 2.46 4.17 1.95
Point estimate 5.25 2.46 4.17 1.96
Point estimate LPT 16.70 6.84 15.05 6.52
Linear 5.53 2.97 3.94 2.35
Table 7.2 shows the results after application of zt-norming. While the frame-by-frame scoring
outperformed all the fast scorings in the un-normalized case, normalization is essential for the other
methods.
Table 7.2: Comparison of different scoring techniques in terms of EER and DCF. zt-norm was used
as score normalization.
Det1 Det3
EER DCF EER DCF
Frame-by-Frame 2.96 1.50 1.80 0.91
Integration 2.90 1.48 1.78 0.91
Point estimate 2.90 1.47 1.83 0.89
Point estimate LPT 3.98 2.01 2.70 1.36
Linear 2.99 1.48 1.73 0.95
7.4.1 Speed
The aim of this experiment was to show the approximate real time factor of each of the systems. The
time measured included reading necessary data connected with the test utterance (features, statistics),
estimating the channel shifts, and computing the likelihood ratio. Any other time, such as reading of
hyper-parameters, models, etc. was not comprised in the result. Each measuring was repeated 5 times
and averaged. Table 7.3 shows the real time of each algorithm. Surprisingly, the integration LLR is
63
faster then the point estimate. This is due to implementation, where the channel compensation term
in the integration formula is computed once per an utterance, while in the point estimate case, each
model needs to be compensated for each trial utterance.
7.5 Conclusions
We have showed a comparison of different scoring techniques that different sites have recently used
in their evaluations. While, in most cases, the performance does not change dramatically, the speed
of evaluation is the major difference. The fastest scoring method is the Linear scoring. It can be
implemented by a simple dot product, allowing for fast scoring of huge problems (e.g., z-, t- norming).
64
Chapter 8
8.1 Introduction
In this chapter we present the work we did to explore some discriminative training techniques of a
speaker recognition system, as an alternative to the generative training that constitutes the current
state-of-the-art.
For the state-of-the-art — generatively trained joint factor analysis (JFA) — we refer to the
introductory chapter 2. In this chapter, we introduce discriminative training, followed by some general
discussion. Finally, we present the specific experiments that we did.
Discriminative training of speaker models, as illustrated in figure 8.1, has been around for more
than a decade, and indeed SVM speaker modeling has been a constant feature at the NIST SRE
evaluations since 2003. However, such discriminative training addresses only half of the speaker
recognition problem. The full problem is to (i) distinguish one speaker from others, but also to
(ii) recognize the same speaker under different circumstances. Traditional SVM speaker modeling
addresses only (i), by discriminatively training every speaker model to discriminate one example
of the speaker from a pool of background speakers. It does not however address problem (ii)
in a discriminative way. Instead, state-of-the-art SVM implementations have had to resort to
solutions such as nuisance attribute projection (NAP), which are external to the discriminative
training—see e.g. [Brümmer et al., 2007] and references therein.
Discriminative fusion of speaker recognition systems, using neural networks, SVM or logistic
regression [Brümmer et al., 2007] has been very successful in several of the latest NIST SRE
evaluations. This technique does use large amounts of training data in a single discriminative
optimization, but the number of parameters that are trained is very small, typically around 10
parameters of fewer. Fusion can improve upon existing systems, but the existing systems already
have to be good and they have to be complementary. Fusion by itself cannot create good systems
and it cannot make different systems complementary.
65
The focus of this work is therefore to explore techniques which are more powerful than both of the
above-mentioned methodologies. Specifically, the focus of this work is to start with some variant of the
JFA system (figure 8.2) and to discriminatively optimize some of the hyperparameters of the system
as shown in figure8.3. Parameters that could be optimized include:
• The parameters of feature vector transformations, analogous to techniques such as HLDA and
RASTA.
match
score
Generative Modeling
via Joint Factor Analysis
(ML optimization)
enrollment speech test speech
system hyper-
feature extraction feature extraction
parameters
estimate model
match
score
M m Vy D z Ux
Figure 8.2: State-of-the-art generatively trained JFA system.
66
System
discriminative
optimization
enrollment speech test speech
system
feature extraction feature extraction
hyper-parameters
estimate model
match
score
M m Vy D z Ux
Figure 8.3: Proposed discriminative training paradigm for speaker recognition.
are indicated by the ad-hoc normalizations (see step 4 in section 2.3) that have to be used to make
generative JFA work, although these normalizations are not derived from the generative models. A
favourable outcome to discriminative training of the JFA hyperparameters could therefore be expected
to give better accuracy than ML estimates of those hyperparameters.
Discriminative training has been shown to be able to optimize smaller, simpler, faster systems to
rival the accuracy of larger generatively trained systems. In this work, we concentrated on this aspect,
with a few encouraging results.
67
8.4 Solutions for discriminative training
In this section, we list possible solutions for several of the challenges mentioned above. Some solutions
are listed here just for reference; some were tried and abandoned for various reasons; and some were
pursued to the stage of obtaining experimental results.
where T denotes the subset of all target trials in the training data and N the subset of all non-target
trials and where | · | denotes the number of trials in each subset. We used Cllr in all of our experiments,
to be described below in section 8.5. These experiments differed in the way that the scores λt were
computed from the input data.
1
Of course variants exist, e.g. where there are multiple train segments. Here we concentrate on the simplest case of
a single train segment.
68
Multi-class discriminative training
For completeness we mention a somewhat different way to approach discriminative training of a speaker
recognition system. To apply this, we would need to set up the recognizer as a multi-class speaker
identifier which outputs a likelihood/posterior for every one of a closed set of speakers, rather than
as a two-class speaker verifier/detector. However, the plan here is somewhat more complex than
traditional discriminative training of generative models:
1. Choose K speakers, so that there are two or more speech segments for each speaker.
2. Hold out one segment per speaker as the training segment. Denote the training segment for
speaker k as Tk . Denote the set of test segments for this speaker as Sk .
3. The model for speaker k is Mk and is a function of the training segment and of θ, the set of
hyperparameters to be optimized, so that Mk = train(Tk , θ).
4. The log-likelihood score for speaker k, given test segment s` is λk` = score(Mk , s` , θ).
5. The objective function is the empirical cross-entropy averaged over all test segments:
K K
1 X 1 X X
O=− λk` − log exp(λj` ) (8.4)
K kSk k
k=1 `∈Sk j=1
8.4.2 Regularization
As mentioned above, discriminative optimization of a large parameter set is vulnerable to overtraining.
Overtraining can be controlled in different ways. Examples are [MacKay, 2003, Bishop, 2007]:
Early stopping, where the iterative optimization process is stopped as soon as the performance on
a held-out evaluation database stops improving.
Regularization penalties, where the objective function is modified by the addition of a suitable
weighted penalty function. (This is standard practice with SVM training.) The weight of the
penalty has to be tuned by using a held-out evaluation set. For non-standard optimizations such
as this work, the design of the penalty function can be challenging.
Bayesian methods, (see above references), where priors are assigned for the parameters under op-
timization. There are several challenges associated with such Bayesian methods.
69
8.4.3 Optimization algorithms
There are very many ways to implement algorithms to numerically optimize a given objective function.
We studied some general-purpose optimization literature, most notably the textbook by Nocedal and
Wright [Nocedal and Wright, 2006], but also some other sources, which will be mentioned below.
In addition, we worked with the extended Baum-Welch algorithm, which is specifically designed for
discriminative MMI training of generative models in speech recognition.
The software that we used for our optimization experiments was all done in MATLAB. We tried
some experiments with the MATLAB optimization toolbox, but found problems in scaling these
algorithms to our problem. Instead, we hand-coded our own optimization algorithms in MATLAB.
• The most primitive methods use only objective function evaluations and do not rely on the
availability of explicitly calculated derivatives of the objective. Examples of this class are:
The Nelder-Mead simplex method. We did not further consider this method, because it
becomes intractable for large-scale optimization.
SPSA, or simultaneous perturbation stochastic approximation. In this methodology, numerical
derivative approximations are done in suitably randomized directions in search space. This
method is tractable for large optimizations, because only one derivative per iteration is
approximated. It is reported to have convergence rates similar to algorithms that do use
explicit gradient information. We did not try this method, but it may be an interesting
avenue for further investigation. See www.jhuapl.edu/SPSA for further reading.
• More sophisticated methods use explicit first-order derivatives in the form of the gradient of
the objective function. For a large number of variables, the gradient cannot be numerically
approximated with finite difference methods, because this would involve prohibitive numbers
of objective function evaluations, namely one or two evaluations per variable. To use these
methods, one therefore needs explicit (and efficient) analytical implementations of the gradient.
The subject of how to implement gradients will be discussed below in section 8.4.4. In this class
of first-order methods, we investigated the following:
Non-linear conjugate gradient methods form the canonical class of general purpose un-
constrained first-order optimization algorithms.
RPROP uses only the sign of the the gradient, and a set of heuristics to adaptively ad-
just the search step size for every variable. For some objective functions, RPROP
has comparable (or superior) performance to conjugate gradient, but it is considerably
easier to implement, because it does not use sophisticated line-search techniques. See
http://en.wikipedia.org/wiki/Rprop. We used RPROP as our optimization algorithm
in the experiment reported in section 8.5.1 below.
Stochastic gradient descent optimization algorithms can be applied to supervised machine
learning objective functions which are sums over very many terms, where each term repre-
sents one training example. This is indeed the case here. The difference between stochastic
gradient descent and more traditional methods is that stochastic gradient adjusts the pa-
rameters being optimized after every example, rather than once after all examples have been
processed. If the examples are repetitive this can give substantial savings in computation.
We tried some preliminary experiments with this methodology, but found no advantage
for our problem. One of the issues that we could not resolve to make this method work,
70
was that it breaks the vector pipelining that allows MATLAB code to run efficiently. For
further reading see http://leon.bottou.org/research/stochastic.
• The most sophisticated methods use not only gradients, but also some second-order derivative
information, in the form of Hessian-vector products. The second-order information can lead to
fast convergence in convex regions of the objective function, which are typically found close to
minima. We mentioned above that gradients cannot be approximated numerically for large-scale
optimization. However, given analytical solutions for the gradient, the Hessian-vector products
can be accurately and efficiently approximated. A promising method in this class, which we
believe merits further investigation is the Trust-Region-Newton-Conjugate-Gradient method.
See [Nocedal and Wright, 2006] and [Lin et al., 2008].
Finite difference methods, as mentioned above, are not tractable for large-scale gradient compu-
tation. However, they do form a very valuable tool for verifying the correctness of gradients
computed by other means and we used these methods for this purpose.
Semi-automatic: We did some experiments with semi-automatic ways to achieve more or less the
same effect as reverse-mode automatic differentiation (also similar to back-propagation in neural-
network training), by using differentiation chain-rules for compositions of functions. With this
methodology, the programmer hand-codes the derivatives for simple building blocks and connects
the blocks such that the chain rules are implemented. This is a promising methodology, but we
had problems with the scaling of memory and CPU resources. Computing gradients of nested
multivariate functions involves large matrix multiplications.
Hand-coded derivatives: In the end, we found it best to globally hand-optimize these matrix com-
putations to obtain tractable solutions to computing the required gradients. In doing this we
found two tools invaluable. The one, as mentioned above was testing via finite differencing. The
other was matrix calculus to find derivatives of expressions involving vectors and matrices, see
http://research.microsoft.com/en-us/um/people/minka/papers/matrix.
8.5 Experiments
In this section, we present details and results for two sets of experiments that were pursued to successful
conclusion.
71
8.5.1 Small scale experiment
In this experiment, we drastically simplified the scale of the problem, by simplifying the JFA system,
prior to discriminative training. This simplification was found by other members of the team in their
quest to find JFA-based transformations to a feature space amenable to SVM training (see 5). The
transform works as follows:
Transform
The transform is a mapping of every input segment to a low-dimensional vector. It is effected by
performing model training, as explained in step 1 of section 2.3, on every input segment. This
differs from the normal JFA recipe where only train segments are treated thus. Then, only the 300-
dimensional y-vector is retained.
In the work reported here, this transform was done by the female variant of the CRIM JFA
system[Kenny et al., 2008b].
Raw Score
Let y1 represent the train segment and y2 represent the test segment. Let W be a 300-by-300 positive
semi-definite matrix, so that y1 Wy2 is a dot-product. Then the score for the trial is the normalized
dot product:
y0 Wy2
s(y1 , y2 ) = p 0 1 (8.5)
y1 Wy1 y20 Wy2
We did not use ZT-norming, because it showed no improvement over the raw score.
Generative baseline
The parameter W of this score can be informally ‘trained’ in a generative way2 by estimating the
within-class covariance matrix of the y-vectors obtained from a development database with multiple
sessions (segments) per speaker. We use the W obtained thus as our baseline for comparison and also
as initialization for the discriminative optimization.
Calibration
As mentioned above, to evaluate the discriminative objective function (8.3), we need the score to
act as log-likelihood-ratio. We force the score to behave thus with an affine calibration transform,
the parameters of which are trained discriminatively along with the other parameters. Denoting the
log-likelihood-ratio of a trial t by λt , the affine transform of a raw score st is:
λt = αst + β (8.6)
where the scale parameter α and the offset parameter β are real scalars. The whole discriminative
training problem is therefore to jointly optimize (α, β, W).
Parametrization
The 300-by-300 parameter matrix W is constrained to be positive semi-definite. In order to avoid
the extra complications of constrained optimization, we reparametrize in terms of an unconstrained
2
A formal generative model, with multivariate Gaussians for between- and within-class variation, yields a somewhat
more complex score function. We tried using such score functions, but the simple normalized dot-product score performed
best.
72
300-by-300 matrix3 X, so that W = X0 X.
The full parameter set for optimization is now (α, β, X). It should be noted that although this
parametrization avoids constraints, it forms a non-convex optimization problem, with multiple equiv-
alent optima, because for a given W, there are many solutions for W = X0 X.
Optimization
As mentioned above optimization involves two interrelated challenges: (i) choosing a numerical op-
timization algorithm and (ii) computing the derivatives required by that algorithm—all subject to
limited resources of time, CPU and memory. Our first successful solution used the above-mentioned,
first-order RPROP method, with hand-optimized gradient calculation. Undoubtedly there are better
solutions, some of which we will investigate in future.
Results
We performed our experiments by discriminative training on all the female JFA development data,
including Switchboard and SRE’04 and SRE’05. By using all pairs of speech sgments, this gave a
training database of almost 200 million detection trials, involving more than 1000 speakers. We tested
on the English, female subset of the 1conv4w-1conv4w task of NIST SRE’06. The above-mentioned
generative baseline gave EER = 2.61%. Discriminative retraining, with early stopping after a few
iterations gave an 11% relative improvement of EER = 2.33%.
Comment
Although we got an improvement, it is clear that this method is very vulnerable to overtraining of the
90000 parameters. If the optimization was iterated further, the training EER dropped to a fraction
of a percent, but this did not translate to good performance on independent test data. In future
we would like to use more sophisticated regularization, coupled with more sophisticated optimization
algorithms, in order to be able to find well-defined optima of the objective function rather than the
somewhat ill-defined early stopping.
73
!
∂Cllr 1 1 X 1 X ∂λt
= (1 − P (target|λt , 0.5)) − P (target|λt , 0.5) , (8.7)
∂θ 2 log 2 |T | |N | ∂θ
t∈T t∈N
where P (target|λt , 0.5) is given by (8.2) and by combining (8.6) and (7.17) and differentiating
w.r.t. V, we obtain
∂λt
= αΣ−1 F̄y∗ (8.8)
∂V
To optimize our objective function, we need to define set of training trials. In these experiments,
each possible pair of two segments from our training set formed a valid trial, where one segment is
considered to be the enrollment and the other the test segment. This allows us to define J × J matrix
P, where J is number of the segments in the training set and each element of the matrix corresponds
to one trial where row index defines test segment and column index defines the enrollment segment.
Let the elements of the matrix P corresponding to trial t be 1−P (target|λ
|T |
t ,0.5)
if the trial is a target
trial and − P (target|λ
|N |
t ,0.5)
if the trial is a non-target trial. Combining (8.7), (8.8), making use of matrix
P and taking the gradient of the objective function w.r.t. V, we obtain
1
∇Cllr (V) = αΣ−1 F̃PY∗ (8.9)
2 log 2
where columns of matrix F̃ are vectors of first order sufficient statistics, F̄, extracted form all segments
in training set (representing test segments) and where columns of matrix Y are vectors of speaker
factors extracted form all segments in training set (representing enrollment segments).
Now, the gradient could be used to optimize the objective function using standard gradient de-
scent optimization. However, the widely adopted technique for MMI training GMMs of HMMs is
Extended Baum-Welch [Schluter et al., 2001] re-estimation, which has been shown to provide much
faster convergence compared to gradient descent. In our case, the Extended Baum-Welch can not be
adopted in straightforward way thanks to our more complicated model and simplified linear scoring.
In [Schluter et al., 2001] section 2.2.2., relation between Extended Baum-Welch and gradient descent
update of parameters was pointed out showing that Extended Baum-Welch update of GMM mean
vectors can be seen as gradient descent update with a specific learning rate used to update each
parameter. Inspired by this relation, we propose to use similar learning rate specific to each row of
matrix V. Specifically, we multiply the gradient ∇Cllr (V) by diagonal matrix
L = η diag(ÑP1)Σ, (8.10)
where columns of matrix Ñ are vectors of zero order sufficient statistics, N, extracted form all segments
in training set, η is parameter independent learning rate, 1 is column vector of ones and diag is an
operator that converts vector into diagonal matrix. Finally, the matrix V is iteratively updated using
the following formula:
74
EER[%] No norm ZT-norm
Generative V 15.44 11.42
Generative V and U 6.99 4.07
Discriminative V 7.19 5.06
Discriminative V with channel compensated y 6.80 4.81
Table 8.1: Results of the 1st large scale experiment, on SRE 2006 all trials (det1).
Table 8.2: Results of the 2nd large scale experiment, on SRE 2006 all trials (det1).
Table 8.1. Comparing the first and the third line in the table, we can see that discriminative training
provides substantial improvement in the performance. The improvement hold also when applying
zt-normalization, though the normalization was not considered during the discriminative training.
As mentioned in the previous section, speaker factors y are always computed using the original
ML trained model. In this case, it is the pure eigenvoice system with speaker factors y estimated
without considering channel variability. The last line in the table shows results obtained with the
same discriminatively trained pure eigenvoice system used for testing, where, however, factors y were
obtained from ML trained system modeling the channel variability. The improved result suggest that
good estimation of y not affected by channel variability may be important. Possibly, in the future, this
could be also achieved by means of discriminative training, without explicitly modeling the channel
variability.
The second line in the table shows performance of ML trained system making use of eigenchannels
for both estimating speaker factors y and testing. We can see that discriminative training provides
comparable improvements to the intersession variability modeling. However, improvement over the
ML trained system is observed only in the first column of the third row, which corresponds to result
without zt-normalization. When zt-norm is used, performance of the generative system is superior
to the discriminatively trained one. Note, that zt-norm was not considered during discriminative
training. Not having zt-norm incorporated in the discriminative training may force the training to
concentrate on problem that can be easily solved by the normalization, which can lead to suboptimal
result.
8.5.4 Conclusion
Discriminative training for speaker identification a large and difficult problem, but it has the potential
of worthwhile gains with the possibility of more accurate, but faster and smaller systems. We have
managed to show some proof of concept, but so far without significantly improving on the state-of-
75
the-art. Remaining problems are practical and theoretical, including complexity of optimization and
principled methods for combating over-training.
Many possible extensions of our large scale experiments are possible. Beside training eigenvoices
V, hyperparameters U and D can be also trained discriminatively. In all of our current experiments,
we worked with sufficient statistics collected by UBM model. This means that assignment of frames to
Gaussians is fixed given by UBM, which was, however, trained using maximum likelihood criterion. It
is quite possible that such allocation of Gaussians is suboptimal for the task of discriminating between
speaker. It would be worthwhile to experiment with discriminative training that has the freedom
to change such frame assignment. We have also pointed out the problem with zt-norm not being
incorporated in the discriminative training. This could be achieved by making λt in (8.3) to be the
zt-normalized score. However, this make integration of our objective function much more complicated.
76
Chapter 9
In this workshop, several approaches to robust speaker recognition, sharing the same theoretical
background — Joint Factor Analysis (JFA) — were investigated.
In diarization (Chapter 3), we have examined application of JFA and Bayesian methods to
diarization. Our approach produced 3-4on challenging interview speech
In Factor Analysis Conditioning (Chapter 4), we have explored ways to use JFA to account for
non-session variability (phone) and showed robustness using within-session, stacking and hierarchical
modeling.
We have also advanced SVM-JFA approaches by developing techniques to use JFA elements
in SVM classifiers (Chapters 5 and 6). The results are comparable to full JFA system but with fast
scoring (Chapter 7) and no score normalization. We have concluded that SVM approaches provide
better performance using all JFA factors.
Finally, discriminative system optimization was investigated (Chapter 8). It focused on means
to discriminatively optimize the whole speaker recognition system and successfully demonstrated proof-
of-concept experiments.
To conclude, we have found JHU 2008 an extremely productive and enjoyable workshop, and our
aim is to have collaboration in problem areas going forward. Cross-site, joint efforts will certainly
provide big gains in future speaker recognition evaluations and experiments.
77
Bibliography
[Auckenthaler et al., 2000] Auckenthaler, R., Carey, M., and Lloyd-Thomas, H. (2000). Score normal-
ization for text-independent speaker verification systems. Digital Signal Processing, 10(1/2/3):42–
54.
[Bishop, 2007] Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Springer.
[Brümmer, 2008] Brümmer, N. (2008). SUN SDV system description for the NIST SRE 2008 evalua-
tion.
[Brümmer et al., 2007] Brümmer, N., Burget, L., Černocký, J., Glembek, O., Grézl, F., Karafiát, M.,
van, D. L., Matějka, P., Schwarz, P., and Strasheim, A. (2007). Fusion of heterogeneous speaker
recognition systems in the stbu submission for the NIST speaker recognition evaluation 2006. IEEE
Transactions on Audio, Speech, and Language Processing, 15(7):2072–2084.
[Brümmer and du Preez, 2006] Brümmer, N. and du Preez, J. (2006). Application-independent eval-
uation of speaker detection. Computer Speech & Language, 20(2-3):230–275.
[Burget et al., 2007] Burget, L., Matejka, P., Glembek, O., Schwarz, P., and Cernocky, J. (2007).
Analysis of feature extraction and channel compensation in GMM speaker recognition system.
IEEE Transactions on Audio, Speech, and Language Processing, 15(7):1979–1986.
[Campbell et al., 2006a] Campbell, W., Sturim, D., Reynolds, D., and Solomonoff, A. (2006a). Svm
Based Speaker Verification using a GMM SuperVector Kernel and NAP Variability Compensation.
In IEEE-ICASSP, Toulouse.
[Campbell et al., 2006b] Campbell, W. M., Sturim, D. E., and Reynolds, D. (2006b). Support Vec-
tor Machines Using GMM Supervectors for Speaker Verification. IEEE Signal Processing Letters,
13(5):308–311.
[Castaldo et al., 2007] Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., and Vair, C. (2007). Com-
pensation of nuisance factors for speaker and language recognition. IEEE Transactions on Audio,
Speech and Language Processing, 15(7):1969–1978.
[Castaldo et al., 2008] Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., and Vair, C. (2008).
Stream-based speaker segmentation using speaker factors and eigenvoices. In Proc. ICASSP, Las
Vegas, Nevada.
[Chaudhari et al., 2000] Chaudhari, U., Navratil, J., and Maes, S. (2000). Transformation enhanced
multi-grained modeling for text independent speaker recognition. ICSLP, 2:298–301.
78
[Dehak et al., 2009] Dehak, N., Kenny, P., Dehak, R., Glembek, O., Dumouchel, P., Burget, L.,
Hubeika, V., and Castaldo, F. (2009). Support vector machines and joint factor analysis for speaker
verification. In Proc. ICASSP, Taipei, Taiwan.
[Dehak et al., 2007] Dehak, N., Kenny, P., and Dumouchel, P. (2007). Modeling prosodic features with
joint factor analysis for speaker verification. IEEE Transactions on Audio, Speech and Language
Processing, 15:2095–2103.
[Dehak et al., 2008] Dehak, R., Dehak, N., Kenny, P., and Dumouchel, P. (2008). Kernel Combination
for SVM Speaker Verification. In Odyssey Speaker and Language Recognition Workshop 2008,
Stellenbosch, South Africa.
[Douglas A. Reynolds, 2000] Douglas A. Reynolds, Thomas F. Quatieri, R. B. D. (2000). Speaker
verification using adapted gaussian mixture models. Digital Signal Processing, pages 19–41.
[Ferrer et al., 2007] Ferrer, L., Sonmez, K., and Shriberg, E. (2007). A smoothing kernel for spatially
related features and its application to speaker verification. In Proceedings of Interspeech.
[Glembek et al., 2009] Glembek, O., Burget, L., Dehak, N., Br ummer, N., and Kenny, P. (2009).
Comparison of scoring methods used in speaker recognition with joint factor analysis. In Proc.
ICASSP, Taipei.
[Hatch et al., 2006] Hatch, A. O., Kajarekar, S., and Stolcke, A. (2006). Within-class covariance
normalization for svm-based speaker recognition. In Proceedings of Interspeech.
[J. Pelecanos, 2006] J. Pelecanos, S. S. (2006). Feature warping for robust speaker verification. In
Proceedings of Odyssey 2006: The Speaker and Language Recognition Workshop, pages 213–218.
[Kajarekar, 2008] Kajarekar, S. (2008). Phone-based cepstral polynomial SVM system for speaker
recognition. Proceedings of Interspeech 2008.
[Kenny, 2005] Kenny, P. (2005). Joint factor analysis of speaker and session variability : Theory and
algorithms - technical report CRIM-06/08-13. Montreal, CRIM, 2005.
[Kenny, 2006] Kenny, P. (2006). Joint factor analysis of speaker and session variability: Theory and
algorithms (draft version). IEEE Speech, Acoustics and Language Processing.
[Kenny, 2008] Kenny, P. (2008). Bayesian analysis of speaker diarization with eigenvoice priors.
[Kenny et al., 2005a] Kenny, P., Boulianne, G., and Dumouchel, P. (2005a). Eigenvoice modeling with
sparse training data. Speech and Audio Processing, IEEE Transactions on, 13(3):345–354.
[Kenny et al., 2005b] Kenny, P., Boulianne, G., Ouellet, P., and Dumouchel, P. (2005b). Factor
analysis simplified. In Proc. of the International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), pages 637– 640, Toulouse, France.
[Kenny et al., 2007a] Kenny, P., Boulianne, G., Ouellet, P., and Dumouchel, P. (2007a). Speaker and
Session Variability in GMM-Based Speaker Verification. IEEE Trans. Audio Speech and Language
Processing.
[Kenny et al., 2007b] Kenny, P., Boulianne, G., Oullet, P., and Dumouchel, P. (2007b). Joint factor
analysis versus eigenchannes in speaker recognition. IEEE Transactions on Audio, Speech, and
Language Processing, 15(7):2072–2084.
[Kenny et al., 2008a] Kenny, P., Dehak, N., Dehak, R., Gupta, V., and Dumouchel, P. (2008a). The
role of speaker factors in the nist extended data task. In Odyssey: The Speaker and Language
Recognition Workshop.
79
[Kenny et al., 2008b] Kenny, P., Dehak, N., Ouellet, P., Gupta, V., and Dumouchel, P. (2008b).
Development of the primary CRIM system for the nist 2008 speaker recognition evaluation. In
Proc. Iinterspeech, Brisbane.
[Kenny and Dumouchel, 2004] Kenny, P. and Dumouchel, P. (2004). Experiments in speaker verifi-
cation using factor analysis likelihood ratios. In Odyssey: The Speaker and Language Recognition
Workshop, pages 219–226.
[Kenny et al., 2008c] Kenny, P., Ouellet, P., Dehak, N., Gupta, V., and Dumouchel, P. (2008c). A
Study of Inter-Speaker Variability in Speaker Verification. IEEE Trans. Audio, Speech and Language
Processing, 16(5):980–988.
[Kenny et al., 2008d] Kenny, P., Ouellet, P., Dehak, N., Gupta, V., and Dumouchel, P. (2008d). A
Study of Inter-Speaker Variability in Speaker Verification. IEEE Transactions on Audio, Speech
and Language Processing.
[Lin et al., 2008] Lin, C.-J., Weng, R. C., and Keerthi, S. S. (2008). Trust region newton method for
logistic regression. J. Mach. Learn. Res., 9:627–650.
[MacKay, 2003] MacKay, D. (2003). Information theory, inference and learning algorithms. Cam-
bridge University Press, New York, NY.
[Matejka et al., 2008] Matejka, P., Burget, L., Glembek, O., Schwarz, P., Hubeika, V., Fapso, M.,
Mikolov, T., Plchot, O., and Cernocky, J. (2008). BUT language recognition system for nist 2007
evaluations. In Proc. Interspeech.
[Matejka et al., 2006] Matejka, P., Burget, L., Schwarz, P., and Cernocky, J. (2006). Brno University
of Technology System for NIST 2005 Language Recognition Evaluation. Speaker and Language
Recognition Workshop, 2006. IEEE Odyssey 2006, pages 1–7.
[Minka, 1998] Minka, T. (1998). Expectation-maximization as lower bound maximization. Technical
report, Microsoft.
[National Institute of Standards and Technology, 2008] National Institute of Standards and Technol-
ogy (2008). NIST speech group website. http://www.nist.gov/speech.
[Nocedal and Wright, 2006] Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer.
[Pelecanos and Sridharan, 2001] Pelecanos, J. and Sridharan, S. (2001). Feature Warping for Robust
Speaker Verification. In Speaker Odyssey, pages 213–218, Crete, Greece.
[Prince and Elder, 2006] Prince, S. and Elder, J. (2006). Tied factor analysis for face recognition
across large pose changes. Proceedings of the British Machine Vision Conference, 3:889–898.
[Reynolds et al., 2000] Reynolds, D., Quatieri, T., and Dunn, R. (2000). Speaker verification using
adapted Gaussian mixture models. Digital Signal Processing, 10(1/2/3):19–41.
[Schluter et al., 2001] Schluter, R., Macherey, W., Muller, B., and Ney, H. (2001). Comparison of
discriminative training criteria and optimization methods for speech recognition. Speech Commu-
nication, 34:287–310.
[Schwarz et al., 2004] Schwarz, P., Matějka, P., and Černocký, J. (2004). Towards lower error rates
in phoneme recognition. In International Conference on Text, Speech and Dialogue, pages 465–472.
[Schwarz et al., 2006] Schwarz, P., Matějka, P., and Černocký, J. (2006). Hierarchical structures of
neural networks for phoneme recognition. In Proc. of the International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), pages 325–328, Toulouse, France.
80
[Sollich, 1999] Sollich, P. (1999). Probabilistic interpretation and bayesian methods for support vector
machines. In Proceedings of ICANN.
[Solomonoff et al., 2005] Solomonoff, A., Campbell, W. M., and Boardman, I. (2005). Advances in
channel compensation for SVM speaker recognition. In Proceedings of ICASSP.
[Strasheim and Brümmer, 2008] Strasheim, A. and Brümmer, N. (2008). SUNSDV system descrip-
tion: NIST SRE 2008. In NIST Speaker Recognition Evaluation Workshop Booklet.
[Tranter and Reynolds, 2006] Tranter, S. and Reynolds, D. (2006). An overview of automatic speaker
diarization systems. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1557–
1565.
[Vair et al., 2006] Vair, C., Colibro, D., Castaldo, F., Dalmasso, E., and Laface, P. (2006). Channel
factors compensation in model and feature domain for speaker recognition. IEEE Odyssey 2006
Speaker and Language Recognition Workshop.
[Vair et al., 2007] Vair, C., Colibro, D., Castaldo, F., Dalmasso, E., and Laface, P. (2007). Loquendo
- Politecnico di Torino’s 2006 NIST speaker recognition evaluation system. In Proceedings of Inter-
speech 2007, pages 1238–1241.
[Valente, 2005] Valente, F. (2005). Variational Bayesian methods for audio indexing. PhD thesis,
Eurecom.
[Vogt et al., 2005] Vogt, R., Baker, B., and Sridharan, S. (2005). Modelling session variability in
text-independent speaker verification. Interspeech, pages 3117–3120.
[Vogt et al., 2008a] Vogt, R., Baker, B., and Sridharan, S. (2008a). Factor analysis subspace estima-
tion for speaker verification with short utterances. In Interspeech, pages 853–856.
[Vogt et al., 2008b] Vogt, R., Lustri, C., and Sridharan, S. (2008b). Factor analysis modelling for
speaker verification with short utterances. In Odyssey: The Speaker and Language Recognition
Workshop.
81