Professional Documents
Culture Documents
A THESIS
submitted by
S. P. KISHORE
(by Resear h)
This is to
ertify that the thesis entitled Speaker Veri
ation Using Autoasso
ia-
tive Neural Network Models submitted by S. P. Kishore to the Indian Institute
of Te
hnology, Madras for the award of the degree of Master of S
ien
e (by Resear
h) is
a bonade re
ord of resear
h work
arried out by him under my supervision. The
on-
tents of this thesis, in full or in parts, have not been submitted to any other Institute
or University for the award of any degree or diploma.
I would like to thank Jyotsna, Anjani, Kamakshi Prasad, Prasad Reddy, K. Kiran,
Nayeem, Nagarajan, Devaraj, Gupta, Venkatesh, Vinod and Anil for their extended
ooperation during my stay in spee
h lab. I am also grateful to my hostel mates P.
Sampath, B. Srikanth and P. Kiran who went beyond their way to help me in the hour
of need.
Words alone
annot des
ribe the en
ouragement, ae
tion and love of my appa
Narasimha Murthy, my amma Sumathi Murthy, my brother Prasanth, my sister
Praveena and my un
les Ar
hak Vyasraj A
har & Bro., whose benedi
tions have been
with me in
ompleting this
ourse and through out.
Prahallad Kishore
iii
ABSTRACT
Keywords: speaker veri
ation; autoasso
iative neural networks; ba
kground model;
hannel and handset ee
ts; equal error rate.
The obje
tive of a speaker veri
ation system is to
onrm the identity of a per-
son from his/her voi
e. Speaker veri
ation
an be performed in a text-dependent or
text-independent mode. For the text-dependent
ase, the feature ve
tors are extra
ted
from the spee
h signal and are stored as referen
e templates. For the text-independent
ase, a model is used to
apture the distribution of the feature ve
tors of a speaker.
This thesis presents an approa
h based on Autoasso
iative Neural Network (AANN)
models for text-independent speaker veri
ation. This approa
h
an be viewed as an
alternative to the
urrent approa
hes based on Gaussian Mixture Models (GMM).
Autoasso
iative neural network models are feedforward neural networks perform-
ing an identity mapping of the input spa
e. In this thesis, the
hara
teristi
s of AANN
models are explained in the perspe
tive of
apturing the distribution of feature ve
tors.
The distribution
apturing ability of the AANN models is studied using a probability
surfa
e derived from the training error surfa
e in the input feature spa
e
aptured by
the network. We illustrate that a three layer AANN model with nonlinear hidden units
lusters the input data in a linear subspa
e, whereas a ve layer AANN model
lus-
ters the input data in a nonlinear subspa
e. The probability surfa
e
aptured by a ve
layer AANN model is viewed as nonparametri
modeling of the input data distribution.
The property of a ve layer AANN model to
apture the distribution of the given
data is exploited to develop a speaker veri
ation system. The proposed system is
evaluated on
onversational telephone spee
h for 230 speakers. Performan
e of the
AANN-based speaker veri
ation system is improved by addressing three issues: (1)
Normalization pro
edure using ba
kground model, (2) Stru
ture of the AANN model,
and (3) Mismat
h between training and testing data. Finally, performan
e of our
AANN-based speaker veri
ation system is
ompared with that of a GMM-based
speaker veri
ation system.
v
TABLE OF CONTENTS
viii
LIST OF TABLES
4.1 Performan
e of speaker veri
ation system for 230 speakers. Here BG
denotes ba
kground model. . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Performan
e
omparison of one of the best GMM-based speaker veri-
ation systems with AANN-based speaker veri
ation system. . . . . . 40
5.1 A set of 11 dierent
laimant s
ores obtained for dierent test utter-
an
es. S
ores of the genuine
laimant models are underlined. Rows 1,
2 and 3 are for the mat
hed
onditions of the genuine
laimant. Row 4
is for the mismat
h
ondition of the genuine
laimant. . . . . . . . . . . 43
5.2 Performan
e
omparison of speaker veri
ation system using IBM and
UBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Performan
e of speaker veri
ation system measured in EER for dier-
ent values of K (number of units in the dimension
ompression hidden
layer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Performan
e of speaker veri
ation system with AANN models of stru
-
ture 19L38N 4N 38N 19L using S and N . . . . . . . . . . . . . . . . . 50
5.5 Performan
e of speaker veri
ation system before and after ltering the
epstral traje
tories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6 Performan
e
omparison of speaker veri
ation system using 38-D fea-
ture ve
tors (
epstral and delta
epstral
oeÆ
ients) and 19-D feature
ve
tors (stati
epstral
oeÆ
ients). . . . . . . . . . . . . . . . . . . . . 52
5.7 Performan
e
omparison of AANN-based speaker veri
ation system
with GMM-based speaker veri
ation system on the database of 1000
speakers. These results are taken from [1℄. . . . . . . . . . . . . . . . . 52
5.8 Performan
e of the online speaker veri
ation system . . . . . . . . . . 53
LIST OF FIGURES
xi
ABBREVIATIONS
1.1 INTRODUCTION
Spee
h
ommuni
ation is a natural phenomenon among human beings. The intended
message is transferred from one person to another through the
omplex me
hanisms of
spee
h produ
tion and spee
h per
eption. Spee
h produ
tion begins when the intended
message represented in some abstra
t form in the mind of the speaker is
onverted into
neural signals. These neural signals
ontrol the human vo
al system to produ
e an
a
ousti
wave. This a
ousti
wave is su
essfully de
oded by the spee
h per
eption
me
hanism of the listener to realize the intended message.
The spee
h produ
tion me
hanism is understood better by studying the anatomi
al
stru
ture of human vo
al system shown in Fig.1.1. The human vo
al system primarily
onsists of vo
al tra
t, nasal
avity and vo
al folds. The vo
al tra
t begins at the vo
al
folds or glottis and ends at the lips. The nasal
avity begins at the velum (soft palate)
1
and ends at the nostrils. It is a
ousti
ally
oupled to the vo
al tra
t when the velum
is lowered. During spee
h produ
tion the vo
al folds may be either in tensed state or
in relaxed state. As the air is expelled from the lungs through the tra
hea, the tensed
vo
al folds are
aused to vibrate. The air
ow is
hopped into quasi-periodi
pulses to
ex
ite the vo
al tra
t system. The periodi
ex
itation of the vo
al tra
t system pro-
du
es voi
ed sounds. When the vo
al folds are relaxed, the air
ow must pass through a
onstri
tion somewhere along the length of the vo
al tra
t to produ
e unvoi
ed sounds.
With the time varying ex
itation, the shape of the vo
al tra
t also varies to produ
e
dierent spee
h sounds. Thus, a spee
h signal
onsisting of sequen
e of sounds
an be
onsidered as a result of time varying ex
itation of a time varying system (vo
al tra
t).
Along with the intended message, spee
h signal also
arries information about the
speaker. The spee
h per
eption me
hanism de
odes the message, as well as re
ognizes
people based upon their voi
e
hara
teristi
s present in the spee
h signal. An imitation
of the latter fun
tion by a ma
hine is known as automati
speaker re
ognition or simply
speaker re
ognition. Speaker re
ognition
an be
lassied into speaker identi
ation
and speaker veri
ation. Given a spee
h signal, the task of speaker identi
ation is
to determine the identity of the speaker, whereas, the task of speaker veri
ation is
to
onrm the identity of the speaker. Speaker re
ognition
an be performed in a
text-dependent or text-independent mode. In a text-dependent speaker re
ognition
system, a restri
tion is imposed on the speaker to utter some xed words or senten
es.
Su
h restri
tion is not imposed in a text-independent speaker re
ognition system. In
this thesis, we address some issues related to the development of a text-independent
speaker veri
ation system and fo
us in parti
ular the issue of development of a model
to
apture the
hara
teristi
s of a speaker present in the spee
h signal.
2
1.2 ISSUES IN SPEAKER RECOGNITION
Speaker re
ognition by a ma
hine involves three stages. They are, (1) extra
tion of fea-
tures to represent the speaker information present in the spee
h signal, (2) probabilisti
modeling of speaker features, and (3) de
ision logi
to implement the identi
ation or
veri
ation task. The issues involved in ea
h of these stages are dis
ussed below.
In pra
ti
al appli
ations, the spee
h signal is transmitted over a
ommuni
ation
hannel. The
hara
teristi
s of the telephone
hannel and handset degrades the per-
forman
e of a speaker re
ognition system [10℄. Variability in the
hara
teristi
s of
hannels and handsets further degrades the performan
e of a speaker re
ognition sys-
tem [11℄ [12℄. The issue of redu
ing the
hannel and handset ee
ts on the performan
e
of a speaker re
ognition system is addressed in the literature at the feature, model and
de
ision levels.
5
on the performan
e of the AANN-based speaker veri
ation system is also addressed.
Methods are suggested to redu
e the ee
ts of
hannel and handset mismat
h between
training and testing data. All the studies reported in this thesis are performed on a
population of 230 speakers using 1448 test utteran
es.
6
veri
ation system is proposed. The ee
t of the stru
ture of AANN model is
studied in detail. Performan
e of speaker veri
ation is examined for dierent
number of units in the dimension
ompression layer. To redu
e the ee
ts of
hannel and handset mismat
h between training and testing data, a method is
proposed to normalize the error obtained by an AANN model. Finally, the per-
forman
e of AANN models is
ompared with that of Gaussian mixture models
for a database of 1000 speakers.
Chapter 6
on
ludes the thesis by summarizing the work.
7
CHAPTER 2
2.1 INTRODUCTION
The fo
us of this review is on the approa
hes followed at the feature, model and de
i-
sion levels for speaker re
ognition, rather than on the performan
e of the approa
hes
and population of databases. Arti
les addressing various issues of speaker re
ognition
an be found in [2℄ [13℄ [14℄ [15℄ [16℄ [17℄ [18℄ [19℄ [20℄ [21℄ [22℄ [23℄.
The organization of this
hapter is as follows: In Se
tion 2.2, we dis
uss suitable
feature ve
tors for representing speaker information present in the spee
h signal. Text-
dependent speaker re
ognition systems use these feature ve
tors for nonlinear mat
hing
te
hniques su
h as Dynami
Time Warping [24℄. Text-independent speaker re
ognition
systems use probabilisti
models of the feature ve
tors extra
ted from the spee
h signal.
The dis
ussion in Se
tion 2.3 is fo
ussed on dierent probabilisti
models used to
des
ribe the distribution of the feature ve
tors of a speaker. The merits and demerits
of these models are dis
ussed and the need to explore new methods is dis
ussed in
Se
tion 2.4.
Features that are derived from dierent sour
es of speaker information present in the
spee
h signal
an be grouped into three
ategories. They are: (1) features related
to the anatomi
al dieren
es in the vo
al tra
t, (2) features related to the ex
itation
sour
e of the vo
al tra
t, and (3) features related to dieren
es in the speaker habits.
8
The anatomi
al dieren
es relate to the stru
tural dieren
es in the shape and size of
the vo
al tra
t, whi
h vary
onsiderably from one speaker to another. The vo
al tra
t
shape is diÆ
ult to derive from the spee
h signal [25℄ [26℄. Therefore, the shape of
the vo
al tra
t is
hara
terized by the resonan
es of the vo
al tra
t system. The time
varying vo
al tra
t is assumed to be in a quasi-stationary state for a short-duration
(10-30 ms), and the spe
tral
ontent of the spee
h segment is used to
hara
terize
the vo
al tra
t shape. The periodi
os
illation of the vo
al folds is a major sour
e
of ex
itation of the vo
al tra
t system. The individuality of the speaker is asso
iated
with the os
illations of these vo
al folds. The dieren
es in speaking habits are due to
the manner in whi
h speakers have learned to use their spee
h produ
tion me
hanism.
These dieren
es indi
ate the temporal variations of the
hara
teristi
s of dierent
individuals.
9
2.2.1 Pit
h
The rate of vibration of vo
al folds is known as fundamental frequen
y or pit
h. Pit
h
information
an be extra
ted from the spee
h signal using various methods su
h as
zero-
rossing,
epstral methods, group delay fun
tions et
. [28℄ [29℄ [30℄. A dis
us-
sion of various algorithms for pit
h extra
tion is given in [31℄. The individuality of a
speaker is asso
iated with the pit
h patterns extra
ted from the spee
h signal. Atal [32℄
demonstrated that pit
h
ontours
an be used ee
tively for text-dependent speaker
identi
ation. Yegnanarayana et al. [33℄ used the lo
al fall and rise patterns of pit
h,
and durational features for identifying the speaker. In [34℄, the long-term average
value of the fundamental frequen
y is used for text-independent speaker re
ognition.
Studies have also been made to investigate the usefulness of
ombining pit
h informa-
tion with spe
tral features [35℄ [36℄. In [37℄ a multistage pattern re
ognition approa
h
is proposed for speaker identi
ation. A two stage
lassier with pit
h and auto
orre-
lation
oeÆ
ients is shown to perform better than a single stage
lassier using these
features together.
2.2.2 Formants
Formants may be des
ribed as the resonan
es of the vo
al tra
t system. They vary
in frequen
y, relative amplitude and bandwidth a
ording to spee
h and speaker. Ex-
tra
tion of formant frequen
ies is a diÆ
ult problem in spee
h pro
essing [38℄ [39℄.
Their presen
e in the spe
trum envelope as peaks may be masked by the harmoni
s
of the ex
itation signal, and thus smoothing is required prior to the use of any peak
pi
king algorithms. Formants and their
ontours have been used for text-dependent
speaker re
ognition studies [40℄ [41℄. Formants
orresponding to nasal
onsonants are
found to be ee
tive for speaker re
ognition [27℄ [42℄. Su et al. used
oarti
ulation
between nasal and the following vowel as an a
ousti
ue for identifying speakers [43℄.
However,
omparative studies on eÆ
ien
y of dierent features indi
ate that distan
es
based on formant frequen
ies
ontribute little towards dis
riminating impostors [44℄.
10
2.2.3 Long-Term Spe
tral Features
The idea of using the long-term spe
trum is to suppress the spe
tral details due to
the linguisti
ontent of the spee
h signal. The Fourier transform of a long segment of
a spee
h signal or averaging of spe
tral features obtained from short segments of the
spee
h signal, represent slow varying
omponents of the utteran
e. Sin
e the speaker
hara
teristi
s are varying slowly
ompared to the message part of the utteran
e, the
long-term spe
trum
an be
onsidered as a speaker-spe
i
feature, independent of the
senten
e [45℄ [46℄ [47℄. It has to be noted that the long-term spe
trum is not a stable
feature due to its dependen
e on the
hara
teristi
s of the
ommuni
ation
hannel [2℄.
A study into the use of LPC for speaker re
ognition was
arried out by Atal [49℄.
These
oeÆ
ients are highly
orrelated, and the use of all predi
tion
oeÆ
ients may
11
not be ne
essary for speaker re
ognition task [50℄. Sambur [51℄ used a method
alled
orthogonal linear predi
tion to orthogonalize linear predi
tion
oeÆ
ients, re
e
tion
oeÆ
ients and log area
oeÆ
ients by proje
ting these
oeÆ
ients onto the
orre-
sponding eigenspa
es. It is noted that only a small subset of the resulting orthogonal
oeÆ
ients exhibits signi
ant variation over the duration of an utteran
e. It is shown
that re
e
tion
oeÆ
ients are as good as the other feature sets. Naik et al. [52℄ used
prin
ipal spe
tral
omponents derived from linear predi
tion
oeÆ
ients for speaker
veri
ation task.
In an early study, Lu
k [35℄ used FFT-based
epstral
oeÆ
ients for speaker ver-
i
ation. Atal [49℄ explored LPC-derived
epstral
oeÆ
ients and proved their ee
-
tiveness over LPC and other features su
h as pit
h and intensity
ontours. Furui [53℄
observed a similar performan
e of speaker veri
ation for LPC-derived and FFT-based
epstral
oeÆ
ients. LPC-derived
epstral
oeÆ
ients take less
omputation time and
are used even in re
ent studies for speaker re
ognition task [11℄ [20℄ [56℄.
12
2.2.4.3 Mel-Frequen
y Cepstral CoeÆ
ients
The FFT-based
epstral
oeÆ
ients are
omputed by taking IFFT of the log magnitude
spe
trum of the spee
h signal. The mel-warped
epstrum is obtained by inserting
a intermediate step of transforming the frequen
y s
ale to pla
e less emphasis on
high frequen
ies before taking the IFFT. The mel s
ale is based on human per
eption
of frequen
y of sounds [54℄. Most of the
urrent speaker veri
ation systems use
mel-frequen
y
epstral
oeÆ
ients to represent the speaker information present in the
spee
h signal [11℄ [19℄ [57℄.
It is interesting to note that the spe
tral features whi
h we
laim to represent
speaker information are borrowed from spee
h re
ognition eld [71℄ [72℄. The dieren
e
lies in grouping the feature ve
tors of a
lass. For spee
h re
ognition, the feature
ve
tors of a parti
ular phoneme uttered by several speakers are grouped into one
lass. For speaker re
ognition, the feature ve
tors extra
ted from the utteran
e of an
individual are grouped into one
lass. One
an argue that it is ne
essary to derive
the features from the spee
h signal with an obje
tive
riterion of representing speaker
information. However, the performan
e of the
urrent speaker re
ognition systems
using spe
tral features as reported in [11℄ and [1℄ suggest that the speaker re
ognition
systems
an be used for pra
ti
al appli
ations, provided a better des
ription is given
for the distribution of feature ve
tors. The next dis
ussion fo
us on the distribution
modeling of feature ve
tors for text-independent speaker re
ognition systems.
14
ify speaker dieren
es. Su
h methods may not adequately represent the distribution
of feature ve
tors. Hen
e, the probability distribution of feature ve
tors are modeled
by parametri
or nonparametri
methods. Models whi
h assume a probability density
fun
tion are termed parametri
. In nonparametri
modeling, minimal or no assump-
tions are made regarding the probability distribution of feature ve
tors. In this se
tion,
we brie
y review the nearest neighbor, Ve
tor Quantization (VQ), Gaussian Mixture
Model (GMM) and neural network based approa
hes for speaker re
ognition. While
GMM is a parametri
model, nearest neighbor, VQ and neural network models are
treated as nonparametri
.
15
book. Soong et al. [76℄ employed VQ te
hnique to represent speaker features. Ea
h
speaker is
hara
terized by a VQ
odebook of 64 ve
tors
onstru
ted from a large set
of short-term spe
tral ve
tors obtained from the training utteran
es. In [77℄, perfor-
man
e of VQ based speaker re
ognition system is evaluated for both text-dependent
and text-independent
ases. Similar approa
hes are reported by Helms [78℄, Dorsey et
al. [79℄, Li [80℄, Shikano [81℄, and Bu
k et al. [82℄. In a later study, two VQ
odebooks
(one for stati
epstrals and the other for delta
epstrals) ea
h with 64 entries are
generated and used as a model for ea
h speaker [83℄. Matsui et al. [36℄ used speaker
models
onsisting of two VQ
odebooks, one for voi
ed sounds and the other for un-
voi
ed sounds.
Instead of point to point
omparison (nearest neighbor approa
h), VQ based ap-
proa
hes represent the speaker features by a set of mean ve
tors. A still better ap-
proa
h is to model the speaker features with a set of mean and varian
e ve
tors. This
te
hnique is widely used and is known as Gaussian mixture models [11℄ [84℄.
P
M
p(x=s) = is fis( x)
i=1
The mixture density fun
tion is a weighted linear
ombination of M
omponent uni-
modal Gaussian densities fis(:). Ea
h Gaussian density fun
tion fis(:) is parameterized
by the mean ve
tor si and the
ovarian
e matrix Cis using
fis (x) =
1
(2 )(n=2) jCis j exp(
1
2
(x si)T (Cis) (x si)),
1
16
where (Cis) and j Cis j denotes the inverse and determinant of the
ovarian
e ma-
1
trix Cis. The mixture weights is satisfy the
onstraint iP is = 1. Colle
tively the
M
=1
parameters of the speaker model s are denoted as s = fis;si; Cisg, i = 1 M .
The number of mixture
omponents is
hosen empiri
ally for a given data set. The
parameters of the GMM are estimated using the iterative expe
tation-maximization
algorithm [5℄ [85℄.
A speaker veri
ation system using GMM
an be developed as follows [86℄: Feature
ve
tors of a given speaker are used to build a speaker-spe
i
GMM (s). The speaker-
spe
i
GMM is derived from an universal ba
kground model (b), whi
h is trained
with feature ve
tors of several speakers. The basi
idea behind using a speaker-spe
i
and an universal ba
kground model is to a
ommodate the linguisti
variability of the
test utteran
e at the de
ision level. During the veri
ation phase, the test utteran
e is
given to the universal ba
kground model, and a few mixture
omponents whi
h
on-
tribute signi
antly to the likelihood value are noted. The likelihood of speaker-spe
i
GMM is then
omputed by
onsidering only the sele
ted mixture
omponents. The
laim is reje
ted or a
epted by
omparing the log likelihood ratio with the threshold
using
reje
t
ln p(x= s) <
>
p(x= ) a
ept
b
This likelihood ratio is viewed as a means of normalizing the likelihood for the
target speaker. It is observed that the performan
e of GMM-based speaker veri
ation
system is dependent on the number of mixture
omponents and also on the database
used to build the universal ba
kground model [7℄ [1℄.
Theoreti
al analysis of multilayer per
eptron using sigmoidal a
tivation fun
tion
suggests that these networks may not draw
losed boundaries in the feature spa
e [93℄.
This
on
lusion expose the inadequa
y of the neural network models using sigmoidal
a
tivation fun
tion for
lassi
ation task. But, there exist other ANN models (radial
basis fun
tion networks [94℄, autoasso
iative networks [95℄) whi
h do not suer from
the disadvantage pointed out in [93℄. It is also to be noted that the neural network
models are not studied expli
itly to
apture the distribution of the data (feature ve
tors
of a speaker), although there are attempts to interpret the results of learning in terms
of distribution of the data [89℄ [9℄.
Current speaker veri
ation systems operating in text-independent mode use GMM to
estimate the probability distribution of the feature ve
tors of a speaker. The probabil-
ity distribution estimated by the GMM is
onstrained by the fa
t that the shape of the
omponents of the distribution is assumed to be Gaussian, and the number of mixtures
are xed a priori. Feature ve
tors extra
ted from spee
h signal may have distributions
in the feature spa
e whi
h we may not be able to des
ribe a
urately using a GMM,
with its rst and se
ond order statisti
s and mixture weights. This fa
t is indi
ated by
the improved performan
e of GMM-based speaker veri
ation systems with in
rease
18
in the number of mixtures [7℄. Therefore, it is worth exploring new methods to
apture
the distribution of the feature ve
tors of a speaker. In this
ontext, we investigate the
potential of nonlinear models su
h as AANN models.
Eorts to use AANN models for speaker veri
ation
an be found in [96℄ [95℄.
While the study in [96℄, highlights the importan
e of nonlinear a
tivation fun
tion, the
eort is restri
ted to a three layer AANN model and its appli
ation to phoneme-based
speaker veri
ation. Ikbal et al. [95℄ have studied the importan
e of more hidden layers,
and the eÆ
ien
y of AANN models is demonstrated on a small subset of NTIMIT
database. The intuitive arguments of [95℄ and [97℄ provide justi
ation for the ability
of AANN models to
apture a nonlinear subspa
e. Our studies on the
hara
teristi
s
of AANN model is fo
ussed on the error given by the AANN model for every point
in the input spa
e. We show that there is a relation between the distribution of the
feature ve
tors and the training error surfa
e
aptured by an AANN model in the
input feature spa
e. A probability surfa
e is derived from the training error surfa
e
aptured by the network. The distribution
apturing ability of three layer and ve
layer AANN models is studied using the probability surfa
e
aptured by these network
models. The potential of the ve layer AANN model for the task of speaker veri
ation
is demonstrated on
onversational telephone spee
h for 230 speakers. The issues of
ba
kground model, role of hidden units and s
ore normalization are addressed in the
ontext of minimizing the ee
ts of telephone
hannel and handset
hara
teristi
s on
the performan
e of speaker veri
ation systems.
19
CHAPTER 3
NETWORK MODELS
3.1 INTRODUCTION
Autoasso
iative neural network models are feedforward neural networks performing an
identity mapping of the input spa
e [89℄ [88℄. Many appli
ations use ve layer models
for dimensionality redu
tion by proje
ting the input data onto the nonlinear subspa
e
aptured by the network. From a dierent perspe
tive the AANN models
an be used
to
apture the distribution of input data. In this
ontext, the nonlinear a
tivation fun
-
tion at the hidden units plays a signi
ant role. There exists a relationship between
the weights of the network and the prin
ipal
omponents of the training data [98℄. We
establish that there is a relation between the distribution of the given data and the
training error surfa
e
aptured by the network in the input spa
e. We show that the
weights of the ve layer AANN model indeed
apture the distribution of the given data.
This
hapter is organized as follows: The relation between prin
ipal
omponent
analysis and AANN models is dis
ussed in Se
tion 3.2. Chara
teristi
s of a three layer
AANN model with linear units in the dimension
ompression layer are dis
ussed in
Se
tion 3.3.1. The relation between training error surfa
e and the distribution of the
input data is also dis
ussed. The ee
t of nonlinear a
tivation fun
tion at the hidden
units is studied in Se
tion 3.3.2. Ee
tiveness of a ve layer to
apture a nonlinear
subspa
e is dis
ussed in Se
tion 3.4.
20
3.2 PRINCIPAL COMPONENT ANALYSIS AND AANN MODELS
algorithm adjusting the weights of a feedforward neural network for identity mapping
is also to obtain a minimum value of J [88℄. Limitation of PCA to represent an input
spa
e using a linear subspa
e motivated the resear
hers to investigate a method of
proje
ting the input data onto a nonlinear subspa
e using AANN models [100℄ [98℄.
Attempts to use nonlinear hidden units in a three layer AANN model did not
provide any solution better than PCA [98℄. Addition of hidden layers before and after
the
ompression layer proje
ts the input data onto a nonlinear subspa
e [100℄. Many
appli
ations use the nonlinear subspa
e
aptured by the ve layer AANN model for
dimensionality redu
tion [6℄. There exists a dierent perspe
tive in whi
h the
hara
-
teristi
s of the AANN model
an be used to
apture the distribution of the given data.
Studies on three layer AANN models show that the nonlinear a
tivation fun
tion
at the hidden units
lusters the input data in a linear subspa
e [101℄. Theoreti
ally,
it was shown that the weights of the network will produ
e small errors, only for a
set of points around the training data [101℄. When the
onstraints of the network
are relaxed in terms of layers, the network is able to
luster the input data in the
21
nonlinear subspa
e [95℄. Our study on the relation of the training error surfa
e and
data distribution, led us to explore the distribution
apturing ability of an AANN
model [9℄. The distribution
apturing ability of a three layer and a ve layer AANN
model is studied using the training error surfa
e realized by the neural network models
in the input feature spa
e.
Consider a three layer AANN model with M units in the input and output layers, and
p < M units in the hidden layer. Let X = [x ; x ; ; xN ℄ be the M N matrix
1 2
formed by the ve
tors realized at units of the output layer of the network. The mat
h
between X and Y is measured in terms of mean square error J = kX Y k , where 2
k:k indi
ate squared norm. Let W T = [uij ℄ 2 RpM represent the weight matrix
2
onne
ting the input layer and the hidden layer, and W = [vij ℄ 2 RM p represent the
weight matrix
onne
ting output layer and hidden layer. For linear a
tivation fun
tion
at the input and output units, J = kX W F (W T X )k , where F is the a
tivation
2
22
0.1 0.1
10 10
0.05 0.05
0 5 0 5
−4 −4
−2 0 −2 0
0 0
2 −5 2 −5
4 4
(a) (b)
0.1 0.1
10 10
0.05 0.05
0 5 0 5
−4 −4
−2 0 −2 0
0 0
2 −5 2 −5
4 4
( ) (d)
(e) (f)
Fig. 3.1: (a) 2-D data (A 3-D view is shown). (b) 2-D data shown in (a) is repeated. (
)
Output of the 3 layer network 2L 1L 2L. (d) Output of the 3 layer network 2L 1N 2L. (e)
Probability surfa
e
aptured by the network 2L 1L 2L. (f) Probability surfa
e
aptured
by the network 2L 1N 2L. Here L denote linear unit and N denote nonlinear unit.
23
For illustration,
onsider a three layer AANN model with one linear unit in the
hidden layer. The model is trained with the arti
ial 2-D data shown in Fig.3.1(a)
using ba
kpropagation learning algorithm in pattern mode [89℄, [88℄. The distribution
(shown by solid lines) of the input ve
tors realized by the AANN model is shown
in Fig.3.1(
). From Fig.3.1(
), we observe that the linear subspa
e
aptured by the
network is along the prin
ipal dire
tion of the input data. In order to visualize the
distribution better, one
an plot the training error for ea
h input data point in the
form of some probability surfa
e as shown in Fig.3.1(e). The training error Ei for data
point i in the input spa
e is plotted as fi = e E = , where = 2 is a
onstant. Note
i
that fi is not stri
tly a probability density fun
tion but we
all the resulting surfa
e
as probability surfa
e. The plot of the probability surfa
e shows larger amplitude for
smaller error Ei , indi
ating better mat
h of the network for that data point. One
an
use the probability surfa
e to study the
hara
teristi
s of the distribution of the input
data
aptured by the network [9℄.
The data set X is said to be -autoasso
iated with if kYkY Xk k < , where 2 R .
2
2
+
following.
24
kY X k2 < ) kW F (W T X ) X k2 <
kY k2 kW F (W T X )k2
We observe that if a linear a
tivation fun
tion is used at the hidden units, gets
an
eled, and for all values of the inequality holds. However, if a nonlinear a
tivation
fun
tion is used at the hidden units, it is shown in [101℄ that the above inequality holds
for < , where = (1 + )kW k =kX k . Thus, only limited points in input spa
e
2 2
are -autoasso
iated with the weights of network. The linear surfa
e
aptured by the
three layer AANN model may not produ
e a low for the training data, and hen
e the
probability surfa
e shown in Fig.3.1(f) does not re
e
t the distribution of the training
data. It is ne
essary to
apture a nonlinear subspa
e to obtain a low for the training
data. In the next se
tion, we show the ee
tiveness of a ve layer AANN model to
apture nonlinear subspa
es.
The ve layer AANN model shown in Fig.3.2 performs Nonlinear Prin
ipal Compo-
nent Analysis (NLPCA) [100℄. The se
ond and fourth layers of the network have more
units than the input layer. The third layer has fewer units than the rst or fth.
The a
tivation fun
tions at third layer may be linear or nonlinear, but the a
tivation
fun
tions at the se
ond and fourth layers are essentially nonlinear.
2 4
Layer 1 0000000
1111111 5
0000000
1111111
0000000
1111111 1111111
0000000
0000000
1111111
0000000
11111113 0000000
1111111
0000000
1111111
0000000
1111111 0000000
1111111
0000000
0000000
1111111
000000
111111 1111111
0000000
1111111 00000000
11111111
0000000
1111111
000000
111111
0000000
1111111 0000000
1111111 11111111
00000000
0000000
1111111
000000
111111 000000
111111
0000000
1111111 000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111
000000 1111111
111111 000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000 0000000
111111
000000
111111
000000
111111
0000000
1111111
000000
111111
000000
111111
0000000
1111111
000000
111111
0000000
1111111
00000000
11111111
0000000
1111111
00000000
11111111
00000000
11111111
0000000
1111111
00000000
11111111
000000 1111111
111111 0000000
000000
111111 0000000
1111111
000000
111111
0000000
1111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111 0000000
1111111
000000
111111
000000
111111 000000
111111
0000000
1111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000 0000000
111111
000000
1111111
000000
111111
0000000
1111111
000000
111111
0000000
1111111
000000
111111
0000000
1111111
000000
111111
0000000
1111111
0000000
1111111
000000
111111
0000000
1111111
00000000
11111111
0000000
1111111
00000000
11111111
00000000
11111111
0000000
1111111
111111
111111 000000
111111
0000000
000000 1111111
000000
111111
0000000
1111111 0000000
1111111
000000
111111
0000000
1111111 00000000
11111111
00000000
11111111
0000000
1111111
00000000
11111111
000000 111111
111111
000000
111111
000000
111111
000000
11111100
11
0000000
1111111
000000
0000000
1111111
00
11
0000000
1111111
000000
111111
0000000
1111111
0000000
1111111
000000
111111
0000000
1111111
0000000
1111111
000000
111111
0000000
1111111
00000000
11111111
0000000
1111111
00000000
11111111
00000000
11111111
0000000
1111111
00000000
11111111
111111 000000
111111
0000000
000000 1111111
000000
111111
0000000
1111111
00000000
111111 0000000
1111111
000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111 0000000
1111111
000000
11111111
0000000
1111111
000000
111111 0000000
1111111
000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
111111 0000000
000000 1111111
000000
11111111
00
0000000
1111111
000000
111111 0000000
1111111
000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111 0000000
1111111
000000
111111
0000000
1111111
000000
111111 0000000
1111111
000000
111111
0000000
1111111
000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111 0000000
1111111
000000
111111
0000000
1111111
000000
111111
0000000 0000000
1111111
000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111 1111111
000000
111111
0000000
1111111
000000
111111 000000
111111
0000000
1111111
0000000
1111111 00000000
11111111
0000000
111111100
11
00000000
11111111
1 11
00
0 111111
000000
00
11 000000
111111
0000000
1111111
000000
111111 000000
111111
00000000
1111111
0000000
1111111 00000000
11111111
0000000
111111100
11
00000000
11111111
000000 1111111
111111 0000000 000000
111111
0000000
1111111
000000
111111 1 00000000
11111111
0000000
1111111
00000000
11111111
11
00
1 111111
0 000000 000000
111111
0000000
1111111 0000000
1111111
000000
111111 00000000
11111111
0000000
1111111
00000000
11111111
11
00
000000 111111
111111 000000
0000000
1111111 0000000
1111111
000000
11111111
00 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111
000000
111111 1 111111
0 000000
0000000
1111111
000000
111111
0000000
1111111
0000000
1111111
000000
111111
11111111
00
0000000
1111111
000000
00000000
11111111
0000000
1111111
00000000
11111111
00000000
11111111
0000000
1111111
00000000
11111111
000000
111111 0000000
1111111
000000
111111 00000000
11111111
000000 1111111
000000
111111
111111 0000000
000000
111111
0000000
1111111 0000000
1111111
000000
111111 0000000
1111111
00000000
11111111
00000000
11111111
0000000
1111111
00000000
11111111
000000 111111
111111 000000
0000000
1111111 000000
111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111
Compression
Input Layer Output Layer
Layer
25
The fun
tion of the ve layer AANN model is understood better by splitting the
ve layers into mapping (layers 1, 2 and 3) and demapping (layers 3, 4 and 5) net-
works. The mapping network proje
ts the input spa
e RM onto an arbitrary subspa
e
Rp , where p < M . The mapping fun
tion G is nonlinear, and a nonlinear subspa
e is
formed at the third layer. The proje
tion of the nonlinear subspa
e Rp ba
k onto the
input spa
e RM is performed by the demapping network, and the demapping fun
tion
H is also nonlinear. The mapping and demapping fun
tions may not be unique for a
given data. This
an be observed from Figs.3.3(a) and 3.3(
), where two hypersurfa
es
are
aptured by the same ve layer network (2L12N 1N 12N 2L) for the arti
ial 2-D
data of Fig.3.1(b) for two dierent trials. These hypersurfa
es are obtained by plotting
the output of the network for all the points in the input spa
e shown in Fig.3.1(b). The
probability surfa
es
orresponding to these two hypersurfa
es are shown in Figs.3.3(b)
and 3.3(d). Though the nonlinear subspa
es
aptured by the network dier from one
training session to another, it has to be noted that the probability surfa
es
aptured by
the network remains similar for dierent training sessions.
From the above dis
ussion, it is
lear that a ve layer AANN model is
apable of
apturing nonlinear subspa
es. The ability of a ve layer AANN model to perform
lustering in the nonlinear subspa
e
an be explored to
apture the
omplex distribu-
tion of the data in the input spa
e. Fig.3.4(
) illustrates the distribution
apturing
ability of a ve layer AANN model for the arti
ial 2-D data shown in Fig.3.4(a).
We noti
e that the ve layer AANN model
an be used as a nonparametri
model
to
apture the distribution of the given data. The
omponents spanning the nonlin-
ear subspa
e
aptured by these models are known as nonlinear prin
ipal
omponents.
These models dier from other nonlinear methods su
h as prin
ipal
urves due to the
relationship between the weights of the network and the input data arising from the
nonlinear a
tivation fun
tion [103℄. Apart from these simple illustrations using 2-D
data, we also show the potential of the ve layer AANN model in situations su
h as
text-independent speaker veri
ation [11℄ [104℄.
26
(a) (b)
3.3: A ve layer network 2L 12N 1N 12N 2L is trained with the 2-D data shown in
(
) (d)
Fig.
Fig.3.1(b) for two dierent sessions. Hypersurfa
es are obtained by plotting the output of
the network for all the points in the input spa
e. (a) Hypersurfa
e (solid lines)
aptured
by the network in training session I. The input data used for training is also plotted
in the gure. (b) Probability surfa
e
aptured by the network in training session I. (
)
Hypersurfa
e (solid lines)
aptured by the network in training session II. The input data
used for training is also plotted in the gure. (d) Probability surfa
e
aptured by the
network in training session II.
27
0.1
10
0.05
0 5
−4
−2 0
0
2 −5
4
(a)
0.1
10
0.05
0 5
−4
−2 0
0
2 −5
4
(b)
(
)
Fig. 3.4: (a) 2-D data. (b) Output of the 5 layer network 2L 12N 1N 12N 2L.
(
) Probability surfa
e
aptured by the network.
28
3.5 SUMMARY
In this
hapter, the
hara
teristi
s of three layer and ve layer AANN models are de-
s
ribed. A three layer AANN model
lusters the input data in the linear subspa
e,
whereas a ve layer AANN model
aptures the nonlinear subspa
e passing through
the distribution of the input data. It is shown that the error surfa
e realized by the
network in the input spa
e is useful to study the
hara
teristi
s of the distribution
ap-
tured by the network. The distribution
apturing ability of a ve layer AANN model
is illustrated for arti
ial 2-D data. The following
hapters exploit the distribution
apturing ability of a ve layer for speaker veri
ation.
29
CHAPTER 4
4.1 INTRODUCTION
In this
hapter, the property of ve layer AANN models to
apture the distribution
of the given data is exploited to build a speaker veri
ation system. Separate AANN
models are used to
apture the distribution of feature ve
tors of ea
h speaker. Devel-
opment of AANN-based speaker veri
ation system is des
ribed using a database of
230 speakers.
This
hapter is organized as follows: Spee
h database used in this study is de-
s
ribed in Se
tion 4.2. The pro
edure used to extra
t feature ve
tors from the spee
h
signal is dis
ussed in Se
tion 4.3. The algorithm used to build a speaker model is
explained in Se
tion 4.4. Veri
ation pro
edure of AANN-based speaker veri
ation
system is dis
ussed in Se
tion 4.5. Performan
e of a GMM-based speaker veri
a-
tion system is
ompared with that of the AANN-based speaker veri
ation system in
Se
tion 4.6
The spee
h
orpus used for the study
onsists of SWITCHBOARD-2 phase-2 and
phase-3 databases of National Institute of Standards and Te
hnology (NIST). These
databases are used for the NIST-99 oÆ
ial speaker re
ognition evaluation [11℄. The
database of phase-2 is used for ba
kground modeling, and hen
e referred to as de-
velopment data. The performan
e of speaker veri
ation system is evaluated on the
30
database of phase-3, whi
h is referred to as evaluation data. Development data
on-
sists of 500 speakers (250 male and 250 female), and evaluation data
onsists of 539
speakers (230 male and 309 female).
Data provided for ea
h speaker is
onversational telephone spee
h
olle
ted from
dierent sessions (
onversations) sampled at the rate of 8000 samples/se
. The training
data
onsists of two minutes of spee
h data,
olle
ted from two dierent
onversations
over the same phone number. The use of the same phone number results in passing
the spee
h data over the same handset and
ommuni
ation
hannel. Two dierent
types of mi
rophones (also referred as handsets) are used for
olle
ting the spee
h
data. They are
arbon-button and ele
tret. Performan
e of the speaker veri
ation
system is evaluated on test utteran
es
olle
ted from dierent re
ording environments.
The duration of the test utteran
e varies between 3 to 60 se
onds. Ea
h test utteran
e
has 11
laimants, where the genuine speaker may or may not be one of the
laimants.
The gender of the
laimant and the speaker of the test utteran
e is same. There are
no
ross gender trials.
All the studies reported in this thesis are performed on the male subset of 230
speakers with 1448 male test utteran
es of the evaluation data. Performan
e of the
speaker veri
ation system is evaluated for the following three
onditions:
1. Mat
hed
ondition: The training and testing data are
olle
ted from same phone
number.
2. Channel mismat
h
ondition: The training and testing data are
olle
ted from
dierent phone numbers, but it is ensured that the same handset type is used
in both the
ases. The use of dierent phone numbers results in passing the
spee
h signal over dierent
ommuni
ation
hannels.
3. Handset mismat
h
ondition: The training and testing data are
olle
ted over
dierent handset types.
31
4.3 FEATURE EXTRACTION
Speaker information
an be extra
ted both at the segmental and suprasegmental levels.
The segmental features are the features extra
ted from short (10-30 ms) segments of
spee
h signal. Some of the segmental features are linear predi
tion
epstral
oeÆ
ients,
mel-
epstral
oeÆ
ients, log spe
tral energy values et
. [54℄. These features represent
the short-term spe
tra of the spee
h signal. The spe
trum of a spee
h segment is
attributed primarily to the shape of the vo
al tra
t. The spe
tral information of the
same sound uttered by two persons may dier due to both in the shapes of their vo
al
tra
ts and in the manner in whi
h they produ
e spee
h [14℄. Comparative studies
between spe
tral features and other features su
h as the fundamental frequen
y show
that the spe
tral features seem to provide better dis
rimination among speakers [34℄.
In this work, spe
tral features represented by linear predi
tion
epstral
oeÆ
ients are
used [53℄.
The spee
h signal transmitted over a telephone
hannel en
ounters a linear distor-
tion due to ltering ee
t of the
hannel [49℄ [53℄. Linear
hannel ee
ts are
ompen-
sated to some extent by removing the mean of the traje
tory of ea
h
epstral
oeÆ
ient.
It has been shown that the mean subtra
tion improves the performan
e signi
antly
when the training and testing data are
olle
ted from dierent
hannels [49℄ [53℄. But
the re
ognition a
ura
y is redu
ed when the mean subtra
tion is used for speaker
veri
ation, in whi
h training and testing are
olle
ted from the same
hannel [49℄
[53℄.
The feature ve
tors extra
ted from the training data of a speaker are used to train
an AANN model using ba
kpropagation learning algorithm in the pattern mode [89℄
[88℄. The stru
ture of the AANN used for this study is 19L38N 14N 38N 19L, where
L denote linear unit and N denote nonlinear unit. The integer value indi
ates the
number of units used in that parti
ular layer. By in
orporating some of the heuristi
s
des
ribed in [89℄ and [106℄, the a
tual algorithm used to train the network is as follows:
NOTATION
The indi
es i, j and k refer to the dierent units in the network.
The iteration (time step) is denoted by n.
33
The symbol ej (n) refers to the error at the output unit j for iteration n.
The symbol dj (n) refers to the desired output unit j for iteration n.
The symbol yj (n) refers to the a
tual output unit j for iteration n.
The symbol wjk(n) denotes the synapti
weight
onne
ting the output of the
unit k to the input of unit j at iteration n. The
orre
tion applied to this weight
at iteration n is denoted by wjk (n).
The indu
ed lo
al eld (i.e., weighted sum of all synapti
inputs plus bias) of
unit j at iteration n is denoted by vj (n).
The a
tivation fun
tion des
ribing the input-output fun
tional relationship of
the nonlinearity asso
iated with unit j is denoted by 'j (:). For linear a
tivation
of the unit j , 'j (vj (n)) = vj (n), whereas, for nonlinear a
tivation of the unit
'j (vj (n)) = a tanh(b vj (n)). The values of a, b are taken as 1:7159 and 2=3
respe
tively (pg. 181 of [89℄).
The bias applied to unit j is denoted by bj ; its ee
t is represented by a synapse
of weight wj = bj
onne
ted to a xed input equal to +1.
0
ALGORITHM
1 Initialize the weights wjk
onne
ting thequnit j withq the uniformly distributed
random values taken from the set [ 3= (Fj ); +3= (Fj )℄ (refer to pg. 211 of
[106℄).
2 Randomly
hoose a input ve
tor x
3 Propagate the signal forward through the network
34
4 Compute the lo
al gradients Æ
{ For an unit j at the output layer, Æj = ej (n)'j (vj (n)), where 'j (:) denotes
0 0
the rst derivate of 'j (:). Sin
e the a
tivation at the output units is linear
in our
ase Æj = ej , where ej = dj yj . Moreover, as we are interested in
autoasso
iative mapping Æj = xj yj .
P
{ For an unit j at the hidden layer, Æj = 'j (vj (n)) Æk (n)wkj (n)
0
5 Update the weights using wji(n) = j Æj (n)yi(n)+wji(n 1), where = 0:3
is the momentum fa
tor.
6 Go to step 2 and repeat for the next input ve
tor
This learning algorithm adjusts the weights of the network to minimize the mean
square error obtained for ea
h feature ve
tor. If the adjustments of weights is done for
all the feature ve
tors on
e, then the network is said to be trained for one epo
h. For
su
essive epo
hs, the mean square error, averaged over all feature ve
tors, is plotted in
Fig.4.1. We observe that the redu
tion in the average error is negligible after about 60
epo
hs. Hen
e, the training of the AANN model was stopped at 60 epo
hs. A similar
behavior is observed with the feature ve
tors of other speakers. On
e an AANN model
is trained with the feature ve
tors of a speaker, we use the AANN as speaker model.
For all the 230 speakers of the evaluation data, the speaker models are built by training
one AANN model for ea
h speaker.
During testing phase, the feature ve
tors extra
ted from the test utteran
e are given
to the
laimant model to obtain the
laimant s
ore. A
laimant model is the model
of the speaker whose identity is being
laimed. The s
ore of a model is dened as
P̀ kx y k
kx k , where xi is the input ve
tor of the model, yi is the output given by
2
1 i i
`i =1
i
2
the model, and ` is the number of feature ve
tors of the test utteran
e. Using 11
laimants for ea
h of the 1448 utteran
es, a total of 15928 tests are performed on
35
0.5
0.4
0.2
0.1
0
0 50 100 150 200
Epochs
Fig. 4.1: Mean square error, averaged over all feature ve
tors, is plotted
for su
essive epo
hs. This
urve demonstrates the
onvergen
e of an AANN
model for the feature ve
tors of a speaker.
230 speaker models. The
laimant s
ores are segregated into genuine s
ores (s
ores
obtained by the genuine
laimants) and impostor s
ores (s
ores obtained by the im-
postor
laimants). The genuine s
ores are further divided into three sets based upon
the re
ording environment between training and testing data. The distributions of the
genuine s
ores and the imposter s
ores are shown in Fig.4.2. From the distributions
of the
laimant s
ores, a threshold is
hosen to a
ept or reje
t the
laim of the speaker.
False A
eptan
e (FA) and False Reje
tion (FR) are the two errors that are used
in evaluating a speaker veri
ation system. The tradeo between FA and FR is a fun
-
tion of the de
ision threshold. Equal Error Rate (EER) is the value for whi
h the error
rates of FA and FR are equal [107℄. A weighted sum of error rates of FA and FR is
known as Dete
tion Cost Fun
tion (DCF) and is given by DCF = 0:99 Fa +0:01 Fr
[1℄ [57℄. The per
entage values of false a
eptan
e (Fa ) and false reje
tion (Fr ) are
hosen using a threshold su
h that the
ost fun
tion is minimized. For dierent testing
onditions, the performan
e of the AANN-based speaker veri
ation system measured
in terms of EER and DCF is shown in Table 4.1 under the
olumn base-line. The EER
values of 15.07%, 31.49% and 42.57% are for mat
hed,
hannel mismat
h and handset
mismat
h
onditions, respe
tively. The high values of the EER re
e
t the overlapping
distributions of the genuine and imposter
laimant s
ores shown in Fig.4.2. The s
ore
36
Imposter scores
300
200
100
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Genuine scores for matched condition
10
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Genuine scores for channel mismatch
8
6
4
2
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Genuine scores for handset mismatch
10
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
score
37
obtained by a
laimant model is ae
ted by the linguisti
ontent of the test utteran
e.
Hen
e, the genuine
laimant s
ore of one test utteran
e may be within the distribution
of impostor
laimant s
ores of another test utteran
e. To over
ome this problem, a
relative (normalized)
laimant s
ore
an be derived to a
ept or reje
t the
laim of the
speaker [108℄ [109℄ [84℄ [65℄.
Table 4.1:Performan
e of speaker veri
ation system for 230 speakers. Here
BG denotes ba
kground model.
Environment Between EER DCF
Training and Testing base-line using BG base-line using BG
200
100
0
−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06
Genuine scores for matched condition
15
10
0
−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06
Genuine scores for channel mismatch
10
0
−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06
Genuine scores for handset mismatch
10
0
−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06
normalized score
39
For dierent test
onditions, the performan
e of the AANN-based speaker veri-
ation system measured in terms of EER and DCF is shown in Table 4.1 under the
olumn using BG. From the Table 4.1, we observe that the ba
kground normalization
of the
laimant s
ores improves the performan
e of the speaker veri
ation system.
The normalization of the
laimant s
ores using UBM is one of the ways of deriving a
relative
laimant s
ore. The purpose of this study is mainly to show the importan
e
of ba
kground model for speaker veri
ation task. Using the ba
kground model, an
EER of about 10% is observed for mat
hed
onditions between training and testing
data. We also observe that the performan
e of the speaker veri
ation system degrades
for mismat
h
onditions. The degradation in performan
e under mismat
h
onditions
is due to the
hara
teristi
s of the
hannels and handsets. The ee
t of mismat
h
between training and testing data on the performan
e of speaker models
an be ob-
served from the overlapping distributions of the genuine and imposter
laimant s
ores
shown in Fig.4.3. The area of overlapping region in
reases for mismat
h
onditions,
parti
ularly for handset mismat
h
ondition.
40
In Table 4.2, performan
e of one of the best GMM-based speaker veri
ation sys-
tems using same database is
ompared with that of our AANN-based speaker veri
a-
tion system. The results of the GMM-based Oregon Graduate Institute (OGI) speaker
veri
ation system are taken from [11℄. In the GMM-based approa
h, speaker models
are built with 256 mixture
omponents using 38 dimensional ve
tors. Ea
h speaker
model is derived from an universal ba
kground model, whi
h is also a 256 mixture
om-
ponent GMM trained with feature ve
tors of 80 speakers. A 38 dimensional ve
tor
used for building a GMM
onsist of 19 mel
epstral
oeÆ
ients and 19 delta
oeÆ
ients.
Mel
epstral
oeÆ
ients are obtained from the logarithmi
lter bank energies whose
time traje
tories are smoothed over long (1 se
ond) segments using data-driven lters
[69℄. Sin
e AANN
aptures the distribution of the feature ve
tors of a speaker, it is
possible to improve the performan
e of the speaker veri
ation system based on AANN
models to mat
h or better than the performan
e of a GMM-based speaker veri
ation
system. This issue will be addressed in the next
hapter.
4.7 SUMMARY
In this
hapter, the distribution
apturing ability of ve layer AANN models is ex-
ploited to build a speaker veri
ation system for
onversational telephone spee
h. A
ve layer AANN model is used to build ea
h speaker model. Performan
e of the
AANN-based speaker veri
ation system is evaluated for 230 speakers. We have ob-
served that the use of ba
kground model improves the performan
e of the speaker
veri
ation system. We have also observed that the performan
e of speaker veri-
ation system degrades for mismat
h
onditions between training and testing data.
The performan
e of an AANN-based speaker veri
ation system may be enhan
ed by
optimizing the parameters related to the ba
kground model and stru
ture of AANN.
41
CHAPTER 5
5.1 INTRODUCTION
The signi
an
e of ba
kground model for speaker veri
ation task is understood better
by studying a small set of
laimant s
ores given in Table 5.1. The s
ores obtained by
the
laimant models for a given test utteran
e indi
ate that the genuine
laimant
s
ore is
omparatively lower than the imposter
laimant s
ores (see row 1 of Table
42
Table 5.1: A set of 11 dierent
laimant s
ores obtained for dierent test utteran
es.
S
ores of the genuine
laimant models are underlined. Rows 1, 2 and 3 are for the
mat
hed
onditions of the genuine
laimant. Row 4 is for the mismat
h
ondition
of the genuine
laimant.
Data S
ores of the
laimant model
ode 1 2 3 4 5 6 7 8 9 10 11
eaaa 0.053 0.099 0.086 0.092 0.085 0.097 0.094 0.107 0.111 0.104 0.096
eaae 0.096 0.049 0.088 0.060 0.082 0.058 0.070 0.086 0.067 0.065 0.073
eabt 0.067 0.061 0.060 0.050 0.050 0.034 0.047 0.051 0.058 0.047 0.064
ef
q 0.076 0.086 0.111 0.096 0.068 0.076 0.120 0.101 0.082 0.122 0.095
5.1). But the low s
ore of the genuine
laimant of a test utteran
e may be higher
than some of the imposter
laimant s
ores of another test utteran
e (see rows 1 and
3 of Table 5.1). Thus, normalization of the
laimant s
ores a
ross the test utteran
es
is ne
essary to measure the performan
e of speaker veri
ation systems in terms of
EER. The problem of normalization of
laimant s
ores is addressed in the literature
using ba
kground models [108℄ [65℄ [86℄. We also observe that the performan
e of the
speaker model is poor due to degradation in the input su
h as mismat
h
onditions as
seen in row 4, where the genuine
laimant s
ore is higher than some of the impostor
s
ores.
43
As an alternative to UBM, we propose an approa
h based on individual ba
kground
models. This approa
h makes use of the hypothesis that for a given test utteran
e
the genuine
laimant may have a low s
ore
ompared to the other
laimant s
ores. To
verify a
laim, de
ision should be based on the spe
ied test utteran
e and the
laimant
model. It is diÆ
ult to use information about the other
laimant models for the same
test utteran
e. Therefore, pseudo-
laimant models are suggested for normalization.
These pseudo-
laimant models are known as individual ba
kground models (IBM).
The speaker models of the evaluation data are generated independent of the
pseudo-
laimant models. In the testing phase, the feature ve
tors of the test utteran
e
are given to the
laimant model and to the pseudo-
laimant models. The s
ores of
all the pseudo-
laimant models are sorted in the as
ending order. The rank of the
laimant model is dened as the position, where the
laimant s
ore
an be inserted in
the sorted list of pseudo-
laimant s
ores. This rank (R) is
onverted into a normalized
s
ore (Sn) using, Sn = ( + 1)=R, where denotes the population of the IBM. The
use of this formula
onverts the rank of the
laimant model into some form of
on-
den
e. The advantages of this simple approa
h are the following: (1) The population
44
of IBM is the only parameter to be
hosen a priori, unlike in the
ase of UBM. (2)
The pseudo-
laimant models
an be
hosen arbitrarily. (3) The
laimant s
ore lies in
the range [1; +1℄, thus normalize the s
ores a
ross the test utteran
es. (4) Heuristi
s
su
h as the best, se
ond best et
.,
an be used for the rank obtained by the
laimant
s
ore to a
ept or reje
t the
laim [111℄. Performan
e
omparison of the AANN-based
speaker veri
ation systems using IBM and UBM is shown in Table 5.2. For mat
hed
onditions, the approa
h of IBM gave a relative redu
tion of 24.91% in EER. For
han-
nel and handset mismat
h
onditions, the performan
e of the IBM is similar to that of
the UBM. The ee
ts of handset mismat
h between training and testing data
an be
ompensated to some extent using handset-dependent ba
kground model as des
ribed
in the next se
tion.
50 11
Handset Mismatch Handset Mismatch
10
40
9
Channel Mismatch
30 8
Channel
DCF
EER
7 Mismatch
20 6
Matched
5 Matched
10
4
Population of IBM Population of IBM
0 3
0 20 40 60 80 100 0 20 40 60 80 100
(a) (b)
and , are the mean and varian
e of the s
ores of pseudo-
laimant models. This
normalization may be helpful in redu
ing the rate of false a
eptan
e and hen
e may
minimize the value of DCF.
The stru
ture of AANN model plays an important role in
apturing the distribution of
the given data. While the hidden units in mapping and demapping layer are respon-
sible for
apturing a nonlinear subspa
e, the units in the dimension
ompression layer
redu
es the dimension of the data. In other words, the number of units in the
ompres-
sion layer determines the number of
omponents
aptured by the network. Stru
ture
of the AANN model used in the previous studies is 19L38N 14N 38N 19L. Feature
ve
tors extra
ted from the spee
h signal are proje
ted onto the subspa
e spanned by
K = 14
omponents to realize them at the output layer. The ee
t of
hanging the
number (K ) of these
omponents on the performan
e of the speaker veri
ation system
is examined in this se
tion.
47
Table 5.3: Performan
e of speaker veri
ation system measured in EER for dierent
values of K (number of units in the dimension
ompression hidden layer)
Environment Between EER
Training and Testing K = 14 K = 10 K = 8 K = 6 K = 4 K = 3 K = 2 K = 1
Mat
hed 8.07% 6.48% 6.45% 6.73% 6.69% 8.34% 10.45% 14.67%
Channel Mismat
h 28.50% 21.38% 22.27% 19.31% 18.70% 20.00% 20.18% 24.01%
Handset Mismat
h 37.44% 34.43% 30.53% 31.71% 30.26% 28.65% 29.47% 31.36%
to noisy
hara
teristi
s of the
hannel and handset. This may be the reason for im-
provement in performan
e of the speaker veri
ation system as the number of units in
the
ompression layer is redu
ed from K = 14 to 4. The gra
eful degradation in the
performan
e of the speaker veri
ation system with still fewer (K < 4)
omponents
indi
ates that there may not be a rigid boundary between the
omponents representing
spee
h and speaker
hara
teristi
s. If we assume that the feature ve
tors form
lusters
representing phonemes, then the speaker
hara
teristi
s are re
e
ted in the way the
phoneme
lusters are distributed. Performan
e of the speaker veri
ation system using
even one hidden unit (K = 1) in the
ompression layer indi
ates that the 1-D
urve
passing through the distribution of the speaker's feature ve
tors is signi
ant enough
to provide dis
rimination among speakers. It is important to note that the AANN
models
apture these 1-D
urves in the higher dimensional spa
e in a nonparametri
mode. The results in Table 5.3 are obtained by using handset-dependent IBM.
Due to the
hannel and handset ee
ts, there will be a shift in the distribution of
training and testing data. Thus, in mismat
h
onditions a trained model may give
large error for the test data of the same speaker. The reje
tion of a genuine
laim due
to the
hannel or handset mismat
h
an be avoided either by suitable normalization
48
of the
laimant s
ore or by using a set of features less ae
ted by
hannel and handset
hara
teristi
s. In this se
tion, we propose a method of normalizing the s
ore obtained
by the AANN model. This method relies on the assumption that the shift in the dis-
tribution of test data of the same speaker may not be signi
ant enough to label it as
an imposter utteran
e. We show that this method yields signi
ant improvement in
the performan
e of the AANN-based speaker veri
ation system.
Let denote a speaker model and Ii denote the s
ore obtained by the model for
an utteran
e (i) whi
h does not belong to the speaker. Let I denote the mean of Ii.
I = (1=l) Ii
Pl
i =1
where l is the number of other speakers. Let S be the s
ore obtained by a model for
a given test utteran
e. We dene the normalized s
ore
N = S =I = SP .
l
=l)
(1 Ii
i=1
The normalized s
ore indi
ates the
loseness of S to I . In mismat
h
ondition, the
value of S may be large enough to reje
t the genuine speaker. But the value of S in
these
ases may not be too
lose to I obtained by the same model, and hen
e the value
of N may be a better measure to a
ept or reje
t the
laim. Using a set of 25 speakers'
utteran
es of NIST-97 database (the duration of ea
h utteran
e is 1/2 minute), this
normalization pro
edure is applied to the s
ores of the
laimant and pseudo-
laimant
models. Performan
e of the speaker veri
ation system has improved signi
antly as
shown in Table 5.4. The set of 25 speakers data used for normalization is randomly
sele
ted from the NIST-97 database. The use of NIST-97 database ensures that these
utteran
es do not belong to any one of the
laimant or pseudo-
laimant models. The
improved performan
e of the speaker veri
ation system support our
onje
ture that
the shift in the distribution of feature ve
tors of the test data for genuine speaker may
not be large enough to be an imposter data for the same speaker model. The s
ore
normalization des
ribed in this se
tion is applied for all the studies reported hereafter.
49
Table 5.4: Performan
e of speaker veri
ation system with AANN models of
stru
ture 19L38N 4N 38N 19L using S and N .
Environment Between EER
Training and Testing S N
Mat
hed 6.69% 5.81%
Channel Mismat
h 18.70% 14.94%
Handset Mismat
h 30.26% 23.05%
NIST
ondu
ts an annual speaker re
ognition evaluation. Most of the parti
ipants in
these evaluations use GMM-based speaker veri
ation system in whi
h the time se-
quen
es of the feature ve
tors are pro
essed using RASTA-like lters [67℄ [68℄ [69℄. The
ee
t of su
h te
hnique on the performan
e of the AANN-based speaker veri
ation
system is examined in this se
tion.
It may be re
alled that during feature extra
tion a windowing fun
tion is used
to segment the spee
h signal into frames of short duration. Then an all-pole model
is applied on ea
h of the windowed signal to obtain 16 predi
tor
oeÆ
ients. These
oeÆ
ients are
onverted into 19
epstral
oeÆ
ients to represent the log magnitude
spe
trum of the spee
h segment. Thus, an entire signal is represented by a set of
time sequen
es of
epstral
oeÆ
ients. These time sequen
es
arry information about
the phoneti
ontent, speaker
hara
teristi
s, a
ousti
distortion and noise. They also
arry
ertain estimation errors for the following reasons. (1) Spe
tral information
based on nite data involves a
ertain random estimation error. (2) Relative position-
ing of ea
h frame with respe
t to pit
h periods introdu
es an estimation error [114℄. It
is shown that the suitable pro
essing of the time sequen
es of the spe
tral parameters
improves the performan
e of speaker veri
ation [68℄.
50
The signi
ant variations in the time sequen
es of the spe
tral parameters are
due to the phoneti
ontent and the least variations may be due to the
hannel and
handset. Thus the time sequen
es of the spe
tral parameters are passed through a
lter to eliminate some of these unwanted variations [7℄. In order to study the ee
t
of su
h te
hnique in the AANN-based speaker veri
ation system, the time sequen
es
of the
epstral
oeÆ
ients are pro
essed. The
epstral traje
tories are passed through
a lowpass lter with a
uto frequen
y at 4 Hz. Note that the lter is applied on
ea
h of the traje
tories after subtra
ting the mean value. Performan
e of the AANN-
based speaker veri
ation system is shown in Table 5.5. We observe that the temporal
pro
essing of the
epstral
oeÆ
ients improves the performan
e of the AANN-based
speaker veri
ation system to some extent. The idea of temporal pro
essing is ex-
tended further by
onsidering the transitional spe
tral information. Delta
epstral
o-
eÆ
ients are
omputed from the ltered
epstral traje
tories [53℄. These delta
epstral
oeÆ
ients are appended to the 19 stati
epstral
oeÆ
ients to form 38-dimensional
feature ve
tor. The performan
e of the AANN-based speaker veri
ation system using
38-dimensional feature ve
tors is shown in Table 5.6. The stru
ture of the AANN used
in this experiment is 38L48N 12N 48N 38L.
Table 5.5: Performan
e of speaker veri
ation system before and after ltering
the
epstral traje
tories
Environment Between EER
Training and Testing After ltering Before ltering
The te
hniques whi
h we dis
ussed so far to enhan
e the performan
e of the speaker
system for 230 speakers of NIST-99 database were applied on the evaluation database of
51
Table 5.6: Performan
e
omparison of speaker veri
ation system using 38-D
feature ve
tors (
epstral and delta
epstral
oeÆ
ients) and 19-D feature ve
-
tors (stati
epstral
oeÆ
ients).
Environment Between EER
Training and Testing 38-D feature 19-D feature
NIST-2000. The evaluation database of NIST-2000
onsists of 1000 speakers (500 male
and 500 female) and 6000 test utteran
es, with 11
laimants for ea
h test utteran
e [1℄.
Performan
e of our AANN-based system and OGI's GMM-based system as evaluated
by NIST for
hannel and handset mismat
h
onditions is shown in Table 5.7. The table
shows that AANN-based system needs to be rened further to mat
h the performan
e
of the GMM-based system. The enhan
ement may be in the
hoi
e of the feature
ve
tors and model parameters.
Table 5.7: Performan
e
omparison of AANN-based speaker veri
ation sys-
tem with GMM-based speaker veri
ation system on the database of 1000
speakers. These results are taken from [1℄.
Environment Between EER
Training and Testing AANN GMM
52
5.6 ONLINE IMPLEMENTATION OF SPEAKER VERIFICATION SYS-
TEM
Genuine 45 Same - 4
Imposter 30 - 0 -
Genuine 45 Dierent - 15
53
5.7 SUMMARY
54
CHAPTER 6
In this thesis, an attempt has been made to explore AANN models as an alternative to
GMM for text-independent speaker veri
ation. A three layer AANN model with lin-
ear hidden units in the
ompression layer
aptures a linear subspa
e. But if a nonlinear
a
tivation fun
tion su
h as tanh(:) is used in the hidden units, the network
lusters
the input data in the linear subspa
e. This
lustering behavior
an be attributed to
the thresholding and saturation properties of the nonlinear a
tivation fun
tion. A ve
layer AANN model with nonlinear units in the hidden layers
lusters the input data in
a nonlinear subspa
e. The distribution
apturing ability of three and ve layer AANN
models has been studied using a probability surfa
e derived from the training error
surfa
e
aptured by the network in the input feature spa
e. The probability surfa
e
aptured by a ve layer AANN model
an be viewed as nonparametri
modeling of the
input data distribution. This property of a ve layer AANN model has been exploited
to develop a speaker veri
ation system for
onversational telephone spee
h.
In this work linear predi
tion
epstral
oeÆ
ients are used as feature ve
tors. It
is ne
essary to study the ee
tiveness of various parametri
representations of
spee
h for speaker veri
ation purpose.
It is well known that suprasegmental features su
h as intonation and duration
patterns have signi
ant speaker
hara
teristi
s and are also robust for
hannel
variations. Therefore methods have to be found to in
orporate the knowledge
of these features in a speaker veri
ation system in the AANN framework.
56
APPENDIX A
The algorithm used in this study to dete
t the spee
h frames is based on the amplitude
of the spee
h signal in time-domain. It also assumes Gaussian distribution for the
amplitudes. The spee
h signal is blo
ked into frames using the spe
ied framesize and
frameshift. The maximum positive amplitude in ea
h frame is determined. The sum
of mean and a fra
tion (1/10) of the standard deviation of these positive amplitudes
is
onsidered as the maximum amplitude value in the spee
h signal. Ten per
ent of
the maximum amplitude is taken as the threshold for a frame to be
onsidered as a
spee
h frame. There is also a
ondition that at least 10% of the frames should be
nonspee
h frames. Hen
e, when the number of nonspee
h frames is less than than this
per
entage of the total number of frames, the threshold is progressively in
reased till
the minimum spe
ied number of frames are obtained. Finally, the frames above the
threshold are given a weightage 1, whereas the remaining frames are given a weightage
of 0.
APPENDIX B
In LP analysis of spee
h, an all-pole model is assumed for the system produ
ing spee
h
signal s(n). A pth order all-pole model assumes that sample value at time n
an be
approximated by linear
ombination of past p samples. i.e.,
p
X
s(n) ak s(n k) (A.1)
k=1
If s^(n) denotes the predi
tion made by the all-pole model then, the predi
tion error is
given by,
p
X
e(n) = s(n) s^(n) = s(n) ak s(n k) (A.2)
k=1
For a spee
h frame of size m samples, the mean square of predi
tion error over the
whole frame is given by,
X X p
X
E= e (m) =
2
[s(m) ak s(m k)℄ 2
(A.3)
m m k =1
Optimal predi
tor
oeÆ
ients will minimize this mean square error. At minimum value
of E ,
E
= 0 , k = 1; 2; :::p:
ak
(A.4)
Dierentiating Eqn A.3 and equating to zero we get,
Ra=r (A.5)
where, a = [a a ap℄T , r = [r(1) r(2) r(p)℄T , and R is a Toeplitz symmetri
1 2
m k 1mp (A.14)
k=1
m
X1 k
=
a
m k m k m>p (A.15)
k=1
59
BIBLIOGRAPHY
[1℄ NIST, \Speaker re
ognition workshop notebook," Pro
. NIST 2000 Speaker Re
ogni-
tion Workshop, University of Maryland, USA, Jun 26-27 2000.
[2℄ G. R. Doddington, \Speaker re
ognition{identifying people by their voi
es," Pro
.
IEEE, vol. 73, pp. 1651{1664, Nov. 1985.
[3℄ C. M. Bishop, Neural networks for pattern re
ognition. New York: Oxford University
Press In
., 1995.
[4℄ R. O. Duda and P. E. Hart, Pattern Classi
ation and S
ene Analysis. New York:
John Wiley & Sons, In
., 1973.
[5℄ A. P. Dempster, N. M. Laird, and D. B. Rubin, \Maximum likelihood from in
omplete
data via the EM algorithm," J. Royal Statist. So
. Ser. B. (methodologi
al), vol. 39,
pp. 1{38, 1977.
[6℄ A. K. Jain, R. P. W. Duin, and J. Mao, \Statisti
al pattern re
ognition: A review,"
IEEE Trans. Pattern Analysis and Ma
hine Intelligen
e, vol. 22, pp. 4{37, Jan. 2000.
[7℄ S. van Vuuren, Speaker Re
ognition in a Time-Frequen
y Spa
e. PhD dissertation,
Orgeon Graduate Institute of S
ien
e and Te
hnology, Department of Ele
tri
al and
Computer Engg., Portland, Mar. 1999.
[8℄ A. L. Higgins, L. Bahler, and J. Porter, \Voi
e identi
ation using nonparametri
density mat
hing," in Automati
Spee
h and Speaker Re
ognition (C.-H. Lee, F. K.
Soong, and K. K. Paliwal, eds.),
h. 9, pp. 211{232, Boston: Kluwer A
ademi
, 1996.
[9℄ B. Yegnanarayana, S. P. Kishore, and A. V. N. S. Anjani, \Neural network models for
apturing probability distribution of training data," in Int. Conferen
e on Cognitive
and Neural Systems, (Boston), p. 6 (A), 2000.
[10℄ D. A. Reynolds and et al., \The ee
ts of telephone transmission degradations on
speaker re
ognition performan
e," in Pro
eedings of IEEE Int. Conf. A
oust., Spee
h,
and Signal Pro
essing, pp. 329{332, 1995.
[11℄ NIST, \Speaker re
ognition workshop notebook," Pro
. NIST 1999 Speaker Re
ogni-
tion Workshop, University of Maryland, USA, Jun 3-4 1999.
[12℄ D. A. Reynolds, \The ee
ts of handset variability on speaker re
ognition performan
e:
Experiments on swit
h board
orpus," in Pro
eedings of IEEE Int. Conf. A
oust.,
Spee
h, and Signal Pro
essing, pp. 113{116, 1996.
[13℄ A. E. Rosenberg, \Automati
speaker veri
ation: A review," Pro
. IEEE, vol. 64,
pp. 475{487, Apr. 1976.
60
[14℄ B. S. Atal, \Automati
re
ognition of speakers from their voi
es," Pro
. IEEE, vol. 64,
pp. 460{475, Apr. 1976.
[15℄ D. O'Shaughnessy, \Speaker re
ognition," IEEE ASSP Magazine, pp. 4{17, 1986.
[16℄ A. Sutherland and M. Ja
k, \Speaker veri
ation," in Aspe
ts of Spee
h Te
hnology
(M. A. Ja
k and J. Laver, eds.),
h. 4, pp. 184{215, Edenburgh: University Press, 1988.
[17℄ J. M. Naik, \Speaker veri
ation: A Tutorial," IEEE Communi
ations Magazine,
pp. 42{48, Jan. 1990.
[18℄ A. E. Rosenberg and F. K. Soong, \Re
ent resear
h in automati
speaker re
ognition,"
in Advan
es in Spee
h signal pro
essing (S. Furui and M. M. Sondhi, eds.),
h. 3,
pp. 701{740, New York: Mar
el Dekker, 1992.
[19℄ H. Gish and M. S
hmidt, \Text{independent speaker identi
ation," IEEE Signal Pro-
essing Magazine, pp. 18{32, O
t. 1994.
[20℄ R. J. Mammone, X. Zhang, and R. P. Rama
handran, \Robust speaker re
ognition: A
feature{based approa
h," IEEE Signal Pro
essing Magazine, vol. 13, pp. 58{71, Sept.
1996.
[21℄ S. Furui, \An overview of speaker re
ognition te
hnology," in Automati
Spee
h and
Speaker Re
ognition (C.-H. Lee, F. K. Soong, and K. K. Paliwal, eds.),
h. 2, pp. 31{56,
Boston: Kluwer A
ademi
, 1996.
[22℄ S. Furui, \Re
ent advan
es in speaker re
ognition," Pattern Re
ognition Lett., vol. 18,
pp. 859{872, 1997.
[23℄ J. P. Campbell, \Speaker re
ognition: A tutorial," Pro
. IEEE, vol. 85, pp. 1436{1462,
Sept. 1997.
[24℄ H. Sakoe, \Two level DP-mat
hing - A dynami
programming based pattern mat
h-
ing algorithm for
onne
ted word re
ognition," IEEE Trans. A
oust., Spee
h, Signal
Pro
essing, vol. ASSP-27, pp. 588{595, 1998.
[25℄ V. N. Sorovin, \Determination of vo
al tra
t shape for vowels," Spee
h Comm., vol. 11,
pp. 71{85, 1992.
[26℄ V. N. Sorovin, \Inverse problems for fri
atives," Spee
h Comm., vol. 14, pp. 249{262,
1994.
[27℄ J. J. Wolf, \EÆ
ient a
ousti
parameters for speaker re
ognition," J. A
oust. So
.
Amer., vol. 52, no. 6, pp. 2044{2056, 1972.
[28℄ M. M. Sondhi, \New methods of pit
h extra
tion," IEEE Trans. A
oust., Spee
h, Signal
Pro
essing, vol. AU-16, pp. 262{266, 1968.
[29℄ A. M. Noll, \Cepstrum pit
h determination," J. A
oust. So
. Amer., vol. 41, no. 2,
pp. 293{309, 1970.
61
[30℄ T. V. Ananthapadmanabha and B. Yegnanarayana, \Epo
h extra
tion of voi
ed
spee
h," IEEE Trans. A
oust., Spee
h, Signal Pro
essing, vol. ASSP-27, pp. 562{570,
1975.
[31℄ W. Hess, Pit
h determination of spee
h signals, Algorithms and Devi
es. Springer-
Verlag, 1983.
[32℄ B. S. Atal, \Automati
speaker re
ognition based on pit
h
ontours," J. A
oust. So
.
Amer., vol. 52, no. 6, pp. 1687{1697, 1972.
[33℄ B. Yegnanarayana, B. Madhukumar, and V. Rama
handran, \Robust features for ap-
pli
ation in spee
h and speaker re
ognition," in Pro
. ESCA-ETRW on Spee
h Pro
.
in AD. CON., (Cannes), 1992.
[34℄ J. D. Markel, B. T. Oshika, and A. H. Gray, \Long-term feature averaging for speaker
re
ognition," IEEE Trans. A
oust., Spee
h, Signal Pro
essing, vol. 25, pp. 330{337,
Aug. 1977.
[35℄ J. E. Lu
k, \Automati
speaker veri
ation using
epstral measurements," J. A
oust.
So
. Amer., vol. 46, no. 4, pp. 1026{1032, 1969.
[36℄ T. Matsui and S. Furui, \Speaker
lustering for spee
h re
ognition using the parameters
hara
terizing vo
al{tra
t dimensions," in Pro
eedings of Int. Conf. Spoken Language
Pro
essing, pp. 137{140, 1990.
[37℄ H. M. Dante and V. V. S. Sharma, \Automati
speaker re
ognition for a large pop-
ulation," IEEE Trans. A
oust., Spee
h, Signal Pro
essing, vol. 27, pp. 255{263, Jun.
1979.
[38℄ B. Yegnanarayana, \Formant extra
tion from linear predi
tion phase," J. A
oust. So
.
Amer., vol. 63, pp. 1638{1640, 1978.
[39℄ H. A. Murthy, K. V. Madhu Murthy, and B. Yegnanarayana, \Formant extra
tion from
phase using weighted group delay fun
tions," Ele
tron. Lett., vol. 25, pp. 1609{1611,
1989.
[40℄ S. K. Das and W. S. Mohn, \A s
heme for spee
h pro
essing in automati
speaker
veri
ation," IEEE Trans. A
oust., Spee
h, Signal Pro
essing, vol. AU-19, pp. 32{43,
1971.
[41℄ G. Doddington, \A method of speaker veri
ation," J. A
oust. So
. Amer., vol. 49,
p. 139 (A), 1971.
[42℄ J. W. Glenn and N. Kleiner, \Speaker identi
ation based on nasal phonation," J.
A
oust. So
. Amer., vol. 43, no. 2, pp. 368{372, 1968.
[43℄ L. S. Su, K. P. Li, and K. S. Fu, \Identi
ation of speakers by the use of nasal
oar-
ti
ulation," J. A
oust. So
. Amer., vol. 56, pp. 1876{1882, 1974.
[44℄ R. C. Lummis, \Speaker veri
ation by
omputer using spee
h intensity for tempo-
ral registration," IEEE Trans. A
oust., Spee
h, Signal Pro
essing, vol. AU-21, no. 2,
pp. 80{89, 1973.
62
[45℄ E. Bunge, \Automati
speaker re
ognition system auros for se
urity systems and foren-
si
voi
e identi
ation," in Pro
. 1977 Int. Conf. on Crime Countermeasures, pp. 1{7,
1977.
[46℄ S. Furui, F. Itakura, and S. Satio, \Talker re
ognition by long-time averaged spee
h
spe
trum," Ele
tron Commun., Jap., vol. 55-A, pp. 54{61, 1972.
[47℄ H. Hollien and W. Majewski, \Speaker identi
ation by long-term spe
tra under nor-
mal
onditions," J. A
oust. So
. Amer., vol. 62, no. 4, pp. 975{980, 1977.
[48℄ J. Makhoul, \Linear predi
tion: A tutorial review," Pro
. IEEE, vol. 63, pp. 561{580,
Apr. 1975.
[49℄ B. S. Atal, \Ee
tiveness of linear predi
tion
hara
teristi
s of the spee
h wave for
automati
speaker identi
ation and veri
ation," J. A
oust. So
. Amer., vol. 55,
pp. 1304{1312, Jun. 1974.
[50℄ A. E. Rosenberg and M. Sambur, \New te
hniques for automati
speaker veri
ation,"
IEEE Trans. A
oust., Spee
h, Signal Pro
essing, vol. 23, no. 2, pp. 169{175, 1975.
[51℄ M. R. Sambur, \Speaker re
ognition using orthogonal linear predi
tion," IEEE Trans.
A
oust., Spee
h, Signal Pro
essing, vol. 24, pp. 283{289, Aug. 1976.
[52℄ J. Naik and G. R. Doddington, \High performan
e speaker veri
ation using prin
ipal
spe
tral
omponents," in Pro
eedings of IEEE Int. Conf. A
oust., Spee
h, and Signal
Pro
essing, pp. 881{884, 1986.
[53℄ S. Furui, \Cepstral analysis te
hnique for automati
speaker veri
ation," IEEE Trans.
A
oust., Spee
h, Signal Pro
essing, vol. 29, pp. 254{272, Apr. 1981.
[54℄ L. R. Rabiner and B. H. Juang, Fundamentals of Spee
h Re
ognition. Prenti
e-Hall,
1993.
[55℄ L. R. Rabiner and R. W. S
hafer, Digital Pro
essing of Spee
h Signals. Prenti
e-Hall,
1978.
[56℄ M. M. Homayounpour and G. Chollet, \A
omparison of some relevant parametri
representations for speaker veri
ation," in ESCA Workshop on speaker Re
ognition,
Identi
ation, and Veri
ation, pp. 185{188, Apr. 1994.
[57℄ G. R. Doddington, M. A. Pryzbo
ki, A. F. Martin, and D. A. Reynolds, \The NIST
speaker re
ognition evaluation - Overview, methodology, systems, results, perspe
tive,"
Spee
h Comm., vol. 31, pp. 225{254, Jun. 2000.
[58℄ H. Wakita, \Residual energy of linear predi
tion applied to vowel and speaker re
og-
nition," IEEE Trans. A
oust., Spee
h, Signal Pro
essing, vol. 24, pp. 270{271, 1976.
[59℄ P. Thevenaz and H. Hugli, \Usefulness of lp
-residue in text-independent speaker ver-
i
ation," Spee
h Comm., vol. 17, pp. 145{157, 1995.
63
[60℄ C. Liu, M. Lin, W. Wang, and H. Wang, \Study of line-spe
trum pair frequen
ies for
speaker re
ognition," in Pro
eedings of IEEE Int. Conf. A
oust., Spee
h, and Signal
Pro
essing, pp. 277{280, 1990.
[61℄ H. A. Murthy, F. Beaufays, L. P. He
k, and M. Weintraub, \Robust text-independent
speaker identi
ation over telephone
hannels," IEEE Trans. A
oust., Spee
h, Signal
Pro
essing, vol. 7, pp. 554{568, 1999.
[62℄ Y. Gong and J. Haton, \Nonlinear ve
tor interpolation for speaker re
ognition," in
Pro
eedings of IEEE Int. Conf. A
oust., Spee
h, and Signal Pro
essing, vol. 2, (San-
Fran
is
o, California, USA), pp. 173{176, Mar. 1992.
[63℄ H. Hermansky and N. Malaynath, \Speaker veri
ation using speaker-spe
i
map-
ping," in RLA2C, (Avigon,Fran
e), Apr. 1998.
[64℄ H. Misra, M. S. Ikbal, and B. Yegnanarayana, \Spe
tral mapping as a feature
for speaker re
ognition," in National Conferen
e on Communi
ations(NCC), (IIT,
Kharagpur), pp. 151{156, Jan 29-31 1999.
[65℄ H. Misra, Development of a Mapping Feature for Speaker Re
ognition. MS dissertation,
Indian Institute of Te
hnology, Department of Ele
tri
al Engg., Madras, May 1999.
[66℄ D. A. Reynolds, \Experimental evaluation of features for robust speaker identi
ation,"
IEEE Trans. A
oust., Spee
h, Signal Pro
essing, vol. 2, no. 4, pp. 639{643, 1994.
[67℄ H. Hermansky and N. Morgan, \RASTA pro
essing of spee
h," IEEE Trans. A
oust.,
Spee
h, Signal Pro
essing, vol. 2, no. 4, pp. 578{579, 1994.
[68℄ S. van Vuuren and H. Hermansky, \Data driven design of RASTA-like lters," in
Eurospee
h, (Gree
e), pp. 409{412, 1997.
[69℄ N. Malayath, Data-Driven Methods for Extra
ting Features from Spee
h. PhD disser-
tation, Orgeon Graduate Institute of S
ien
e and Te
hnology, Department of Ele
tri
al
and Computer Engg., Portland, Jan. 2000.
[70℄ H. Gish, \Robust dis
rimination in automati
speaker identi
ation," in Pro
eedings
of IEEE Int. Conf. A
oust., Spee
h, and Signal Pro
essing, pp. 289{292, 1990.
[71℄ B. H. Juang, \Past, present and future of spee
h pro
essing," IEEE ASSP Magazine,
pp. 24{48, May 1998.
[72℄ N. Morgan and H. Bourlard, \Continuous spee
h re
ognition," IEEE ASSP Magazine,
pp. 25{42, May 1995.
[73℄ W. A. Hargreaves and J. A. Starkweather, \Re
ognition of speaker identity," Language
and Spee
h, vol. 6, pp. 63{67, 1963.
[74℄ K. P. Li and G. W. Hughes, \Talker dieren
es as they appear in
orrelation matri
es
of
ontinuous spee
h spe
tra," J. A
oust. So
. Amer., vol. 55, no. 4, pp. 833{837, 1974.
64
[75℄ A. L. Higgins, L. G. Bahler, and J. E. Porter, \Voi
e identi
ation using nearest-
neighbour distan
e measure," in Pro
eedings of IEEE Int. Conf. A
oust., Spee
h, and
Signal Pro
essing, vol. 2, pp. 375{378, 1993.
[76℄ F. K. Soong, A. E. Rosenberg, L. R. Rabiner, and B. H. Juang, \A ve
tor quantization
approa
h to speaker re
ognition," in Pro
eedings of IEEE Int. Conf. A
oust., Spee
h,
and Signal Pro
essing, pp. 387{390, 1985.
[77℄ A. E. Rosenberg and F. K. Soong, \Evaluation of a ve
tor quantization talker re
ogni-
tion system in a text independent and text dependent modes," in Pro
eedings of IEEE
Int. Conf. A
oust., Spee
h, and Signal Pro
essing, pp. 873{876, 1986.
[78℄ R. E. Helms, Speaker re
ognition using linear predi
tion
odebooks. PhD dissertation,
Southern Methodist University, 1981.
[79℄ E. Dorsey and J. Bernstein, \Inter-speaker
omparison of lp
a
ousti
spa
e using a
minmax distortion measure," in Pro
eedings of IEEE Int. Conf. A
oust., Spee
h, and
Signal Pro
essing, pp. 16{19, 1981.
[80℄ K. P. Li and E. H. Wren
h Jr., \An approa
h to text-independent speaker re
ognition
with short utteran
es," in Pro
eedings of IEEE Int. Conf. A
oust., Spee
h, and Signal
Pro
essing, pp. 555{558, 1983.
[81℄ K. Shikano, \Text-independent speaker re
ognition experiments using
odebooks in
ve
tor quantization," J. A
oust. So
. Amer., vol. 77, p. S11 (A), 1985.
[82℄ J. T. Bu
k and et al., \Text-dependent speaker re
ognition using ve
tor quantization,"
in Pro
eedings of IEEE Int. Conf. A
oust., Spee
h, and Signal Pro
essing, pp. 391{394,
1985.
[83℄ F. K. Soong and A. E. Rosenberg, \On the use of instantaneous and transitional
spe
tral information in speaker re
ognition," in Pro
eedings of IEEE Int. Conf. A
oust.,
Spee
h, and Signal Pro
essing, pp. 877{890, 1986.
[84℄ D. A. Reynolds, \Speaker identi
ation and veri
ation using gaussian mixture mod-
els," Spee
h Comm., vol. 17, pp. 91{108, Aug. 1995.
[85℄ R. A. Redner and H. F. Walker, \Mixture densities, maximum likelihood and the EM
algorithm," SIAM Review, vol. 26, pp. 195{239, 1984.
[86℄ D. A. Reynolds, \Comparison of ba
kground normalization methods for text{
independent speaker veri
ation," in Eurospee
h, (Gree
e), pp. 963{966, 1997.
[87℄ R. P. Lippmann, \An introdu
tion to
omputing with neural nets," IEEE ASSP Mag-
azine, vol. 4, pp. 4{22, Apr. 1989.
[88℄ B. Yegnanarayana, Arti
ial Neural Networks. New Delhi: Prenti
e-Hall of India,
1999.
[89℄ S. Haykin, Neural networks: A
omprehensive foundation. New Jersey: Prenti
e-Hall
In
., 1999.
65
[90℄ Y. Bennani and P. Gallinari, \Neural networks for dis
rimination and modelization of
speakers," Spee
h Comm., vol. 17, pp. 159{175, 1995.
[91℄ Y. Bennani, \Speaker identi
ation through a modular
onne
tionist ar
hite
ture:
Evaluation on the TIMIT database," in ICSLP, pp. 607{610, 1992.
[92℄ J. Oglesby and J. S. Mason, \Optimisation of neural models for speaker identi
ation,"
in Pro
eedings of IEEE Int. Conf. A
oust., Spee
h, and Signal Pro
essing, pp. 261{264,
1990.
[93℄ M. Gori and F. S
arselli, \Are multilayer per
eptrons adequate for pattern re
ognition
and veri
ation," IEEE Trans. Pattern Analysis and Ma
hine Intelligen
e, vol. 20,
pp. 1121{1132, Nov. 1998.
[94℄ J. Oglesby and J. S. Mason, \Radial basis fun
tion networks for speaker re
ognition,"
in Pro
eedings of IEEE Int. Conf. A
oust., Spee
h, and Signal Pro
essing, pp. 393{396,
1991.
[95℄ M. S. Ikbal, H. Misra, and B. Yegnanarayana, \Analysis of autoasso
iative mapping
neural networks," in Int. Joint Conf. on Neural Networks, (Washington, USA), 1999.
[96℄ M. Gori, L. Lastru
i, and G. Soda, \Autoasso
iator-based models for speaker veri-
ation," Pattern Re
ognition Lett., vol. 17, pp. 241{250, 1996.
[97℄ M. S. Ikbal, Autoasso
iative Neural Network Models for Speaker Veri
ation. MS
dissertation, Indian Institute of Te
hnology, Department of Computer S
ien
e and
Engg., Madras, May 1999.
[98℄ H. Bourlard and Y. Kamp, \Auto-asso
iation by multilayer per
eptrons and singular
value de
omposition," Biol. Cybernet., vol. 59, pp. 291{294, 1988.
[99℄ K. I. Diamantaras and S. Y. Kung, Prin
ipal Component Neural networks: Theory and
Appli
ations. New York: John Wiley & Sons In
., 1996.
[100℄ M. A. Kramer, \Nonlinear prin
ipal
omponent analysis using autoasso
iative neural
networks," AIChE, vol. 37, pp. 233{243, Feb. 1991.
[101℄ M. Bian
hini, P. Fras
oni, and M. Gori, \Learning in multilayered networks used as
autoasso
iators," IEEE Trans. Neural Networks, vol. 6, pp. 512{515, Mar. 1995.
[102℄ P. Baldi and K. Hornik, \Neural networks and prin
ipal
omponent analysis: Learning
from examples without lo
al minima," IEEE Trans. Neural Networks, vol. 2, pp. 53{58,
1989.
[103℄ E. C. Malthouse, \Limitations of nonlinear PCA as performed with generi
neural
networks," IEEE Trans. Neural Networks, vol. 9, pp. 165{173, Jan. 1998.
[104℄ S. P. Kishore and B. Yegnanarayana, \Speaker veri
ation: Minimizing the
hannel
ee
ts using autoasso
iative neural network models," in Pro
eedings of IEEE Int. Conf.
A
oust., Spee
h, and Signal Pro
essing, (Istanbul), pp. 1101{1104, 2000.
66
[105℄ D. O'Shaughnessy, Spee
h Communi
ation-Human and ma
hine. Addison-Wesley,
1987.
[106℄ M. H. Hassoun, Fundamentals of Arti
ial Neural networks. New Delhi: Prenti
e-Hall
of India, 1998.
[107℄ J. Oglesby, \What's in a number ? Moving beyond the equal error rate," Spee
h
Comm., vol. 17, pp. 193{208, Aug. 1995.
[108℄ A. L. Higgins and R. E. Wohlford, \A method of text-independent speaker re
ognition,"
in Pro
eedings of IEEE Int. Conf. A
oust., Spee
h, and Signal Pro
essing, pp. 869{872,
1986.
[109℄ A. E. Rosenberg and S. Parthasarathy, \Speaker ba
kground models for
onne
ted digit
password speaker veri
ation," in Pro
eedings of IEEE Int. Conf. A
oust., Spee
h, and
Signal Pro
essing, pp. 81{84, 1996.
[110℄ R. A. Finan, A. T. Sapeluk, and R. I. Damper, \Imposter
ohort sele
tion for s
ore
normalization in speaker veri
ation," Pattern Re
ognition Lett., vol. 18, pp. 881{888,
1997.
[111℄ S. P. Kishore and B. Yegnanarayana, \Online text-independent speaker veri
ation
system at IITM," in Pro
eedings of Int. Conf. on Multimedia Pro
essing and Systems,
(IIT Madras, INDIA), pp. 178{180, Aug. 2000.
[112℄ S. P. Kishore and B. Yegnanarayana, \Identi
ation of handset type using autoas-
so
iative neural network models," in Fourth Int. Conferen
e on Advan
es in Pattern
Re
ognition and Digital Te
hniques, (ISI, Cal
utta, INDIA), pp. 353{356, De
. 1999.
[113℄ L. P. He
k and M. Weintraub, \Handset{dependent ba
kground models for robust
text{independent speaker re
ognition," in Pro
eedings of IEEE Int. Conf. A
oust.,
Spee
h, and Signal Pro
essing, (Muni
h, Germany), April 1997.
[114℄ C. Nadeu, P. Pa
hes-Leal, and B.-H. Juang, \Filtering the time sequen
es of spe
tral
parameters for spee
h re
ognition," Spee
h Comm., vol. 22, pp. 315{332, 1997.
67
LIST OF PUBLICATIONS
REFERRED JOURNALS
1. S. P. Kishore and B. Yegnanarayana, \Speaker veri
ation using autoasso
iative
neural network models," IEEE Trans. Spee
h and Audio Pro
essing (
ommuni-
ated).
2. B. Yegnanarayana and S. P. Kishore, \AANN - An alternative to GMM for
pattern re
ognition," Neural Networks (
ommuni
ated).
PRESENTATIONS IN CONFERENCES
1. S. P. Kishore and B. Yegnanarayana, \Identi
ation of handset type using au-
toasso
iative neural network models," in Fourth Int. Conferen
e on Advan
es
in Pattern Re
ognition and Digital Te
hniques, (ISI, Cal
utta, INDIA), pp. 353{
356, De
. 1999.
2. S. P. Kishore and B. Yegnanarayana, \Speaker veri
ation: Minimizing the
hannel ee
ts using autoasso
iative neural network models," in Pro
eedings of
IEEE Int. Conf. A
oust., Spee
h, and Signal Pro
essing, (Istanbul), pp. 1101{
1104, 2000.
3. B. Yegnanarayana, S. P. Kishore, and A. V. N. S. Anjani, \Neural network mod-
els for
apturing probability distribution of training data," in Int. Conferen
e
on Cognitive and Neural Systems, (Boston), p. 6 (A), 2000.
4. S. P. Kishore and B. Yegnanarayana, \Online text-independent speaker veri-
ation system at IITM," in Pro
eedings of Int. Conf. on Multimedia Pro
essing
and Systems, (IIT Madras, INDIA), pp. 178{180, 2000.
5. B. Yegnanarayana, K. Sharat Reddy and S. P. Kishore, \Sour
e and System
Features for Speaker Re
ognition using AANN models", in Pro
eedings of IEEE
Int. Conf. A
oust., Spee
h, and Signal Pro
essing, 2001.
6. S. P. Kishore, B. Yegnanarayana and Suryakanth V. Gangashetty, \Online Text-
Independent Speaker Veri
ation System using Autoasso
iative Neural Network
Models", in Pro
eedings of IEEE Int. Joint Conf. on Neural Networks, 2001.
69
GENERAL TECHNICAL COMMITTEE
70
BIODATA
71