Kishore Prahallad MS Thesis

SPEAKER VERIFICATION USING AUTOASSOCIATIVE
NEURAL NETWORK MODELS
A THESIS
submitted by
S. P. KISHORE
for the award of the degree

of
MASTER OF SCIENCE
(by Resear h)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY, MADRAS.
DECEMBER 2000
THESIS CERTIFICATE
This is to ertify that the thesis entitled Speaker Veri ation Using Autoasso ia-
tive Neural Network Models submitted by S. P. Kishore to the Indian Institute
of Te hnology, Madras for the award of the degree of Master of S ien e (by Resear h) is
a bonade re ord of resear h work arried out by him under my supervision. The on-
tents of this thesis, in full or in parts, have not been submitted to any other Institute
or University for the award of any degree or diploma.
Chennai-36 Prof. B. Yegnanarayana

Date: Dept. of Computer S ien e and Engg.
ACKNOWLEDGEMENTS
First and foremost, I am grateful to my supervisor Prof. B. Yegnanarayana and my

seniors of spee h lab for reating su h a stimulating environment for resear h. My long
intera tions with my professor together with Monday meetings have been a onstant
sour e of energy whi h allowed me to stay in IIT. I am indeed fas inated by his never
say no attitude for a dis ussion, and his never ending analysis on graphs, urves and
plots.
I would like to express my sin ere gratitude to Prof. C. Pandurangan, Head of

Dept. of CSE, and Dr. Sunkendu Das for their ooperation and onstant en ourage-
ment. I also would like to thank Prof. Hynek Hermansky of OGI, Portland USA, for
supporting this work and my stay at IIT Madras.
I onsider myself to be lu ky to have wonderful seniors. My intera tions with my

seniors and fa ulty members Dr. C. Chandra Sekhar and Dr. Hema A. Murthy have
been inspirational and joyful. I an never forget my early lessons in Spee h Pro essing
from P. Satyanarayana Murthy, Neural Networks from Manish Sarkar and System Ad-
ministration from S. Rajendran. My immediate seniors Hemant, Ikbal and Mathew are
in fa t are my mentors in shaping my attitude and aptitude of resear h. The personal
ae tion shown by Hemant, Siva and Ikbal has given quite happy and joyful moments
for this Kid. I am also grateful to Sa hin and Naren of OGI, for all the things they
taught me.
I am grateful to Suryakanth V. Gangashetty, without whom it would not have

been possible for me to run NIST 2000. I also would like to express my gratitude
to Prasanna for orre ting my thesis subse tions, se tions, hapters and drafts. The
feedba k given by Prasanna, Suryakanth, Ramesh and Sharat have helped me quite a
lot in rening this thesis.
I would like to thank Jyotsna, Anjani, Kamakshi Prasad, Prasad Reddy, K. Kiran,
Nayeem, Nagarajan, Devaraj, Gupta, Venkatesh, Vinod and Anil for their extended
ooperation during my stay in spee h lab. I am also grateful to my hostel mates P.
Sampath, B. Srikanth and P. Kiran who went beyond their way to help me in the hour
of need.
Words alone annot des ribe the en ouragement, ae tion and love of my appa
Narasimha Murthy, my amma Sumathi Murthy, my brother Prasanth, my sister
Praveena and my un les Ar hak Vyasraj A har & Bro., whose benedi tions have been
with me in ompleting this ourse and through out.
Prahallad Kishore
iii
ABSTRACT
Keywords: speaker veri ation; autoasso iative neural networks; ba kground model;
hannel and handset ee ts; equal error rate.
The obje tive of a speaker veri ation system is to onrm the identity of a per-
son from his/her voi e. Speaker veri ation an be performed in a text-dependent or
text-independent mode. For the text-dependent ase, the feature ve tors are extra ted
from the spee h signal and are stored as referen e templates. For the text-independent
ase, a model is used to apture the distribution of the feature ve tors of a speaker.
This thesis presents an approa h based on Autoasso iative Neural Network (AANN)
models for text-independent speaker veri ation. This approa h an be viewed as an
alternative to the urrent approa hes based on Gaussian Mixture Models (GMM).
Autoasso iative neural network models are feedforward neural networks perform-
ing an identity mapping of the input spa e. In this thesis, the hara teristi s of AANN
models are explained in the perspe tive of apturing the distribution of feature ve tors.
The distribution apturing ability of the AANN models is studied using a probability
surfa e derived from the training error surfa e in the input feature spa e aptured by
the network. We illustrate that a three layer AANN model with nonlinear hidden units
lusters the input data in a linear subspa e, whereas a ve layer AANN model lus-
ters the input data in a nonlinear subspa e. The probability surfa e aptured by a ve
layer AANN model is viewed as nonparametri modeling of the input data distribution.
The property of a ve layer AANN model to apture the distribution of the given
data is exploited to develop a speaker veri ation system. The proposed system is
evaluated on onversational telephone spee h for 230 speakers. Performan e of the
AANN-based speaker veri ation system is improved by addressing three issues: (1)
Normalization pro edure using ba kground model, (2) Stru ture of the AANN model,
and (3) Mismat h between training and testing data. Finally, performan e of our
AANN-based speaker veri ation system is ompared with that of a GMM-based
speaker veri ation system.
v
TABLE OF CONTENTS
Thesis erti ate i

A knowledgements ii
Abstra t iv
List of Tables ix
List of Figures x
Abbreviations 0
1 INTRODUCTION TO SPEAKER RECOGNITION 1
1.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Issues in Speaker Re ognition . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Extra tion of Speaker Information . . . . . . . . . . . . . . . . . 3
1.2.2 Probabilisti Modeling of Speaker Features . . . . . . . . . . . . 3
1.2.3 De ision Logi to Implement Identi ation or Veri ation . . . . 4
1.3 Issues Addressed in the Thesis . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 APPROACHES FOR SPEAKER RECOGNITION 8
2.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Features to Represent Speaker Information . . . . . . . . . . . . . . . . 8
2.2.1 Pit h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Formants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Long-Term Spe tral Features . . . . . . . . . . . . . . . . . . . 11
2.2.4 Short-Term Spe tral Features . . . . . . . . . . . . . . . . . . . 11
2.2.4.1 Linear Predi tion CoeÆ ients . . . . . . . . . . . . . . 11
2.2.4.2 Cepstral CoeÆ ients . . . . . . . . . . . . . . . . . . . 12
2.2.4.3 Mel-Frequen y Cepstral CoeÆ ients . . . . . . . . . . 13
2.2.4.4 LP Residual Features . . . . . . . . . . . . . . . . . . 13
2.2.4.5 Other Segmental Features . . . . . . . . . . . . . . . . 13
2.3 Models for Speaker Re ognition . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Approa h of Nearest Neighbor . . . . . . . . . . . . . . . . . . . 15
2.3.2 Ve tor Quantization . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Arti ial Neural Network Models . . . . . . . . . . . . . . . . . 17
2.4 Need for New Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 CHARACTERISTICS OF AUTOASSOCIATIVE NEURAL NET-
WORK MODELS 20
3.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Prin ipal Component Analysis and AANN Models . . . . . . . . . . . . 21
3.3 Three layer AANN Model . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Linear A tivation Fun tion at the Hidden Units . . . . . . . . . 22
3.3.2 Nonlinear A tivation Fun tion at the Hidden Units . . . . . . . 24
3.4 Five Layer AANN Model . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 AANN-BASED SPEAKER VERIFICATION SYSTEM 30
4.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Des ription of Spee h Database . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Feature Extra tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Prepro essing of Spee h Signal . . . . . . . . . . . . . . . . . . . 32
4.3.2 Extra tion of Linear Predi tion Cepstral CoeÆ ients . . . . . . 32
4.3.3 Cepstral Weighting and Mean Subtra tion . . . . . . . . . . . . 33
4.4 Generation of Speaker Models . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Veri ation Pro edure . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Comparison with GMM-Based speaker veri ation system . . . . . . . 40
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vii
5 PERFORMANCE ENHANCEMENT OF AANN-BASED SPEAKER
VERIFICATION SYSTEM 42
5.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Signi an e of Ba kground Model . . . . . . . . . . . . . . . . . . . . . 42
5.2.1 Individual Ba kground Model (IBM) . . . . . . . . . . . . . . . 43
5.2.2 Speaker Model and IBM . . . . . . . . . . . . . . . . . . . . . . 44
5.2.3 Handset-Dependent IBM (HIBM) . . . . . . . . . . . . . . . . . 45
5.2.4 Ee t of Population of IBM . . . . . . . . . . . . . . . . . . . . 46
5.3 Stru ture of AANN Model . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 Components Captured by the Network for Speaker Veri ation . 47
5.4 S ore Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Temporal Pro essing of Feature Ve tors . . . . . . . . . . . . . . . . . . 50
5.6 Online Implementation of Speaker Veri ation System . . . . . . . . . 53
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6 SUMMARY AND CONCLUSIONS 55
6.1 Dire tions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 56
Appendix A 57
Appendix B 58
Bibliography 60
List of Publi ations 68
viii
LIST OF TABLES
4.1 Performan e of speaker veri ation system for 230 speakers. Here BG
denotes ba kground model. . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Performan e omparison of one of the best GMM-based speaker veri-
ation systems with AANN-based speaker veri ation system. . . . . . 40
5.1 A set of 11 dierent laimant s ores obtained for dierent test utter-
an es. S ores of the genuine laimant models are underlined. Rows 1,
2 and 3 are for the mat hed onditions of the genuine laimant. Row 4
is for the mismat h ondition of the genuine laimant. . . . . . . . . . . 43
5.2 Performan e omparison of speaker veri ation system using IBM and
UBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Performan e of speaker veri ation system measured in EER for dier-
ent values of K (number of units in the dimension ompression hidden
layer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Performan e of speaker veri ation system with AANN models of stru -
ture 19L38N 4N 38N 19L using S and N . . . . . . . . . . . . . . . . . 50
5.5 Performan e of speaker veri ation system before and after ltering the
epstral traje tories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.6 Performan e omparison of speaker veri ation system using 38-D fea-
ture ve tors ( epstral and delta epstral oeÆ ients) and 19-D feature
ve tors (stati epstral oeÆ ients). . . . . . . . . . . . . . . . . . . . . 52
5.7 Performan e omparison of AANN-based speaker veri ation system
with GMM-based speaker veri ation system on the database of 1000
speakers. These results are taken from [1℄. . . . . . . . . . . . . . . . . 52
5.8 Performan e of the online speaker veri ation system . . . . . . . . . . 53
LIST OF FIGURES
1.1 Graphi al sket h of human vo al system. . . . . . . . . . . . . . . . . 1

3.1 (a) 2-D data (A 3-D view is shown). (b) 2-D data shown in (a) is
repeated. ( ) Output of the 3 layer network 2L 1L 2L. (d) Output of
the 3 layer network 2L 1N 2L. (e) Probability surfa e aptured by the
network 2L 1L 2L. (f) Probability surfa e aptured by the network 2L
1N 2L. Here L denote linear unit and N denote nonlinear unit. . . . . 23
3.2 Five layer AANN model. . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 A ve layer network 2L 12N 1N 12N 2L is trained with the 2-D data
shown in Fig.3.1(b) for two dierent sessions. Hypersurfa es are ob-
tained by plotting the output of the network for all the points in the
input spa e. (a) Hypersurfa e (solid lines) aptured by the network in
training session I. The input data used for training is also plotted in the
gure. (b) Probability surfa e aptured by the network in training ses-
sion I. ( ) Hypersurfa e (solid lines) aptured by the network in training
session II. The input data used for training is also plotted in the gure.
(d) Probability surfa e aptured by the network in training session II. 27
3.4 (a) 2-D data. (b) Output of the 5 layer network 2L 12N 1N 12N 2L. ( )
Probability surfa e aptured by the network. . . . . . . . . . . . . . . 28
4.1 Mean square error, averaged over all feature ve tors, is plotted for su -
essive epo hs. This urve demonstrates the onvergen e of an AANN
model for the feature ve tors of a speaker. . . . . . . . . . . . . . . . . 36
4.2 Distribution of laimant s ores. . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Distribution of laimant s ores after ba kground normalization. . . . . 39
5.1 Ee t of population of IBM on (a) EER and (b) DCF. . . . . . . . . . 46
xi
ABBREVIATIONS
ANN - Arti ial Neural Network

AANN - Autoasso iative Neural Network
CMS - Cepstral Mean Subtra tion
DCF - Dete tion Cost Fun tion
EER - Equal Error Rate
FA - False A eptan e
FR - False Reje tion
GMM - Gaussian Mixture Model
HIBM - Handset-dependent Individual Ba kground Model
IBM - Individual Ba kground Model
LP - Linear Predi tion
LPC - Linear Predi tion CoeÆ ients
NIST - National Institute of Standards and Te hnology
NLPCA - Nonlinear Prin ipal Component Analysis
PCA - Prin ipal Component Analysis
UBM - Universal Ba kground Model
VQ - Ve tor Quantization
CHAPTER 1
INTRODUCTION TO SPEAKER RECOGNITION
1.1 INTRODUCTION
Spee h ommuni ation is a natural phenomenon among human beings. The intended
message is transferred from one person to another through the omplex me hanisms of
spee h produ tion and spee h per eption. Spee h produ tion begins when the intended
message represented in some abstra t form in the mind of the speaker is onverted into
neural signals. These neural signals ontrol the human vo al system to produ e an
a ousti wave. This a ousti wave is su essfully de oded by the spee h per eption
me hanism of the listener to realize the intended message.
Fig. 1.1: Graphi al sket h of human vo al system.
The spee h produ tion me hanism is understood better by studying the anatomi al
stru ture of human vo al system shown in Fig.1.1. The human vo al system primarily
onsists of vo al tra t, nasal avity and vo al folds. The vo al tra t begins at the vo al
folds or glottis and ends at the lips. The nasal avity begins at the velum (soft palate)
1
and ends at the nostrils. It is a ousti ally oupled to the vo al tra t when the velum
is lowered. During spee h produ tion the vo al folds may be either in tensed state or
in relaxed state. As the air is expelled from the lungs through the tra hea, the tensed
vo al folds are aused to vibrate. The air ow is hopped into quasi-periodi pulses to
ex ite the vo al tra t system. The periodi ex itation of the vo al tra t system pro-
du es voi ed sounds. When the vo al folds are relaxed, the air ow must pass through a
onstri tion somewhere along the length of the vo al tra t to produ e unvoi ed sounds.
With the time varying ex itation, the shape of the vo al tra t also varies to produ e
dierent spee h sounds. Thus, a spee h signal onsisting of sequen e of sounds an be
onsidered as a result of time varying ex itation of a time varying system (vo al tra t).
Along with the intended message, spee h signal also arries information about the
speaker. The spee h per eption me hanism de odes the message, as well as re ognizes
people based upon their voi e hara teristi s present in the spee h signal. An imitation
of the latter fun tion by a ma hine is known as automati speaker re ognition or simply
speaker re ognition. Speaker re ognition an be lassied into speaker identi ation
and speaker veri ation. Given a spee h signal, the task of speaker identi ation is
to determine the identity of the speaker, whereas, the task of speaker veri ation is
to onrm the identity of the speaker. Speaker re ognition an be performed in a
text-dependent or text-independent mode. In a text-dependent speaker re ognition
system, a restri tion is imposed on the speaker to utter some xed words or senten es.
Su h restri tion is not imposed in a text-independent speaker re ognition system. In
this thesis, we address some issues related to the development of a text-independent
speaker veri ation system and fo us in parti ular the issue of development of a model
to apture the hara teristi s of a speaker present in the spee h signal.
2
1.2 ISSUES IN SPEAKER RECOGNITION
Speaker re ognition by a ma hine involves three stages. They are, (1) extra tion of fea-
tures to represent the speaker information present in the spee h signal, (2) probabilisti
modeling of speaker features, and (3) de ision logi to implement the identi ation or
veri ation task. The issues involved in ea h of these stages are dis ussed below.
1.2.1 Extra tion of Speaker Information

The primary task in a speaker re ognition system is to extra t features apable of
representing the speaker information present in the spee h signal. It is known that
human beings use high-level features su h as speaker diale t, style of spee h and verbal
mannerisms (for example, use of parti ular words and idioms, or a parti ular kind of
a laugh) to re ognize speakers. Intuitively, it is lear that these features onstitute
important speaker information. The diÆ ulty arises in representing these features
due to limitations of the existing feature extra tion te hniques [2℄. Current speaker
re ognition systems use segmental features su h as vo al tra t shape to represent the
speaker-spe i information. These features show signi ant variations a ross speak-
ers, but they also show onsiderable variations from time to time for a single speaker.
In addition to this, the hara teristi s of the re ording equipment and transmission
path are re e ted in these features [2℄.
1.2.2 Probabilisti Modeling of Speaker Features

For a text-independent speaker re ognition system, the underlying distribution of
speaker features extra ted from the spee h signal is useful. Probabilisti models are
used to estimate this distribution, whi h an be parametri or nonparametri . Para-
metri models assume a probability density fun tion for the underlying distribution of
the feature ve tors. The parameters of the hosen density fun tion are optimized to
provide the best t for the given data. The drawba k of su h an approa h is that the
hosen density fun tion may not provide a good representation for the true distribu-
3
tion of the data. Nonparametri models su h as K-nearest-neighbor approa h make no
assumptions about the distribution of the feature ve tors. They allow omplex forms
of density fun tions, but suer from the requirement that all of the feature ve tors
must be stored [3℄ [4℄. These issues led to a lass of models known as semi-parametri
models, whi h are being used by some of the urrent speaker re ognition systems.
Semi-parametri models approximate the distribution of the given data as a linear
ombination of several basis fun tions. The density fun tion obtained by the linear
ombination of several basis fun tions is known as mixture density. Although the basis
fun tion an be any known density fun tion, the majority of the speaker re ognition
systems use Gaussian fun tion. The two fundamental issues arise in using su h an ap-
proa h are: (1) How to estimate the parameters of the mixture density, and (2) How
to estimate the number of basis fun tions. For the rst question, the standard answer
is expe tation-maximization algorithm [5℄. The se ond question is diÆ ult to answer
[6℄. One has to experimentally determine the number of basis fun tions required to
give satisfa tory performan e on the given data. However, it is observed that the per-
forman e of the speaker re ognition system improves with in rease in the number of
basis fun tions [7℄. This observation indi ates that the distribution of feature ve tors
extra ted from the spee h signal is omplex in nature [8℄ [9℄.
1.2.3 De ision Logi to Implement Identi ation or Veri ation

For speaker identi ation, a spee h utteran e from an unknown speaker is ompared
with the models of all the speakers registered with the system. The unknown speaker is
identied as the speaker whose model mat hes best with the given utteran e. The use
of su h a simple logi leads to: (1) in rease in the identi ation time with in rease in
the population of registered speakers, and (2) possibility of an unknown speaker being
identied as one of the registered speakers. For speaker veri ation, the given spee h
utteran e is ompared with the model of the speaker whose identity is laimed. The
mat h of the model with the given utteran e is ompared with a threshold to a ept
or reje t the laim. The task of de iding the threshold is nontrivial. A high threshold
4
makes it diÆ ult for impostors to be a epted but at the risk of falsely reje ting the
genuine speakers. Conversely, a low threshold enables genuine speakers to be a epted,
but at the risk of a epting the impostors. The threshold an be speaker-dependent
or speaker-independent. Another issue in the text-independent speaker veri ation is
the variation in the hara teristi s of the spee h signal due to linguisti ontent and
transmission media. It is ne essary to onsider these variations at the de ision level.
It is also observed that the veri ation performan e is more satisfa tory for a set of
speakers (referred to as sheeps) ompared to the remaining set of speakers (referred
to as goats) present in a parti ular database. The problem of sheeps and goats has to
be addressed to obtain a uniform performan e for all the speakers [2℄.
In pra ti al appli ations, the spee h signal is transmitted over a ommuni ation
hannel. The hara teristi s of the telephone hannel and handset degrades the per-
forman e of a speaker re ognition system [10℄. Variability in the hara teristi s of
hannels and handsets further degrades the performan e of a speaker re ognition sys-
tem [11℄ [12℄. The issue of redu ing the hannel and handset ee ts on the performan e
of a speaker re ognition system is addressed in the literature at the feature, model and
de ision levels.
1.3 ISSUES ADDRESSED IN THE THESIS
In this thesis, we address the issues involved in developing a text-independent speaker

veri ation system using nonparametri models. A database of onversational tele-
phone spee h is used in this study. The issue of apturing and representing the dis-
tribution of the feature ve tors of a speaker is addressed using Autoasso iative Neural
Network (AANN) models. For a text-independent speaker veri ation system, varia-
tions in the spee h signal arising due to linguisti ontent has to be onsidered at the
de ision level. This issue is addressed by normalizing with a ba kground model in the
AANN-based speaker veri ation system. The issue of hannel and handset ee ts
5
on the performan e of the AANN-based speaker veri ation system is also addressed.
Methods are suggested to redu e the ee ts of hannel and handset mismat h between
training and testing data. All the studies reported in this thesis are performed on a
population of 230 speakers using 1448 test utteran es.
1.4 ORGANIZATION OF THE THESIS
The organization of the thesis is as follows:

Chapter 2 is devoted to a review of the existing approa hes for speaker re ogni-
tion task. Representations suitable for dierent sour es of speaker information
present in the spee h signal are dis ussed. Probabilisti models used to des ribe
the distribution of the feature ve tors of a speaker are analyzed. The need to
investigate new models is dis ussed.
Chapter 3 investigates the potential of AANN model to apture the distribution
of feature ve tors of a speaker. The signi an e of error surfa e realized by a
neural network model in the input feature spa e is explained. A probability
surfa e is derived from the training error surfa e to study the hara teristi s of
an AANN model. The distribution apturing ability of three layer and ve layer
AANN models is dis ussed.
Chapter 4 develops a speaker veri ation system using ve layer AANN models.
The pro edure used to extra t the feature ve tors from the spee h signal is
explained. The algorithm used to train an AANN model is des ribed. The ver-
i ation pro edure of an AANN-based speaker veri ation system is dis ussed.
Performan e of the speaker veri ation system is evaluated on a database of 230
speakers.
Chapter 5 fo uses on enhan ing the performan e of the AANN-based speaker ver-
i ation system. The signi an e of ba kground model for speaker veri ation
task is explained. A ba kground model suitable for the AANN-based speaker
6
veri ation system is proposed. The ee t of the stru ture of AANN model is
studied in detail. Performan e of speaker veri ation is examined for dierent
number of units in the dimension ompression layer. To redu e the ee ts of
hannel and handset mismat h between training and testing data, a method is
proposed to normalize the error obtained by an AANN model. Finally, the per-
forman e of AANN models is ompared with that of Gaussian mixture models
for a database of 1000 speakers.
Chapter 6 on ludes the thesis by summarizing the work.
7
CHAPTER 2
APPROACHES FOR SPEAKER RECOGNITION
2.1 INTRODUCTION
The fo us of this review is on the approa hes followed at the feature, model and de i-
sion levels for speaker re ognition, rather than on the performan e of the approa hes
and population of databases. Arti les addressing various issues of speaker re ognition
an be found in [2℄ [13℄ [14℄ [15℄ [16℄ [17℄ [18℄ [19℄ [20℄ [21℄ [22℄ [23℄.
The organization of this hapter is as follows: In Se tion 2.2, we dis uss suitable
feature ve tors for representing speaker information present in the spee h signal. Text-
dependent speaker re ognition systems use these feature ve tors for nonlinear mat hing
te hniques su h as Dynami Time Warping [24℄. Text-independent speaker re ognition
systems use probabilisti models of the feature ve tors extra ted from the spee h signal.
The dis ussion in Se tion 2.3 is fo ussed on dierent probabilisti models used to
des ribe the distribution of the feature ve tors of a speaker. The merits and demerits
of these models are dis ussed and the need to explore new methods is dis ussed in
Se tion 2.4.
2.2 FEATURES TO REPRESENT SPEAKER INFORMATION
Features that are derived from dierent sour es of speaker information present in the
spee h signal an be grouped into three ategories. They are: (1) features related
to the anatomi al dieren es in the vo al tra t, (2) features related to the ex itation
sour e of the vo al tra t, and (3) features related to dieren es in the speaker habits.
8
The anatomi al dieren es relate to the stru tural dieren es in the shape and size of
the vo al tra t, whi h vary onsiderably from one speaker to another. The vo al tra t
shape is diÆ ult to derive from the spee h signal [25℄ [26℄. Therefore, the shape of
the vo al tra t is hara terized by the resonan es of the vo al tra t system. The time
varying vo al tra t is assumed to be in a quasi-stationary state for a short-duration
(10-30 ms), and the spe tral ontent of the spee h segment is used to hara terize
the vo al tra t shape. The periodi os illation of the vo al folds is a major sour e
of ex itation of the vo al tra t system. The individuality of the speaker is asso iated
with the os illations of these vo al folds. The dieren es in speaking habits are due to
the manner in whi h speakers have learned to use their spee h produ tion me hanism.
These dieren es indi ate the temporal variations of the hara teristi s of dierent
individuals.
The general requirements of the features representing speaker information noted

in [27℄ are as follows:
EÆ ient in representing the speaker-dependent information
Easy to measure
Stable over time
O ur naturally and frequently in spee h
Robust towards degradations of the transmission media
Not to be sus eptible to mimi ry
Among the dierent sour es of speaker information, speaking habit and the style of
an individual are onsidered as higher-level information. These features are hard to
quantify with the existing te hniques. Speaker re ognition systems relies on the fea-
tures derived from the vo al tra t shape and its ex itation hara teristi s to represent
the speaker information [2℄. Following subse tions dis uss some of these features.
9
2.2.1 Pit h
The rate of vibration of vo al folds is known as fundamental frequen y or pit h. Pit h
information an be extra ted from the spee h signal using various methods su h as
zero- rossing, epstral methods, group delay fun tions et . [28℄ [29℄ [30℄. A dis us-
sion of various algorithms for pit h extra tion is given in [31℄. The individuality of a
speaker is asso iated with the pit h patterns extra ted from the spee h signal. Atal [32℄
demonstrated that pit h ontours an be used ee tively for text-dependent speaker
identi ation. Yegnanarayana et al. [33℄ used the lo al fall and rise patterns of pit h,
and durational features for identifying the speaker. In [34℄, the long-term average
value of the fundamental frequen y is used for text-independent speaker re ognition.
Studies have also been made to investigate the usefulness of ombining pit h informa-
tion with spe tral features [35℄ [36℄. In [37℄ a multistage pattern re ognition approa h
is proposed for speaker identi ation. A two stage lassier with pit h and auto orre-
lation oeÆ ients is shown to perform better than a single stage lassier using these
features together.
2.2.2 Formants
Formants may be des ribed as the resonan es of the vo al tra t system. They vary
in frequen y, relative amplitude and bandwidth a ording to spee h and speaker. Ex-
tra tion of formant frequen ies is a diÆ ult problem in spee h pro essing [38℄ [39℄.
Their presen e in the spe trum envelope as peaks may be masked by the harmoni s
of the ex itation signal, and thus smoothing is required prior to the use of any peak
pi king algorithms. Formants and their ontours have been used for text-dependent
speaker re ognition studies [40℄ [41℄. Formants orresponding to nasal onsonants are
found to be ee tive for speaker re ognition [27℄ [42℄. Su et al. used oarti ulation
between nasal and the following vowel as an a ousti ue for identifying speakers [43℄.
However, omparative studies on eÆ ien y of dierent features indi ate that distan es
based on formant frequen ies ontribute little towards dis riminating impostors [44℄.
10
2.2.3 Long-Term Spe tral Features
The idea of using the long-term spe trum is to suppress the spe tral details due to
the linguisti ontent of the spee h signal. The Fourier transform of a long segment of
a spee h signal or averaging of spe tral features obtained from short segments of the
spee h signal, represent slow varying omponents of the utteran e. Sin e the speaker
hara teristi s are varying slowly ompared to the message part of the utteran e, the
long-term spe trum an be onsidered as a speaker-spe i feature, independent of the
senten e [45℄ [46℄ [47℄. It has to be noted that the long-term spe trum is not a stable
feature due to its dependen e on the hara teristi s of the ommuni ation hannel [2℄.
2.2.4 Short-Term Spe tral Features

Short-term spe tral features are obtained from the spee h signal of duration 10-30
ms. They primarily hara terize the vo al tra t shape, whi h may vary signi antly
a ross speakers. These features are realized in dierent forms su h as linear predi tion
oeÆ ients, mel-frequen y epstral oeÆ ients, line spe tral pair et . We brie y review
some of the short-term spe tral features used for speaker re ognition.
2.2.4.1 Linear Predi tion CoeÆ ients

The theory of Linear Predi tion (LP) is losely linked to modeling of the vo al tra t
system, and relies upon the fa t that a parti ular spee h sample may be predi ted by
a linear ombination of previous samples. The number of previous samples used for
predi tion is known as the order of the predi tion. The weights applied to ea h of
the previous spee h samples are known as Linear Predi tion CoeÆ ients (LPC). They
are al ulated so as to minimize the predi tion residual. As a byprodu t of the LP
analysis, re e tion oeÆ ients and log area oeÆ ients are also obtained [48℄.
A study into the use of LPC for speaker re ognition was arried out by Atal [49℄.
These oeÆ ients are highly orrelated, and the use of all predi tion oeÆ ients may
11
not be ne essary for speaker re ognition task [50℄. Sambur [51℄ used a method alled
orthogonal linear predi tion to orthogonalize linear predi tion oeÆ ients, re e tion
oeÆ ients and log area oeÆ ients by proje ting these oeÆ ients onto the orre-
sponding eigenspa es. It is noted that only a small subset of the resulting orthogonal
oeÆ ients exhibits signi ant variation over the duration of an utteran e. It is shown
that re e tion oeÆ ients are as good as the other feature sets. Naik et al. [52℄ used
prin ipal spe tral omponents derived from linear predi tion oeÆ ients for speaker
veri ation task.
2.2.4.2 Cepstral CoeÆ ients

In many appli ations, Eu lidean distan e is used as a measure of similarity/dissimilarity.
The sharp peaks of the LP spe trum may produ e large errors in a similarity test, even
for a slight shift in the position of the peaks. Hen e, linear predi tion oeÆ ients are
onverted into epstral oeÆ ients using a re ursive relation [53℄. Cepstral oeÆ ients
represent the log magnitude spe trum, and the rst few oeÆ ients model the smooth
envelope of the log spe trum [54℄. These oeÆ ients an be obtained either from non-
linear transformation of linear predi tion oeÆ ients or dire tly from the IFFT of log
magnitude spe trum of the spee h signal. In both ases, the pro ess results in de on-
volution of the vo al tra t from the spee h signal [55℄.
In an early study, Lu k [35℄ used FFT-based epstral oeÆ ients for speaker ver-
i ation. Atal [49℄ explored LPC-derived epstral oeÆ ients and proved their ee -
tiveness over LPC and other features su h as pit h and intensity ontours. Furui [53℄
observed a similar performan e of speaker veri ation for LPC-derived and FFT-based
epstral oeÆ ients. LPC-derived epstral oeÆ ients take less omputation time and
are used even in re ent studies for speaker re ognition task [11℄ [20℄ [56℄.
12
2.2.4.3 Mel-Frequen y Cepstral CoeÆ ients
The FFT-based epstral oeÆ ients are omputed by taking IFFT of the log magnitude
spe trum of the spee h signal. The mel-warped epstrum is obtained by inserting
a intermediate step of transforming the frequen y s ale to pla e less emphasis on
high frequen ies before taking the IFFT. The mel s ale is based on human per eption
of frequen y of sounds [54℄. Most of the urrent speaker veri ation systems use
mel-frequen y epstral oeÆ ients to represent the speaker information present in the
spee h signal [11℄ [19℄ [57℄.
2.2.4.4 LP Residual Features

The individuality of a speaker asso iated with the ex itation of the vo al tra t has
been a subje t of interest in speaker re ognition studies. Linear predi tion analysis
models the vo al tra t (system) parameters and hen e the information about the ex-
itation (sour e) of the vo al tra t is present in the residual signal.
Wakita [58℄ reported an experiment using LP residual energy on vowel re ognition

and speaker re ognition in whi h the distan es between vowels from dierent speakers
are ompared. He found that most of the time a vowel produ ed by a speaker is lose
to one of his/her vowel produ tions than to the vowels from the other speaker. A
re ent investigation on the usefulness of LP residual for speaker veri ation task is
reported in [59℄.
2.2.4.5 Other Segmental Features

Liu et al. [60℄ studied the ee tiveness of line spe tral pair representation for speaker
veri ation. Hema et al. [61℄ used a new feature based on spe tral slopes to dis rim-
inate among speakers. Gong and Haton suggested nonlinear ve tor interpolation for
speaker re ognition [62℄. This idea of spe tral mapping has been studied extensively
in [63℄ [64℄ and [65℄.
13
Until now, we have dis ussed various short-term spe tral features used for speaker
re ognition. The extensive usage of spe tral features for speaker re ognition task is
due to their better dis rimination ability over other features su h as pit h. How-
ever, it is shown that the spe tral features (extra ted in dierent forms su h as LPC,
mel-frequen y and line frequen y epstral oeÆ ients) are ae ted by the frequen y
distortions introdu ed by the telephone hannels [66℄. Hen e, many feature-based
ompensation te hniques like Cepstral Mean Subtra tion (CMS), RASTA pro essing,
ltering the spe tral traje tories, ltering the spee h signal with a xed lter are em-
ployed in pra ti al situations [7℄ [49℄ [53℄ [67℄ [68℄ [69℄ [70℄.
It is interesting to note that the spe tral features whi h we laim to represent
speaker information are borrowed from spee h re ognition eld [71℄ [72℄. The dieren e
lies in grouping the feature ve tors of a lass. For spee h re ognition, the feature
ve tors of a parti ular phoneme uttered by several speakers are grouped into one
lass. For speaker re ognition, the feature ve tors extra ted from the utteran e of an
individual are grouped into one lass. One an argue that it is ne essary to derive
the features from the spee h signal with an obje tive riterion of representing speaker
information. However, the performan e of the urrent speaker re ognition systems
using spe tral features as reported in [11℄ and [1℄ suggest that the speaker re ognition
systems an be used for pra ti al appli ations, provided a better des ription is given
for the distribution of feature ve tors. The next dis ussion fo us on the distribution
modeling of feature ve tors for text-independent speaker re ognition systems.
2.3 MODELS FOR SPEAKER RECOGNITION
Early studies on text-independent speaker re ognition used long-term averaging of

feature ve tors to reate referen e templates [34℄ [73℄. In [74℄ the orrelation matri es
derived from the spe tra of relatively long duration of spee h signals are used to spe -
14
ify speaker dieren es. Su h methods may not adequately represent the distribution
of feature ve tors. Hen e, the probability distribution of feature ve tors are modeled
by parametri or nonparametri methods. Models whi h assume a probability density
fun tion are termed parametri . In nonparametri modeling, minimal or no assump-
tions are made regarding the probability distribution of feature ve tors. In this se tion,
we brie y review the nearest neighbor, Ve tor Quantization (VQ), Gaussian Mixture
Model (GMM) and neural network based approa hes for speaker re ognition. While
GMM is a parametri model, nearest neighbor, VQ and neural network models are
treated as nonparametri .
2.3.1 Approa h of Nearest Neighbor

The nearest neighbor rule de ides, whether the given utteran e belongs to the speaker
of its nearest neighbor. In this approa h, the feature ve tors of the registered speakers
are stored as referen e ve tors. During testing, feature ve tors of the test utteran e
are ompared with ea h of the registered speaker's feature ve tors. The speaker of the
test utteran e is the speaker of the feature set, whi h gives the lowest nearest neighbor
distan e. It is shown that su h a simple approa h does provide reasonable performan e
for speaker identi ation [8℄ [75℄.
2.3.2 Ve tor Quantization

Ve tor quantization is similar to nearest neighbor modeling, ex ept that the distan e
is measured to the nearest data representative. The data representative is the entroid
ve tor for a luster of feature ve tors. A set of su h entroid ve tors is known as ode-
book. In a speaker re ognition system, odebooks are formed for ea h speaker using
the referen e feature ve tors. For text-independent ase, odebooks may be formed
from widely varying input text, in the hope that most of the phonemes will be repre-
sented. The a tual task of veri ation is arried out by measuring the distan e of the
feature ve tors extra ted from the test utteran e against the laimed speaker's ode-
15
book. Soong et al. [76℄ employed VQ te hnique to represent speaker features. Ea h
speaker is hara terized by a VQ odebook of 64 ve tors onstru ted from a large set
of short-term spe tral ve tors obtained from the training utteran es. In [77℄, perfor-
man e of VQ based speaker re ognition system is evaluated for both text-dependent
and text-independent ases. Similar approa hes are reported by Helms [78℄, Dorsey et
al. [79℄, Li [80℄, Shikano [81℄, and Bu k et al. [82℄. In a later study, two VQ odebooks
(one for stati epstrals and the other for delta epstrals) ea h with 64 entries are
generated and used as a model for ea h speaker [83℄. Matsui et al. [36℄ used speaker
models onsisting of two VQ odebooks, one for voi ed sounds and the other for un-
voi ed sounds.
Instead of point to point omparison (nearest neighbor approa h), VQ based ap-
proa hes represent the speaker features by a set of mean ve tors. A still better ap-
proa h is to model the speaker features with a set of mean and varian e ve tors. This
te hnique is widely used and is known as Gaussian mixture models [11℄ [84℄.
2.3.3 Gaussian Mixture Models

The basis for using GMM is that the distribution of feature ve tors extra ted from
an individual's spee h data an be modeled by a Gaussian mixture density. For a
M -dimensional feature ve tor denoted as x , the mixture density fun tion for speaker
s is dened as
P
M
p(x=s) = is fis( x)
i=1
The mixture density fun tion is a weighted linear ombination of M omponent uni-
modal Gaussian densities fis(:). Ea h Gaussian density fun tion fis(:) is parameterized
by the mean ve tor si and the ovarian e matrix Cis using
fis (x) =
1
(2 )(n=2) jCis j exp(
1
2
(x si)T (Cis) (x si)),
1
16
where (Cis) and j Cis j denotes the inverse and determinant of the ovarian e ma-
1
trix Cis. The mixture weights is satisfy the onstraint iP is = 1. Colle tively the
M
=1
parameters of the speaker model s are denoted as s = fis;si; Cisg, i = 1 M .
The number of mixture omponents is hosen empiri ally for a given data set. The
parameters of the GMM are estimated using the iterative expe tation-maximization
algorithm [5℄ [85℄.
A speaker veri ation system using GMM an be developed as follows [86℄: Feature
ve tors of a given speaker are used to build a speaker-spe i GMM (s). The speaker-
spe i GMM is derived from an universal ba kground model (b), whi h is trained
with feature ve tors of several speakers. The basi idea behind using a speaker-spe i
and an universal ba kground model is to a ommodate the linguisti variability of the
test utteran e at the de ision level. During the veri ation phase, the test utteran e is
given to the universal ba kground model, and a few mixture omponents whi h on-
tribute signi antly to the likelihood value are noted. The likelihood of speaker-spe i
GMM is then omputed by onsidering only the sele ted mixture omponents. The
laim is reje ted or a epted by omparing the log likelihood ratio with the threshold
using
reje t
ln p(x= s) <
>
p(x= ) a ept
b
This likelihood ratio is viewed as a means of normalizing the likelihood for the
target speaker. It is observed that the performan e of GMM-based speaker veri ation
system is dependent on the number of mixture omponents and also on the database
used to build the universal ba kground model [7℄ [1℄.
2.3.4 Arti ial Neural Network Models

Arti ial Neural Network Models (ANN) onsists of inter onne ted pro essing units,
where ea h unit represent the model of an arti ial neuron, and the inter onne tion be-
tween two units has a weight asso iated with it. ANN models with dierent topologies
17
perform dierent pattern re ognition tasks [87℄ [88℄ [89℄. The apability of a neural
network model to dis riminate between patterns of dierent lasses is exploited for
speaker re ognition task [90℄. A global lassier for N speakers may perform poorly,
as the omplexity of the lassi ation task in reases with the in rease in the value
of N [91℄. Oglesby et al. [92℄ proposed one network model for ea h speaker, whi h
is trained to dis riminate between spee h data of a parti ular speaker and a small
set of other speakers (impostors). Re ent studies exploiting the mapping apability
of neural network models for speaker re ognition task an be found in [62℄ [63℄ [64℄ [65℄.
Theoreti al analysis of multilayer per eptron using sigmoidal a tivation fun tion
suggests that these networks may not draw losed boundaries in the feature spa e [93℄.
This on lusion expose the inadequa y of the neural network models using sigmoidal
a tivation fun tion for lassi ation task. But, there exist other ANN models (radial
basis fun tion networks [94℄, autoasso iative networks [95℄) whi h do not suer from
the disadvantage pointed out in [93℄. It is also to be noted that the neural network
models are not studied expli itly to apture the distribution of the data (feature ve tors
of a speaker), although there are attempts to interpret the results of learning in terms
of distribution of the data [89℄ [9℄.
2.4 NEED FOR NEW MODELS
Current speaker veri ation systems operating in text-independent mode use GMM to
estimate the probability distribution of the feature ve tors of a speaker. The probabil-
ity distribution estimated by the GMM is onstrained by the fa t that the shape of the
omponents of the distribution is assumed to be Gaussian, and the number of mixtures
are xed a priori. Feature ve tors extra ted from spee h signal may have distributions
in the feature spa e whi h we may not be able to des ribe a urately using a GMM,
with its rst and se ond order statisti s and mixture weights. This fa t is indi ated by
the improved performan e of GMM-based speaker veri ation systems with in rease
18
in the number of mixtures [7℄. Therefore, it is worth exploring new methods to apture
the distribution of the feature ve tors of a speaker. In this ontext, we investigate the
potential of nonlinear models su h as AANN models.
Eorts to use AANN models for speaker veri ation an be found in [96℄ [95℄.
While the study in [96℄, highlights the importan e of nonlinear a tivation fun tion, the
eort is restri ted to a three layer AANN model and its appli ation to phoneme-based
speaker veri ation. Ikbal et al. [95℄ have studied the importan e of more hidden layers,
and the eÆ ien y of AANN models is demonstrated on a small subset of NTIMIT
database. The intuitive arguments of [95℄ and [97℄ provide justi ation for the ability
of AANN models to apture a nonlinear subspa e. Our studies on the hara teristi s
of AANN model is fo ussed on the error given by the AANN model for every point
in the input spa e. We show that there is a relation between the distribution of the
feature ve tors and the training error surfa e aptured by an AANN model in the
input feature spa e. A probability surfa e is derived from the training error surfa e
aptured by the network. The distribution apturing ability of three layer and ve
layer AANN models is studied using the probability surfa e aptured by these network
models. The potential of the ve layer AANN model for the task of speaker veri ation
is demonstrated on onversational telephone spee h for 230 speakers. The issues of
ba kground model, role of hidden units and s ore normalization are addressed in the
ontext of minimizing the ee ts of telephone hannel and handset hara teristi s on
the performan e of speaker veri ation systems.
19
CHAPTER 3
CHARACTERISTICS OF AUTOASSOCIATIVE NEURAL
NETWORK MODELS
3.1 INTRODUCTION
Autoasso iative neural network models are feedforward neural networks performing an
identity mapping of the input spa e [89℄ [88℄. Many appli ations use ve layer models
for dimensionality redu tion by proje ting the input data onto the nonlinear subspa e
aptured by the network. From a dierent perspe tive the AANN models an be used
to apture the distribution of input data. In this ontext, the nonlinear a tivation fun -
tion at the hidden units plays a signi ant role. There exists a relationship between
the weights of the network and the prin ipal omponents of the training data [98℄. We
establish that there is a relation between the distribution of the given data and the
training error surfa e aptured by the network in the input spa e. We show that the
weights of the ve layer AANN model indeed apture the distribution of the given data.
This hapter is organized as follows: The relation between prin ipal omponent
analysis and AANN models is dis ussed in Se tion 3.2. Chara teristi s of a three layer
AANN model with linear units in the dimension ompression layer are dis ussed in
Se tion 3.3.1. The relation between training error surfa e and the distribution of the
input data is also dis ussed. The ee t of nonlinear a tivation fun tion at the hidden
units is studied in Se tion 3.3.2. Ee tiveness of a ve layer to apture a nonlinear
subspa e is dis ussed in Se tion 3.4.
20
3.2 PRINCIPAL COMPONENT ANALYSIS AND AANN MODELS
Prin ipal Component Analysis (PCA) is a method of representing the distribution of

a given data in terms of orthogonal omponents [99℄ [89℄ [88℄. These orthogonal om-
ponents a ount for the varian e of the data. The obje tive of PCA is to proje t the
given data onto a linear subspa e spanned by the signi ant orthogonal omponents for
dimension redu tion [6℄. Consider a random ve tor x = [x ; x ; ; xM ℄T with mean
1 2
E fxg = 0 and ovarian e matrix R = E fxxT g, where E is expe tation operator. In

PCA, the obje tive is to nd a transformation ve tor y = W T x, su h that the mean
square error between the re onstru ted ve tor x^ = W y = W (W T x) and the input
ve tor x is minimum. PCA seeks to minimize the mean square re onstru tion error
J = E fkx x^ k g. Note that the obje tive riterion of the ba kpropagation learning
2
algorithm adjusting the weights of a feedforward neural network for identity mapping
is also to obtain a minimum value of J [88℄. Limitation of PCA to represent an input
spa e using a linear subspa e motivated the resear hers to investigate a method of
proje ting the input data onto a nonlinear subspa e using AANN models [100℄ [98℄.
Attempts to use nonlinear hidden units in a three layer AANN model did not
provide any solution better than PCA [98℄. Addition of hidden layers before and after
the ompression layer proje ts the input data onto a nonlinear subspa e [100℄. Many
appli ations use the nonlinear subspa e aptured by the ve layer AANN model for
dimensionality redu tion [6℄. There exists a dierent perspe tive in whi h the hara -
teristi s of the AANN model an be used to apture the distribution of the given data.
Studies on three layer AANN models show that the nonlinear a tivation fun tion
at the hidden units lusters the input data in a linear subspa e [101℄. Theoreti ally,
it was shown that the weights of the network will produ e small errors, only for a
set of points around the training data [101℄. When the onstraints of the network
are relaxed in terms of layers, the network is able to luster the input data in the
21
nonlinear subspa e [95℄. Our study on the relation of the training error surfa e and
data distribution, led us to explore the distribution apturing ability of an AANN
model [9℄. The distribution apturing ability of a three layer and a ve layer AANN
model is studied using the training error surfa e realized by the neural network models
in the input feature spa e.
3.3 THREE LAYER AANN MODEL
Consider a three layer AANN model with M units in the input and output layers, and
p < M units in the hidden layer. Let X = [x ; x ; ; xN ℄ be the M N matrix
1 2
formed by the N input ve tors, and let Y = [y ; y ; ; yN ℄ be the M N matrix

1 2
formed by the ve tors realized at units of the output layer of the network. The mat h
between X and Y is measured in terms of mean square error J = kX Y k , where 2
k:k indi ate squared norm. Let W T = [uij ℄ 2 RpM represent the weight matrix
2
onne ting the input layer and the hidden layer, and W = [vij ℄ 2 RM p represent the
weight matrix onne ting output layer and hidden layer. For linear a tivation fun tion
at the input and output units, J = kX W F (W T X )k , where F is the a tivation
2
fun tion at the hidden units.
3.3.1 Linear A tivation Fun tion at the Hidden Units

For linear a tivation fun tion at the hidden units, F (W T X ) = W T X . Therefore
J = kX W W T X k = kX X k , where = W W T . Sin e the rank of the matrix
2 2
X is p < M , the produ t X minimizing J is the best rank (p) approximation of

X in the Eu lidean spa e. This an be obtained using Singular Value De omposition
(SVD) of X [98℄, [102℄. It is shown that optimal weights of the network minimizing J
orresponds to the prin ipal (singular) ve tors of the ovarian e matrix XX T [98℄. In
other words, the units in the hidden layer apture the linear subspa e spanned by the
rst p prin ipal omponents of the given data.
22
0.1 0.1
10 10
0.05 0.05
0 5 0 5
−4 −4
−2 0 −2 0
0 0
2 −5 2 −5
4 4
(a) (b)
0.1 0.1
10 10
0.05 0.05
0 5 0 5
−4 −4
−2 0 −2 0
0 0
2 −5 2 −5
4 4
( ) (d)
(e) (f)
Fig. 3.1: (a) 2-D data (A 3-D view is shown). (b) 2-D data shown in (a) is repeated. ( )
Output of the 3 layer network 2L 1L 2L. (d) Output of the 3 layer network 2L 1N 2L. (e)
Probability surfa e aptured by the network 2L 1L 2L. (f) Probability surfa e aptured
by the network 2L 1N 2L. Here L denote linear unit and N denote nonlinear unit.
23
For illustration, onsider a three layer AANN model with one linear unit in the
hidden layer. The model is trained with the arti ial 2-D data shown in Fig.3.1(a)
using ba kpropagation learning algorithm in pattern mode [89℄, [88℄. The distribution
(shown by solid lines) of the input ve tors realized by the AANN model is shown
in Fig.3.1( ). From Fig.3.1( ), we observe that the linear subspa e aptured by the
network is along the prin ipal dire tion of the input data. In order to visualize the
distribution better, one an plot the training error for ea h input data point in the
form of some probability surfa e as shown in Fig.3.1(e). The training error Ei for data
point i in the input spa e is plotted as fi = e E = , where = 2 is a onstant. Note
i
that fi is not stri tly a probability density fun tion but we all the resulting surfa e
as probability surfa e. The plot of the probability surfa e shows larger amplitude for
smaller error Ei , indi ating better mat h of the network for that data point. One an
use the probability surfa e to study the hara teristi s of the distribution of the input
data aptured by the network [9℄.
3.3.2 Nonlinear A tivation Fun tion at the Hidden Units

If the a tivation fun tion at the hidden units is nonlinear of the type tanh(:), then its
approximation by a linear fun tion is shown to give a similar result as that of a linear
a tivation fun tion [98℄. It follows that the subspa e formed at the hidden layer is
linear. The linear subspa e aptured by the three layer AANN model with nonlinear
hidden unit is shown in Fig.3.1(d). The ee t of nonlinear a tivation fun tion at the
hidden layer an be observed from the probability surfa e shown in Fig.3.1(f). The
network is able to luster the input data be ause of the nonlinear a tivation fun tion
at the hidden unit. The lustering ability of the AANN model an be explained with
the on ept of -autoasso iation des ribed in [101℄.
The data set X is said to be -autoasso iated with if kYkY Xk k < , where 2 R .
2
2
+
To nd the range of ( 2 R ) for whi h X is -autoasso iated with , onsider the

+
following.
24
kY X k2 < ) kW F (W T X ) X k2 <
kY k2 kW F (W T X )k2
We observe that if a linear a tivation fun tion is used at the hidden units, gets
an eled, and for all values of the inequality holds. However, if a nonlinear a tivation
fun tion is used at the hidden units, it is shown in [101℄ that the above inequality holds
for < , where = (1 + )kW k =kX k . Thus, only limited points in input spa e
2 2
are -autoasso iated with the weights of network. The linear surfa e aptured by the
three layer AANN model may not produ e a low for the training data, and hen e the
probability surfa e shown in Fig.3.1(f) does not re e t the distribution of the training
data. It is ne essary to apture a nonlinear subspa e to obtain a low for the training
data. In the next se tion, we show the ee tiveness of a ve layer AANN model to
apture nonlinear subspa es.
3.4 FIVE LAYER AANN MODEL
The ve layer AANN model shown in Fig.3.2 performs Nonlinear Prin ipal Compo-
nent Analysis (NLPCA) [100℄. The se ond and fourth layers of the network have more
units than the input layer. The third layer has fewer units than the rst or fth.
The a tivation fun tions at third layer may be linear or nonlinear, but the a tivation
fun tions at the se ond and fourth layers are essentially nonlinear.
2 4
Layer 1 0000000
1111111 5
0000000
1111111
0000000
1111111 1111111
0000000
0000000
1111111
0000000
11111113 0000000
1111111
0000000
1111111
0000000
1111111 0000000
1111111
0000000
0000000
1111111
000000
111111 1111111
0000000
1111111 00000000
11111111
0000000
1111111
000000
111111
0000000
1111111 0000000
1111111 11111111
00000000
0000000
1111111
000000
111111 000000
111111
0000000
1111111 000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111
000000 1111111
111111 000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000 0000000
111111
000000
111111
000000
111111
0000000
1111111
000000
111111
000000
111111
0000000
1111111
000000
111111
0000000
1111111
00000000
11111111
0000000
1111111
00000000
11111111
00000000
11111111
0000000
1111111
00000000
11111111
000000 1111111
111111 0000000
000000
111111 0000000
1111111
000000
111111
0000000
1111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111 0000000
1111111
000000
111111
000000
111111 000000
111111
0000000
1111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000 0000000
111111
000000
1111111
000000
111111
0000000
1111111
000000
111111
0000000
1111111
000000
111111
0000000
1111111
000000
111111
0000000
1111111
0000000
1111111
000000
111111
0000000
1111111
00000000
11111111
0000000
1111111
00000000
11111111
00000000
11111111
0000000
1111111
111111
111111 000000
111111
0000000
000000 1111111
000000
111111
0000000
1111111 0000000
1111111
000000
111111
0000000
1111111 00000000
11111111
00000000
11111111
0000000
1111111
00000000
11111111
000000 111111
111111
000000
111111
000000
111111
000000
11111100
11
0000000
1111111
000000
0000000
1111111
00
11
0000000
1111111
000000
111111
0000000
1111111
0000000
1111111
000000
111111
0000000
1111111
0000000
1111111
000000
111111
0000000
1111111
00000000
11111111
0000000
1111111
00000000
11111111
00000000
11111111
0000000
1111111
00000000
11111111
111111 000000
111111
0000000
000000 1111111
000000
111111
0000000
1111111
00000000
111111 0000000
1111111
000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111 0000000
1111111
000000
11111111
0000000
1111111
000000
111111 0000000
1111111
000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
111111 0000000
000000 1111111
000000
11111111
00
0000000
1111111
000000
111111 0000000
1111111
000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111 0000000
1111111
000000
111111
0000000
1111111
000000
111111 0000000
1111111
000000
111111
0000000
1111111
000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111 0000000
1111111
000000
111111
0000000
1111111
000000
111111
0000000 0000000
1111111
000000
111111
0000000
1111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111 1111111
000000
111111
0000000
1111111
000000
111111 000000
111111
0000000
1111111
0000000
1111111 00000000
11111111
0000000
111111100
11
00000000
11111111
1 11
00
0 111111
000000
00
11 000000
111111
0000000
1111111
000000
111111 000000
111111
00000000
1111111
0000000
1111111 00000000
11111111
0000000
111111100
11
00000000
11111111
000000 1111111
111111 0000000 000000
111111
0000000
1111111
000000
111111 1 00000000
11111111
0000000
1111111
00000000
11111111
11
00
1 111111
0 000000 000000
111111
0000000
1111111 0000000
1111111
000000
111111 00000000
11111111
0000000
1111111
00000000
11111111
11
00
000000 111111
111111 000000
0000000
1111111 0000000
1111111
000000
11111111
00 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111
000000
111111 1 111111
0 000000
0000000
1111111
000000
111111
0000000
1111111
0000000
1111111
000000
111111
11111111
00
0000000
1111111
000000
00000000
11111111
0000000
1111111
00000000
11111111
00000000
11111111
0000000
1111111
00000000
11111111
000000
111111 0000000
1111111
000000
111111 00000000
11111111
000000 1111111
000000
111111
111111 0000000
000000
111111
0000000
1111111 0000000
1111111
000000
111111 0000000
1111111
00000000
11111111
00000000
11111111
0000000
1111111
00000000
11111111
000000 111111
111111 000000
0000000
1111111 000000
111111 00000000
11111111
0000000
1111111
00000000
11111111
000000
111111
Compression
Input Layer Output Layer
Layer
Fig. 3.2: Five layer AANN model.
25
The fun tion of the ve layer AANN model is understood better by splitting the
ve layers into mapping (layers 1, 2 and 3) and demapping (layers 3, 4 and 5) net-
works. The mapping network proje ts the input spa e RM onto an arbitrary subspa e
Rp , where p < M . The mapping fun tion G is nonlinear, and a nonlinear subspa e is
formed at the third layer. The proje tion of the nonlinear subspa e Rp ba k onto the
input spa e RM is performed by the demapping network, and the demapping fun tion
H is also nonlinear. The mapping and demapping fun tions may not be unique for a
given data. This an be observed from Figs.3.3(a) and 3.3( ), where two hypersurfa es
are aptured by the same ve layer network (2L12N 1N 12N 2L) for the arti ial 2-D
data of Fig.3.1(b) for two dierent trials. These hypersurfa es are obtained by plotting
the output of the network for all the points in the input spa e shown in Fig.3.1(b). The
probability surfa es orresponding to these two hypersurfa es are shown in Figs.3.3(b)
and 3.3(d). Though the nonlinear subspa es aptured by the network dier from one
training session to another, it has to be noted that the probability surfa es aptured by
the network remains similar for dierent training sessions.
From the above dis ussion, it is lear that a ve layer AANN model is apable of
apturing nonlinear subspa es. The ability of a ve layer AANN model to perform
lustering in the nonlinear subspa e an be explored to apture the omplex distribu-
tion of the data in the input spa e. Fig.3.4( ) illustrates the distribution apturing
ability of a ve layer AANN model for the arti ial 2-D data shown in Fig.3.4(a).
We noti e that the ve layer AANN model an be used as a nonparametri model
to apture the distribution of the given data. The omponents spanning the nonlin-
ear subspa e aptured by these models are known as nonlinear prin ipal omponents.
These models dier from other nonlinear methods su h as prin ipal urves due to the
relationship between the weights of the network and the input data arising from the
nonlinear a tivation fun tion [103℄. Apart from these simple illustrations using 2-D
data, we also show the potential of the ve layer AANN model in situations su h as
text-independent speaker veri ation [11℄ [104℄.
26
(a) (b)
3.3: A ve layer network 2L 12N 1N 12N 2L is trained with the 2-D data shown in
( ) (d)
Fig.
Fig.3.1(b) for two dierent sessions. Hypersurfa es are obtained by plotting the output of
the network for all the points in the input spa e. (a) Hypersurfa e (solid lines) aptured
by the network in training session I. The input data used for training is also plotted
in the gure. (b) Probability surfa e aptured by the network in training session I. ( )
Hypersurfa e (solid lines) aptured by the network in training session II. The input data
used for training is also plotted in the gure. (d) Probability surfa e aptured by the
network in training session II.
27
0.1
10
0.05
0 5
−4
−2 0
0
2 −5
4
(a)
0.1
10
0.05
0 5
−4
−2 0
0
2 −5
4
(b)
( )
Fig. 3.4: (a) 2-D data. (b) Output of the 5 layer network 2L 12N 1N 12N 2L.
( ) Probability surfa e aptured by the network.
28
3.5 SUMMARY
In this hapter, the hara teristi s of three layer and ve layer AANN models are de-
s ribed. A three layer AANN model lusters the input data in the linear subspa e,
whereas a ve layer AANN model aptures the nonlinear subspa e passing through
the distribution of the input data. It is shown that the error surfa e realized by the
network in the input spa e is useful to study the hara teristi s of the distribution ap-
tured by the network. The distribution apturing ability of a ve layer AANN model
is illustrated for arti ial 2-D data. The following hapters exploit the distribution
apturing ability of a ve layer for speaker veri ation.
29
CHAPTER 4
AANN-BASED SPEAKER VERIFICATION SYSTEM
4.1 INTRODUCTION
In this hapter, the property of ve layer AANN models to apture the distribution
of the given data is exploited to build a speaker veri ation system. Separate AANN
models are used to apture the distribution of feature ve tors of ea h speaker. Devel-
opment of AANN-based speaker veri ation system is des ribed using a database of
230 speakers.
This hapter is organized as follows: Spee h database used in this study is de-
s ribed in Se tion 4.2. The pro edure used to extra t feature ve tors from the spee h
signal is dis ussed in Se tion 4.3. The algorithm used to build a speaker model is
explained in Se tion 4.4. Veri ation pro edure of AANN-based speaker veri ation
system is dis ussed in Se tion 4.5. Performan e of a GMM-based speaker veri a-
tion system is ompared with that of the AANN-based speaker veri ation system in
Se tion 4.6
4.2 DESCRIPTION OF SPEECH DATABASE
The spee h orpus used for the study onsists of SWITCHBOARD-2 phase-2 and
phase-3 databases of National Institute of Standards and Te hnology (NIST). These
databases are used for the NIST-99 oÆ ial speaker re ognition evaluation [11℄. The
database of phase-2 is used for ba kground modeling, and hen e referred to as de-
velopment data. The performan e of speaker veri ation system is evaluated on the
30
database of phase-3, whi h is referred to as evaluation data. Development data on-
sists of 500 speakers (250 male and 250 female), and evaluation data onsists of 539
speakers (230 male and 309 female).
Data provided for ea h speaker is onversational telephone spee h olle ted from
dierent sessions ( onversations) sampled at the rate of 8000 samples/se . The training
data onsists of two minutes of spee h data, olle ted from two dierent onversations
over the same phone number. The use of the same phone number results in passing
the spee h data over the same handset and ommuni ation hannel. Two dierent
types of mi rophones (also referred as handsets) are used for olle ting the spee h
data. They are arbon-button and ele tret. Performan e of the speaker veri ation
system is evaluated on test utteran es olle ted from dierent re ording environments.
The duration of the test utteran e varies between 3 to 60 se onds. Ea h test utteran e
has 11 laimants, where the genuine speaker may or may not be one of the laimants.
The gender of the laimant and the speaker of the test utteran e is same. There are
no ross gender trials.
All the studies reported in this thesis are performed on the male subset of 230
speakers with 1448 male test utteran es of the evaluation data. Performan e of the
speaker veri ation system is evaluated for the following three onditions:
1. Mat hed ondition: The training and testing data are olle ted from same phone
number.
2. Channel mismat h ondition: The training and testing data are olle ted from
dierent phone numbers, but it is ensured that the same handset type is used
in both the ases. The use of dierent phone numbers results in passing the
spee h signal over dierent ommuni ation hannels.
3. Handset mismat h ondition: The training and testing data are olle ted over
dierent handset types.
31
4.3 FEATURE EXTRACTION
Speaker information an be extra ted both at the segmental and suprasegmental levels.
The segmental features are the features extra ted from short (10-30 ms) segments of
spee h signal. Some of the segmental features are linear predi tion epstral oeÆ ients,
mel- epstral oeÆ ients, log spe tral energy values et . [54℄. These features represent
the short-term spe tra of the spee h signal. The spe trum of a spee h segment is
attributed primarily to the shape of the vo al tra t. The spe tral information of the
same sound uttered by two persons may dier due to both in the shapes of their vo al
tra ts and in the manner in whi h they produ e spee h [14℄. Comparative studies
between spe tral features and other features su h as the fundamental frequen y show
that the spe tral features seem to provide better dis rimination among speakers [34℄.
In this work, spe tral features represented by linear predi tion epstral oeÆ ients are
used [53℄.
4.3.1 Prepro essing of Spee h Signal

The spee h signal x(n) is preemphasized to ountera t the spe tral roll-o due to the
glottal losure in voi ed spee h [105℄.
x(n) = x(n) x(n 1), where = 1.
Dieren ing the spee h signal in time domain, multiplies the signal spe trum with
linear lter to give emphasis to the high frequen y omponents [54℄.
4.3.2 Extra tion of Linear Predi tion Cepstral CoeÆ ients

The hara teristi s of the spee h signal are assumed to be stationary over a short
duration of time (between 10-30 ms) [54℄. The dieren ed spee h signal is segmented
into frames of 27.5 ms using a Hamming window with a shift of 13.75 ms. The silen e
frames are removed using an amplitude threshold (Appendix A). A 16th order Linear
Predi tion (LP) analysis is used to apture the properties of the signal spe trum [48℄.
32
The re ursive relation between the predi tor oeÆ ients and epstral oeÆ ients is
used to onvert the 16 LP oeÆ ients into 19 LP epstral oeÆ ients (Appendix B).
4.3.3 Cepstral Weighting and Mean Subtra tion

The LP epstral oeÆ ients for ea h frame are linearly weighted. The linear weighted
LP epstral oeÆ ients hara terize the dieren ed log magnitude spe trum, where
the spe tral peaks are emphasized [54℄.
The spee h signal transmitted over a telephone hannel en ounters a linear distor-
tion due to ltering ee t of the hannel [49℄ [53℄. Linear hannel ee ts are ompen-
sated to some extent by removing the mean of the traje tory of ea h epstral oeÆ ient.
It has been shown that the mean subtra tion improves the performan e signi antly
when the training and testing data are olle ted from dierent hannels [49℄ [53℄. But
the re ognition a ura y is redu ed when the mean subtra tion is used for speaker
veri ation, in whi h training and testing are olle ted from the same hannel [49℄
[53℄.
4.4 GENERATION OF SPEAKER MODELS
The feature ve tors extra ted from the training data of a speaker are used to train
an AANN model using ba kpropagation learning algorithm in the pattern mode [89℄
[88℄. The stru ture of the AANN used for this study is 19L38N 14N 38N 19L, where
L denote linear unit and N denote nonlinear unit. The integer value indi ates the
number of units used in that parti ular layer. By in orporating some of the heuristi s
des ribed in [89℄ and [106℄, the a tual algorithm used to train the network is as follows:
NOTATION
The indi es i, j and k refer to the dierent units in the network.
The iteration (time step) is denoted by n.
33
The symbol ej (n) refers to the error at the output unit j for iteration n.
The symbol dj (n) refers to the desired output unit j for iteration n.
The symbol yj (n) refers to the a tual output unit j for iteration n.
The symbol wjk(n) denotes the synapti weight onne ting the output of the
unit k to the input of unit j at iteration n. The orre tion applied to this weight
at iteration n is denoted by wjk (n).
The indu ed lo al eld (i.e., weighted sum of all synapti inputs plus bias) of
unit j at iteration n is denoted by vj (n).
The a tivation fun tion des ribing the input-output fun tional relationship of
the nonlinearity asso iated with unit j is denoted by 'j (:). For linear a tivation
of the unit j , 'j (vj (n)) = vj (n), whereas, for nonlinear a tivation of the unit
'j (vj (n)) = a tanh(b vj (n)). The values of a, b are taken as 1:7159 and 2=3
respe tively (pg. 181 of [89℄).
The bias applied to unit j is denoted by bj ; its ee t is represented by a synapse
of weight wj = bj onne ted to a xed input equal to +1.
0
The ith element of the input ve tor is denoted by xi(n).

The learning rate parameter for ea h unit j is denoted by j , where j =
0:04 (1=Fj ). Fj denote the number of inputs (fan-in) for unit j (refer to
pg. 211 of [106℄). The s aling fa tor 0:04 is an empiri ally hosen value.
ALGORITHM
1 Initialize the weights wjk onne ting thequnit j withq the uniformly distributed
random values taken from the set [ 3= (Fj ); +3= (Fj )℄ (refer to pg. 211 of
[106℄).
2 Randomly hoose a input ve tor x
3 Propagate the signal forward through the network
34
4 Compute the lo al gradients Æ
{ For an unit j at the output layer, Æj = ej (n)'j (vj (n)), where 'j (:) denotes
0 0
the rst derivate of 'j (:). Sin e the a tivation at the output units is linear
in our ase Æj = ej , where ej = dj yj . Moreover, as we are interested in
autoasso iative mapping Æj = xj yj .
P
{ For an unit j at the hidden layer, Æj = 'j (vj (n)) Æk (n)wkj (n)
0
5 Update the weights using wji(n) = j Æj (n)yi(n)+wji(n 1), where = 0:3
is the momentum fa tor.
6 Go to step 2 and repeat for the next input ve tor
This learning algorithm adjusts the weights of the network to minimize the mean
square error obtained for ea h feature ve tor. If the adjustments of weights is done for
all the feature ve tors on e, then the network is said to be trained for one epo h. For
su essive epo hs, the mean square error, averaged over all feature ve tors, is plotted in
Fig.4.1. We observe that the redu tion in the average error is negligible after about 60
epo hs. Hen e, the training of the AANN model was stopped at 60 epo hs. A similar
behavior is observed with the feature ve tors of other speakers. On e an AANN model
is trained with the feature ve tors of a speaker, we use the AANN as speaker model.
For all the 230 speakers of the evaluation data, the speaker models are built by training
one AANN model for ea h speaker.
4.5 VERIFICATION PROCEDURE
During testing phase, the feature ve tors extra ted from the test utteran e are given
to the laimant model to obtain the laimant s ore. A laimant model is the model
of the speaker whose identity is being laimed. The s ore of a model is dened as
P̀ kx y k
kx k , where xi is the input ve tor of the model, yi is the output given by
2
1 i i
`i =1
i
2
the model, and ` is the number of feature ve tors of the test utteran e. Using 11
laimants for ea h of the 1448 utteran es, a total of 15928 tests are performed on
35
0.5
0.4
Mean square error

0.3
0.2
0.1
0
0 50 100 150 200
Epochs
Fig. 4.1: Mean square error, averaged over all feature ve tors, is plotted
for su essive epo hs. This urve demonstrates the onvergen e of an AANN
model for the feature ve tors of a speaker.
230 speaker models. The laimant s ores are segregated into genuine s ores (s ores
obtained by the genuine laimants) and impostor s ores (s ores obtained by the im-
postor laimants). The genuine s ores are further divided into three sets based upon
the re ording environment between training and testing data. The distributions of the
genuine s ores and the imposter s ores are shown in Fig.4.2. From the distributions
of the laimant s ores, a threshold is hosen to a ept or reje t the laim of the speaker.
False A eptan e (FA) and False Reje tion (FR) are the two errors that are used
in evaluating a speaker veri ation system. The tradeo between FA and FR is a fun -
tion of the de ision threshold. Equal Error Rate (EER) is the value for whi h the error
rates of FA and FR are equal [107℄. A weighted sum of error rates of FA and FR is
known as Dete tion Cost Fun tion (DCF) and is given by DCF = 0:99 Fa +0:01 Fr
[1℄ [57℄. The per entage values of false a eptan e (Fa ) and false reje tion (Fr ) are
hosen using a threshold su h that the ost fun tion is minimized. For dierent testing
onditions, the performan e of the AANN-based speaker veri ation system measured
in terms of EER and DCF is shown in Table 4.1 under the olumn base-line. The EER
values of 15.07%, 31.49% and 42.57% are for mat hed, hannel mismat h and handset
mismat h onditions, respe tively. The high values of the EER re e t the overlapping
distributions of the genuine and imposter laimant s ores shown in Fig.4.2. The s ore
36
Imposter scores
300
200
100
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Genuine scores for matched condition
10
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Genuine scores for channel mismatch
8
6
4
2
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
Genuine scores for handset mismatch
10
0
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18
score
Fig. 4.2: Distribution of laimant s ores.
37
obtained by a laimant model is ae ted by the linguisti ontent of the test utteran e.
Hen e, the genuine laimant s ore of one test utteran e may be within the distribution
of impostor laimant s ores of another test utteran e. To over ome this problem, a
relative (normalized) laimant s ore an be derived to a ept or reje t the laim of the
speaker [108℄ [109℄ [84℄ [65℄.
Table 4.1:Performan e of speaker veri ation system for 230 speakers. Here
BG denotes ba kground model.
Environment Between EER DCF
Training and Testing base-line using BG base-line using BG
Mat hed 15.07% 10.16% 10.0 5.51

Channel Mismat h 31.49% 28.64% 10.0 9.21
Handset Mismat h 42.57% 39.97% 10.0 9.76
To normalize the laimant s ore, the approa h of Universal Ba kground Model

(UBM) is adapted for the AANN-based speaker veri ation system [104℄. The UBM
is also an AANN model trained with the feature ve tors of several speakers. In our ase,
the UBM is trained with the feature ve tors of 250 male speakers of the development
data (refer to Se tion 4.2). We onsidered 400 feature ve tors per speaker to train
the UBM. Ea h speaker model is deriving by adapting the UBM. During the testing
phase, the feature ve tors extra ted from a test utteran e are given to the laimant
model and the UBM to obtain the laimant s ore S and the ba kground s ore Sb,
respe tively. The normalized s ore Sn is obtained as Sn = Sb S [65℄. The use of
this formula onverts the normalized s ore into some form of onden e whi h should
be larger for the genuine laim [65℄. In Fig.4.3, the distribution of the normalized
genuine and impostor laimant s ores are shown. From Figs.4.2 and 4.3, we observe
that normalization of laimant s ores redu e the overlapping tenden y of genuine and
imposter s ore distributions to some extent.
38
Imposter scores
300
200
100
0
−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06
Genuine scores for matched condition
15
10
0
−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06
Genuine scores for channel mismatch
10
0
−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06
Genuine scores for handset mismatch
10
0
−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06
normalized score
Fig. 4.3: Distribution of laimant s ores after ba kground normalization.
39
For dierent test onditions, the performan e of the AANN-based speaker veri-
ation system measured in terms of EER and DCF is shown in Table 4.1 under the
olumn using BG. From the Table 4.1, we observe that the ba kground normalization
of the laimant s ores improves the performan e of the speaker veri ation system.
The normalization of the laimant s ores using UBM is one of the ways of deriving a
relative laimant s ore. The purpose of this study is mainly to show the importan e
of ba kground model for speaker veri ation task. Using the ba kground model, an
EER of about 10% is observed for mat hed onditions between training and testing
data. We also observe that the performan e of the speaker veri ation system degrades
for mismat h onditions. The degradation in performan e under mismat h onditions
is due to the hara teristi s of the hannels and handsets. The ee t of mismat h
between training and testing data on the performan e of speaker models an be ob-
served from the overlapping distributions of the genuine and imposter laimant s ores
shown in Fig.4.3. The area of overlapping region in reases for mismat h onditions,
parti ularly for handset mismat h ondition.
4.6 COMPARISON WITH GMM-BASED SPEAKER VERIFICATION

SYSTEM
Table 4.2: Performan e omparison of one of the best GMM-based speaker

veri ation systems with AANN-based speaker veri ation system.
Environment Between EER
Training and Testing GMM AANN
Mat hed 05.01% 10.16%

Channel Mismat h 10.00% 28.64%
Handset Mismat h 21.00% 39.97%
40
In Table 4.2, performan e of one of the best GMM-based speaker veri ation sys-
tems using same database is ompared with that of our AANN-based speaker veri a-
tion system. The results of the GMM-based Oregon Graduate Institute (OGI) speaker
veri ation system are taken from [11℄. In the GMM-based approa h, speaker models
are built with 256 mixture omponents using 38 dimensional ve tors. Ea h speaker
model is derived from an universal ba kground model, whi h is also a 256 mixture om-
ponent GMM trained with feature ve tors of 80 speakers. A 38 dimensional ve tor
used for building a GMM onsist of 19 mel epstral oeÆ ients and 19 delta oeÆ ients.
Mel epstral oeÆ ients are obtained from the logarithmi lter bank energies whose
time traje tories are smoothed over long (1 se ond) segments using data-driven lters
[69℄. Sin e AANN aptures the distribution of the feature ve tors of a speaker, it is
possible to improve the performan e of the speaker veri ation system based on AANN
models to mat h or better than the performan e of a GMM-based speaker veri ation
system. This issue will be addressed in the next hapter.
4.7 SUMMARY
In this hapter, the distribution apturing ability of ve layer AANN models is ex-
ploited to build a speaker veri ation system for onversational telephone spee h. A
ve layer AANN model is used to build ea h speaker model. Performan e of the
AANN-based speaker veri ation system is evaluated for 230 speakers. We have ob-
served that the use of ba kground model improves the performan e of the speaker
veri ation system. We have also observed that the performan e of speaker veri-
ation system degrades for mismat h onditions between training and testing data.
The performan e of an AANN-based speaker veri ation system may be enhan ed by
optimizing the parameters related to the ba kground model and stru ture of AANN.
41
CHAPTER 5
PERFORMANCE ENHANCEMENT OF AANN-BASED
SPEAKER VERIFICATION SYSTEM
5.1 INTRODUCTION
In the previous hapter, we have developed an AANN-based approa h for onversa-

tional telephone spee h. In this hapter, we address the issues related to the ba k-
ground model, stru ture of the AANN, and mismat h between training and testing
data to enhan e the performan e of the AANN-based speaker veri ation system.
This hapter is organized as follows: The signi an e of ba kground models for

speaker veri ation task is dis ussed in Se tion 5.2. A rank-based approa h of in-
dividual ba kground models is proposed. The signi an e of the nonlinear subspa e
aptured by the AANN model for the task of speaker veri ation is studied in Se tion
5.3. Some results on dimension redu tion are also dis ussed. To redu e the ee ts
of mismat h onditions, a method is proposed to normalize the s ore obtained by an
AANN model in Se tion 5.4. Comparison of performan e of the AANN-based speaker
veri ation system with GMM-based speaker system on a database of 1000 speakers
is dis ussed in Se tion 5.5.
5.2 SIGNIFICANCE OF BACKGROUND MODEL
The signi an e of ba kground model for speaker veri ation task is understood better
by studying a small set of laimant s ores given in Table 5.1. The s ores obtained by
the laimant models for a given test utteran e indi ate that the genuine laimant
s ore is omparatively lower than the imposter laimant s ores (see row 1 of Table
42
Table 5.1: A set of 11 dierent laimant s ores obtained for dierent test utteran es.
S ores of the genuine laimant models are underlined. Rows 1, 2 and 3 are for the
mat hed onditions of the genuine laimant. Row 4 is for the mismat h ondition
of the genuine laimant.
Data S ores of the laimant model
ode 1 2 3 4 5 6 7 8 9 10 11
eaaa 0.053 0.099 0.086 0.092 0.085 0.097 0.094 0.107 0.111 0.104 0.096
eaae 0.096 0.049 0.088 0.060 0.082 0.058 0.070 0.086 0.067 0.065 0.073
eabt 0.067 0.061 0.060 0.050 0.050 0.034 0.047 0.051 0.058 0.047 0.064
ef q 0.076 0.086 0.111 0.096 0.068 0.076 0.120 0.101 0.082 0.122 0.095
5.1). But the low s ore of the genuine laimant of a test utteran e may be higher
than some of the imposter laimant s ores of another test utteran e (see rows 1 and
3 of Table 5.1). Thus, normalization of the laimant s ores a ross the test utteran es
is ne essary to measure the performan e of speaker veri ation systems in terms of
EER. The problem of normalization of laimant s ores is addressed in the literature
using ba kground models [108℄ [65℄ [86℄. We also observe that the performan e of the
speaker model is poor due to degradation in the input su h as mismat h onditions as
seen in row 4, where the genuine laimant s ore is higher than some of the impostor
s ores.
5.2.1 Individual Ba kground Model (IBM)

In the studies in Chapter 4, UBM is used to normalize the laimant s ore. This normal-
ization approa h relies on the dieren es in the distribution of feature ve tors aptured
by the UBM and the laimant model. The distribution aptured by the UBM depends
on the number of speakers, number of feature ve tors per speaker and the number of
epo hs used for training. Hen e hoi e of these parameters may ae t the performan e
of the speaker veri ation system [65℄ [1℄.
43
As an alternative to UBM, we propose an approa h based on individual ba kground
models. This approa h makes use of the hypothesis that for a given test utteran e
the genuine laimant may have a low s ore ompared to the other laimant s ores. To
verify a laim, de ision should be based on the spe ied test utteran e and the laimant
model. It is diÆ ult to use information about the other laimant models for the same
test utteran e. Therefore, pseudo- laimant models are suggested for normalization.
These pseudo- laimant models are known as individual ba kground models (IBM).
5.2.2 Speaker Model and IBM

To generate pseudo- laimant models, a set of 92 speakers is sele ted from the devel-
opment data. Ea h pseudo- laimant model is derived by training an AANN with the
feature ve tors extra ted from the utteran e of a speaker in the hosen set. This sub-
set of 92 speakers belong to male set of the development data. The utteran es of 46
speakers are from an ele tret handset, and the utteran es of remaining 46 speakers
are from a arbon-button handset. These speakers are pi ked arbitrarily without any
sele tion riteria su h as ohorts [110℄. The even mixture of male speaker utteran es
olle ted from both the handset types indi ate the generation of gender-dependent and
handset-balan ed ba kground model.
The speaker models of the evaluation data are generated independent of the
pseudo- laimant models. In the testing phase, the feature ve tors of the test utteran e
are given to the laimant model and to the pseudo- laimant models. The s ores of
all the pseudo- laimant models are sorted in the as ending order. The rank of the
laimant model is dened as the position, where the laimant s ore an be inserted in
the sorted list of pseudo- laimant s ores. This rank (R) is onverted into a normalized
s ore (Sn) using, Sn = ( + 1)=R, where denotes the population of the IBM. The
use of this formula onverts the rank of the laimant model into some form of on-
den e. The advantages of this simple approa h are the following: (1) The population
44
of IBM is the only parameter to be hosen a priori, unlike in the ase of UBM. (2)
The pseudo- laimant models an be hosen arbitrarily. (3) The laimant s ore lies in
the range [1; +1℄, thus normalize the s ores a ross the test utteran es. (4) Heuristi s
su h as the best, se ond best et ., an be used for the rank obtained by the laimant
s ore to a ept or reje t the laim [111℄. Performan e omparison of the AANN-based
speaker veri ation systems using IBM and UBM is shown in Table 5.2. For mat hed
onditions, the approa h of IBM gave a relative redu tion of 24.91% in EER. For han-
nel and handset mismat h onditions, the performan e of the IBM is similar to that of
the UBM. The ee ts of handset mismat h between training and testing data an be
ompensated to some extent using handset-dependent ba kground model as des ribed
in the next se tion.
Table 5.2:Performan e omparison of speaker veri ation system

using IBM and UBM
Environment Between EER DCF
Training and Testing UBM IBM HIBM UBM IBM HIBM
Mat hed 10.16% 07.63% 08.07% 5.51 3.20 3.80

Channel Mismat h 28.64% 27.39% 28.50% 9.21 8.25 8.66
Handset Mismat h 39.97% 40.98% 37.44% 9.76 9.74 10.00
5.2.3 Handset-Dependent IBM (HIBM)

Experimental results indi ate that a model trained with the spee h data olle ted
from a handset type may give less error to the test utteran e olle ted from the same
handset type [112℄ [113℄. In su h a ase, the handset mismat h between training and
testing data will reje t the genuine laimants due to the presen e of pseudo- laimant
models, whose training utteran es are olle ted from the same handset type of the test
utteran e. The normalization pro edure of IBM ompares the laimant s ore with
45
the s ores of all the pseudo- laimant models. Instead, if we restri t our omparison
to the pseudo- laimant models, whose utteran es are olle ted over the same handset
of the laimant model, then the performan e of speaker veri ation system improves
as shown in Table 5.2 under the olumn HIBM. For handset mismat h ondition, the
EER is de reased from 40.98% to 37.44%. The in rease in the DCF values an be
attributed to the de rease in the population of IBM.
5.2.4 Ee t of Population of IBM

It is to be re alled that the rank of the laimant model is derived from a set of pseudo-
laimant models whi h are randomly pi ked from the development data. For a given
test utteran e, the probability that the laimant model will get rst rank and hen e
high onden e is 1=( + 1), where is the population of IBM. Thus, hoosing the
IBM population implies that a FA rate of 1=( +1) is in orporated by design. Fig.5.1
shows the ee t of population of IBM on the performan e of the speaker veri ation
system measured in terms of EER and DCF. The de rease in FA due to in rease in the
population of IBM is observed from the DCF urves. The EER urves do not seem
to be ae ted signi antly, as it is a measure of possible trade-o between the rate of
FA and FR. The use of IBM with large population will de rease FA, but at the ost
of in rease in the omputation time to test the pseudo- laimant models.
50 11
Handset Mismatch Handset Mismatch
10
40
9
Channel Mismatch
30 8
Channel
DCF
EER
7 Mismatch
20 6
Matched
5 Matched
10
4
Population of IBM Population of IBM
0 3
0 20 40 60 80 100 0 20 40 60 80 100
(a) (b)
Fig. 5.1: Ee t of population of IBM on (a) EER and (b) DCF.

46
The limitation of rank-based approa h of IBM lies in not onsidering the deviation
of laimant s ore from s ores of pseudo- laimant models. One an also obtain the
normalized s ore (Ns) using Ns = S where S is the s ore of the laimant model,

and , are the mean and varian e of the s ores of pseudo- laimant models. This
normalization may be helpful in redu ing the rate of false a eptan e and hen e may
minimize the value of DCF.
5.3 STRUCTURE OF AANN MODEL
The stru ture of AANN model plays an important role in apturing the distribution of
the given data. While the hidden units in mapping and demapping layer are respon-
sible for apturing a nonlinear subspa e, the units in the dimension ompression layer
redu es the dimension of the data. In other words, the number of units in the ompres-
sion layer determines the number of omponents aptured by the network. Stru ture
of the AANN model used in the previous studies is 19L38N 14N 38N 19L. Feature
ve tors extra ted from the spee h signal are proje ted onto the subspa e spanned by
K = 14 omponents to realize them at the output layer. The ee t of hanging the
number (K ) of these omponents on the performan e of the speaker veri ation system
is examined in this se tion.
5.3.1 Components Captured by the Network for Speaker Veri ation

Speaker veri ation experiments were ondu ted by systemati ally redu ing the num-
ber (K ) of units in the ompression layer from K = 10 to 1. Performan e of the speaker
veri ation system measured in terms of EER for dierent values of K is shown in
Table 5.3. These suggest that even for K = 4 ase the system seems to give reason-
ably good performan e in terms of EER. Experimental studies suggest that signi ant
variation in the spe trum of spee h is due to linguisti omponent [51℄. The next most
signi ant fa tor ontributing to spe tral hanges is due to speaker [51℄. The least
varying omponents in the feature ve tors representing spe trum may be attributed
47
Table 5.3: Performan e of speaker veri ation system measured in EER for dierent
values of K (number of units in the dimension ompression hidden layer)
Training and Testing K = 14 K = 10 K = 8 K = 6 K = 4 K = 3 K = 2 K = 1
Mat hed 8.07% 6.48% 6.45% 6.73% 6.69% 8.34% 10.45% 14.67%
Channel Mismat h 28.50% 21.38% 22.27% 19.31% 18.70% 20.00% 20.18% 24.01%
Handset Mismat h 37.44% 34.43% 30.53% 31.71% 30.26% 28.65% 29.47% 31.36%
to noisy hara teristi s of the hannel and handset. This may be the reason for im-
provement in performan e of the speaker veri ation system as the number of units in
the ompression layer is redu ed from K = 14 to 4. The gra eful degradation in the
performan e of the speaker veri ation system with still fewer (K < 4) omponents
indi ates that there may not be a rigid boundary between the omponents representing
spee h and speaker hara teristi s. If we assume that the feature ve tors form lusters
representing phonemes, then the speaker hara teristi s are re e ted in the way the
phoneme lusters are distributed. Performan e of the speaker veri ation system using
even one hidden unit (K = 1) in the ompression layer indi ates that the 1-D urve
passing through the distribution of the speaker's feature ve tors is signi ant enough
to provide dis rimination among speakers. It is important to note that the AANN
models apture these 1-D urves in the higher dimensional spa e in a nonparametri
mode. The results in Table 5.3 are obtained by using handset-dependent IBM.
5.4 SCORE NORMALIZATION
Due to the hannel and handset ee ts, there will be a shift in the distribution of
training and testing data. Thus, in mismat h onditions a trained model may give
large error for the test data of the same speaker. The reje tion of a genuine laim due
to the hannel or handset mismat h an be avoided either by suitable normalization
48
of the laimant s ore or by using a set of features less ae ted by hannel and handset
hara teristi s. In this se tion, we propose a method of normalizing the s ore obtained
by the AANN model. This method relies on the assumption that the shift in the dis-
tribution of test data of the same speaker may not be signi ant enough to label it as
an imposter utteran e. We show that this method yields signi ant improvement in
the performan e of the AANN-based speaker veri ation system.
Let denote a speaker model and Ii denote the s ore obtained by the model for
an utteran e (i) whi h does not belong to the speaker. Let I denote the mean of Ii.
I = (1=l) Ii
Pl
i =1
where l is the number of other speakers. Let S be the s ore obtained by a model for
a given test utteran e. We dene the normalized s ore
N = S =I = SP .
l
=l)
(1 Ii
i=1
The normalized s ore indi ates the loseness of S to I . In mismat h ondition, the
value of S may be large enough to reje t the genuine speaker. But the value of S in
these ases may not be too lose to I obtained by the same model, and hen e the value
of N may be a better measure to a ept or reje t the laim. Using a set of 25 speakers'
utteran es of NIST-97 database (the duration of ea h utteran e is 1/2 minute), this
normalization pro edure is applied to the s ores of the laimant and pseudo- laimant
models. Performan e of the speaker veri ation system has improved signi antly as
shown in Table 5.4. The set of 25 speakers data used for normalization is randomly
sele ted from the NIST-97 database. The use of NIST-97 database ensures that these
utteran es do not belong to any one of the laimant or pseudo- laimant models. The
improved performan e of the speaker veri ation system support our onje ture that
the shift in the distribution of feature ve tors of the test data for genuine speaker may
not be large enough to be an imposter data for the same speaker model. The s ore
normalization des ribed in this se tion is applied for all the studies reported hereafter.
49
Table 5.4: Performan e of speaker veri ation system with AANN models of
stru ture 19L38N 4N 38N 19L using S and N .
Training and Testing S N
Mat hed 6.69% 5.81%
5.5 TEMPORAL PROCESSING OF FEATURE VECTORS
NIST ondu ts an annual speaker re ognition evaluation. Most of the parti ipants in
these evaluations use GMM-based speaker veri ation system in whi h the time se-
quen es of the feature ve tors are pro essed using RASTA-like lters [67℄ [68℄ [69℄. The
ee t of su h te hnique on the performan e of the AANN-based speaker veri ation
system is examined in this se tion.
It may be re alled that during feature extra tion a windowing fun tion is used
to segment the spee h signal into frames of short duration. Then an all-pole model
is applied on ea h of the windowed signal to obtain 16 predi tor oeÆ ients. These
oeÆ ients are onverted into 19 epstral oeÆ ients to represent the log magnitude
spe trum of the spee h segment. Thus, an entire signal is represented by a set of
time sequen es of epstral oeÆ ients. These time sequen es arry information about
the phoneti ontent, speaker hara teristi s, a ousti distortion and noise. They also
arry ertain estimation errors for the following reasons. (1) Spe tral information
based on nite data involves a ertain random estimation error. (2) Relative position-
ing of ea h frame with respe t to pit h periods introdu es an estimation error [114℄. It
is shown that the suitable pro essing of the time sequen es of the spe tral parameters
improves the performan e of speaker veri ation [68℄.
50
The signi ant variations in the time sequen es of the spe tral parameters are
due to the phoneti ontent and the least variations may be due to the hannel and
handset. Thus the time sequen es of the spe tral parameters are passed through a
lter to eliminate some of these unwanted variations [7℄. In order to study the ee t
of su h te hnique in the AANN-based speaker veri ation system, the time sequen es
of the epstral oeÆ ients are pro essed. The epstral traje tories are passed through
a lowpass lter with a uto frequen y at 4 Hz. Note that the lter is applied on
ea h of the traje tories after subtra ting the mean value. Performan e of the AANN-
based speaker veri ation system is shown in Table 5.5. We observe that the temporal
pro essing of the epstral oeÆ ients improves the performan e of the AANN-based
speaker veri ation system to some extent. The idea of temporal pro essing is ex-
tended further by onsidering the transitional spe tral information. Delta epstral o-
eÆ ients are omputed from the ltered epstral traje tories [53℄. These delta epstral
oeÆ ients are appended to the 19 stati epstral oeÆ ients to form 38-dimensional
feature ve tor. The performan e of the AANN-based speaker veri ation system using
38-dimensional feature ve tors is shown in Table 5.6. The stru ture of the AANN used
in this experiment is 38L48N 12N 48N 38L.
Table 5.5: Performan e of speaker veri ation system before and after ltering
the epstral traje tories
Training and Testing After ltering Before ltering
Mat hed 5.58% 5.81%

The te hniques whi h we dis ussed so far to enhan e the performan e of the speaker
system for 230 speakers of NIST-99 database were applied on the evaluation database of
51
Table 5.6: Performan e omparison of speaker veri ation system using 38-D
feature ve tors ( epstral and delta epstral oeÆ ients) and 19-D feature ve -
tors (stati epstral oeÆ ients).
Training and Testing 38-D feature 19-D feature
Mat hed 5.77% 5.81%

NIST-2000. The evaluation database of NIST-2000 onsists of 1000 speakers (500 male
and 500 female) and 6000 test utteran es, with 11 laimants for ea h test utteran e [1℄.
Performan e of our AANN-based system and OGI's GMM-based system as evaluated
by NIST for hannel and handset mismat h onditions is shown in Table 5.7. The table
shows that AANN-based system needs to be rened further to mat h the performan e
of the GMM-based system. The enhan ement may be in the hoi e of the feature
ve tors and model parameters.
Table 5.7: Performan e omparison of AANN-based speaker veri ation sys-
tem with GMM-based speaker veri ation system on the database of 1000
speakers. These results are taken from [1℄.
Training and Testing AANN GMM

52
5.6 ONLINE IMPLEMENTATION OF SPEAKER VERIFICATION SYS-
TEM
Out of resear h interest, we developed an online version of the AANN-based speaker

veri ation system [111℄. In this version, ea h speaker has a ve layer AANN model
trained with the feature ve tors extra ted from one minute of spee h signal. For
veri ation, we used 10 se onds of data. The laim of a speaker is a epted or reje ted
based on the rank of the laimant model derived from a set of individual ba kground
models. On a pentium-III pro essor, the time taken for training a speaker model
is approximately ve minutes. The time taken for veri ation is approximately two
se onds. The online version of the AANN-based speaker veri ation system helped
us in evaluating the performan e of the speaker veri ation for dierent experiments
su h as babbling, use of dierent languages in training and testing phases, use of
dierent mi rophones et . While we have not ondu ted a systemati evaluation of
this version, the performan e of the online speaker veri ation system observed for a
set of 15 speakers using 25 individual ba kground models is shown in Table 5.8. For
a detailed dis ussion of this version of the AANN-based speaker veri ation system
refer to [111℄.
Table 5.8: Performan e of the online speaker veri ation system

No. of Mi rophone
Claimant tests for training FA FR
& testing in % in %
Genuine 45 Same - 4
Imposter 30 - 0 -
Genuine 45 Dierent - 15
53
5.7 SUMMARY
In this hapter, methods to improve the performan e of the AANN-based speaker

veri ation system are dis ussed. The rank-based approa h of IBM is proposed to
normalize the laimant s ores. It is observed that the approa h of IBM performs
better than that of UBM. The use of fewer units in the dimension ompression layer
is suggested to ompensate for the noise hara teristi s of the hannel and handset.
Performan e of the speaker veri ation is improved signi antly by normalizing the
s ore obtained by an AANN model.
54
CHAPTER 6
SUMMARY AND CONCLUSIONS
In this thesis, an attempt has been made to explore AANN models as an alternative to
GMM for text-independent speaker veri ation. A three layer AANN model with lin-
ear hidden units in the ompression layer aptures a linear subspa e. But if a nonlinear
a tivation fun tion su h as tanh(:) is used in the hidden units, the network lusters
the input data in the linear subspa e. This lustering behavior an be attributed to
the thresholding and saturation properties of the nonlinear a tivation fun tion. A ve
layer AANN model with nonlinear units in the hidden layers lusters the input data in
a nonlinear subspa e. The distribution apturing ability of three and ve layer AANN
models has been studied using a probability surfa e derived from the training error
surfa e aptured by the network in the input feature spa e. The probability surfa e
aptured by a ve layer AANN model an be viewed as nonparametri modeling of the
input data distribution. This property of a ve layer AANN model has been exploited
to develop a speaker veri ation system for onversational telephone spee h.
Performan e of the AANN-based speaker veri ation system an be improved by

addressing the issues related to the ba kground model, stru ture of AANN and mis-
mat h between training and testing data. The rank-based approa h of Individual
Ba kground Models (IBM) an be used to normalize the laimant s ore. This method
involves a set of randomly sele ted pseudo- laimant models. The use of large number
of pseudo- laimants de reases the false a eptan e but at the ost of in rease in the
testing time. Performan e of the speaker veri ation system has been studied by vary-
ing the number of units in the dimension ompression layer. Small number of units
redu e the noise hara teristi s of the hannel and the handset. The ee ts of hannel
55
and handset an be further redu ed by normalizing the s ore obtained by an AANN
model with the s ore obtained by the same model for impostors' data. This method
yields signi ant improvement in the performan e of the AANN-based speaker veri-
ation system. Finally, we have ompared the performan e of AANN-based speaker
veri ation system with that of GMM-based systems on a database of 1000 speakers.
The obje tive of this omparison is to show that AANN models an also be used for
speaker veri ation.
The following are the important on lusions of this work:

The error surfa e realized by an AANN model an be used to study the har-
a teristi s of the distribution of the input data aptured by the network.
A speaker veri ation system for a large database of onversational telephone
spee h an be developed using AANN models.
The hoi e of parameters su h as feature ve tors, initial weights and stru ture
of AANN is not very riti al, as variation of these parameters does not seem to
ae t the performan e of the AANN-based speaker veri ation system abruptly.
6.1 DIRECTIONS FOR FUTURE WORK
In this work linear predi tion epstral oeÆ ients are used as feature ve tors. It
is ne essary to study the ee tiveness of various parametri representations of
spee h for speaker veri ation purpose.
It is well known that suprasegmental features su h as intonation and duration
patterns have signi ant speaker hara teristi s and are also robust for hannel
variations. Therefore methods have to be found to in orporate the knowledge
of these features in a speaker veri ation system in the AANN framework.
56
APPENDIX A
ALGORITHM FOR SPEECH FRAME DETECTION
The algorithm used in this study to dete t the spee h frames is based on the amplitude
of the spee h signal in time-domain. It also assumes Gaussian distribution for the
amplitudes. The spee h signal is blo ked into frames using the spe ied framesize and
frameshift. The maximum positive amplitude in ea h frame is determined. The sum
of mean and a fra tion (1/10) of the standard deviation of these positive amplitudes
is onsidered as the maximum amplitude value in the spee h signal. Ten per ent of
the maximum amplitude is taken as the threshold for a frame to be onsidered as a
spee h frame. There is also a ondition that at least 10% of the frames should be
nonspee h frames. Hen e, when the number of nonspee h frames is less than than this
per entage of the total number of frames, the threshold is progressively in reased till
the minimum spe ied number of frames are obtained. Finally, the frames above the
threshold are given a weightage 1, whereas the remaining frames are given a weightage
of 0.
APPENDIX B
LINEAR PREDICTION ANALYSIS
In LP analysis of spee h, an all-pole model is assumed for the system produ ing spee h
signal s(n). A pth order all-pole model assumes that sample value at time n an be
approximated by linear ombination of past p samples. i.e.,
p
X
s(n) ak s(n k) (A.1)
k=1
If s^(n) denotes the predi tion made by the all-pole model then, the predi tion error is
given by,
p
X
e(n) = s(n) s^(n) = s(n) ak s(n k) (A.2)
k=1
For a spee h frame of size m samples, the mean square of predi tion error over the
whole frame is given by,
X X p
X
E= e (m) =
2
[s(m) ak s(m k)℄ 2
(A.3)
m m k =1
Optimal predi tor oeÆ ients will minimize this mean square error. At minimum value
of E ,
E
= 0 , k = 1; 2; :::p:
ak
(A.4)
Dierentiating Eqn A.3 and equating to zero we get,
Ra=r (A.5)
where, a = [a a ap℄T , r = [r(1) r(2) r(p)℄T , and R is a Toeplitz symmetri
1 2
auto orrelation matrix given by,

2 3
66 r(0) r(1) r(p 1) 7
7
66 r(1) r(0) r(p 2) 7
7
R = 66 ...
... ... (A.6) 7
7
66 7
7
4 5
r(p 1) r(0)
Eqns A.5 an be solved for predi tion oeÆ ients using Durbin's algorithm as follows:
E (0) = r[0℄ (A.7)

PL
r[i℄ j j
i
r[ji 1 ( 1)
j j℄
ki = E i
=1
( 1)
1ip (A.8)
ii = ki (A.9)
ji = ji ki ii j
( 1) ( 1)
(A.10)
E (i) = (1 ki ) E i 2 ( 1)
(A.11)
The above set of equations are solved re ursively for i = 1; 2; :::; p. The nal solution
is given by
am = mp 1mp ( )
(A.12)
where, am 's are linear predi tive oeÆ ients (LPCs).
Cepstral oeÆ ients an be extra ted from the predi tor oeÆ ients using re ursive
algorithm as follows.
0 = ln 2
(A.13)
mX
= am + mk k am
1
m k 1mp (A.14)
k=1
m
X1 k
= a
m k m k m>p (A.15)
k=1
59
BIBLIOGRAPHY
[1℄ NIST, \Speaker re ognition workshop notebook," Pro . NIST 2000 Speaker Re ogni-
tion Workshop, University of Maryland, USA, Jun 26-27 2000.
[2℄ G. R. Doddington, \Speaker re ognition{identifying people by their voi es," Pro .
IEEE, vol. 73, pp. 1651{1664, Nov. 1985.
[3℄ C. M. Bishop, Neural networks for pattern re ognition. New York: Oxford University
Press In ., 1995.
[4℄ R. O. Duda and P. E. Hart, Pattern Classi ation and S ene Analysis. New York:
John Wiley & Sons, In ., 1973.
[5℄ A. P. Dempster, N. M. Laird, and D. B. Rubin, \Maximum likelihood from in omplete
data via the EM algorithm," J. Royal Statist. So . Ser. B. (methodologi al), vol. 39,
pp. 1{38, 1977.
[6℄ A. K. Jain, R. P. W. Duin, and J. Mao, \Statisti al pattern re ognition: A review,"
IEEE Trans. Pattern Analysis and Ma hine Intelligen e, vol. 22, pp. 4{37, Jan. 2000.
[7℄ S. van Vuuren, Speaker Re ognition in a Time-Frequen y Spa e. PhD dissertation,
Orgeon Graduate Institute of S ien e and Te hnology, Department of Ele tri al and
Computer Engg., Portland, Mar. 1999.
[8℄ A. L. Higgins, L. Bahler, and J. Porter, \Voi e identi ation using nonparametri
density mat hing," in Automati Spee h and Speaker Re ognition (C.-H. Lee, F. K.
Soong, and K. K. Paliwal, eds.), h. 9, pp. 211{232, Boston: Kluwer A ademi , 1996.
[9℄ B. Yegnanarayana, S. P. Kishore, and A. V. N. S. Anjani, \Neural network models for
apturing probability distribution of training data," in Int. Conferen e on Cognitive
and Neural Systems, (Boston), p. 6 (A), 2000.
[10℄ D. A. Reynolds and et al., \The ee ts of telephone transmission degradations on
speaker re ognition performan e," in Pro eedings of IEEE Int. Conf. A oust., Spee h,
and Signal Pro essing, pp. 329{332, 1995.
[11℄ NIST, \Speaker re ognition workshop notebook," Pro . NIST 1999 Speaker Re ogni-
tion Workshop, University of Maryland, USA, Jun 3-4 1999.
[12℄ D. A. Reynolds, \The ee ts of handset variability on speaker re ognition performan e:
Experiments on swit h board orpus," in Pro eedings of IEEE Int. Conf. A oust.,
Spee h, and Signal Pro essing, pp. 113{116, 1996.
[13℄ A. E. Rosenberg, \Automati speaker veri ation: A review," Pro . IEEE, vol. 64,
pp. 475{487, Apr. 1976.
60
[14℄ B. S. Atal, \Automati re ognition of speakers from their voi es," Pro . IEEE, vol. 64,
pp. 460{475, Apr. 1976.
[15℄ D. O'Shaughnessy, \Speaker re ognition," IEEE ASSP Magazine, pp. 4{17, 1986.
[16℄ A. Sutherland and M. Ja k, \Speaker veri ation," in Aspe ts of Spee h Te hnology
(M. A. Ja k and J. Laver, eds.), h. 4, pp. 184{215, Edenburgh: University Press, 1988.
[17℄ J. M. Naik, \Speaker veri ation: A Tutorial," IEEE Communi ations Magazine,
pp. 42{48, Jan. 1990.
[18℄ A. E. Rosenberg and F. K. Soong, \Re ent resear h in automati speaker re ognition,"
in Advan es in Spee h signal pro essing (S. Furui and M. M. Sondhi, eds.), h. 3,
pp. 701{740, New York: Mar el Dekker, 1992.
[19℄ H. Gish and M. S hmidt, \Text{independent speaker identi ation," IEEE Signal Pro-
essing Magazine, pp. 18{32, O t. 1994.
[20℄ R. J. Mammone, X. Zhang, and R. P. Rama handran, \Robust speaker re ognition: A
feature{based approa h," IEEE Signal Pro essing Magazine, vol. 13, pp. 58{71, Sept.
1996.
[21℄ S. Furui, \An overview of speaker re ognition te hnology," in Automati Spee h and
Speaker Re ognition (C.-H. Lee, F. K. Soong, and K. K. Paliwal, eds.), h. 2, pp. 31{56,
Boston: Kluwer A ademi , 1996.
[22℄ S. Furui, \Re ent advan es in speaker re ognition," Pattern Re ognition Lett., vol. 18,
pp. 859{872, 1997.
[23℄ J. P. Campbell, \Speaker re ognition: A tutorial," Pro . IEEE, vol. 85, pp. 1436{1462,
Sept. 1997.
[24℄ H. Sakoe, \Two level DP-mat hing - A dynami programming based pattern mat h-
ing algorithm for onne ted word re ognition," IEEE Trans. A oust., Spee h, Signal
Pro essing, vol. ASSP-27, pp. 588{595, 1998.
[25℄ V. N. Sorovin, \Determination of vo al tra t shape for vowels," Spee h Comm., vol. 11,
pp. 71{85, 1992.
[26℄ V. N. Sorovin, \Inverse problems for fri atives," Spee h Comm., vol. 14, pp. 249{262,
1994.
[27℄ J. J. Wolf, \EÆ ient a ousti parameters for speaker re ognition," J. A oust. So .
Amer., vol. 52, no. 6, pp. 2044{2056, 1972.
[28℄ M. M. Sondhi, \New methods of pit h extra tion," IEEE Trans. A oust., Spee h, Signal
Pro essing, vol. AU-16, pp. 262{266, 1968.
[29℄ A. M. Noll, \Cepstrum pit h determination," J. A oust. So . Amer., vol. 41, no. 2,
pp. 293{309, 1970.
61
[30℄ T. V. Ananthapadmanabha and B. Yegnanarayana, \Epo h extra tion of voi ed
spee h," IEEE Trans. A oust., Spee h, Signal Pro essing, vol. ASSP-27, pp. 562{570,
1975.
[31℄ W. Hess, Pit h determination of spee h signals, Algorithms and Devi es. Springer-
Verlag, 1983.
[32℄ B. S. Atal, \Automati speaker re ognition based on pit h ontours," J. A oust. So .
Amer., vol. 52, no. 6, pp. 1687{1697, 1972.
[33℄ B. Yegnanarayana, B. Madhukumar, and V. Rama handran, \Robust features for ap-
pli ation in spee h and speaker re ognition," in Pro . ESCA-ETRW on Spee h Pro .
in AD. CON., (Cannes), 1992.
[34℄ J. D. Markel, B. T. Oshika, and A. H. Gray, \Long-term feature averaging for speaker
re ognition," IEEE Trans. A oust., Spee h, Signal Pro essing, vol. 25, pp. 330{337,
Aug. 1977.
[35℄ J. E. Lu k, \Automati speaker veri ation using epstral measurements," J. A oust.
So . Amer., vol. 46, no. 4, pp. 1026{1032, 1969.
[36℄ T. Matsui and S. Furui, \Speaker lustering for spee h re ognition using the parameters
hara terizing vo al{tra t dimensions," in Pro eedings of Int. Conf. Spoken Language
Pro essing, pp. 137{140, 1990.
[37℄ H. M. Dante and V. V. S. Sharma, \Automati speaker re ognition for a large pop-
ulation," IEEE Trans. A oust., Spee h, Signal Pro essing, vol. 27, pp. 255{263, Jun.
1979.
[38℄ B. Yegnanarayana, \Formant extra tion from linear predi tion phase," J. A oust. So .
Amer., vol. 63, pp. 1638{1640, 1978.
[39℄ H. A. Murthy, K. V. Madhu Murthy, and B. Yegnanarayana, \Formant extra tion from
phase using weighted group delay fun tions," Ele tron. Lett., vol. 25, pp. 1609{1611,
1989.
[40℄ S. K. Das and W. S. Mohn, \A s heme for spee h pro essing in automati speaker
veri ation," IEEE Trans. A oust., Spee h, Signal Pro essing, vol. AU-19, pp. 32{43,
1971.
[41℄ G. Doddington, \A method of speaker veri ation," J. A oust. So . Amer., vol. 49,
p. 139 (A), 1971.
[42℄ J. W. Glenn and N. Kleiner, \Speaker identi ation based on nasal phonation," J.
A oust. So . Amer., vol. 43, no. 2, pp. 368{372, 1968.
[43℄ L. S. Su, K. P. Li, and K. S. Fu, \Identi ation of speakers by the use of nasal oar-
ti ulation," J. A oust. So . Amer., vol. 56, pp. 1876{1882, 1974.
[44℄ R. C. Lummis, \Speaker veri ation by omputer using spee h intensity for tempo-
ral registration," IEEE Trans. A oust., Spee h, Signal Pro essing, vol. AU-21, no. 2,
pp. 80{89, 1973.
62
[45℄ E. Bunge, \Automati speaker re ognition system auros for se urity systems and foren-
si voi e identi ation," in Pro . 1977 Int. Conf. on Crime Countermeasures, pp. 1{7,
1977.
[46℄ S. Furui, F. Itakura, and S. Satio, \Talker re ognition by long-time averaged spee h
spe trum," Ele tron Commun., Jap., vol. 55-A, pp. 54{61, 1972.
[47℄ H. Hollien and W. Majewski, \Speaker identi ation by long-term spe tra under nor-
mal onditions," J. A oust. So . Amer., vol. 62, no. 4, pp. 975{980, 1977.
[48℄ J. Makhoul, \Linear predi tion: A tutorial review," Pro . IEEE, vol. 63, pp. 561{580,
Apr. 1975.
[49℄ B. S. Atal, \Ee tiveness of linear predi tion hara teristi s of the spee h wave for
automati speaker identi ation and veri ation," J. A oust. So . Amer., vol. 55,
pp. 1304{1312, Jun. 1974.
[50℄ A. E. Rosenberg and M. Sambur, \New te hniques for automati speaker veri ation,"
IEEE Trans. A oust., Spee h, Signal Pro essing, vol. 23, no. 2, pp. 169{175, 1975.
[51℄ M. R. Sambur, \Speaker re ognition using orthogonal linear predi tion," IEEE Trans.
A oust., Spee h, Signal Pro essing, vol. 24, pp. 283{289, Aug. 1976.
[52℄ J. Naik and G. R. Doddington, \High performan e speaker veri ation using prin ipal
spe tral omponents," in Pro eedings of IEEE Int. Conf. A oust., Spee h, and Signal
Pro essing, pp. 881{884, 1986.
[53℄ S. Furui, \Cepstral analysis te hnique for automati speaker veri ation," IEEE Trans.
A oust., Spee h, Signal Pro essing, vol. 29, pp. 254{272, Apr. 1981.
[54℄ L. R. Rabiner and B. H. Juang, Fundamentals of Spee h Re ognition. Prenti e-Hall,
1993.
[55℄ L. R. Rabiner and R. W. S hafer, Digital Pro essing of Spee h Signals. Prenti e-Hall,
1978.
[56℄ M. M. Homayounpour and G. Chollet, \A omparison of some relevant parametri
representations for speaker veri ation," in ESCA Workshop on speaker Re ognition,
Identi ation, and Veri ation, pp. 185{188, Apr. 1994.
[57℄ G. R. Doddington, M. A. Pryzbo ki, A. F. Martin, and D. A. Reynolds, \The NIST
speaker re ognition evaluation - Overview, methodology, systems, results, perspe tive,"
Spee h Comm., vol. 31, pp. 225{254, Jun. 2000.
[58℄ H. Wakita, \Residual energy of linear predi tion applied to vowel and speaker re og-
nition," IEEE Trans. A oust., Spee h, Signal Pro essing, vol. 24, pp. 270{271, 1976.
[59℄ P. Thevenaz and H. Hugli, \Usefulness of lp -residue in text-independent speaker ver-
i ation," Spee h Comm., vol. 17, pp. 145{157, 1995.
63
[60℄ C. Liu, M. Lin, W. Wang, and H. Wang, \Study of line-spe trum pair frequen ies for
speaker re ognition," in Pro eedings of IEEE Int. Conf. A oust., Spee h, and Signal
Pro essing, pp. 277{280, 1990.
[61℄ H. A. Murthy, F. Beaufays, L. P. He k, and M. Weintraub, \Robust text-independent
speaker identi ation over telephone hannels," IEEE Trans. A oust., Spee h, Signal
Pro essing, vol. 7, pp. 554{568, 1999.
[62℄ Y. Gong and J. Haton, \Nonlinear ve tor interpolation for speaker re ognition," in
Pro eedings of IEEE Int. Conf. A oust., Spee h, and Signal Pro essing, vol. 2, (San-
Fran is o, California, USA), pp. 173{176, Mar. 1992.
[63℄ H. Hermansky and N. Malaynath, \Speaker veri ation using speaker-spe i map-
ping," in RLA2C, (Avigon,Fran e), Apr. 1998.
[64℄ H. Misra, M. S. Ikbal, and B. Yegnanarayana, \Spe tral mapping as a feature
for speaker re ognition," in National Conferen e on Communi ations(NCC), (IIT,
Kharagpur), pp. 151{156, Jan 29-31 1999.
[65℄ H. Misra, Development of a Mapping Feature for Speaker Re ognition. MS dissertation,
Indian Institute of Te hnology, Department of Ele tri al Engg., Madras, May 1999.
[66℄ D. A. Reynolds, \Experimental evaluation of features for robust speaker identi ation,"
IEEE Trans. A oust., Spee h, Signal Pro essing, vol. 2, no. 4, pp. 639{643, 1994.
[67℄ H. Hermansky and N. Morgan, \RASTA pro essing of spee h," IEEE Trans. A oust.,
Spee h, Signal Pro essing, vol. 2, no. 4, pp. 578{579, 1994.
[68℄ S. van Vuuren and H. Hermansky, \Data driven design of RASTA-like lters," in
Eurospee h, (Gree e), pp. 409{412, 1997.
[69℄ N. Malayath, Data-Driven Methods for Extra ting Features from Spee h. PhD disser-
tation, Orgeon Graduate Institute of S ien e and Te hnology, Department of Ele tri al
and Computer Engg., Portland, Jan. 2000.
[70℄ H. Gish, \Robust dis rimination in automati speaker identi ation," in Pro eedings
of IEEE Int. Conf. A oust., Spee h, and Signal Pro essing, pp. 289{292, 1990.
[71℄ B. H. Juang, \Past, present and future of spee h pro essing," IEEE ASSP Magazine,
pp. 24{48, May 1998.
[72℄ N. Morgan and H. Bourlard, \Continuous spee h re ognition," IEEE ASSP Magazine,
pp. 25{42, May 1995.
[73℄ W. A. Hargreaves and J. A. Starkweather, \Re ognition of speaker identity," Language
and Spee h, vol. 6, pp. 63{67, 1963.
[74℄ K. P. Li and G. W. Hughes, \Talker dieren es as they appear in orrelation matri es
of ontinuous spee h spe tra," J. A oust. So . Amer., vol. 55, no. 4, pp. 833{837, 1974.
64
[75℄ A. L. Higgins, L. G. Bahler, and J. E. Porter, \Voi e identi ation using nearest-
neighbour distan e measure," in Pro eedings of IEEE Int. Conf. A oust., Spee h, and
Signal Pro essing, vol. 2, pp. 375{378, 1993.
[76℄ F. K. Soong, A. E. Rosenberg, L. R. Rabiner, and B. H. Juang, \A ve tor quantization
approa h to speaker re ognition," in Pro eedings of IEEE Int. Conf. A oust., Spee h,
and Signal Pro essing, pp. 387{390, 1985.
[77℄ A. E. Rosenberg and F. K. Soong, \Evaluation of a ve tor quantization talker re ogni-
tion system in a text independent and text dependent modes," in Pro eedings of IEEE
Int. Conf. A oust., Spee h, and Signal Pro essing, pp. 873{876, 1986.
[78℄ R. E. Helms, Speaker re ognition using linear predi tion odebooks. PhD dissertation,
Southern Methodist University, 1981.
[79℄ E. Dorsey and J. Bernstein, \Inter-speaker omparison of lp a ousti spa e using a
minmax distortion measure," in Pro eedings of IEEE Int. Conf. A oust., Spee h, and
Signal Pro essing, pp. 16{19, 1981.
[80℄ K. P. Li and E. H. Wren h Jr., \An approa h to text-independent speaker re ognition
with short utteran es," in Pro eedings of IEEE Int. Conf. A oust., Spee h, and Signal
Pro essing, pp. 555{558, 1983.
[81℄ K. Shikano, \Text-independent speaker re ognition experiments using odebooks in
ve tor quantization," J. A oust. So . Amer., vol. 77, p. S11 (A), 1985.
[82℄ J. T. Bu k and et al., \Text-dependent speaker re ognition using ve tor quantization,"
in Pro eedings of IEEE Int. Conf. A oust., Spee h, and Signal Pro essing, pp. 391{394,
1985.
[83℄ F. K. Soong and A. E. Rosenberg, \On the use of instantaneous and transitional
spe tral information in speaker re ognition," in Pro eedings of IEEE Int. Conf. A oust.,
Spee h, and Signal Pro essing, pp. 877{890, 1986.
[84℄ D. A. Reynolds, \Speaker identi ation and veri ation using gaussian mixture mod-
els," Spee h Comm., vol. 17, pp. 91{108, Aug. 1995.
[85℄ R. A. Redner and H. F. Walker, \Mixture densities, maximum likelihood and the EM
algorithm," SIAM Review, vol. 26, pp. 195{239, 1984.
[86℄ D. A. Reynolds, \Comparison of ba kground normalization methods for text{
independent speaker veri ation," in Eurospee h, (Gree e), pp. 963{966, 1997.
[87℄ R. P. Lippmann, \An introdu tion to omputing with neural nets," IEEE ASSP Mag-
azine, vol. 4, pp. 4{22, Apr. 1989.
[88℄ B. Yegnanarayana, Arti ial Neural Networks. New Delhi: Prenti e-Hall of India,
1999.
[89℄ S. Haykin, Neural networks: A omprehensive foundation. New Jersey: Prenti e-Hall
In ., 1999.
65
[90℄ Y. Bennani and P. Gallinari, \Neural networks for dis rimination and modelization of
speakers," Spee h Comm., vol. 17, pp. 159{175, 1995.
[91℄ Y. Bennani, \Speaker identi ation through a modular onne tionist ar hite ture:
Evaluation on the TIMIT database," in ICSLP, pp. 607{610, 1992.
[92℄ J. Oglesby and J. S. Mason, \Optimisation of neural models for speaker identi ation,"
1990.
[93℄ M. Gori and F. S arselli, \Are multilayer per eptrons adequate for pattern re ognition
and veri ation," IEEE Trans. Pattern Analysis and Ma hine Intelligen e, vol. 20,
pp. 1121{1132, Nov. 1998.
[94℄ J. Oglesby and J. S. Mason, \Radial basis fun tion networks for speaker re ognition,"
1991.
[95℄ M. S. Ikbal, H. Misra, and B. Yegnanarayana, \Analysis of autoasso iative mapping
neural networks," in Int. Joint Conf. on Neural Networks, (Washington, USA), 1999.
[96℄ M. Gori, L. Lastru i, and G. Soda, \Autoasso iator-based models for speaker veri-
ation," Pattern Re ognition Lett., vol. 17, pp. 241{250, 1996.
[97℄ M. S. Ikbal, Autoasso iative Neural Network Models for Speaker Veri ation. MS
dissertation, Indian Institute of Te hnology, Department of Computer S ien e and
Engg., Madras, May 1999.
[98℄ H. Bourlard and Y. Kamp, \Auto-asso iation by multilayer per eptrons and singular
value de omposition," Biol. Cybernet., vol. 59, pp. 291{294, 1988.
[99℄ K. I. Diamantaras and S. Y. Kung, Prin ipal Component Neural networks: Theory and
Appli ations. New York: John Wiley & Sons In ., 1996.
[100℄ M. A. Kramer, \Nonlinear prin ipal omponent analysis using autoasso iative neural
networks," AIChE, vol. 37, pp. 233{243, Feb. 1991.
[101℄ M. Bian hini, P. Fras oni, and M. Gori, \Learning in multilayered networks used as
autoasso iators," IEEE Trans. Neural Networks, vol. 6, pp. 512{515, Mar. 1995.
[102℄ P. Baldi and K. Hornik, \Neural networks and prin ipal omponent analysis: Learning
from examples without lo al minima," IEEE Trans. Neural Networks, vol. 2, pp. 53{58,
1989.
[103℄ E. C. Malthouse, \Limitations of nonlinear PCA as performed with generi neural
networks," IEEE Trans. Neural Networks, vol. 9, pp. 165{173, Jan. 1998.
[104℄ S. P. Kishore and B. Yegnanarayana, \Speaker veri ation: Minimizing the hannel
ee ts using autoasso iative neural network models," in Pro eedings of IEEE Int. Conf.
A oust., Spee h, and Signal Pro essing, (Istanbul), pp. 1101{1104, 2000.
66
[105℄ D. O'Shaughnessy, Spee h Communi ation-Human and ma hine. Addison-Wesley,
1987.
[106℄ M. H. Hassoun, Fundamentals of Arti ial Neural networks. New Delhi: Prenti e-Hall
of India, 1998.
[107℄ J. Oglesby, \What's in a number ? Moving beyond the equal error rate," Spee h
Comm., vol. 17, pp. 193{208, Aug. 1995.
[108℄ A. L. Higgins and R. E. Wohlford, \A method of text-independent speaker re ognition,"
1986.
[109℄ A. E. Rosenberg and S. Parthasarathy, \Speaker ba kground models for onne ted digit
password speaker veri ation," in Pro eedings of IEEE Int. Conf. A oust., Spee h, and
Signal Pro essing, pp. 81{84, 1996.
[110℄ R. A. Finan, A. T. Sapeluk, and R. I. Damper, \Imposter ohort sele tion for s ore
normalization in speaker veri ation," Pattern Re ognition Lett., vol. 18, pp. 881{888,
1997.
[111℄ S. P. Kishore and B. Yegnanarayana, \Online text-independent speaker veri ation
system at IITM," in Pro eedings of Int. Conf. on Multimedia Pro essing and Systems,
(IIT Madras, INDIA), pp. 178{180, Aug. 2000.
[112℄ S. P. Kishore and B. Yegnanarayana, \Identi ation of handset type using autoas-
so iative neural network models," in Fourth Int. Conferen e on Advan es in Pattern
Re ognition and Digital Te hniques, (ISI, Cal utta, INDIA), pp. 353{356, De . 1999.
[113℄ L. P. He k and M. Weintraub, \Handset{dependent ba kground models for robust
text{independent speaker re ognition," in Pro eedings of IEEE Int. Conf. A oust.,
Spee h, and Signal Pro essing, (Muni h, Germany), April 1997.
[114℄ C. Nadeu, P. Pa hes-Leal, and B.-H. Juang, \Filtering the time sequen es of spe tral
parameters for spee h re ognition," Spee h Comm., vol. 22, pp. 315{332, 1997.
67
LIST OF PUBLICATIONS
REFERRED JOURNALS
1. S. P. Kishore and B. Yegnanarayana, \Speaker veri ation using autoasso iative
neural network models," IEEE Trans. Spee h and Audio Pro essing ( ommuni-
ated).
2. B. Yegnanarayana and S. P. Kishore, \AANN - An alternative to GMM for
pattern re ognition," Neural Networks ( ommuni ated).
PRESENTATIONS IN CONFERENCES
1. S. P. Kishore and B. Yegnanarayana, \Identi ation of handset type using au-
toasso iative neural network models," in Fourth Int. Conferen e on Advan es
in Pattern Re ognition and Digital Te hniques, (ISI, Cal utta, INDIA), pp. 353{
356, De . 1999.
2. S. P. Kishore and B. Yegnanarayana, \Speaker veri ation: Minimizing the
hannel ee ts using autoasso iative neural network models," in Pro eedings of
IEEE Int. Conf. A oust., Spee h, and Signal Pro essing, (Istanbul), pp. 1101{
1104, 2000.
3. B. Yegnanarayana, S. P. Kishore, and A. V. N. S. Anjani, \Neural network mod-
els for apturing probability distribution of training data," in Int. Conferen e
on Cognitive and Neural Systems, (Boston), p. 6 (A), 2000.
4. S. P. Kishore and B. Yegnanarayana, \Online text-independent speaker veri-
ation system at IITM," in Pro eedings of Int. Conf. on Multimedia Pro essing
and Systems, (IIT Madras, INDIA), pp. 178{180, 2000.
5. B. Yegnanarayana, K. Sharat Reddy and S. P. Kishore, \Sour e and System
Features for Speaker Re ognition using AANN models", in Pro eedings of IEEE
Int. Conf. A oust., Spee h, and Signal Pro essing, 2001.
6. S. P. Kishore, B. Yegnanarayana and Suryakanth V. Gangashetty, \Online Text-
Independent Speaker Veri ation System using Autoasso iative Neural Network
Models", in Pro eedings of IEEE Int. Joint Conf. on Neural Networks, 2001.
69
GENERAL TECHNICAL COMMITTEE
Head: Prof. C. Pandurangan

Guide: Prof. B. Yegnanarayana
Members:
Dr. D. Janakiram (Dept. of CSE)
Dr. M. Giridhar (Dept. of EE)
70
BIODATA
Name: S.P. Kishore

Roll No: CS-98M14
Date of Birth: 21 Mar 1977
Edu ational Quali ations:
B.E (Computer S ien e) - 1998
De an College of Engineering and Te hnology
Osmania University
Address for orresponden e:
S/0. S. N. Murthy
Venkata Ramana Nilayam
Opp: CSI Chru h
ADONI - 518301
Andhra Pradesh
Tel: 08512 53936
71

Kishore Prahallad MS Thesis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kishore Prahallad MS Thesis

Uploaded by

Copyright:

Available Formats

SPEAKER VERIFICATION USING AUTOASSOCIATIVE

NEURAL NETWORK MODELS

for the award of the degree

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Chennai-36 Prof. B. Yegnanarayana

First and foremost, I am grateful to my supervisor Prof. B. Yegnanarayana and my

I would like to express my sin ere gratitude to Prof. C. Pandurangan, Head of

I onsider myself to be lu ky to have wonderful seniors. My intera tions with my

I am grateful to Suryakanth V. Gangashetty, without whom it would not have

Thesis erti ate i

1.1 Graphi al sket h of human vo al system. . . . . . . . . . . . . . . . . 1

ANN - Arti ial Neural Network

INTRODUCTION TO SPEAKER RECOGNITION

Fig. 1.1: Graphi al sket h of human vo al system.

1.2.1 Extra tion of Speaker Information

1.2.2 Probabilisti Modeling of Speaker Features

1.2.3 De ision Logi to Implement Identi ation or Veri ation

1.3 ISSUES ADDRESSED IN THE THESIS

In this thesis, we address the issues involved in developing a text-independent speaker

1.4 ORGANIZATION OF THE THESIS

The organization of the thesis is as follows:

APPROACHES FOR SPEAKER RECOGNITION

2.2 FEATURES TO REPRESENT SPEAKER INFORMATION

The general requirements of the features representing speaker information noted

2.2.4 Short-Term Spe tral Features

2.2.4.1 Linear Predi tion CoeÆ ients

2.2.4.2 Cepstral CoeÆ ients

2.2.4.4 LP Residual Features

Wakita [58℄ reported an experiment using LP residual energy on vowel re ognition

2.2.4.5 Other Segmental Features

2.3 MODELS FOR SPEAKER RECOGNITION

Early studies on text-independent speaker re ognition used long-term averaging of

2.3.1 Approa h of Nearest Neighbor

2.3.2 Ve tor Quantization

2.3.3 Gaussian Mixture Models

2.3.4 Arti ial Neural Network Models

2.4 NEED FOR NEW MODELS

CHARACTERISTICS OF AUTOASSOCIATIVE NEURAL

Prin ipal Component Analysis (PCA) is a method of representing the distribution of

E fxg = 0 and ovarian e matrix R = E fxxT g, where E is expe tation operator. In

3.3 THREE LAYER AANN MODEL

formed by the N input ve tors, and let Y = [y ; y ; ; yN ℄ be the M N matrix

fun tion at the hidden units.

3.3.1 Linear A tivation Fun tion at the Hidden Units

X is p < M , the produ t X minimizing J is the best rank (p) approximation of

3.3.2 Nonlinear A tivation Fun tion at the Hidden Units

To nd the range of  ( 2 R ) for whi h X is -autoasso iated with , onsider the

3.4 FIVE LAYER AANN MODEL

Fig. 3.2: Five layer AANN model.

AANN-BASED SPEAKER VERIFICATION SYSTEM

4.2 DESCRIPTION OF SPEECH DATABASE

4.3.1 Prepro essing of Spee h Signal

4.3.2 Extra tion of Linear Predi tion Cepstral CoeÆ ients

4.3.3 Cepstral Weighting and Mean Subtra tion

4.4 GENERATION OF SPEAKER MODELS

 The ith element of the input ve tor is denoted by xi(n).

4.5 VERIFICATION PROCEDURE

Mean square error

Fig. 4.2: Distribution of laimant s ores.

Mat hed 15.07% 10.16% 10.0 5.51

To normalize the laimant s ore, the approa h of Universal Ba kground Model

Fig. 4.3: Distribution of laimant s ores after ba kground normalization.

4.6 COMPARISON WITH GMM-BASED SPEAKER VERIFICATION

X is p < M , the produ t X minimizing J is the best rank (p) approximation of

To nd the range of ( 2 R ) for whi h X is -autoasso iated with , onsider the

The ith element of the input ve tor is denoted by xi(n).

5.2.4 Ee t of Population of IBM

Fig. 5.1: Ee t of population of IBM on (a) EER and (b) DCF.