You are on page 1of 71

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS


TEXT-PROMPTED REMOTE SPEAKER AUTHENTICATION

By:
GANESH TIWARI (063/BCT/510)
MADHAV PANDEY (063/BCT/514)
MANOJ SHRESTHA (063/BCT/518)

A PROJECT WAS SUBMITTED TO THE DEPARTMENT OF ELECTRONICS
AND COMPUTER ENGINEERING IN PARTIAL FULFILLMENT OF THE
REQUIREMENT FOR THE BACHELORS DEGREE IN COMPUTER
ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING
LALITPUR, NEPAL


January, 2011
TRIBHUVAN UNIVERSITY
INSTITUTE OF ENGINEERING
PULCHOWK CAMPUS
DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

The undersigned certify that they have read, and recommended to the Institute of Engineering for
acceptance, a project report entitled Text-Prompted Remote Speaker Authentication submitted
by Ganesh Tiwari, Madhav Pandey and Manoj Shrestha in partial fulfillment of the requirements
for the Bachelors degree in Computer Engineering.


__________________________________
Supervisor, Dr. Subarna Shakya
Associate Professor
Department of Electronics and Computer Engineering


__________________________________
Internal Examiner,


_________________________________
External Examiner,

DATE OF APPROVAL:
COPYRIGHT

The author has agreed that the Library, Department of Electronics and Computer Engineering,
Pulchowk Campus, Institute of Engineering may make this report freely available for inspection.
Moreover, the author has agreed that permission for extensive copying of this project report for
scholarly purpose may be granted by the supervisors who supervised the project work recorded
herein or, in their absence, by the Head of the Department wherein the project report was done. It
is understood that the recognition will be given to the author of this report and to the Department
of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering in any use
of the material of this project report. Copying or publication or the other use of this report for
financial gain without approval of to the Department of Electronics and Computer Engineering,
Pulchowk Campus, Institute of Engineering and authors written permission is prohibited.
Request for permission to copy or to make any other use of the material in this report in whole or
in part should be addressed to:

Head
Department of Electronics and Computer Engineering
Pulchowk Campus, Institute of Engineering
Lalitpur, Kathmandu
Nepal

ACKNOWLEDGEMENT

We are very thankful to Institute of Engineering (IOE), Pulchowk Campus for offering the
course on major project. We also thank all teachers and staffs of Electronics and Computer
Engineering Department who assisted during the project conduction period by giving suitable
suggestions and lectures on different subject matters relating to the conduction and achievement
of the project goals.
We are very much obliged to Dr. Subarna Shakya, Department, Electronics and Computer
Engineering, IOE Pulchowk Campus, for their inspiration and valuable suggestions that we got
throughout the working period.
We would like to thank to the forum members of askmeflash.com, stackoverflow.com,
dsprelated.com for their quick response and value able opinion to our queries.
We also express our gratitude to all the friends and juniors who helped a lot for training data
collection.
Members of Project
Ganesh Tiwari (063BCT510)
Madhav Pandey (063BCT514)
Manoj Shrestha (063BCT518)

IOE, PULCHOWK CAMPUS

ABSTRACT
Biometric is physical characteristic unique to each individual. It has a very useful application in
authentication and access control.
The designed system is a text-prompted version of voice biometric which incorporates text-
independent speaker verification and speaker-independent speech verification system
implemented independently. The foundation for this joint system is that the speech signal
conveys both the speech content and speaker identity. Such systems are more-secure from
playback attack, since the word to speak during authentication is not previously set.
During the course of the project various digital signal processing and pattern classification
algorithms were studied. Short time spectral analysis was performed to obtain MFCC, energy
and their deltas as feature. Feature extraction module is same for both systems. Speaker
modeling was done by GMM and Left to Right Discrete HMM with VQ was used for isolated
word modeling. And results of both systems were combined to authenticate the user.
The speech model for each word was pre-trained by using utterance of 45 English words. The
speaker model was trained by utterance of about 2 minutes each by 15 speakers. While uttering
the individual words, the recognition rate of the speech recognition system is 92 % and speaker
recognition system is 66%. For longer duration of utterance (>5sec) the recognition rate of
speaker recognition system improves to 78%.


TABLE OF CONTENTS

PAGE OF APPROVAL.I
COPYRIGHT ................................................................................................................... 2
ACKNOWLEDGEMENT ................................................................................................. 3
ABSTRACT ..................................................................................................................... 4
TABLE OF CONTENTS .............................................................................................. V
LIST OF FIGURES .......................................................................................................... 1
LIST OF SYMBOLS AND ABBREVIATIONS ............................................................. IX
1. INTRODUCTION......................................................................................................... 1
1.2 Objectives ................................................................................................................... 2
2. LITERATURE REVIEW .............................................................................................. 3
2.1 Pattern Recognition..................................................................................................... 3
2.2 Generation of Voice ................................................................................................... 4
2.3 Voice as Biometric ..................................................................................................... 6
2.4 Speech Recognition .................................................................................................... 7
2.5 Speaker Recognition ................................................................................................... 7
2.5.1. Types of Speaker Recognition ............................................................................. 8
2.5.2. Modes of Speaker Recognition ............................................................................ 9
2.6 Feature Extraction for Speech/Speaker Recognition System ...................................... 10
2.6.1. Short Time Analysis .......................................................................................... 10
2.6.2. MFCC Feature ................................................................................................... 11
2.7 Speaker/Speech Modeling ......................................................................................... 12
2.7.1. Gaussian Mixture Model ................................................................................... 12
2.7.2. Hidden Markov Model ...................................................................................... 15
2.7.3. K-Means Clustering .......................................................................................... 19
3. IMPLEMENTATION DETAILS ................................................................................ 20
3.1 Pre-Processing and Feature Extraction ...................................................................... 20
3.1.1. Capture .............................................................................................................. 20
3.1.2. End point Detection and Silence Removal ......................................................... 21
3.1.3. PCM Normalization .......................................................................................... 22
3.1.4. Pre-emphasis ..................................................................................................... 22
3.1.5. Framing and Windowing ................................................................................... 23
3.1.6. Discrete Fourier Transform ............................................................................... 25
3.1.7. Mel Filter .......................................................................................................... 25
3.1.8. Cepstrum by Inverse Discrete Fourier Transform .............................................. 27
3.2 GMM Implementation .............................................................................................. 30
3.2.1. Block Diagram of GMM Based Speaker Recognition System, ........................... 30
3.2.2. GMM Training .................................................................................................. 31
3.2.3. Verification ....................................................................................................... 34
3.2.4. Performance Measure of Speaker Verification System....................................... 34
3.3 Implementation of HMM for Speech Recognition ..................................................... 36
3.3.1. Isolated Word Recognition ................................................................................ 39
3.3.2. Application of HMM ......................................................................................... 40
3.3.3. Scaling .............................................................................................................. 47
4. UML CLASS DIAGRAMS OF THE SYSTEMS ........................................................ 48
5. DATA COLLECTION AND TRAINING ................................................................... 50
6. RESULTS ................................................................................................................... 51
7. APPLICATION AREA ............................................................................................... 52
8. CONCLUSION ........................................................................................................... 52
REFERENCES ............................................................................................................... 53
APPENDIX A: BlazeDS Configuration for Remoting Service ........................................ 54
APPENDIX B: Words Used for HMM Training.............................................................. 55
APPENDIX C: Development Tools and Environment ..................................................... 56
APPENDIX D: Snapshots of Output GUI ....................................................................... 57
LIST OF FIGURES

Figure 1.1: System Architecture..1
Figure 1.2: Block Diagram of Text Prompted Speaker Verification System .2
Figure 2.1: General block diagram of pattern recognition system ..3
Figure 2.2: Vocal Schematic....4
Figure 2.3: Audio Sample for /i:/ phoneme ....5
Figure 2.4: Audio Magnitude Spectrum for /i:/ phoneme ..6
Figure 2.5: GMM with four Gaussian components and their equivalent model ..13
Figure 2.6: Ergodic Model of HMM ....17
Figure 2.7: Left to Right HMM ...18
Figure 3.1: Pre-Processing and Feature Extraction..20
Figure 3.2: Input signal to End-point detection system ...22
Figure 3.3: Output signal from End point Detection System .. 22
Figure 3.4: Signal before Pre-Emphasis . .23
Figure 3.5: Signal after Pre-Emphasis 23
Figure 3.6: Frame Blocking of the Signal 23
Figure 3.7: Hamming window . 24
Figure 3.8: A single frame before and after windowing .. 24
Figure 3.9: Equally spaced Mel values 26
Figure 3.10: Mel Scale Filter Bank .. 26
Figure 3.11: Block diagram of GMM based Speaker Recognition System . 30
Figure 3.12: Equal Error Rate (EER) ... 35
Figure 3.13: Speech Recognition algorithm flow .36
Figure 3.14: Pronunciation model of word TOMATO .37
Figure 3.15: Vector Quantization ..38
Figure 3.16: Average word error rate (for a digits vocabulary) versus the number of
states N in the HMM .39
Figure 3.17: Curve showing tradeoff of VQ average distortion as a function of the
size of the VQ, M (shown of a log scale) . 40
Figure 3.18: Forward Procedure - Induction Step .. 42
Figure 3.19: Backward Procedures - Induction Step .. 43
Figure 3.20: Viterbi Search . ... 45
Figure 3.21: Computation of
t
(i, j) . 46
Figure 4.1: UML diagram of Client System .48
Figure 4.2: UML Diagram of Server System ....49



LIST OF SYMBOLS AND ABBREVIATIONS

GMM/HMM Model

T
Threshold

Variance
() Likelihood Ratio
Mean
Initial State Distribution
A State Transition Probability Distribution
Observation Symbol Probability Distribution
C
m
Covariance Matrix for m
th
Component

State at time t
W
m
Weighting Factor for m
th
Gaussian Component
Feature Vector
AIR Adobe Integrated Runtime
DC Direct Current
DCT Discrete Cosine Transform
DFT Discrete Fourier Transform
DHMM Discrete Hidden Markov Model
DTW Dynamic Time Warping
EM Expectation-Maximization
FAR False Acceptance Rate
FRR False Rejection Rate
GMM Gaussian Markov Model
HMM Hidden Markov Model
LPC Linear Prediction Coding
MFCC Mel Frequency Cepstral Coefficient
ML Maximum Likelihood
PDF Probability Distribution Function
PLP Perceptual Linear Prediction
RIA Rich Internet Application
RPC Remote Procedure Call
SID Speaker IDentification
TER Total Error Rate
UBM Universal Background Model
UML Unified Modeling Language
VQ Vector Quantization
WTP Web Tool Platform


1
1. INTRODUCTION
Biometrics is, in the simplest definition, something you are. It is a physical characteristic
unique to each individual such as fingerprint, retina, iris, speech. Biometrics has a very useful
application in security; it can be used to authenticate a persons identity and control access to
a restricted area, based on the premise that the set of these physical characteristics can be
used to uniquely identify individuals.
Speech signal conveys two important types of information, the primarily the speech content
and on the secondary level, the speaker identity. Speech recognizers aim to extract the lexical
information from the speech signal independently of the speaker by reducing the inter-
speaker variability. On the other hand, speaker recognition is concerned with extracting the
identity of the person speaking the utterance. So both speech recognition and speaker
recognition system is possible from same voice input.
Text Prompted Remote Speaker Authentication is a voice biometric system that authenticates
a user before permitting the user to log into a system on the basis of the user's input voice. It
is a web application. Voice signal acquisition and feature extraction is done on the client.
Training and Authentication task based on the voice feature obtained from client side is done
on Server. The authentication task is based on text-prompted version of speaker recognition,
which incorporates both speaker recognition and speech recognition. This joint
implementation of speech and speaker recognition includes text-independent speaker
recognition and speaker-independent speech recognition. Speaker Recognition verifies
whether the speaker is claimed one or not while Speech Recognition verifies whether or not
spoken word matches the prompted word.
The client side is realized in Adobe Flex whereas the server side is realized in J ava. The
communication between these two cross-platforms is made possible with the help of Blaze
DSs RPC remote object.
Browser Application in Client
(Flex)
Server
(J ava)
User BlazeDS RPC

Figure 1.1: SystemArchitecture

2
Mel Filter Cepstral Coefficient (MFCC) is used as feature for both speech and speaker
recognition task. We also combined energy features and delta and delta-delta features of
energy and MFCC.
After calculating feature, Gaussian Mixture Model (GMM) is used to model the speaker
modeling and Left to Right Discrete Hidden Markov Model with Vector Quantization
(DHMM/VQ) for speech modeling.
Based on the speech model the system decides whether or not the uttered speech matches
what was prompted to utter. Similarly, based on the speaker model, the system decides
whether or not the speaker is claimed one. Then the speaker is authenticated with the help of
combined result of these two tests.
Referring to figure 1.2, the feature extraction module is same for both speech and speaker
recognition. And these recognition systems are implemented independent of each other.

Figure 1.2: Block Diagramof Text Prompted Speaker Verification System



1.2 Objectives
The objectives of this project are:
To design and build a speaker verification system
To design and build a speech verification system
To implement these systems jointly to control remote access to secret area


3
2. LITERATURE REVIEW
2.1 Pattern Recognition
Pattern recognition, one of the branches of artificial intelligence, sub-section of machine
learning, is the study of how machines can observe the environment, learn to distinguish
patterns of interest from their background, and make sound and reasonable decisions about
the categories of the patterns. A pattern can be a fingerprint image, a handwritten cursive
word, a human face, or a speech signal, sales pattern etc
The applications of pattern recognition include data mining, document classification,
financial forecasting, organization and retrieval of multimedia databases, and biometrics
(personal identification based on various physical attributes such as face, retina, speech, ear
and fingerprints).
The essential steps of pattern recognition are: Data Acquisition, Preprocessing, Feature
Extraction, Training and Classification.

Figure 2.1: General block diagramof pattern recognition system
Features are used to denote the descriptor. Features must be selected so that they are
discriminative and invariant. They can be represented as a vector, matrix, tree, graph, or
string. They are ideally similar for objects in the same class and very different for objects in
different class.
Pattern class is a family of patterns that share some common properties. Pattern recognition
by machine involves techniques for assigning patterns to their respective classes
automatically and with as little human intervention as possible.
Learning and Classification usually use one of the following approaches: Statistical Pattern
Recognition is based on statistical characterizations of patterns, assuming that the patterns are
4
generated by a probabilistic system. Syntactical (or Structural) Pattern Recognition is based
on the structural interrelationships of features.
Given a pattern, its recognition/classification may consist of one of the following two tasks
according to the type of learning procedure: 1) Supervised Classification (e.g., Discriminant
Analysis) in which the input pattern is identified as a member of a predefined class. 2)
Unsupervised Classification (e.g., clustering) in which the pattern is assigned to a previously
unknown class.
2.2 Generation of Voice
Speech begins with the generation of an airstream, usually by the lungs and diaphragm -
process called initiation. This air then passes through the larynx tube, where it is modulated
by the glottis (vocal chords). This step is called phonation or voicing, and is responsible for
the generation of pitch and tone. Finally, the modulated air is filtered by the mouth, nose, and
throat - a process called articulation - and the resultant pressure wave excites the air.

Figure 2.2: Vocal Schematic
Depending upon the positions of the various articulators different sounds are produced.
Position of articulators can be modeled by linear time- invariant system that has frequency
response characterized by several peaks called formants. The change in frequency of
formants characterizes the phoneme being articulated.
5

As a consequence of this physiology, we can notice several characteristics of the frequency
domain spectrum of speech. First of all, the oscillation of the glottis results in an underlying
fundamental frequency and a series of harmonics at multiples of this fundamental.
This is shown in the figure below, where we have plotted a brief audio waveform for the
phoneme /i:/ and its magnitude spectrum. The fundamental frequency (180 Hz) and its
harmonics appear as spikes in the spectrum. The location of the fundamental frequency is
speaker dependent, and is a function of the dimensions and tension of the vocal chords. For
adults it usually falls between 100 Hz and 250 Hz, and females average significantly higher
than that of males.

Figure 2.3: Audio Sample for /i:/ phoneme showing stationary property of phonemes for a short period
The sound comes out in phonemes which are the building blocks of speech. Each phoneme
resonates at a fundamental frequency and harmonics of it and thus has high energy at those
frequencies in other words have different formats. It is the feature that enables the
identification of each phoneme at the recognition stage.
The variations in inter-speaker features of speech signal during utterance of a word are
modeled in word training in speech recognition. And for speaker recognition the intra-
speaker variations in features in long speech content is modeled.

0 500 1000 1500 2000 2500
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Samples
A
m
p
l
i
t
u
d
e


6

Figure 2.4: Audio Magnitude Spectrumfor /i:/ phoneme showing fundamental frequency and its harmonics
Besides the configuration of articulators, the acoustic manifestation of a phoneme is affected
by:
Physiology and emotional state of speaker
Phonetic context
Accent
2.3 Voice as Biometric
The underlying premise for voice authentication is that each persons voice differs in pitch,
tone, and volume enough to make it uniquely distinguishable. Several factors contribute to
this uniqueness: size and shape of the mouth, throat, nose, and teeth (articulators) and the
size, shape, and tension of the vocal cords. The chance that all of these are exactly the same
in any two people is very low.
Voice Biometric has following advantages from other form of biometrics
Natural signal to produce
Implementation cost is low since, doesnt require specialized input device
Acceptable by user
Easily mixed with other form of authentication system for multifactor authentication
Only biometric that allows users to authenticate remotely
0 500 1000 1500 2000 2500 3000 3500 4000
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
Frequency (Hz)
|
Y
(
f
)
|
7
2.4 Speech recognition
Speech is the dominant means for communication between humans, and promises to be
important for communication between humans and machines, if it can just be made a little
more reliable.
Speech recognition is the process of converting an acoustic signal to a set of words. The
applications include voice commands and control, data entry, voice user interface, automating
the telephone operators job in telephony, etc. They can also serve as the input to natural
language processing.
There is two variant of speech recognition based on the duration of speech signal : Isolated
word recognition, in which each word is surrounded by some sort of pause, is much easier
than recognizing continuous speech, in which words run into each other and have to be
segmented.
Speech recognition is a difficult task because of the many source of variability associated
with the signal such as the acoustic realizations of phonemes, the smallest sound units of
which words are composed, are highly dependent on the context. Acoustic variability can
result from changes in the environment as well as in the position and characteristics of the
transducer. Third, within speaker variability can result from changes in the speaker's physical
and emotional state, speaking rate, or voice quality. Finally, differences in socio linguistic
background, dialect, and vocal tract size and shape can contribute to cross-speaker variability.
Such variability is modeled in various ways. At the level of signal representation, the
representation that emphasizes the speaker independent features is developed.
2.5 Speaker Recognition
Speaker recognition is the process of automatically recognizing who is speaking on the basis
of individuals information included in speech waves. Speaker recognition can be classified
into identification and verification. Speaker recognition has been applied most often as means
of biometric authentication.
8
2.5.1. Types of Speaker Recognition
2.5.1.1 Speaker Identification
Speaker identification is the process of determining which registered speaker provides a
given utterance. In Speaker IDentification (SID) system, no identity claim is provided, the
test utterance is scored against a set of known (registered) references for each potential
speaker and the one whose model best matches the test utterance is selected.
There is two types of speaker identification task closed-set and open-set speaker
identification.
In closed-set, the test utterance belongs to one of the registered speakers. During testing, a
matching score is estimated for each registered speaker. The speaker corresponding to the
model with the best matching score is selected. This requires N comparisons for a population
of N speakers.
In open-set, any speaker can access the system; those who are not registered should be
rejected. This requires another model referred to as garbage model or imposter model or
background model, which is trained with data provided by other speakers different from the
registered speakers. During testing, the matching score corresponding to the best speaker
model is compared with the matching score estimated using the garbage model. In order to
accept or reject the speaker, making the total number of comparisons equal to N +1. Speaker
identification performance tends to decrease as the population size increases.
2.5.1.2 Speaker verification
Speaker verification, on the other hand, is the process of accepting or rejecting the identity
claim of a speaker. That is, the goal is to automatically accept or reject an identity that is
claimed by the speaker. During testing, a verification score is estimated using the claimed
speaker model and the anti-speaker model. This verification score is then compared to a
threshold. If the score is higher than the threshold, the speaker is accepted, otherwise, the
speaker is rejected. Thus, speaker verification, involves a hypothesis test requiring a simple
binary decision: accept or reject the claimed identity regardless of the population size. Hence,
the performance is quite independent of the population size, but it depends on the number of
test utterances used to evaluate the performance of the system.
9
2.5.2. Modes of Speaker Recognition
There are 3 modes in which speaker verification/identification can be done.
2.5.2.1 Text Independent
In text independent mode, the system relies only on the voice characteristics of the speaker;
the lexical content of the utterance is not used. System models the characteristics of his
speech which show up irrespective of what one is saying. This mode is used in surveillance
or forensic applications where there is no control over the speakers to access the system. The
test utterances can be different from those used for enrollment; hence, text-independent
speaker verification needs a large and rich training data set to model the characteristics of the
speaker's voice and to cover the phonetic space.
A large training set and long test segments is required to appropriately model the feature
variations from current user in uttering different phonemes, than that for text-dependent.
2.5.2.2 Text Dependent
In the text dependent mode of verification, the user is expected to say a pre-determined text -
a voice password. Since recognition is based on the speaker characteristics as well as the
lexical content of the password, text dependent speaker recognition systems are generally
more robust and achieve good performance. However, this system is not yet used in large
scale due to fear of playback attack, since, the system has a priori knowledge about the
password i.e., the training and the test texts are the same. The speaker model encodes the
speaker's voice characteristics associated with the phonemic or syllabic content of the
password.
2.5.2.3 Text-prompted
Both text-dependent and text-independent systems are susceptible to fraud, since for typical
applications the voice of a speaker could be captured, recorded, and reproduced. To limit this
risk, a particular kind of text-dependent speaker verification systems based on prompted text
has been developed. The password i.e., the text to speak is not pre-determined; rather he/she
is asked to speak a prompted text (digits or word or phrase). If the number of distinct random
passwords is large, the playback attack is not feasible. Hence the text prompted system is
more secure.
10
As in the case of text-independent systems, the text-prompted systems also need a large and
rich training data set for each registered speaker to create robust speaker-dependent models.
Because of that reason, we have chosen text prompted system.
2.6 Feature Extraction for speech/speaker recognition system
Signal representation or coding from short-term spectrum into feature vectors is one of the
most important steps in automatic speaker recognition and continues being subject of
research. Many different techniques have been proposed in the literature and generally they
are based on speech production models or speech perception models.
Goal of feature extraction is to transform the input waveform into a sequence of acoustic
feature vectors, each vector representing the information in a small time window of the
signal. Feature extraction transforms high-dimensional input signal into lower dimensional
vectors. For speaker recognition purposes, optimal feature has the following properties
1. High inter-speaker variation,
2. Low intra-speaker variation,
3. Easy to measure,
4. Robust against disguise and mimicry,
5. Robust against distortion and noise,
6. Maximally independent of the other features.
2.6.1. Short time analysis
The analysis at spectral level of the speech signal is based on classic Fourier analysis to the
whole speech signal. However, an exact definition of Fourier transform cannot be directly
applied because speech signal cannot be considered stationary due to constant changes in the
articulatory system within each speech utterance. To solve these problems, speech signal is
split into a sequence of short segments in such a way that each one is short enough to be
considered pseudo-stationary. The length of each segment, also called window or frame,
ranges between 10 and 40ms (in such a short time period our articulatory system is not able
to significantly change). Finally, a feature vector will be extracted from the short-time
spectrum in each window. The whole process, known as short-term Spectral analysis,
11
2.6.2. MFCC Feature
The commonly used feature extraction method for speech/ speaker recognition is LPC (linear
prediction coding), MFCC (Mel Frequency Cepstral Coefficients) and PLP (Perceptual
Linear Prediction). LPC is based on assumption that a speech sample can be approximated by
a linearly weighted summation of determined number of preceding samples. PLP is
calculated in a similar way as LPC coefficients, but previous transformations are carried out
in the spectrum of each window aiming at introducing about human hearing behavior.
The most popular feature extraction method, MFCC mimic the human hearing behavior by
emphasizing lower frequencies and penalizing higher frequencies.
The Mel scale, proposed by Stevens, Volkman and Newman in 1937 is a perceptual scale of
pitches judged by listeners to be equal in distance from one another.
The Mel scale is based on an empirical study of the human perceived pitch or frequency.
Human hearing, however, is not equally sensitive at all frequency bands. It is less sensitive at
higher frequencies, roughly above 1000 Hertz. It is a unit of pitch defined so that pairs of
sounds which are perceptually equidistant in pitch are separated by an equal number of Mels.
The mapping between frequency in Hertz and the Mel scale is linear below 1000 Hz and the
logarithmic above 1000 Hz.
Hcl(x) =2595logj1+
x
700
[
Modeling this property of human hearing during feature extraction improves speech
recognition performance. The form of the model used in MFCCs is to warp the frequencies
output by the DFT onto the Mel scale. During MFCC computation, this insight is
implemented by creating a bank of filters which collect energy from each frequency band.

12
2.7 Speaker/Speech Modeling
There are various pattern modeling/matching techniques. They include Dynamic Time
Warping (DTW), Gaussian Mixture Model (GMM), Hidden Markov Modeling (HMM),
Artificial Neural Network (ANN), and Vector Quantization (VQ). These are interchangeably
used for speech, speaker modeling. The best approach is statistical learning methods: GMM
for Speaker Recognition, which models the variations in features of a speaker for a long
sequence of utterance. And another statistical method widely used for speech recognition is
HMM. HMM models the Markovian nature of speech signal where each phoneme represents
a state and sequence of such phonemes represents a word. Sequence of Features of such
phonemes from different speakers is modeled by HMM.
2.7.1. Gaussian Mixture Model
2.7.1.1 Univariate Gaussian
The Gaussian distribution, also known as the normal distribution, is the bell-curve
function. A Gaussian distribution is a function parameterized by a mean: and a variance:
o
2
. The following formula for a Gaussian functions:
(x|,o) =
1
2ao
2
exp_
(x )
2
2o
2
_
2.7.1.2 Mixture Model
In statistics, a mixture model is a probabilistic model which assumes the underlying data to
belong to a mixture distribution. In a mixture distribution, its density function is just a convex
combination (a linear combination in which all coefficients or weights sum to one) of other
probability density functions:
p(x) =w
1
p
1
(x) +w
1
p
1
(x) ++w
n
p
n
(x)
The individual p
i
(x) density functions that are combined to make the mixture density p(x) are
called the mixture components, and the weights w
1
, w
2
,, w
n
associated with each
component are called the mixture weights or mixture coefficients.
2.7.1.3 Gaussian Mixture Model
A Gaussian Mixture Model (GMM) is a parametric probability density function commonly
used as a model of continuous data and most notably biometric features in speaker
13
recognition systems due to their capability of representing large class of sample distributions.
Like K-Means, Gaussian Mixture Models (GMM) can be regarded as a type of unsupervised
learning or clustering methods. GMM is based on clustering technique, where the entire set
of experimental data set is modeled by a mixture of Gaussians. But unlike K-Means, GMMs
are able to build soft clustering boundaries, i.e., points in space can belong to any class with a
given probability.
In a Gaussian mixture distribution, its density function is just a convex combination (a linear
combination in which all coefficients or weights sum to one) of Gaussian probability density
functions:

Figure: 2.5: GMM with four Gaussian components and their equivalent model
Mathematically, A GMM is the weighted sum of M Gaussian component densities given by
the equation
p(x/ ) = w
m
. g
m
(x/ p
m
,C
m
)
M
m=1

where,
x is a k dimensional random vector,
w
m
are the mixture weights that shows the relative importance of each component and
satisfies the constraint that w
m
M
m=1
=1.
g
m(
x/ p
m
,C
m
), m=1,2,,M are the component densities where each component
density is a k-dimensional Gaussian function (pdf) of the form
g
m
(x/ p
m
,C
m
) =
1
_
(2n)
K
.|C
m|
exp {
1
2
(x p
m
).(C
m
-1
(x p
m
))}
14
Where,
p
m
is the mean vector of length k of m
th
Gaussian PDF,
C
m
is the covariance matrix of kk of m
th

Gaussian PDF
Thus the complete Gaussian Mixture Model is parameterized by mixture weights, mean
vectors and covariance matrices for all component densities. The parameters are collectively
represented by the notation,
={w
m
,p
m
,C
m
}, m =1, 2,, M
These parameters are estimated in training section. For speaker recognition system, each
speaker is represented by a GMM and is referred by his/her model .
GMM is widely used in speaker modeling and classification due to its two important benefits:
first the individual Gaussian component in a speaker-dependent GMM are interpreted to
represent some broad acoustic classes such as speaker-dependent vocal tract configurations
that are useful for modeling speaker identity. A speaker voice can be characterized by a set of
acoustic classes representing some broad phonetic events such as vowels, nasals, fricatives.
These acoustic classes reflect some general speaker-dependent vocal tract configurations that
are useful for characterizing speaker identity. The spectral shape of the i
th
acoustic class can
in turn be represented by mean of the ith component density and variations of the average
spectral shape can be represented by the covariance matrix. These acoustic classes are hidden
before training. Secondly Gaussian mixture density provides a smooth approximation to the
long term sample distribution of training utterances by a given speaker. The unimodal
Gaussian speaker model represents a speakers feature distribution by a mean vector and
covariance matrix and the VQ model represents a speakers distribution by a discrete set of
characteristic templates. GMM acts as a hybrid between these two models using a discrete set
of Gaussian functions, each with their own mean and covariance matrix to allow better
modeling capability.


15
2.7.2. Hidden Markov Model
In general, a Markov model is a way of describing a process that goes through a series of
states. The model describes all the possible paths through the state space and assigns a
probability to each one. The probability of transitioning from the current state to another one
depends only on the current state, not on any prior part of the path.
HMMs can be applied in many fields where the goal is to recover a data sequence that is not
immediately observable. Common applications include: Cryptanalysis, Speech recognition,
Part-of-speech tagging, Machine translation, Partial discharge, Gene prediction, Alignment of
bio-sequences, Activity recognition.
2.7.2.1 Discrete Markov Processes
The transition probability o
]
with N distinct states, S
1
,S
2
,S
3
,,S
N
, for the first order
Markov chain is given by:
o
]
=P|q
t
=S
]
|q
t-1
=S

], 1 i,] N
where q
t
is the state at time t.
The state transition coefficients have the following properties (due to standard stochastic
constraints):
o
]
0 ],i
o
]
N
=1
=1 i
The transition probabilities for all states in a model can be described by a transition
probability matrix:
A =_
o
11
o
12
o
21
o
22
o
1N
o
2N

o
N1
o
N2

o
NN
_
NN

The initial state distribution matrix is given by:
n =_
n
1
=P(q
1
=1)
n
2
=P(q
1
=2)

n
N
=P(q
1
=N)
_
N1

16

The stochastic property for initial state distribution vector is:
n

=1
N
=1

where the n

is defined as:
n

=P(q
1
=i), 1 i N
The Markov model can be described by
I =(A,n)
This stochastic process could be called an observable Markov model since the output of the
process is the set of states at each instant of time, where each state corresponds to physical
(observable) event.
2.7.2.3 Hidden Markov Model
Markov model is too restrictive to be applicable to many problems of interest. So the concept
of Markov model is extended to Hidden Markov model to include the case where the
observation is a probabilistic function of the state. The resulting model is doubly embedded
stochastic process with an underlying stochastic process that is not observable (i.e. hidden),
but can only be observed through another set of stochastic processes that produce the
sequence of observations. The difference is that in Markov Chain the output state is
completely determined at each time t. In the Hidden Markov Model the state at each time t
must be inferred from observations. An observation is a probabilistic function of a state.
Elements of HMM
The HMM is characterized by the following:
1) Set of hidden states
S ={S
1
,S
2
,,S
N
} and
state at time t, q
t
S
2) Set of observation symbols per state
V ={v
1
,v
2
,,v
M
}
observation at time t, O
t
V
17

3) The initial state distribution


={
i
}
i
=P[q
1
=S
1
] 1 i N
4) State transition probability distribution
A ={a
ij
} a
ij
=P[q
t+1
=S
i
|q
t
=S
i
] 1 i, j N
5) Observation symbol probability distribution in state j
B ={b
j
(k)} b
j
(k) =P[v
k
at t|q
t
=S
j
] 1 j N, 1 k M
Normally, an HMM is typically written as: I =(A,B,n)
2.7.2.4 Types of HMMS
An ergodic or fully connected HMMs has the property that every state can be reached from
every other state in a finite number of steps. This type of model has the property that every
a
ij
coefficient is positive. For some applications, other types of HMMs have been found to
account for observed properties of the signal being modeled better than the standard ergodic
model.

Figure 2.6: Ergodic Model of HMM
One such model is left-right model or Bakis model because the underlying state sequence
associated with the model has the property that as time increases the state index increases (or
stays the same), i.e. the state proceed from left to right. Clearly, the left-right type of HMMs
18
has the desirable property that it can readily model signals whose properties change over time
e.g., Speech.
State
1
State
2
State
3
State
4
a
11
a
13
a
12 a
23
a
22
a
34
a
44
a
33
a
24

Figure 2.7: Left to Right HMM
The properties of left-right HMMs are:
1) The state transition coefficients have the property
o
]
=0, ] <i
i.e., no transition is allowed to states whose indices are lower than the current state.
3) The state transition coefficient for the last state in a left-right model are specified as
o
NN
=1

2) The initial state probabilities have the property

i
=1,i =1
=0, i 1
Since the state sequence must begin in state 1 and end in state N.
With left-right models, additional constraints are placed on the state transition coefficients to
make sure that large changes in state indices do not occur, hence a constraint of the form
o
]
=0, ] >i +
is often used. The value of is 2 in this speech recognition system, i.e., no jumps of more
than 2 states are allowed. The form of the state transition matrix for = 2 and N=4 is as
follows.

19
2.7.3. K-Means Clustering
Clustering can be considered the most important unsupervised learning problem; so, as every
other problem of this kind, it deals with finding a structure in a collection of unlabeled data.
A loose definition of clustering could be the process of organizing objects into groups whose
members are similar in some way. A cluster is therefore a collection of objects which are
similar between them and are dissimilar to the objects belonging to other clusters.
In statistics and machine learning, k-means clustering is a method of cluster analysis which
aims to partition n observations into k clusters in which each observation belongs to the
cluster with the nearest mean.
The algorithm is composed of the following steps:
1. Place K points into the space represented by the objects that are being clustered.
These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation
of the objects into groups from which the metric to be minimized can be calculated.
Both the clustering process and the decoding process require a distance metric or distortion
metric, that specifies how similar two acoustic feature vectors are. The distance metric is used
to build clusters, to find a prototype vector for each cluster, and to compare incoming vectors
to the prototypes. The simplest distance metric for acoustic feature vectors is Euclidean
distance. Euclidean distance is the distance in N-dimensional space between the two points
defined by the two vectors.
20
3. IMPLEMENTATION DETAILS
The implementation of joint speaker/speech recognition system includes common pre-
processing and feature extraction module, text independent speaker modeling and
classification by GMM and speaker independent speech modeling and classification by
HMM/VQ.
3.1 Pre-Processing and Feature Extraction
Starting from the capturing of audio signal, feature extraction consists of the following steps
as shown in the block diagram below:
Pre-
emphasis
Window DFT
Mel-Filter
bank
Log
IDFT
Deltas Energy
SPEECH
SIGNAL
MFCC 12
Coefficients
1 Energy
Feature
12 MFCC
12 MFCC
12 MFCC
1 energy
1 energy
1 energy
Silence
Removal
Framing
CMS
Figure 3.1: Pre-Processing and Feature Extraction
3.1.1. Capture
The first step in processing speech is to convert the analog representation (first air pressure,
and then analog electric signals in a microphone) into a digital signal x[n], where n is an
index over time. Analysis of the audio spectrum shows that nearly all energy resides in the
band between DC and 4 kHz, and beyond 10 kHz there is virtually no energy whatsoever
Used sound format
22050 Hz
16-bits, Signed
Little Endian
Mono Channel
Uncompressed PCM
21
3.1.2. End point detection and Silence removal
The captured audio signal may contain silence at different positions such as beginning of
signal, in between the words of a sentence, end of signal. etc. If silent frames are included,
modeling resources are spent on parts of the signal which do not contribute to the
identification. The silence present must be removed before further processing.
There are several ways for doing this: most popular are Short Time Energy and Zeros
Crossing Rate. But they have their own limitation regarding setting thresholds as an ad hoc
basis. The algorithm we used [Ref.4] uses statistical properties of background noise as well as
physiological aspect of speech production and does not assume any ad hoc threshold. It
assumes that background noise present in the utterances is Gaussian in nature.
Usually first 200msec or more (we used 4410 samples for the sampling rate 22050
samples/sec) of a speech recording corresponds to silence (or background noise) because the
speaker takes some time to read when recording starts.
Endpoint Detection Algorithm
Step 1: Calculate the mean () and standard deviation () of the first 200ms samples of the
given utterance. The background noise is characterized by this and .
Step 2: Go from 1
st

sample to the last sample of the speech recording. In each sample, check
whether one-dimensional Mahalanobis distance functions i.e. |x-|/ greater than 3 or not. If
Mahalanobis distance function is greater than 3, the sample is to be treated as voiced sample
otherwise it is an unvoiced/silence.
The threshold reject the samples up to 99.7% as per given by P[|x|3]=0.997 in a
Gaussian Distribution thus accepting only the voiced samples.
Step 3: Mark the voiced sample as 1 and unvoiced sample as 0. Divide the whole speech
signal into 10 ms non-overlapping windows. Represent the complete speech by only zeros
and ones.
Step 4: Consider there are M number of zeros and N number of ones in a window. If M N
then convert each of ones to zeros and vice versa. This method adopted here keeping in mind
that a speech production system consisting of vocal cord, tongue, vocal tract etc. cannot
change abruptly in a short period of time window taken here as 10ms.
22
Step 5: Collect the voiced part only according to the labeled 1 samples from the windowed
array and dump it in a new array. Retrieve the voiced part of the original speech signal from
labeled 1 sample.

Figure 3.2: Input signal to End-point detection system

Figure 3.3: Output signal fromEnd point Detection System
3.1.3. PCM Normalization
The extracted pulse code modulated values of amplitude is normalized, to avoid amplitude
variation during capturing.
3.1.4. Pre-emphasis
Usually speech signal is pre-emphasized before any further processing, if we look at the
spectrum for voiced segments like vowels, there is more energy at lower frequencies than the
higher frequencies. This drop in energy across frequencies is caused by the nature of the
glottal pulse. Boosting the high frequency energy makes information from these higher
formants more available to the acoustic model and improves phone detection accuracy.
The pre-emphasis filter is a first-order high-pass filter. In the time domain, with input x[n]
and 0.9 1.0, the filter equation is:
y[n] =x[n]x[n1]
We used =0.95.
0 1 2 3 4 5 6 7 8 9
x 10
4
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2 2.5 3 3.5 4
x 10
4
-1
-0.5
0
0.5
1
23

Figure 3.4: Signal before Pre-Emphasis

Figure 3.5: Signal after Pre-Emphasis
3.1.5. Framing and windowing
Speech is a non-stationary signal, meaning that its statistical properties are not constant
across time. Instead, we want to extract spectral features from a small window of speech that
characterizes a particular sub phone and for which we can make the (rough) assumption that
the signal is stationary (i.e. its statistical properties are constant within this region).
We used frame block of 23.22ms with 50% overlapping i.e., 512 samples per frame.

Figure 3.6: Frame Blocking of the Signal
0 2000 4000 6000 8000 10000 12000
0
0.01
0.02
0.03
0.04
0.05
Frequency (Hz)
|
Y
(
f
)
|
0 2000 4000 6000 8000 10000 12000
0
1
2
3
4
5
x 10
-3
Frequency (Hz)
|
Y
(
f
)
|
24
The rectangular window (i.e., no window) can cause problems, when we do Fourier analysis;
it abruptly cuts of the signal at its boundaries. A good window function has a narrow main
lobe and low side lobe levels in their transfer functions, which shrinks the values of the signal
toward zero at the window boundaries, avoiding discontinuities. The most commonly used
window function in speech processing is the Hamming window defined as follows:
S
w
(n) =_0.54 0.46cos_
2n(n 1)
N 1
]_ , 1 n N

Figure 3.7: Hamming window
The extraction of the signal takes place by multiplying the value of the signal at time n,
s
frame
[n], with the value of the window at time n, S
w
[n]:
y[n] =S
w
[n]S
frame
[n]


Figure 3.8: A single frame before and after windowing
0 10 20 30 40 50 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Hamming Window


0 200 400 600 800 1000 1200
-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
0 200 400 600 800 1000 1200
-0.05
-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
0.05
25
3.1.6. Discrete Fourier Transform
A Discrete Fourier Transform (DFT) of the windowed signal is used to extract the frequency
content (the spectrum) of the current frame. The tool for extracting spectral information i.e.,
how much energy the signal contains at discrete frequency bands for a discrete-time
(sampled) signal is the Discrete Fourier Transform or DFT. The input to the DFT is a
windowed signal x[n]...x[m], and the output, for each of N discrete frequency bands, is a
complex number X[k] representing the magnitude and phase of that frequency component in
the original signal.
bin
k
=_S
w
(n)c
-(n-1)k
2n
N
N
n=1
_ , k =0,1,2,,N 1
The commonly used algorithm for computing the DFT is the Fast Fourier Transform or in
short FFT.
3.1.7. Mel Filter
For calculating the MFCC, first, a transformation is applied according to the following
formula:
Hcl(x) =2595logj1+
x
700
[
Where, x is the linear frequency.
Then, a filter bank is applied to the amplitude of the mel-scaled spectrum.
The Mel frequency warping is most conveniently done by utilizing a filter bank with filters
centered according to Mel frequencies. The width of the triangular filters varies according to
the Mel scale, so that the log total energy in a critical band around the center frequency is
included. The centers of the filters are uniformly spaced in the mel scale.
26

Figure 3.9: Equally spaced Mel values
The result of Mel filter is information about distribution of energy at each Mel scale band.
We obtain a vector of outputs (12 coeffs) from each filter.

Figure 3.10: Triangular filter bank in frequency scale
We have used 30 filters in the filter bank.

27
The Mel frequency m can be computed from the raw acoustic frequency as follows:
bonk
k
=
i cbin
k-1
+1
cbin
k
cbin
k-1
+1
cbn
k
=cbn
k-1
bin

+
i cbin
k
cbin
k+1
cbin
k
+1
cbn
k+1
=cbn
k
+1
bin

, k
=1,2,,NF 1
where,
cbin

=rounJ_

c
i

s
N_
cbin
0
=rounJ _

stut

s
N_ ;
cbin
24
=rounJ

s
2
,

s
N

=
N
2

c
i
=Hcl
-1
_Hcl{
stut
}+
McI]
]
s
2
, -McI{]
stcrt
}
NP
i_ i =1,2,3,,NF 1
Hcl(x) =2595lnj1+
x
700
[
x =700._10
mcI
2595
1]
3.1.8. Cepstrum by Inverse Discrete Fourier Transform
Cepstrum transform is applied to the filter outputs in order to obtain MFCC feature of each
frame. The triangular filter outputs Y (i), i=0, 1, 2 M are compressed using logarithm, and
discrete cosine transform (DCT) is applied. Here, M is equal to number of filters in filter
bank i.e., 30.
c[n] =log(i) cos_
nn
H
(i
1
2
)_
M
=1

Where, C[n] is the MFCC vector for each frame.
28

The resulting vector is called the Mel-frequency cepstrum (MFC), and the individual
components are the Mel-frequency cepstral coefficients (MFCCs). We extracted 12 features
from each speech frame.
3.1.9. Post Processing
3.1.9.1 Cepstral Mean Subtraction (CMS)
A speech signal may be subjected to some channel noise when recorded, also referred to as
the channel effect. A problem arises if the channel effect when recording training data for a
given person is different from the channel effect in later recordings when the person uses the
system. The problem is that a false distance between the training data and newly recorded
data is introduced due to the different channel effects. The channel effect is eliminated by
subtracting the Mel-cepstrum coefficients with the mean Mel-cepstrum coefficients:
mc
]
(q) =C
]
(q)
1
H
C

(q)
M
=1
, q =1,2,,12
3.1.9.2 The energy feature
The energy in a frame is the sum over time of the power of the samples in the frame; thus for
a signal x in a window from time sample t1 to time sample t2, the energy is:
Encrgy = X
2
[t]
t
2
t=t
1

3.1.9.3 Delta feature
Another interesting fact about the speech signal is that it is not constant from frame to frame.
Co-articulation (influence of a speech sound during another adjacent or nearby speech sound)
can provide a useful cue for phone identity. It can be preserved by using delta features.
Velocity (delta) and acceleration (delta delta) coefficients are usually obtained from the static
window based information. This delta and delta delta coefficients model the speed and
acceleration of the variation of cepstral feature vectors across adjacent windows.
A simple way to compute deltas would be just to compute the difference between frames;
thus the delta value d(t) for a particular cepstral value c(t) at time t can be estimated as:
J(t) =
k
[i] =
k+M
[i]
k-M
[i]
29
The differentiating method is simple, but since it acts as a high-pass filtering operation on the
parameter domain, it tends to amplify noise. The solution to this is linear regression, i.e. first-
order polynomial, the least squares solution is easily shown to be of the following form:

k
[i] =
m
k+m
[i]
M
m=-M
m
2 M
m=-M

Where, M is regression window size. We used M=4.

3.1.9.4 Composition of Feature Vector
We calculated 39 Features from each frame
12 MFCC Features
12 Delta MFCC
12 Delta-Delta MFCC
1 Energy Feature
1 Delta Energy Feature
1 Delta-Delta Energy Feature
30
3.2 GMM Implementation
It is also important to note that because the component Gaussians are acting together to
model the overall pdf, full covariance matrices are not necessary even if the features are not
statistically independent. So, the linear combination of diagonal covariance basis Gaussians is
capable of modeling the correlations between feature vector elements. In addition, the use of
diagonal covariance matrices greatly reduces the complexity in computation. Hence in our
project, the m
th
covariance matrix is
C
m
=diag ( a
m1 ,
a
m2,,
a
mK
),
Where,
a
mj,
j =1, 2,,K are the diagonal elements or variances
K=Number of features in each feature vector
The effect of a set of using a set of M full covariance Gaussians can be compensated by using
by using a larger set of diagonal covariance Gaussians (M=16 in our case). M=16 is best for
speaker Modeling, according to research papers.
The components pdfs can now be expressed as,
g
m
(x/ p
m
,C
m
) =
1
_(2n)
K
. o
m,]
K
]=1
exp {
1
2
((x p
m
)/ o
m,]
)
K
]=1
}
Where,

m,j
are the elements of m
th
mean vector p
m
.
3.2.1. Block diagram of GMM based Speaker Recognition System,
Feature
Extraction
Model training
Model DB
Matching
Decision
Accepted /
Rejected
Speech
Enrollment
Verification

Figure 3.11: Block diagramof GMM based Speaker Recognition System
31
3.2.2. GMM Training
Given the training speech from a speaker, the goal of speaker model training is to estimate
the parameters of the GMM that best match the distribution of training features vectors and
hence develop a robust model for the speaker. Out of several techniques available for
estimating the parameters of GMM, the most popular method is Maximum Likelihood (ML)
estimation or Expectation-Maximization (EM).
It is a well-established maximum likelihood algorithm for fitting a mixture model to a set of
training data. EM requires an a priori selection of model order, the number of M components
to be incorporated into the model and initial estimate of training parameters before iterating
through the training.
The aim of the ML estimation method is to maximize the likelihood of GMM, given the
training data. Under the assumption of independent feature vectors, the likelihood of GMM,
for the sequence of T training vectors X ={x
1
,x
1
,,x
1
} can be written as,
P(X/ ) =_p(x
t
/ )
1
t=1

In practice, the above computation is done in log domain to avoid underflow. That is, instead
of multiplying lots of very small probabilities, we can simply add them in log domain.
Thus, the log-likelihood of a model for a sequence of feature vectors X ={x
1
,x
1
,,x
1
} is
computed as follows:
logP(X/ ) =
1
I
log p(x
t
/ )
1
t=1

Note that in the above equation, the average log likelihood value is used so as to normalize
out duration effects from the log-likelihood value. Also, since the incorrect assumption of
independence is underestimating the actual likelihood value with dependencies, scaling by T
can be considered as a rough compensation factor.
The direct maximization of this likelihood function is not possible as it is a non-linear
function of the parameter . So, the likelihood function is maximized using Expectation
Maximization algorithm.
The basic idea of EM algorithm is beginning with the initial model

, to estimate a new
model
+1
such that P(X/
+1
) P(X/

). The new model


+1
then becomes the initial
32

model for the next iteration and the process is repeated until some convergence threshold is
reached. i.e., P(X/
+1
) - P(X/

) < .
3.2.2.1 The Expectation Maximization Algorithm
On each iteration, the following formulas are used to estimate the parameters of new model

+1
that guarantee a monotonic increase in the likelihood of the model.
Mixture weight: w
m
+1
=
1
1
y(m/ x
t
, )
1
t=1

Means: p

m
+1
=
y(m/ x
t
, ) x
t
T
t=1

y(m/ x
t
, )
T
t=1

Variances: o
2
m
+1
=
y(m/ x
t
, ) x
t
T
t=1

y(m/ x
t
, )
T
t=1
- p

m
2

where, (m/x
t
,) is the probability that the observation x
t
was drawn from m
th
component
and is given by,
y(m/ x
t
, ) =
w
m
. g
m
(x
t
/ p
m
,C
m)
w

. g

(x
t
/ p

,C
)
M
=1

3.2.2.2 Estimation of initial parameters for training:
Commonly, the initializations of the GMM parameters are done as follows:
Mixture weights: 1/mixtureDimension
Mean: Random feature vector from training data
Covariance: generally initialized to 1.
But, it is important to initialize the covariance matrices with rather large variances, to reduce
the risk that the EM training gets stuck in some local maximum. So we require larger values.
K-Means can be used for the good estimate of initial estimate of covariance matrix.
We used a method found on [Ref 1], because of its ease to compute.
To set reasonable values for the covariance matrices, we need an estimate of the covariance
of the whole training set, C
data
. Start by estimating the mean of the training data like
33

dutu
=
1
I
x
t
1
t=1

The j-th diagonal element of C
data
can be estimated as
z
dutu,]
=
1
I
(x
t,]

dutu
)
2
1
t=1

A measure of the volume that the training data occupies can be given by
I
dutu
=__z
dutu,]
K
]=1

Finally the covariance can be calculated as
z
m,]
=((I
dutu
H )
1 K
[
1
)
2

For minimum covariance (threshold) value to avoid NaN (Not a Number) error during EM
iterations,
z
mn
=((I
dutu
H )
1 K
[
1
)
2

Covariance limiting was done as calculated above for each mixture. For simplicity we
initialized covariance values to be same for all gaussian components.

For Training the GMM parameters we used the following constants:
Number of Iterations:
MINIMUM_ITERATION =100;
MAXIMUM_ITERATION =500;
And
Minimum log likelihood change for Convergence:
LOGLIKELIHOOD_CHANGE =0.000001;
34
3.2.3. Verification
After training section, now we have a complete model (GMM) of speakers. The speaker
verification task is a hypothesis testing problem where based on the input speech
observations, it must be decided whether the claimed identity of the speaker is correct or not.
So, the hypothesis test can be set as:
H
0
: the speaker is the claimed speaker
H
1
: the speaker is an imposter
The right decision between these two hypotheses is based on the likelihood ratio given by
P(X/ )
P(X/

)

Where, P(X/ ) is the likelihood that the utterance was produced by speaker model while
P(X/

) is the likelihood that he utterance was produced by imposter model

.
Here, the imposter model

, also called as Universal Background Model (UBM), is obtained


by training a collection of speech samples from a large no. of speakers, representative of the
population of speakers.
The likelihood ratio is often expressed in logarithm as
(X) =log(
P(X/ )
P(X/

)
) =logP(X/ ) logP(X/

)
The decision is made as follows:
If (X) <
T
, reject null hypothesis i.e. the speaker is an imposter.
If (X) >
T
, accept null hypothesis i.e. the speaker is the claimed one.
where, the threshold value
T
is set in suck a way that, the error of the system is minimum so
that the true claimants are always accepted and false claimants are always rejected.
3.2.4. Performance measure of Speaker Verification System
In general, the performance of the speaker verification system is determined by False
Rejection Rate (FRR) and False Acceptance Rate (FAR).
35
1) False Rejection Rate(FRR)
FRR is the measure of the likelihood that the system will incorrectly reject an access
attempted by an authorized user. A systems FRR typically is the ratio of the number of false
rejections divided by the number of verification tests.
2) False Acceptance Rate(FAR)
FAR is the measure of the likelihood that the system will incorrectly accept an access attempt
by an unauthorized user. A systems FAR usually is stated as the ratio of the number of false
acceptances divided by the number of verification tests.
Total Error Rate (TER) is the combination of false rejection and false acceptance rate. And
the requirement of the system is to minimize the Total Error Rate. These errors are dependent
on the choice of threshold value used during verification. It seems that, at lower threshold
value, FAR is predominant while at higher threshold value, FRR is predominant. This
dependency of the two errors can be seen in the figure below. At certain threshold value,
these errors are equal and TER is minimum.

Figure 3.12: Equal Error Rate (EER)


36
3.3 Implementation of HMM for Speech Recognition
The basic block diagram for isolated word recognition is given below:
Pre-
process
MFCC
Features
Vector Quantization
(VQ)
HMM Recognition
CODEBOOK
HMM
Model
Speech
Signal
K-means
Clustering
Recognition
Result
Baum-Welch
Algorithm
Viterbi
Algorithm
Discrete
Observation
Sequence
) , , ( B A
Figure 3.13: Speech Recognition algorithmflow
In order to do isolated word speech recognition, we must perform the following:
1) The codebook is generated using the feature vector of the training data and Vector
quantization uses the codebook to map the feature vector to discrete observation
symbol.
2) For each word v in the vocabulary, an HMM
v
is built, i.e., we must estimate the
model parameters (A, B, ) that optimize the likelihood of the training set observation
vectors for the vth word. In order to make reliable estimates of all model parameters,
multiple observation sequences must be used. Baum-Welch algorithm is used for
estimation of HMM parameters.
3) For each unknown word which is to be recognized, processing of some steps must be
carried out, namely measurement of the observation sequence O={O
1
,O
2
,..,O
T
}, via
feature analysis of the speech corresponding to the word, followed by calculation of
model likelihoods for all possible models, P(O|
v
), 1 v V; followed by selection
of the word whose model likelihood is highest.
: =org max
1<<v
[P(O|

]
The probability computation step is performed using the Viterbi algorithm and
requires on the order of V.N
2
.T computations.

37

Figure 3.14: Pronunciation model of word TOMATO
The above figure shows the pronunciation model of word tomato. The circles represent the
states and the numbers above the arrows represent transition probabilities. The pronunciation
of the same word may differ from person to person. The above figure reflects the two
pronunciation styles for the same word tomato. So, in order to best model each word, we
need to train the word for as large set of persons as possible so that it models all the variation
in pronunciation for that word.
Vector Quantization:
HMM is used in speech recognition because a speech signal can be viewed as a piecewise
stationary signal or a short-time stationary signal. In a short-time speech can be approximated
as a stationary process. Each acoustic feature vector represents information such as the
amount of energy in different frequency bands at a particular point in time. The observation
sequence for speech recognition is a sequence of acoustic feature vectors (MFCC vectors)
and the phonemes are the hidden states. One way to make MFCC vectors look like symbols
that we could count is to build a mapping function that maps each input vector into one of a
small number of symbols. This idea of mapping input vectors to discrete quantized symbols
is called vector quantization or VQ.
The type of HMM that models speech signals based on VQ technique to produce the
observations is called Discrete Hidden Markov Model (DHMM). However, VQ is
responsible for losing some information from the speech signal even when we try to increase
the codewords. This lose is due to the quantization error (distortion). This distortion can be
reduced by increasing the number of codewords in the codebook but cannot be eliminated.
The long sequence of speech samples will be represented by stream of indices representing
frames of different window lengths. Hence, VQ is considered as a process of redundancy
removal, which minimizes the number of bits required to identify each frame of speech
38

signal. In vector quantization, we create the small symbol set by mapping each training
feature vector into a small number of classes, and then we represent each class by a discrete
symbol. More formally, a vector quantization system is characterized by a codebook, a
clustering algorithm, and a distance metric.
A codebook is a list of possible classes, a set of symbols constituting features F ={f
1
, f
2
, ...,
f
n
}. All feature vector from training speech data are clustered into 256 classes thereby
generating a Codebook with 256 centroids with the help of K-Means clustering technique.
Vector Quantization (VQ) is used to get discrete observation sequence from input feature
vector by applying distance metric to Codebook.

Figure 3.15: Vector Quantization
As shown in the above figure, to make the feature vectors discrete, each incoming feature
vector is compared with each of the 256 prototype vectors in the codebook. And the one
which is closest (Euclidian distance) is selected, and then the input vector is replaced by the
index of corresponding centroid in codebook. In this way all continuous input feature vectors
are quantized to a discrete set of symbols.

39
3.3.1. Isolated Word Recognition
For isolated word recognition with a distinct HMM designed for each word in the vocabulary,
a left-right model is more appropriate than an ergodic model, since we can then associate
time with model states in a fairly straightforward manner. Furthermore we can envision the
physical meaning of the model states as distinct sounds (e.g., phonemes, syllables) of the
word being modeled.
The issue of the number of states to use in each word model leads to two schools of thought.
One idea is to let the number of states correspond roughly to the number of sounds
(phonemes) within the word hence model with from 2 to 10 states would be appropriate.
The other idea is to let the number of states correspond roughly to the average number of
observations in a spoken version of the word. In this manner each state corresponds to an
observation interval i.e., about 15 ms for the analysis we use. The former approach is used
in our speech recognition system. Furthermore we restrict each word model to have the same
number of states; this implies that the models will work best when they represent words with
the same number of sounds.

Figure 3.16: Average word error rate (for a digits vocabulary) versus the number of states N in the HMM
Above figure shows a plot of average word error rate versus N, for the case of recognition of
isolated digits (i.e., a 10-word vocabulary). It can be seen that the error is somewhat
insensitive to N, achieving a local minimum at N=6; however, differences in error rate for
values of N close to 6 are small.
The next issue is the choice of observation vector and the way it is represented. Since we are
representing an entire region of the vector space by a single vector, distortion penalty is
40
associated with VQ. It is advantageous to keep the distortion penalty as small as possible.
However, this implies a large size codebook, and that leads to problems in implementing
HMMs with a large number of parameters. Although the distortion steadily decreases as M
increases, only small decreases in distortion accrue beyond a value of M=32. Hence HMMs
with codebooks sizes of from M=32 to 256 vectors have been in speech recognition
experiments using HMMs. For the discrete symbol models we have used codebook to
generate the discrete symbols with M=256 codewords.

Figure 3.17: Curve showing tradeoff of VQ average distortion as a function of the size of the VQ, M (shown of
a log scale)
Another main issue is to initialize the parameters of HMM. The parameters that constitute
any model are , A, and B. The values of are given by = [1 0 0 0 0........0] because the
left-right model of HMM is used in our speech recognition system which always starts with
first state and ends in the last state. The random values between 0 and 1 are assigned as the
initial value to the elements of A and B parameters.
3.3.2. Application of HMM
Given the form of HMM, there are three basic problems of interest that must be solved for the
model to be useful in real-world applications. These problems are the following:
41

3.3.2.1 Evaluation Problem: Calculating Parameters
Given the observation sequence O = O
1
O
2
O
T
, and Markov Model =(A,B,) ,
how do we efficiently compute P(O | ) , the probability of the observation sequence, given
the model?
Solution:
The aim of this problem is to find the probability of the observation sequence, O = (O
1,
O
2
,
, O
T
) given the model , i.e. P(O | ). Because the observations produced by states are
assumed to be independent of each other and the time t, the probability of observation
sequence, O = (O
1,
O
2
, , O
T
) being generated by a certain state sequence q can be
calculated by a product:
P(O|q,B) =h
q1
(u
1
).h
q2
(u
2
) ..h
qT
(u
T
)
And the probability of the state sequence, q can be found as:
P(q|A,a) =a
q1
.a
q1q2
.a
q2q3
..a
qT-1qT

The aim was to find P(O | ), and this probability of O (given the model ) is obtained by
summing the joint probability over all possible state sequence q, giving:
P(D|I) = P(D|q,B).P(q|A,n)
all q

This direct computation has one major drawback. It is infeasible due to the exponential
growth of computations as a function of sequence length T. To be precise, it needs (2T-1)N
T

multiplications and N
T
-1 additions. An excellent tool which cuts the computational
requirements to linear, relative to T, is the well-known forward algorithm. The forward
algorithm has N(N+1)(T-1)+1 multiplications and N(N-1)(T-1) additions.
Forward Algorithm
Initially consider a new forward probability variable o
t
(i) , at instant t and state i , has the
following formula:
o t (i) = P(O1, O2 , O3, ......., Ot , qt = Si / )
This probability function could be solved for N states and T observations iteratively:
Step 1: Initialization
u
1
(|) =a
|
.h
|
(O
1
) 1 | N

42

Figure 3.18: Forward Procedure - Induction Step
Step 2: Induction
u
t+1
(j) =_u
t
(|)a
|j
N
|=1
_ h
j
(O
t+1
) 1 t T 1, 1 j N
Step 3: Termination
P(O|2) =u
T
(|)
N
|=1

This stage is just a sum of all the values of the probability function o
T
(i) over all the states at
instant T. This sum will represent the probability of the given observations to be driven from
the given model. That is how likely the given model produces the given observations.

Backward Algorithm
This procedure is similar to the forward procedure but it takes into consideration the state
flow as if in backward direction from the last observation entity, instant T, till the first one,
43
instant 1. That means that the access to any state will be from the states that are coming just
after that state in time.
To formulate this approach let us consider the backward probability function |t (i) which can
be defined as:

t
(|) =P(O
|+1
,O
|+2
,O
T
|q
t
=S
|
,2)

Figure 3.19: Backward Procedures - Induction Step
In analogy to the forward procedure we can solve for |t(i) in the following two steps:
1 - Initialization:

t
(|) =1, 1 | N
These initial values for |s of all states at instant T is arbitrarily selected.
2 Induction:

t
(|) =a
|j
.h
j
(O
t+1
)
N
j=1
.
t+1
(j) t =T 1,T 2,,1 ,1 | N

44

3.3.2.2 Decoding Problem: Finding the best path
Given the observation sequence O = O
1
O
2
O
T
, and Markov Model =(A,B,), find
optimal state sequence q = q
1
q
2
q
T
.
Solution:
The problem is to find the optimal sequence of states, given the observation sequence and the
model. This means that we have to find the optimal state sequence Q= ( q
1
, q
2
, q
3
,....., q
T-1
,
q
T
) associated with the given observation sequence O = (O
1
, O
2
, O
3
,........., O
T-1
, O
T
)
presented to the model = (A , B , t ). The criteria of optimality here is to search for a single
best state sequence through modified dynamic programming technique called Viterbi
Algorithm.
To explain the Viterbi Algorithm, the probability quantity ot (i) is defined which represents
the maximum probability along the best probable state sequence path of a given observation
sequence after t instants and being in state i. This quantity can be defined mathematically by:
6
t
(|) = max
q
1
,q
2
,q
t-1
P[q
1
,q
2
, q
t-1
,q
t
=S
|
,O
1
O
2
O
t
|2]
The best state sequence is backtracked by another function t (j). The complete algorithm can
be described by the following steps:
Step 1: Initialization:
6
1
(|) =a
|
h
|
(O
1
), 1 | N

1
(|) =
Step 2: Recursion:
6
t
(j) =max
1<|<N
|6
t-1
(|)a
|j
]h
j
(O
t
) , 2 t T, 1 j N

t
(j) =argmax
1<|<N
[6
t-1
(|)a
|j
]h
j
(O
t
) , 2 t T, 1 j N
Step 3: Termination:
P

=max
1<|<N
[6
T
(|)]
q
T

=argmax
1<|<N
[6
T
(|)]
Step 4: Path (state sequence) backtracking:
q
t

=
t+1
(q
t+1

) , T 1 t 1
45
Viterbi Algorithm can also be used to calculate the P(O/) approximately by considering the
use of P* instead.

Figure 3.20: Viterbi Search
3.3.2.3 Training Problem: Estimating the Model Parameters
Given the observation sequence O = O
1
O
2
O
T
, estimate parameters for Model =(A,B,)
that maximize P(O | ) .
Solution:
This problem deals with the training issue which is the most difficult one in all the three
cases. The task of this problem is to adjust the model parameters, (A, B, 2), according to a
certain optimality criteria. Baum-Welch Algorithm (ForwardBackward Algorithm) is one of
the well-known techniques to solve the problem. It is an iterative method to estimate the new
values for the model parameters. To explain the training procedure, first a posteriori
probability function t(i) is defined, the probability of being in state i at instant t, given the
observation sequence O and the model 2 as:
y
t
(|) =P(q
t
=S
|
|O,2)
y
t
(|) =
P(O,q
t
=S
|
|2)
P(O|2)

y
t
(|) =
u
t
(|)
t
(|)
u
t
(|)
t
(|)
N
|=1

46

Then another probability function t (i, j) is defined, the probability of being in state i at
instant t and going to state j at instant t+1, given the model 2 and the observation sequence O.
t (i, j) can be mathematically defined as:

t
(|,j) =P(q
t
=S
|
,q
t+1
=S
j
|O,2)

Figure 3.21: Computation of
t
(i, j)
From the definition of the forward and backward variables, we can write t (i, j) in the form

t
(|,j) =
u
t
(|)a
|j
h
j
(O
t+1
)
t+1
(j)
u
t
(|)
t
(|)
N
|=1

t
(|,j) =
u
t
(|)a
|j
h
j
(O
t+1
)
t+1
(j)
u
t
(|)a
|j
h
j
(O
t+1
)
t+1
(j)
N
j=1
N
|=1

The relation between t(i) and t (i, j) can be easily deduced from their definitions :
y
t
(|) =
t
(|,j)
N
j=1

Now, if t(i) is summed over all instants (excluding instant T) we get the expected number of
times that state Si has left, or the number of times this state has been visited over all instants.
47
On the other hand if we sum t (i, j) over all instants (excluding T) we will get the expected
number of transitions that have been made from i to j.
From the behavior of t(i) and t (i, j) the following re-estimations of the model parameters
could be deduced:
Initial state distribution:

) (
1
i
i


Transition probabilities:

1
1
1
1
) (
) , (

T
t
t
T
t
t
ij
i
j i
a


Emission probabilities:
j
k j
s state in times of number expected
v symbol observing and s state in times of number expected
) (

k b
j



3.3.3. Scaling

t
(i) consists of the sum of a large number of terms. Since transition matrix element (a) and
emission matrix element (b) are less than 1, as t starts to get big, each term of
t
(i) starts to
head exponentially to zero. For large t the dynamic range of
t
(i) computation will exceed
the precision range of computer (even in double precision ). This is accomplished by
multiplying
t
(i) and
t
(i) by a scaling factor that is independent of i (i.e., it depends only on
t), with the goal of keeping the scaled
t
(i) within the dynamic range of the computer for 1
t T . Then at the end of the computation, the scaling coefficients are canceled out exactly.
When using the Viterbi Algorithm, if logarithms are used to give the maximum likelihood
state sequence, no scaling is required.
1 t at time s state in times of number expected
i

i

i
j i
s state from ns transitio of number expected
s state to s state from ns transitio of number expected

ij
a

T
t
t
T
v o
t
t
i
j
j
k b
k t
1
1
) (
) (
) (

48

+execute()
interface
Algorit hm
+doPreprocessing()
+doPCMNormalization()
-capturedSignal
-processedSignal
PreProcessing
+doFeatureExtraction()
+doCepstralMeanSubtraction()
+combineFeatures()
+calculateDeltas()
+calculateEnergy()
-inputSignal
-featureVector
-fftPts
-numCepsCoeffs
-deltaWindowSize
-numOfFilters
Feat ureExt ract ion
+captureSignal()
+preprocess()
+extractFeatures()
+registerUser()
+checkUser()
Client
WaveDat a Windowing Framing SilenceRemoval
FFT DCT
Pre-Emphasis
+doCMS()
MFCC
Delt a
+capture()
+play()
-bitRate
-samplingRate
-audioData
-status
SoundRecorder
+getNextWord()
-loginFormData
LoginPanel
+getNextSentence()
-trainingSentence
-currentUser
TrainingPanel
Energy
+register()
+reset()
-regFormData
Regist rat ionPanel
-featureVector
-mfcc
Feat ureVect or
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
Figure4.1: UML diagramof Client System
4. UML CLASS DIAGRAMS OF THE SYSTEMS
49

Figure4.2: UML Diagramof Server System
50
5. DATA COLLECTION AND TRAINING
For Speech modeling, the hmm models were trained by using 45 English words (multi-
phoneme >3) each spoken three times by 15 speakers. The trained words listed in Appendix
B.
For Speaker modeling, about 2 minutes spoken data was collected from 15 speakers and
trained accordingly.
The speech model for each word is pre-trained.
Speaker model is trained for a new speaker after registering or when a user wants to re-train.

51
6. RESULTS
The performance of speech recognition system will be high, as the words are trained by
many speakers with variations in tones, pronunciation, speed, etc. Similarly, for speaker
recognition the length of training voice should be large enough (at least 1 minute) to model
the distinct features of speaker during pronunciation of different phonemes. According to
Ref[3], GMM identification performance for different amounts of training data and model
orders is shown below:

We could not collect the sufficient amount of data required to train the systems. So the
system has been trained with limited available data. Still the performance of the system with
the available data (training and testing) is very good.
The performance of isolated speech recognition rate is 92%.
The performance of isolated speaker recognition rate is 78% with long test data (greater than
5 seconds).
In the combined system, the same word (prompted word) is used as test word for both HMM
and GMM. This word can easily be recognized by HMM but the length of test word is very
small (1-2 seconds) to recognize by GMM. So the overall performance of the system
degrades 66%. We can use long words to improve the performance of joint system.

52
7. APPLICATION AREA
The voice based remote authentication has many application domains. Using this system for
the multifactor authentication provides the extra amount of security.
Web page login -- alternative of password
In security critical web apps, as multifactor authentication or re-verification
purpose
o such as password change
o while making transactions in E-banking
o Toll fraud prevention
The standalone version of the proposed system (with some modifications), can be
used for
o Access Control
o Forensics

8. CONCLUSION
The proposed system is academic project. Various signal processing algorithms for MFCC
feature extraction were studied. Similarly for speech and speaker modeling and classification
various machine learning algorithms (GMM, HMM, VQ, K-means Clustering) were studied
and implemented successfully. The designed system is trained with limited data. The
performance of the system can be improved by utilizing various noise reduction /removal
algorithms and training with large dataset. The performance of individual systems is very
good and overall performance of the joint system is good for large utterance of words.



53
REFERENCES
1. Assignment 3: GMM Based Speaker Identification EN2300 Speech Signal
Processing, [ www.kth.se/polopoly_fs/1.41342!assignment_03.pdf]
2. Conrad Sanderson, Automatic Person Verification Using Speech and Face
Information - A Dissertation Presented to The School of Microelectronic Engineering
Faculty of Engineering and Information Technology, Griffith University, August
2002, [revised February 2003].
3. Douglas A Reynolds and Richard C Rose, Robust text-independent speaker
identification using Gaussian mixture speaker models. IEEE Transactions on Speech
and Audio Processing, 3(1):7283, 1995.
4. G. Saha, Sandipan Chakroborty, Suman Senapati , A New Silence Removal and
Endpoint Detection Algorithm for Speech and Speaker Recognition Applications,
Department of Electronics and Electrical Communication Engineering, Indian
Institute of Technology, Khragpur, Kharagpur, India.
5. J P Campbell, J r. Speaker recognition: A tutorial. Proc. IEEE, 85(9):14371462,
1997.
6. K.R. AidaZade, C. Ardil and S.S. Rustamov, Investigation of Combined use of
MFCC and LPC Features in Speech Recognition Systems, World Academy of
Science, Engineering and Technology, 2006
7. L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition, vol-77, no. 2, pp. 257-286, 1989.
8. Lasse L Mlgaard, Kasper W J rgensen, Speaker Recognition: Special Course;
IMM-DTU; 2005
9. Mohamed Faouzi BenZeghibaa , J oint Speech and Speaker Recognition,IDIAP
Research Report, 2005.
10. Robin Teo Choon Guan @ Myo Thant, Majority Rule- Based Non-Intrusive User
Authentication by Speech: Part 2 (Speaker Verification), Thesis, School of Science
and Technology, Sim University,2009.
11. Shi-Huang Chen and Yu-Ren Luo , Speaker Verification Using MFCC and Support
Vector Machine, Proceedings of the International Multi Conference of Engineers and
Computer Scientists 2009, vol I, IMECS 2009.
12. Tomi Kinnunen , Spectral Features for Automatic Text-Independent Speaker
Recognition- Licentiates Thesis, University of J oensuu, Department of Computer
Science, Finland, 2003.
13. Waleed H. Abdulla and Nikola K. Kasabov, The Concepts of Hidden Markov Model
in Speech Recognition, Knowledge Engineering Lab, Department of Information
Science, University of Otago,New Zealand, 1999.


54
APPENDIX A: BLAZEDS CONFIGURATION FOR REMOTING
SERVICE
The Remoting service is one of the RPC Service included in Blaze DS.
We need to configure the remoting-config.xml for the endpoint class in java
<destination id="mainServer">
<properties>
<source>[myPackage].MainServer</source>
<scope>session</scope>
</properties>
</destination>
In services-config.xml, we have to change the channel-defenition end points to use servers
port, host and context.
<channels>
<channel-definition id="my-amf"
class="mx.messaging.channels.AMFChannel">
<endpoint
url="http://{server.name}:{server.port}/PROJECT/messagebroker/amf"
class="flex.messaging.endpoints.AMFEndpoint"/>
</channel-definition>
<!--similarly for id="my-secure-amf", id="my-polling-amf" -->
</channels>
The Flex client application uses the Remote Object component to access the J ava method,
for doing so client application uses the destination property to specify the destination, which
is described in remoting-config.xml.
<mx:RemoteObject id="remoteObject
destination="mainServer"
result=resultHandler(event);
fault=faultHandler(event);>
Finally we call remoteMethods from action script as :
remoteObject.remoteMethod(params.);
And the result or error is handled by corresponding handler.
55
APPENDIX B: WORDS USED FOR HMM TRAINING

Acknowledgement Digital Machine
Application Electronics Management
Automatic Engineer Monopoly
Bombardment Fireproof Nightingale
Brotherhood Football Overlapping
Category Formation Parabola
Circumference Fundamental Powerful
Collection Grandfather Production
Communication Horizontal Quantity
Computer Hospital Signature
Confident Hydrogen Technology
Control Hypothesis Testimonial
Cryptography Important Trademark
Database Installment Unpredictable
Differentiation Interchange Zebra Crossing



56
APPENDIX C: DEVELOPMENT TOOLS AND ENVIRONMENT
Languages:
Client side (Browser app):
Adobe Flex, released by Adobe Systems, cross-platform rich Internet
applications (RIA) based on the Adobe Flash platform.
Server and standalone apps:
Java, programming language and computing platform first released by Sun
Microsystems, now owned by Oracle.
Tools and Development Environment:
Web Server Apache Tomcat
Blaze DS for Remote procedural call from Adobe Flex to J ava
Blaze DS is an open source, server-based J ava remoting and web messaging
technology that allows you to connect to back-end distributed data and push
data in real-time to Adobe Flex and Adobe Integrated Runtime (AIR) Rich
Internet applications (RIA).
Database MySQL to store the speaker data
Object I/O file for Storing models
IDE: Eclipse with WTP (Web Tool Platform) and Flex Builder Plug-in, integrated
java and Adobe Flex project was created.
Tortoise SVN (local): for version controlling
Operating System: Windows 7
MATLAB: for verification of algorithms, plotting the graphs, outputs
OC Volume: (Orange Cow Corporation) for implementing algorithms such as
FFT, VQ, HMM training and evaluation
57
APPENDIX D: SNAPSHOTS OF OUTPUT GUI

Login Form:

Speaker Training:

58
Display after successful login:



Speech Training GUI:


59


Register Form:

60

Project Members:

Ganesh Tiwari
Permanent Address: Mangalpur VDC-3, North Anandpur, Chitwan
Email: hsenag_tiwari@yahoo.com; 063bct510@ioe.edu.np
Contact Number: +977-9841020107
Blog: www.ganeshtiwari.com.np; www.ganeshtiwaridotcomdotnp.blogspot.com
Madhav Pandey
Permanent Address: Siddharthnagar Munacipality-7, Himalipath, Rupendehi
Email: pandey.madhav@yahoo.com; 063bct514@ioe.edu.np
Contact Number: +977-9841639322
Manoj Shrestha
Permanent Address: Madhyapur Thimi Munacipality-12, Shivatole, Bhaktapur
Email: frenmanoj@yahoo.com; 063bct518@ioe.edu.np
Contact Number: +977-9803598691