Speaker Independent Speech Recognition System: Session 2007-2011

Speaker Independent Speech Recognition System
Session 2007-2011
Submitted By:
Zeeshan Nadir 2007-Elect-237
Muhammad Salman 09M/07-Elect-259
Supervised By: Prof. Dr. Haroon A. Babri
Department of Electrical Engineering
University of Engineering and Technology Lahore

Speaker Independent Speech Recognition System
Submitted to the faculty of the Electrical Engineering Department of the

in partial fulfillment of the requirements for the Degree of
Bachelor of Science
in
Electrical Engineering
Internal Examiner External Examiner
Director
Undergraduate Studies

ii
Declaration
I declare that the work contained in this thesis is my own, except where explicitly stated
otherwise. In addition this work has not been submitted to obtain another degree or
qualification.
Signed: ________________________________
Supervisor: _____________________________
Dated: _________________________________
iii
Acknowledgements
First of all, we would like to thank Allah Almighty, who gave us courage and strength to
complete this thesis and project within timeline. We also want to extend our thankfulness to
Prophet Muhammad (SAW) and his progeny thorough whom we received the light of
knowledge. We would like to thank our advisor Prof. Dr. Haroon Babri, who guided us
during all the problems that we faced while making this project. We want to show our
gratitude and thankfulness to Mr Shahid Awan, who was really a great source of help,
guidance and inspiration for us. Moreover we are also thankful to Mr. Ehsan Mohsin for
sharing with us his experience and guiding us. Last but not the least, we are highly thankful
to our families who have shown great patience by allowing us to work and stay away from
them for many months.
Zeeshan Nadir
Muhammad Salman
iv
This thesis and project is dedicated to our beloved parents and teachers.
v
vi
Contents
Chapter 1...............................................................................................................................................1
Introduction...........................................................................................................................................1
1.1 Introduction...............................................................................................................................1
1.1.1 Overview of the Project.....................................................................................................3
1.2 Motivation...................................................................................................................................4
1.3 Thesis Organization...................................................................................................................5
Chapter 2...............................................................................................................................................7
Speech Recognition System..................................................................................................................7
2.1 Speech Recognition- Definition and Issues................................................................................7
2.2 Existing Systems........................................................................................................................8
2.3 Problem Statement.....................................................................................................................9
2.4 Objectives of the Project............................................................................................................9
2.5 General Design of System.........................................................................................................9
2.6 Description of Project..............................................................................................................11
Chapter 3.............................................................................................................................................15
Speech Recognition Techniques..........................................................................................................15
3.1 Characteristics of Speech Recognition Systems.......................................................................15
3.2 Techniques in Speech Recognition............................................................................................16
3.2.1 Zero-Crossing and Energy-based Speech recognition.......................................................16
3.2.2 Feature Dependent Speech Recognition System...............................................................17
3.2.3 Template-Based Speech Recognition................................................................................17
3.2.4 Stochastic Speech Recognition Systems...........................................................................18
3.2.5 Knowledge Based Approach.............................................................................................19
3.2.6 Acoustic-Phonetic Approach to Speech Recognition..........................................................19
3.2.7 Artificial Intelligence (AI) Approach to Speech Recognition:............................................20
3.3 Applications of Speech Recognition........................................................................................23
3.3.1 Health Care.......................................................................................................................23
3.3.2 High Performance Fighter Aircraft...................................................................................23
3.3.3 Helicopters..........................................................................................................................24
3.3.4 Battle Management...........................................................................................................24
3.3.5 Telephony and Personal Workstations..............................................................................24
3.3.6 Training Air Traffic Controller....................................................................................24
3.3.7 People with Disabilities...................................................................................................25
vii
3.3.8 Interactive Voice Response (IVR)....................................................................................25
3.3.9 Machine translation...........................................................................................................25
Chapter 4.............................................................................................................................................26
Knowledge Models..............................................................................................................................26
4.1 Acoustic Model........................................................................................................................26
4.1.1 Word model......................................................................................................................26
4.1.2 Phone Model.....................................................................................................................27
4.2 Language Model......................................................................................................................28
4.4 Implementation........................................................................................................................29
Chapter 5.............................................................................................................................................30
Hidden Markov Model........................................................................................................................30
5.1 Signal Models............................................................................................................................30
5.2 Discrete-Time Markov Model.................................................................................................30
5.3 Hidden Markov Model.............................................................................................................31
5.3.1 The Urn and Ball Example................................................................................................32
5.3.2 Types of Hidden Markov Models.....................................................................................33
5.3.3 Elements of Hidden Markov Models................................................................................35
5.4 The Three Basic Problems for HMM.......................................................................................36
5.4 Solutions to the Three Problems of HMM...............................................................................38
5.4.1 Solution to Problem 1 (The Evaluation Problem).............................................................38
5.4.2 Solution to Problem 2 ( Decoding Problem).....................................................................46
5.4.3 Solution to Problem 3 (Learning Problem).......................................................................50
Chapter 6.............................................................................................................................................53
Working of the Project........................................................................................................................53
6.1 System Level Block Diagram..................................................................................................53
6.2 Data Acquisition Block............................................................................................................56
6.3 Signal Processing Block..........................................................................................................56
6.3.1 Pre-emphasis Filter...........................................................................................................56
6.3.2 Voice Activation Detection/ End Point Detection.............................................................57
6.3.3 Frame Blocking and Windowing......................................................................................60
6.4 Feature Extraction Block...........................................................................................................62
6.5 Training and Recognition Mode..............................................................................................67
6.5.4 Vector Quantization..........................................................................................................67
6.6 Hardware Block.......................................................................................................................71
Chapter No 7.......................................................................................................................................72
The Hardware of the Project................................................................................................................72
viii
7.1 Serial Port................................................................................................................................72
7.2 Max 232 IC..............................................................................................................................73
7.3 Microcontroller........................................................................................................................73
7.4 Remote Control........................................................................................................................73
7.5 Wireless Car............................................................................................................................73
8.1 Results.....................................................................................................................................74
8.2 Snapshots of the GUI...............................................................................................................75
8.3 Conclusion...............................................................................................................................77
8.4 Future Work.............................................................................................................................78
Appendix A.........................................................................................................................................79
Vector Quantization in MATLAB.......................................................................................................79
References...................................................................................................................................83
`List of Figures
0
2
3
2
5.2: 4
4
4
6
6
ix
List of Tables
x
Abbreviations, Definitions, Acronyms
ASR Automatic Speech Recognition
HMM Hidden Markov Model
ANN Artificial Neural Network
VAD Voice Activation Detection
AI Artificial Intelligence
IDCT Inverse Discrete Cosine Transform
MFCC Mel Frequency Cepstrum Coefficients
LPC Linear Predictive Coding
FFT Fast Fourier Transform
DFT Discrete Fourier Transform
LM Language Model
IVR Interactive Voice Response
MT Machine Translation
ML Machine Learning
xi
AFTI Advanced Fighter Technology Integration
Abstract
Automatic Speech Recognition has been a goal of research for more than four decades and has
inspired many science fiction wonders. Automatic Speech Recognition has been implemented using
many different techniques like Neural Networks, Acoustic Phonetic Approach, Dynamic Time
Wrapping Approach and Hidden Markov Models etc. Amongst all these many techniques Hidden
Markov Models have been the most popular because of its well suited implementation and its ability
to model the speech in a realistic way. We have followed the HMM technique because of its accurate
modeling regarding the speech signals and giving better accuracy. Both the Hidden Markov Model
and the Vector Quantizer need to be trained for vocabulary being recognized. Such training results in
distinct Hidden Markov Model for each word of the vocabulary. Recognition consists of computing
the probability of generating the input test word, with each of the hidden markov model that has been
created and selecting the one that gives maximum probability. Our project basically focuses on
speech recognition using HMM techniques that would provide us with speaker independence, does
not depend upon the vocabulary used, provide with the run time training and recognition option,
ability to combat with noise using the concept of voice activation detection, ability to update the
vocabulary and change the words etc. All these features have been included in a very handy and easy
to use graphical user interface that provides the user with the utmost ease. The basic aim was to
control some robot in any kind of environment with speech commands. We have been able to control
a small wireless car with speech commands using serial port interfacing with the help of PC’s serial
port and microcontroller.
xii
The project has many basic building blocks, both in hardware and software. This thesis has described
all of these in a very comprehensive way. The modules that are being employed are Sound Recorder,
Filtering and Windowing, Feature Extraction, HMM training, HMM recognizer, Serial Port
Interfacing, Microcontroller, and Robot. The results of the experiments have also been given at the
end.
© Zeeshan Nadir
© Muhammad Salman
Printed in Pakistan
Contact Information
Zeeshan Nadir
Pakistan
Ph: +92-346-548-6147
xiii
E-mail: znadir@purdue.edu
zeeshan_708@hotmail.com
Muhammad Salman
Pakistan
Ph: +92-323-880-4001
E-mail: salmanakram88@gmail.com
xiv
xv
Chapter 1
Introduction
1.1 Introduction
Being able to talk to a machine has always been a fantasy for the people and being able to speak to
your computer and have it recognize and understand what you say would provide a natural and
comfortable form of communication. Such systems also help the users in interacting with
workstations and desktop computers by reducing the amount of typing that we have to do.
Generally speech recognition system is a system that can recognize what a person is saying and can
then take decision and act thereafter. For example it can be used in a cockpit of an airplane to do all
the critical tasks because the hands of the pilot are busy, it can be similarly used in a car to play the
DVD and switch on the AC, and it can be used in a cell phone to dial the numbers etc. Another aspect
of speech recognition is to facilitate the people who have some kind of disability or who are
handicapped. For them, speech recognizers would be of great help, because it would make the daily
chores lot more easier to do and control the domestic appliances like juicer, coffee machine, TV, air
conditioner etc. Thus the applications are countless and one can fully make use of such systems in a
variety of scenarios.
This thesis has particularly focused on how to control a robot accurately by a speech commands that
should not depend upon the speaker, that should allow the speaker to use any language of his choice
and provide the user with the option of training at run time if he feels that results are not up to the
mark or if he feels that the system is making some mistakes in recognizing his speech commands. We
are controlling a car with the help of speech commands that move as we keep on giving it the
commands that are prefixed in our database. The algorithms of many different kinds have been
implemented in MATLAB to implement different modules of the project. We have made the
implementation easy to use and have developed it in such a way that it can easily be extended to add
new features in it. We can also implement it on some processor to switch it to totally hardware.
The thesis also covers the hardware portion of the project. Although our prime purpose was to come
up with a robust speech recognition system, that should be able to combat noisy environments and
provide user with vast number of options, yet we were able to do it well within time and include
hardware to provide a demonstration and explain about its ability of being able to be used in home
based applications primarily and in the environments where the hands and eyes of the users are busy,
and he can just use his mouth to do things, like in industries etc.
One of the most difficult aspects in speech recognition is its interdisciplinary nature and the tendency
of most researchers to forget this fact and apply a monolithic approach to individual approach.
Consider the disciplines that have been applied to one or more speech recognition problems.
1. Signal Processing
Signal Processing is the process of extracting the important information from the speech signal.
For speech recognition, we have to extract information which is of our interest from the speech
signals. Such information is often called features of the speech signal. We have to use signal
processing techniques to extract these features out of the signal and in an efficient and robust
manner so that, it has the ability to combat with noise and environmental changes. Moreover we
need a lot of pre-processing and post-processing with speech signals. We also need filters to be
implemented and characterize the time varying properties of the speech signal by extracting the
spectral properties of the signal with signal processing techniques. Thus a lot of signal processing
is used in speech recognition.
2. Physics(acoustics)
“Acoustics is the science of understanding the relationship between the physical speech and signal
and physiological mechanisms (the human vocal tract mechanism) that produced speech and with
which the speech is perceived (the human hearing mechanism)” [5] .
3. Pattern Recognition
Pattern Recognition consists of the set of algorithms used to cluster data to create one or more
prototype patterns of a data ensemble, and to match (compare) a pair of patterns on the basis of the
feature measurements of the patterns [5].
4. Communication and Information Theory
“ Communication and information theory is the procedure for estimating parameters of statistical
models; the methods for detecting the presence of particular speech patterns, the set of modern
coding and decoding algorithms (including dynamic programming, stack algorithms, and Viterbi
decoding) used to search a large but finite grid for a best recognition sequence of words” [5] .
5. Linguistics
Linguistics is the study of human language. It consists of language form, language meaning the
context of the language. Language form is concerned with the language structure. It focuses on
the rules of the language, morphology, and the syntax of the language. For the case of language
meaning, we are concerned with how do humans employ the logical structures to convey the
2
meanings, assign the meanings to their sentences, how they remove confusion and ambiguities.
Finally the language includes the evolutionary linguistics. This considers the history of the
language, the origin of the language; it also accesses the social aspects of the language and the
change it has undergone over the years. In speech recognition, we are concerned with the
relationship between the sound and the words of the language. We also have to take care about
grammar and context of the language.
6. Physiology
Physiology is the science that is concerned with the functions of living things. Human Physiology
is concerned with the mechanical and physical functions of human in good health. It is concerned
with the understanding of the higher order mechanism within the human central nervous system.
We shall be concerned with the human physiology that accounts for the speech production and
perception in human beings [5].
7. Computer Science
Computer science is the science of computation. It consists of the study of practical techniques of
developing different algorithms that are used in the practical speech recognition system.
8. Psychology
Psychology is the science of mental processes in human being. It is concerned with the
understanding of general principles of that govern the human psyche. of understanding the factors
that enable a technology to be used by human beings in practical tasks.
For accomplishing the project with a good accuracy, a researcher needs to have a good grip over
all these subjects, a range which is quite large for a simple person to master sufficiently. Thus if a
researcher wants to analyze a series of different problems he should have a good understanding of
all of these.
1.1.1 Overview of the Project

Before diving into the details, and motivation of the project, we would give a quick look to a very
general speech recognition system, so that, we can start reading the thesis, with an already abstract
idea.
3
Speech Capture Device
Preprocessed Signal
DSP Module
Storage
Pattern Matching Recognized Words

Reference Speech
Algorithms
Patterns
Acoustic Models Language Models
Figure 1.1: Basic System Architecture of Speech Recognition
This part has only focused only on the speech recognition part. The applications can then be attached
after the decision has been made using any kind of technique. The diagram has been made in a
generalized sense. In our case, once the word is recognized we are then utilizing the decision in
controlling the wireless car with the help of wireless remote control via the serial port of computer.
The serial port is actually getting the decisions via MATLAB that further sends this data to
microcontroller interface to the same serial port via RS 232 protocol. This data is then analyzed and
the corresponding key of the wireless remote control is activated. This signal travels through the
wireless link and hence the car moves accordingly to any of the spoken direction.
1.2 Motivation
Many people are working on speech recognition systems, and many people have already done enough
work on it. After 1930, research on speech recognition started with a good pace. People come up with
speech recognizers that primarily focus on one thing. For example some might come up with speech
recognizers that would allow you to continuously speak and have your speech recognized but this
might be with a high error rate, some speech recognizers might give you good accuracy but they may
depend upon speaker, some might have used very complex algorithms for pattern matching, whereas
some might have used a very novel method for vector quantization. Our basic motivation was to
provide a very simple to use Speaker Independent Speech Recognizer, that should allow the user to
4
train it at run time, have his words recognized in any language he wants, modify the codebook, plot
the results, see the power of speech signal and analyze the noise conditions of the environment. Thus
our focus was to develop a very easy to use Speech Recognizer with a graphical user interface and
navigations buttons that would allow the user to perform all the tasks with just simple mouse clicks.
Following is a high abstract level diagram of our project.
Graphical Software in Serial Port

User Hardware
User Interface MATLAB
Figure 1.2: Abstract Diagram of the Project
1.3 Thesis Organization

The thesis is divided into different chapters emphasizing each part of the project separately in its very
own sitting.
Chapter 2 analyzes the problem statement and then the system design and description of project is
given. Objectives of project are discussed along with other possible applications that may be
developed.
Chapter 3 addresses different speech recognition techniques that are being used for developing
speech recognition systems. The techniques that are discussed are zero crossing and energy based
speech recognition systems, feature dependent speech recognition systems, template based speech
recognition systems, knowledge based speech recognition, stochastic speech recognition systems,
Connectionist Speech Recognition Systems, dynamic time wrapping approach speech recognition
systems.
Chapter 4 explains different kinds of knowledge models that are used.
Chapter 5 discusses the Hidden Markov Models and the use of Hidden Markov Models in speech
recognition. It also discusses the HMM with regard to our project. First of all HMM is explained
without reference to any specific application and the three basic problems of HMM are addressed.
Then it is discussed with respect to our implementation.
Chapter No 6 explains the working of the project. It gives complete detail right from the process of
speech recording till the recognized word and then movement of the robot (car). Both the software
and hardware part is explained.
Chapter No 7 explains the hardware portion and the robot being used and the technique to control it.
Chapter No 8 finally shows the results, analysis and conclusion.
5
Appendix discusses about Vector Quantization in MATLAB using VQDTOOL.
6
Chapter 2
Speech Recognition System
In the last two decades, attempts have been made to automate the recognition of human speech. The
term "Speech Recognition" is one that covers many different approaches to the problem of
recognizing human speech. It ranges from isolated word speech recognition to continuous speech
recognition, from speaker-dependent recognition to speaker-independent recognition, and from a
small vocabulary to a large vocabulary. The simplest scenario is speaker-dependent, isolated word
recognition on a small vocabulary and the most complex is speaker-independent, continuous speech
recognition on a large vocabulary. In any case, the speech recognition problem, as developed over the
years, is a highly computation intensive problem; it requires fast processors, and a large amount of
memory and providing the user with some kind of specialty and promising to be used in some
applications. This project has focused on providing the user with a speech recognition system, with an
easy to use graphical interface and a fully developed application.
2.1 Speech Recognition- Definition and Issues

In speech recognition today there are no well known software that can provide the user with very
simple navigation, easy to use interface, language independent and speaker independent recognition
capability. The general problem of automatic speech recognition is still far from solved by any
speaker in any environment. But recent years have seen ASR technology mature to the point where it
is viable in certain limited domains. One major application area is the home appliances and home
based robots that typically don’t need much vocabulary, have prefixed commands to operate them.
They need isolated words recognition capability to perform different kinds of tasks. While many tasks
are better solved with visual or pointing interfaces, speech has the potential to be a better interface
than the keyboard for tasks where full natural language communication is useful, or for which
keypads are not appropriate.
Speech recognition refers to the ability to listen (input in the audio format from some input device,
like a microphone) to spoken words and identify various sounds present in it, and recognize them as
words of some known language. Speech recognition in computer system domain may then be defined
as the ability of computer systems to accept spoken words in audio format such as wav and then
generate its content in text format also being able to use it in different applications. Speech
recognition in computer has various steps involved in them. The major steps required to make
7
computers perform the speech recognition are: Voice Recording, Word Boundary Detection, Feature
Extraction, Vector Quantization, and Recognition using knowledge models.
Word boundary detection is the process of identifying the start and the end of a spoken word in the
given sound signal. At times it becomes difficult to find the word boundary because of the change of
accent of people, the environment in which they might be speaking, the duration of the pause they
give between words while speaking. [4] is an algorithm for determining the endpoints of isolated
utterances.
Feature Extraction refers to the process of conversion of sound signal to a form suitable for the
following stages to use. Feature extraction may include extracting parameters such as amplitude of the
signal, energy of frequencies etc.
Vector Quantization like any other kind of quantization is the process of quantizing the vectors
because of the constraints on the amount of memory we have. We know that, although this process
discards some amount of information, but the performance penalty is not very much high. Rather the
benefits that we get from the limited memory, i.e. more speed and space far outclasses the
disadvantages if any. Vector Quantization is a very efficient representation. As already described, if
we use Vector Quantization in our speech recognition system, we shall need less processing, and
enjoy more speed. In the process of vector quantization, we quantize a number of different vectors
that are good representative of the spectral vectors that we are getting from the speech vectors. Once
these vectors are quantized, then the incoming vectors are compared against these vectors, the
distance between the vectors is calculated, and the vector is mapped to that vector which is nearest to
it according to the distances measured. Thus a codebook is generated and kept, that contains the
representative vectors. We only need to look up this codebook for mapping of the spectral vector.
Recognition involves mapping the given input (in form of various features) to one of the known
sounds. This may involve use of various knowledge models for precise identification and ambiguity
removal.
Recognition system takes help from knowledge models which include models such as phone acoustic
model, language models. We have to train the system for generating these models. For the purpose of
training, one needs to show the system a set of inputs and what output they should map to. This is
referred to as supervised learning [17].
2.2 Existing Systems

Although some promising solutions are available for speech synthesis and recognition, most of them
are tuned to a specific language or with specific constraints with few options. Most of them require a
lot of configuration before they can be used. There are also projects which have tried to develop
8
speech recognizers for Urdu language. ISIP [8] and Sphinx [7] are two very well known speech
recognition software in open source. [9] gives a comparison of public domain software tools for
speech recognition. Some commercial software like IBM’s Via Voice is also a similar kind of
software.
2.3 Problem Statement

The aim of this to build a speech recognition system that should be speaker independent. It should
provide a good level of accuracy for a small vocabulary. It should allow the user to analyze the feature
vectors and speech signals with the help of graphs. The speech recognizer can then be used in home
based appliances or in a environment, where the hands of the user are busy. For the demonstration of
a practical application of our project, we have controlled a wireless remote control car with the help of
speech commands.
2.4 Objectives of the Project

The main objectives of this project are as follows:
 To develop and test a speaker independent speech recognition system via Hidden Markov
Modeling.
 To evaluate the performance of the system in terms of time, accuracy and storage.
 To provide the user with an easy to use graphical interface so that he can easily navigate
between different menus.
 To provide the user with the ability to plot the graphs to see the feature vectors, Mel filters
etc.
 To be able to demonstrate that such kind of speech recognizers can be integrated in home
based appliances like juicer, TV, DVD Players etc with the help of digital signal processors.
2.5 General Design of System

The prepared system if visualized as a block diagram will have the following components:
Sound Recording and word detection component, feature extraction component, speech recognition
component, acoustic and language model.
9
Sound Input
Language
Sound Recorder model
Word Detector Feature Recognition

Extraction Component
Figure 2.1: Block Diagram of the Recognition System Output to serial

Acoustic Model
port
Sound Recording: The component is responsible for taking input from microphone.
Word Detection Component: This consists of identifying the presence of words. Words detection is
done using energy and zero crossing rate of the signal. We can use a wave file, or directly feed the
output of sound recording and word detection to the feature extractor. We use power measure of the
signal to detect the end points of the signal.
Feature Extraction Component: This component generated feature vectors the signals given to it. In
our case, it generates Mel Frequency Cepstrum Coefficient as the features. However it can also extract
other features of the speech signa, like LPC features etc. These features are to uniquely identify the
given sound signal. This module is discussed in detail in Chapter 2.
Recognition Component: This is a Hidden Markov Model base recognizer that has a model for each
of the word in the vocabulary. It is the most important component of the system and is responsible for
finding the best match in knowledge base, for the incoming feature vectors.
Knowledge model: The components consist of Word base Acoustic; In Acoustic Model we are
concerned with how a word sounds. Recognition system makes use of this model while recognizing
the sound signal. We have used a word base model for each of the word, and the model is obviously a
Hidden Markov Model.
10
Microphone input
Language model
Sound Recorder
Word Detector Feature Extraction Recognition Component Vocabulary
Acoustic Model
Figure 2.2: Block Diagram of the Training System
2.6 Description of Project

It has already been told in the introduction of the project that the project is actually divided into many
different small modules. The modules consist of speech recording, signal processing (windowing and
framing), feature extraction, vector quantization, training, recognition, serial port interface, micro
controller, wireless car. Thus along with the speech recognizer this car application has been included
to come up with an implemented application that can inspire others to develop and integrate such
speech recognizers in home based appliances and home based applications.
The technique that has been used for speech recognition in this project is Hidden Markov Modeling.
Since the first order hidden Markov model (HMM) has been a tremendously successful
mathematically established paradigm, which makes it the up-to-the-minute technique in current
speech recognition systems, this dissertation bases all its studies and experiments on HMM. HMM is
a statistical framework that supports both acoustic and temporal modeling. It is widely used despite
making a number of suboptimal modeling assumptions, which put limits on its full potential. We
investigate how the model design strategy and the algorithms can be adapted to HMMs. Large suites
of experimental results are demonstrated to expound the relative effectiveness of each
component.HMM is used for recognition of a word irrespective of the speaker. We will use this type
of modeling in our regard.
The project basically has two phases. First of all we have the training phase and then we have the
Recognition and the application phase.
In the training phase, we have to create a Hidden Markov Model for each of the word in the
vocabulary. A Hidden Markov Model basically consists of three parameters. Those are A, B and Pi.
11
Although the Hidden Markov Model is not the subject of this chapter, yet the basic terms used in this
section are mentioned.
A and B are matrices. A is called the state transition matrix and B is called the confusion matrix
(although its name might be confusing, it’s simple to understand itself). Pi is a vector. The sizes of
these matrices are as follows.
Size of A = n x n;
Size of B = n x m;
Size of Pi = 1 x n
Where:
n is the number of states in the Hidden Markov Model of each word
m is the possible number of observation symbols for each word
For training, first of all we assume an initial value for A, B and Pi. This initial selection has
as such no specific rules. Both the Hidden Markov Model and the Vector Quantizer need to
be trained for vocabulary being recognized. Such training results in distinct Hidden Markov
Model for each word of the vocabulary. For recognition consists of computing the probability
of generating the test word with each word model, and choosing the word model that gives
the highest probability. This is shown in the coming diagram.
MFCC vectors are extracted using the feature extraction technique for each of the word and
then a codebook is made (which is not shown in the following diagram) which is used to
create a Hidden Markov Model for each word of the vocabulary. Once a model is created for
each word, then a word is spoken and its corresponding MFCC feature vectors are extracted
and a symbol sequence is found using the codebook. Then the probability of producing this
sequence with each of the Hidden Markov Model is computed. The word that gives
maximum probability is ultimately selected as the recognized word. On the next page, there is
a diagram that explains this.
12
Figure 2.3: HMM model and probability computation [10]
For each of the word, its corresponding Hidden Markov Model is stored. Apart from that, a common
codebook is also stored which contains the quantized spectral vectors for different words. After the
word has been recognized then the output is displayed on the screen. All this is done with the help of
graphical user interface. Once the word is recognized then other than displaying the word on the
screen, a corresponding movement command is given to the robot, since the word used in our project
are Move, Stop, Left, Right, Stop, Reverse. Thus the car moves or stops according to the issued
speech command. For this purpose, serial port interface is being used.
After recognizing the word, MATLAB issues a corresponding command to the serial port. This is
received by the microcontroller which then activates a corresponding button the wireless remote
control of the car. The information travels from the wireless link and reaches the car that ultimately
moves or stops the way the user wants. Following is the diagram of this scenario.
MATLAB Serial Port Microcontroller Remote Control
Wireless Car
Figure 2.4: description of the hardware portion of the project
13
In the next chapter, speech recognition techniques shall be analyzed and their pros and cons shall also
be discussed. In chapter 4 we shall focus on the technique that we have used in our project which is
the Hidden Markov Modeling.
14
Chapter 3
Speech Recognition Techniques

The first step in solving the speech recognition problem is to understand its complexity. There are
four basic components that should be considered in understanding the operation of a speech
recognition system. First of all, the recognition system must have some form of encoding and
representing a set of utterances that it will recognize. Secondly, during recognition phase there must
be some type of pattern-matching algorithm that compares representation of a particular input
utterance with the representations that are present in the vocabulary. Third, we must have some form
of pattern matching algorithm that compares the input with the representations present in the
vocabulary. Finally there must be some kind of user interface to the functions and operation of the
speech recognition system, and not just that, it should be easy to use so that, anyone can use it [18].
3.1 Characteristics of Speech Recognition Systems

There are a number of ways to classify speech recognition systems. A very important classification is
whether the system is speaker dependent or speaker independent [6]. Speaker-dependent systems
recognize a particular person's utterances only if that person has previously stored examples of his or
her speech in the system. Speaker-independent systems recognize speech without prior experience
with a particular person.
There are different kinds of speech recognizers that accept discrete, connected, or continuous speech.
Isolated word recognizers recognize the words while speaker have to pause between different words
he/she speaks. This is adequate for many applications but it is not like a natural way of
communication. Systems that recognize connected speech are able to accept a sequence of
concatenated words. Some words are separated by pauses while others are connected [18]. Thus in
this case the task of spotting the beginning and ending of words becomes important for the recognizer.
We solve this task by the algorithm of Voice Activation Detection, devised by [13]. Finally, in
continuous speech recognition, the speaker talks naturally and the system recognizes strings of words.
This system is more complex than all the previous system because here we have to model the
grammar of a natural language. These rules or models usually render the system slower than its
discrete counterpart [8]. This is shown in the diagram on next page.
15
Inventory of Word Dictionary (in Task
terms of chosen
Speech
units) Grammar Model
recognition
units
Feature Unit Matching Lexical Syntactic Semantic

Analysis System Decoding Analysis Analysis
Figure 3.1: Block Diagram for Continuous Speech Recognizer [10]
Finally, systems can differ depending upon the size of the vocabulary. Some system can recognize
only a small set of words [9] (as few as ten digits), while other can recognize larger sets of words,
even tens of thousands [6].
3.2 Techniques in Speech Recognition
There are many different methods of speech recognition. Zero crossing, energy measurement, and
feature extraction algorithm are based on acoustic phonetic approach. Template matching algorithms
are considered to be among the pattern recognition approach, while algorithms that depend on
knowledge sources, stochasticity of speech signals and neural networks are based on the artificial
intelligence approach [18]. However, an important class of stochastic modeling is the modeling using
Hidden Markov Models in conjunction with the Viterbi algorithm. Among these the most popular and
accurate algorithm is the template based Dynamic Time Warping [5].
3.2.1 Zero-Crossing and Energy-based Speech recognition
Zero crossing measurement and energy measurement based speech recognition systems require less
computations. Speech signals are divided into frames and then the parameters or the features are
extracted out these speech signals. These features or parameters can be the average zero-crossing rate,
16
density of zero-crossings within frames, excess threshold duration, standard deviation of the zero-
crossing within frames, mean zero-crossing within frames, and energy estimates for each frame etc.
These parameters are then compared with fixed numbers to determine the spoken word. For example
we can separate different numbers according to their zero crossing rates. Choki Ki Chan developed a
system based on zero crossing measurement. It was speaker dependent, isolated Catonese digit and
words limited vocabulary speech recognition (SR) system, developed and implemented on a PC-386
with a recognition accuracy of 97.2 %. It worked reasonably well for isolated digits but when tested
for isolated words accuracy dropped to 76% [18].
3.2.2 Feature Dependent Speech Recognition System
Feature-dependent speech systems are work on the principles of human speech perception [18]. Some
of the features that these kind of speech recognizers use are: frequency location of the first few
formants, the maximum and minimum frequency of the first few formants, and duration of a periodic
energy, formant transitions and the ratio of high frequency energy to low frequency energy [18]. “To
obtain the above sets of features, parameters such as the spectrum, pitch, zero crossings, total energy,
energy in low, mid, and high frequency bands are produced using signal processing routines” [18].
Clustering algorithms are used here for making groups of letters in clusters. An example of such a
system is FEATURE, a speaker independent isolate letter recommendation system [18]. This system
recognizes an 80 letter vocabulary, generated by 20 different speakers. The recognition accuracy was
85% for a speaker-independent mode while it was 91% when operated in dynamic adaptation mode
[18].
3.2.3 Template-Based Speech Recognition
Templates of all the words of vocabulary are made in this kind of speech recognition that can be later
used for comparisons. This database is generated during the training mode. During recognition, the
input speech signal is compared with each of the template and selects that template that best matches
the input speech signal. Since the rate of human speech production varies considerably, we may need
to stress or compress the time axes between the incoming speech and the reference template. This can
be done efficiently using Dynamic Time Warping (DTW). For this purpose we may divide the speech
signals in different frames and then compare the corresponding frames. Frame distances between the
processed speech frames and those of the reference templates are summed to provide an overall
distance measure of similarity. But, instead of taking frames that correspond exactly in time, a time
“wrap” on the utterance is done (and scale its length) so that similar frames in the utterance line up
17
better against the reference frames [11]. The wrap is found by dynamic programming procedures
such that sum of frame distances in the template comparison is minimized. The distance produced by
this wrap is chosen as the similarity measure. In the illustration here, the time is on x axis and the
speech frames that make up the test and reference templates are shown as scalar amplitude values
on y axis. “In practice, they are multidimensional vectors, and the distance between them is usually
taken as the Euclidean distance” [11].
Figure 3.2: Matching with Dynamic Time Wrapping [11]
The above figure shows how warping one of the templates improves the match between them. (For
further information, see chapter 10 of O’Shaughnessy [12]).
“In a few algorithms, like Vector Quantization (VQ), it is not necessary to vary the time axis for each
word, even if two words have different utterance length. This is performed by splitting the utterance
into several different sections and coding each of the sections separately to generate a template for the
18
word. Each word has its own template, and therefore this method becomes impractical as the
vocabulary size is increased (> 500 words)” [18].
3.2.4 Stochastic Speech Recognition Systems
Stochastic Modeling is another technique of building up speech recognition systems. Probabilistic

models of speech are used in this approach to deal with incomplete information or uncertainty. The
most widely used model is the Hidden Markov model. For example in our case, we use a separate
Hidden Markov Model for each of the word. The HMM uses states that model generic speech sounds
and transitions between states with associated transition probabilities to model the temporal behavior
of speech [18]. HMM assumes that speech was produced by a Hidden Markov Process. To derive the
transition probabilities and finding the solutions to the problems of HMM, we use forward-backward
algorithm. “Though the HMM approach can give substantially accurate results, if the time factor is
taken into consideration, then algorithms based on template matching using Vector Quantization or
DTW prove to be much faster than the ones based on HMM” [18]. We have used this type (HMM) of
modeling in our speech recognition system.
3.2.5 Knowledge Based Approach
Knowledge-based speech recognition systems incorporate expert knowledge that is, for example,
derived from spectrograms, linguistics, or phonetics. The goal of a knowledge-based system is to
include the knowledge using rules or procedures. The disadvantage of these systems is the difficulty
of quantifying expert knowledge and integrating the plethora of knowledge sources [18]. For large
vocabulary or continuous speech recognition it becomes a difficult task. HEARSAY is an example of
a knowledge-based system which is developed on a PDP-11 microcomputer, at Carnegie-Mellon
University. It was a speaker-dependent and continuous-speech recognition system. Its vocablulary
was of 1011 words [18].
3.2.6 Acoustic-Phonetic Approach to Speech Recognition

Figure 3.3 shows a block diagram of the acoustic-phonetic approach to speech recognition. All of the
steps involved in this approach are explained one by one.
 The first step in the processing is the speech analyses system in which we get extract the
important parameters of the speech which are called the features of the speech, which give us
the spectral properties of the time varying speech signal. The most common techniques of
spectral analysis are the class of the filter bank methods (e.g. MFCC) and the class of LPC
methods. Both of these methods provide spectral description of the speech over time [5].
 The second step is the feature detection stage in which we convert the spectral measurements
to a set of feature that describes the broad acoustic properties of the different phonetic units.
19
The most common features are nasality, friction, format locations, voiced-unvoiced
classification, and ratios of high- and low frequency energy [5].
 “The third step in the procedure is the segmentation and labeling phase whereby the system
tries to find stable regions (where the features change very little over the region) and then to
label the segmented region according to how well the features within the region match those
of individual phonetic units” [5]. For more information on this read [5].
The result of the segmentation and labeling step is usually a phonetic lattice which is shown in figure
3.4. From this a lexical access procedure determines the best matching word or sequence of words.
Other types of lattices (e.g., syllable, word) can also be derived by integrating different kinds of
constraints into the control strategy e.g. the vocabulary and syntax constraints [5].
Feature
Detector 1
Speech Segmentation Control

Analysis
System & Strategy
.
Labeling
Feature
.
Detector Q
Formants Pitch
Figure 3.3: Block diagram of acoustic phonetic speech recognition system [5]
“The quality of the matching of the features, within a segment, to phonetic units can be used to assign
probabilities to the labels, which then can be used in a probabilistic lexical access procedure. The final
output of the recognizer is the word or word sequence that best matches, in some well defined sense,
the sequence of the phonetic units in the phoneme lattice” [5].
20
AO R EH M AW T
SIL AA AX B AA SIL
OW SIL
Time
Figure 3.4: Phonetic lattices for word string [5]
3.2.7 Artificial Intelligence (AI) Approach to Speech Recognition:
The basic idea of the artificial intelligence approach to speech recognition is to compile and
incorporate knowledge from a variety of knowledge sources. Thus in the AI approach to segmentation
and labeling acoustic knowledge is augmented with phonetic knowledge, lexical knowledge, syntactic
knowledge, semantic knowledge, and even pragmatic knowledge. To be more specific, we first define
these different knowledge sources in the following lines [5];
 Acoustic knowledge- evidence of which sounds (predefined phonetic units) are spoken on the
basis of spectral measurements.
 Lexical knowledge- the combination of acoustic evidence that maps sounds into words (or
equivalently decomposes words into sounds)
 Syntactic knowledge- the combination of words to form grammatically correct sentences.
 Semantic knowledge- understanding of the task domain so as to be able to validate sentences

that are consistent with the task being performed, or which are consistent within the context
of previous sentences.
 Pragmatic knowledge- inference ability necessary in resolving ambiguity of meaning based

on the context.
For further study on these, please refer to [5].
There are several ways of integration knowledge sources within a speech recognizer. Perhaps the most
standard approach is the “bottom-up” processor which is shown in figure 3.6.In this approach the
lowest level processes (e.g., feature detection, phonetic decoding) precede higher level processes
21
(lexical decoding, language model) in sequential manner. Thus each stage of the processing is
constrained as little as possible. The alternative approach is the “top-down” processor, in which the
language model generates word hypotheses that are matched against the speech signal, and
syntactically and semantically meaningful sentences are built up on the basis of the word match scores
[5]. Figure 3.5 shows a system that is often implemented in the top down mode. Here unit matching,
lexical decoding and syntactic analyses modules are integrated into a consistent framework [5].
Inventory of Task
speech Word
Recognition Dictionary Grammar Model
Units
Feature Analysis Unit matching Lexical Syntactic Semantic

System hypothesis hypothesis hypotheses
Utterance
verifier/ Matcher
Recognized Utterance
Figure 3.5: A Top Down Approach to knowledge integration for speech recognition
22
Speech Utterance
K
n
o Signal Processing
w
l
e Feature extraction Voiced/Unvoiced/ Silence
d
g
e Segmentation
S Sound Classification Rules

o Labeling
u
r Sound Merging Phonotactic Rule
c
e
s
Wording Verification Lexical Access
Sentence Verification Language Model
Recognized Word
Figure 3.6: A Bottom Up Approach to Knowledge Integration for Speech Recognition [5]
3.3 Applications of Speech Recognition
Although still there is a lot of room for improvement in speech recognition systems, they have been
employed in different applications. They are mostly employed in industrial applications where the
hands of people are busy in doing other things e.g. in product inspection, inventory control,
command/control, and material handling. Speech recognition also finds frequent application in health
sciences. In hospitals it can be used to help patients to do those tasks that are performed in routine.
For example instead of calling anyone and telling him toj bring water, patients can just use specific
commands and attendants would come to know that, the patient needs water.
Speech recognizers have also potential applications in telephone networks. Telecommunications

companies can save millions of dollars by using speech recognizers instead of human operators that
23
can perform very fast and provide quick service to hundreds of customers in parallel.
Some of the typical real world applications are:
3.3.1 Health Care

Speech recognizers can be employed in medical documentation processes. The medical
documentation process can be made very fast with the help of speech recognizers. For example
searching the patient records, queries, form filling etc. All these things if performed with speech
recognizers can save a lot of time.
3.3.2 High Performance Fighter Aircraft

Countries like U.S.A, U.K, and France have done substantial amount of work in the applications of
speech recognition in Fighter Aircrafts. For example the U.S. program in speech recognition for the
Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16 Vista), the program in France
on installing speech recognition systems on Mirage aircraft. Similarly U.K is also carrying out a series
of program in this domain. These speech recognizers have been employed to do many common taskts
e.g. setting radio frequencies, commanding an autopilot system, setting steer-point coordinates. They
are also used in setting weapons release parameters, and controlling flight displays. Normally the
vocabulary being used is very limited thus providing great deal of accuracy.
The Eurofighter Typhoon currently in service with the UK RAF employs a speaker-dependent system,
i.e. it requires each pilot to create a template. The system is not used for any safety critical or weapon
critical tasks, such as weapon release or lowering of the undercarriage, but is used for a wide range of
other cockpit functions. Voice commands are confirmed by visual and/or aural feedback. The system
is seen as a major design feature in the reduction of pilot workload, and even allows the pilot to assign
targets to himself with two simple voice commands or to any of his wingmen with only five
commands.
3.3.3 Helicopters
In helicopters there is a lot of noise, so speech recognition is a big problem in that environment.
Moreover there is no face mask on the faces of helicopter pilots so speech recognition can be difficult.
However since most of the time, helicopters have to go on rescue missions, therefore the hands and
eyes of pilots are very busy, and speech recognizers can be of great help in such situations. Examples
of research carried out in speech recognition in helicopters are by the U.S. Army Avionics Research
and Development Activity (AVRADA) and by the Royal Aerospace Establishment (RAE) in the UK.
France has included speech recognition in the Puma helicopter. There has also been much useful work
in Canada as well. Results are good, yet with a lot of room for improvement. The tasks that can be
24
performed in helicopters is control of communication radios; setting of navigation systems; and
control of an automated target handover system.
3.3.4 Battle Management

In battle management, the commanders have to access different databases while their eyes and hands
are busy. In such kind of environment, speech recognizers have great benefits. Isolated word
recognizers have been integrated in battle environments to provide the performance under this kind of
environment. Defense Advanced Research Projects Agency (DARPA) in the U.S. has tried to work in
this area of natural speech interface. Currently efforts are carried out to integrate speech recognition
and natural language processing to allow spoken language interaction with a naval resource
management system.
3.3.5 Telephony and Personal Workstations

Speech recognition is now becoming common in the field of telephony and in personal workstations.
Users can interact with their personal computers with the help of speech recognition software.
Although the performance is not good, still there is a lot of room of improvement. Moreover speech
recognition software also being used in documents processing software to perform many trivial tasks.
They are also used in mobile phones to do a series of different task like dialing numbers, sending
SMS etc. Examples are: Microsoft Corporation (Microsoft Voice Command), Nuance
Communications (Nuance Voice Control), Vito Technology (VITO Voice2Go), Speereo Software
(Speereo Voice Translator) and SVOX.
3.3.6 Training Air Traffic Controller

Air traffic controllers (ATC) communicate with the pilots to guide them while they are flying
aircrafts. These air traffic controllers have to be trained before they can perform their jobs. For this
purpose a person called pseudo pilots are employed to train the air traffic controllers. They try to
simulate the original scenario. Since the communication being carried out between the pilot and the
ATC is of fixed nature, therefore speech recognition software can be employed in this situation to
train the ATC. Thus speech recognizers offer the potential to eliminate the need for a person to act as
pseudo-pilot. This will reduce training and support personnel.
3.3.7 People with Disabilities

Speech recognition software offer great benefits to people who have dis abilities. For example there
are people who to sit on wheel chairs all the day. Speech recognizers can be installed there to do many
of their routine chores. For example if speech recognizers are installed, they would not have to call
someone to open the door, switch on the T.V or switch the A.C etc. Moreover they would be able to
communicate with computers etc.
25
3.3.8 Interactive Voice Response (IVR)
Interactive Voice Response (IVR) is extensively used in telecommunications. Using this service, the
users dial the number of call centers and interact with the computer using the keypad of mobile phone
to query different things, like balance left, call records, call package information etc. A lot of time can
be saved if we install speech recognizers in this area. They also have potential applications in satellite
navigation, audio and mobile phone systems.
3.3.9 Machine translation

Machine translation (MT) is a sub-field of computational linguistics that investigates the use of
computer software to translate text or speech from one natural language to another. It is used for
simple substitution of words in one natural language for words in another. Here also there is a great
deal of room for the employment of speech recognition software.
Chapter 4
Knowledge Models
In knowledge models based approach first of all some productions rules are developed. Knowledge is
acquired from the speech spectrogram or from the linguistics of the speech. The production rules
which are generated are then later used for classification of the phonemes. During the training stage,
features are extracted from the speech signals, and a set of production rules are then generated based
26
on these extracted features. The recognition is then done, on the basis of these rules, using decision
trees. For this purpose, the system needs to know how the words sound. During the training, using the
input speech data, the system generates acoustic model and language model. These models are later
used in the recognition stage by the system to map a sound to a word or phrase [17].
4.1 Acoustic Model
“Feature extracted by the Feature extraction module need to be compared against a model to identify
the sound that was produced as the word that was spoken” [17]. This model is called as Acoustic
Model.
There are two kinds of Acoustic Model
 Word model
 Phone Model
4.1.1 Word model

For small vocabularies, we normally use word models. In this model we model the words as a whole.
Thus each word is modeled separately. If we want our system to recognize a new word, we will have
to train the system for the word. For the process of recognition process, the spoken sound is matched
against each of the model to find the best match. This best match is assumed to be the spoken word.
We use the input sound files to train a HMM Model. Figure 4.1 shows a diagram that represents a
phone based acoustic model [17].
Word 1
Word 2
Start Word 3 End
Word n
27
S0 S1 S2 S3 S4 S5
Figure 4.1: Word acoustic model [17]
4.1.2 Phone Model

In phone model rather than modeling the whole word, we model only parts of words generally
phones. And then we model the word itself as a sequence of phone. The heard sound is now matched
against the parts (in this case the phones) and parts are recognized. These recognized parts are put
together to form a word. For example the word Move is generated by combination of three phones,
which arem , oo∧v
´ . This type of model is generally useful when we need a large vocabulary. The
process of adding a new word in this case is easy, because the sounds of the phones are already
known to us, and we only need to add the new sequence of phone (of the new word being added)
along with its probability to the system [17]. Figure 4.2 shows a diagram that shows phone based
acoustic model [17].
Word 1
Word 2
Start Word 3 End
The
28
dh ah
S0 S1 S2 S3 S4 S5
Figure 4.2: Phone acoustic model [17]
4.2 Language Model

Although there are words that have similar sounding phone; humans generally find it easy to
recognize the word. This is mainly because they know the context which is being discussed, and also
they have a pretty much good idea about what words or phrases can occur in the context. Providing
this context to a speech recognition system is the purpose of language model. Thus we actually
specify extra constraints that try to bring to enhance the performance of speech recognizer by
discarding those paths in the recognition phase that lead to words that are out of the context according
to the language. However this makes the process of speech recognition dependent on the language
being used. The language model specifies what are the valid words in the language and in what
sequence they can occur.
Language Model can be classified into several categories:
Uniform Model
In this kind of model each word has equal probability of occurrence.
Stochastic Model
In this kind of model probability of occurrence of a word depends on the words preceding it.
Finite State Languages
In this case language uses a finite state network to define allowed word sequences.
Context Free Grammar Context
In this case, free grammar can be used to encode which kind of sentence is allowed [17].
29
4.4 Implementation
We have implemented a word acoustic model in our project. The system has a model for each word..
While recognizing the system compares the word with each of the model. We extract the feature
vectors of a word to characterize it. When a speech is given to the system to recognize, it compares
each model with the word and finds out the model that most closely matches with it (in our case using
Viterbi Algorithm). HMM model that best matches is thus the output, and the word corresponding to
that model is displayed. For more details of HMM, read chapter 5.
30
Chapter 5
Hidden Markov Model
5.1 Signal Models
In real world many processes generate observable outputs which we can name as signals. These
signals can be discrete, continuous or combination of both type. Other classifications can be made
depending upon the time varying characteristics of the signal and the source of the signal. For
example the signal can be stationary or non-stationary. Stationary are those signals, whose time
varying properties do not change with time. And non-stationary are those signals whose properties do
not change with time. Similarly there are signals which are pure, means they are generated from a
single signal source, and there are signals that are not pure, because they are corrupted by the noise
that is being added by other sources. Signal Models are used to characterize such signals. There are
many reasons of using signal models. Two major reasons are: to get theoretical description of a signal
processing system and getting learned about signal source without availability of the source [10].
We can model the signals depending upon the type of the signals and the nature of the generating
source of the signal. Real world signal models can be one of two categories: deterministic and
statistical. In deterministic model some of the specific properties of signals are known to us, so that
the specification and analysis of signal model is simple. For example we know that, the signal is a
square pulse or a Gaussian pulse. The second broad class of signal models is the statistical models in
which one tries to characterize only the statistical properties of the signals, and in this case we don’t
know specific properties of the signal. Examples of such statistical models are Gaussian processes,
Binomial processes, Markov processes and Hidden Markov Processes etc. For statistical systems we
assume that the signal can be characterized as a parametric random process and parameters of the
random process can be estimated in a precise, well defined manner [10].
Here, we will see a descriptive detail on one of these statistical models – Hidden Markov Model
31
(HMM). We will carry a step by step approach; first of all we shall have a look on discrete markov
models and then we shall study the Hidden Markov Model (HMM).
5.2 Discrete-Time Markov Model
This section will describe the theory of Markov chains. Here the hidden part is not actually hidden
rather uncovered. The system in this section is thereby called an Observable Markov Model. Consider
a system that may be described at any time being one of a set of N distinct states index by 1, 2. . . N.
For the case of discrete-time observable markov model, at regular spaced, discrete times, the system
undergoes a change of state (possibly back to same state) according to a set of probabilities associated
with the state. The time instances for a state change are denoted t whereas the actual state at time t as
qt. In the case of a first order Markov chain, the state transition probabilities do not depend on the
whole history of the process, but they only depend upon the preceding state. Here the choice of state
is purely made on the basis of previous state only. Although this may be a gross over simplification,
however the model so generated is still of much practical imprtance. This is the Markov property and
is defined as:
P ( q t= j|i , qt −2=k , … ¿=P ( ( qt = j|i , qt −1=i ) (5.1)
We can see that the right hand side of eq 4.1 is independent of time, which leads to a set of state
transition probabilitiesa ij, of the form:
aij=P ( ( q t= j|i, q t−1=i ) , 1≤ i , j≤ N (5.2)
The state transition probabilities for all the states in a model can be described by a transition
probability matrix:
a11 ⋯ a1 N
[
A=¿ ⋮ ⋱ ⋮
a N 1 ⋯ a NN ] (5.4)
The only thing remaining to describe the system is the initial state distribution vector (the probability
to start in some state). And this vector is described by:
32
π 1=P ( q 1=1 )
π=
[ ]
π 2=P ( q 2=2 )
.
.
π N =P ( q N =N )
(5.5)
The stochastic property for the initial state distribution vector is:
∑ π i =1 (5.6)
i=1
Where π i is defined as
π i=P ( q 1=i ) ,1 ≤i ≤ N (5.7)
These properties and equations describe a Markov Process. The Markov Model can be described by A
and π . For more reading refer to chapter 6 of [5].
5.3 Hidden Markov Model
A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed
to be a Markov process with unobserved state. In the above model the observer can directly see and
observe the states and it’s visible to the observer. That was actually the case of a Markov Process, but
in the case of Hidden Markov Process, the state is not directly visible, but the output that depends on
the state is visible to the observer. Therefore the sequence of observation that the observers sees gives
some information about the sequence of the underlying hidden states. HMM is a doubly embedded
stochastic process with an underlying stochastic process that is not observable but can only be
observed through another set of stochastic process that produces the sequence of observations.
Thus for modeling HMM, we can extend our previous model to the Hidden Markov Model. The
extension done is that, every state is now not deterministic, rather it will be probabilistic. This means
that every state generates an observation according to some probabilistic function, and thus we get a
matrix according to the number of states and according to the number of observable observations [10]
[13]. These observations are denoted by O t. The production of observations is defined by a set of
observation probability measures
b j ( Ot ) =P (ot ∨q t= j) (5.8)
33
To explain the Hidden Markov Models, we shall present a very famous ball and urn example.
5.3.1 The Urn and Ball Example
Assume that there are N large glass urns in a room. Within each urn is a quantity of colored balls.
Assume that there are M distinct colors of the balls. Let’s make an example, consider a set of N urns
containing colored balls of M =6 different colors (R = red, O=orange, B=black, G=green, B=blue,
P=purple) [13].
Figure 5.1: Urn and Ball Example [13]
The steps for generating the observation sequence for the above example are as follows [13]:
1. First of all an initial state is chosen according to the initial state distribution. This is
mathematically equal to choosing q1 = i according to the initial state distribution π (here state
equals urn).
2. Set t=1 (clock, t=1, 2, …. T).
3. Choose a ball from the selected urn (state) according to the symbol probability distribution in
34
state i, i.e., b j ( O t ) (for example the probability for a red ball in the second urn 0.44, see Fig.
5.1). This colored ball represents the observationO t . Put the ball back to the urn.
4. Make transition to a new state according to the state (urn in this case) q t= j transition
probability distribution for that state i, i.e. a ij.

5. Set t=t +1; return to step 3, if t <T ; otherwise terminate the procedure.
This step describes how a Hidden Markov Model works when it is generating the observation
sequence. One thing here should noted that, there are same color balls in all the urns, and the
difference among them exists in terms of the number of balls in all the urns. Thus if we make a single
independent observation, we not be able to decide that, from which urn the ball has come from. The
analogy between the urn and ball example and the speech recognition problem must be noted. In the
urn and ball example, we have the colored balls as our observation, whereas in speech recognition we
have feature vectors which are equal to our observations [13].
5.3.2 Types of Hidden Markov Models
There is not just one type of a Hidden Markov Model, rather there are many types. Those types can be
realized by putting different kinds of constraints or making different kinds of simplifications on the
state transition probability matrix A.
The most basic model is the ergodic of the fully connected HMM. In this model, every state of the
HMM can be reached by every other state of the HMM. In other words we can say that 0< aij <1 . This
has been shown in figure 5.2.
The sate transition matrix for an ergodic model will have all entries non zero and not equal to one as
well, as it has already been described above. It would be something like this:
a11 a12 a13 a14
[ ]
A= a21 a 22 a23 a24
a31 a 32 a33 a34
a 41 a 42 a43 a 44 4× 4
35
Figure 5.2: Ergodic Model [13]
For some applications especially the applications related to isolated word speech recognition, other
types of HMM (other than the standard ergodic model) have been found to be pretty useful. One such
model is shown in figure 5.3.
Figure 5.3a: Left-Right Model for Δ=1 [13]
Figure 5.3b: Left-Right Model for Δ=2 [13]
This model is called the left-right model or the Bakis model, because as the time increases, the state
index also increases (or at least stays the same). Thus the left-right type of HMM has the desirable
property that it can readily model signals whose properties change over time e.g. speech. The
36
fundamental property of all the left right models is that any state that has index greater of equal to the
current state, can be achieved of reached, however those states that have index lower than the current
state are impossible to achieve or reach. Interested readers should read [10] for more information.
This can be mathematically stated as:
a ij=0 , j< i (5.9)
Thus no transitions are allowed to the states whose indices are lower than the current state.
We can put further constraints on this left-right model by defining another equation :
a ij=0 , j> i+ Δ (5.10)
This means that, other than not being able to go to a state whose index is smaller than the current
state’s index, we can also not go to a state whose index is greater that the index of current state by an
amount k , where k > Δ. This means that, in our state transition diagram, no jump of greater than Δ is
allowed. This has been explained in figure 5.3. The state transition matrices for the case of
Δ=1∧ Δ=2 are given below.
a11 a12 0 0
A=
0
[
0 a22 a23 0
0 a33 a34
0 0 0 a 44
]
4×4
for Δ=1 (5.10)
a11 a12 a12 0

A=
0
[
0 a22 a23 a24
0 a33 a34
0 0 0 a 44
]
4× 4
for Δ=2 (5.11)
The choice of constrained model structure like a left-right model does not affect on the coding of
training algorithms because these state transition probabilities are just treated as an ordinary 0 and
nothing special.
37
5.3.3 Elements of Hidden Markov Models
Any Hidden Markov would require three terms to distinguish itself from any other Hidden Markov
Model and those are A, B, and π. Although the total number of states N (hidden states) can be found
just by checking the dimensionality of the matrix A which has a dimension of N × N. However we
shall describe it here. Same is the argument for M, the total number of observable outputs in the
model which can be told by looking at the matrix B, since it has the dimension of N × M. Thus
following are the terms that describe a Hidden Markov Model [10].
1. N, the number of states in the model. Although in case of Hidden Markov Models, these
states are hidden from the user, however they have practical significance. Generally the
states are interconnected in such a way that any state can be reached from any other state
(if no special constraints are put, e.g. the left right model etc). We denote individual state
as S = {S1, S2… SN} and state at time t as qt.
2. M, the number of distinct observable symbols per state. The observation symbols
correspond to the physical output of the system being modeled. We denote the individual
symbols as O= {o1, o2, … , oM}.
3. A={a ij }, The state transition probability distribution. Here,

aij = Pr( qt = Si | qt-1 = Sj ), 1 ≤ i, j ≤ N
For special case where any state can reach any other state in a single step, we have a ij> 0
for all i, j. This is called an ergodic model.
4. B = {bj(k)}, the observation symbol probability distribution in state j, where

b j (k )=Pr ( ok at t∨q t=S j) , 1≤ j ≤ N , 1≤ k ≤ M
5. π= { π i } , the initial state distribution, where
π i=Pr (q 1=S i) 1 ≤ j≤ N
Given appropriate values of A, B, π ,N and M the HMM can be used to produce an observation
sequence
O=o1, o2, o3 … oT
38
(In which each observation ot is one of the symbols from O, and T is the number of
observation in the sequence).
The observation sequence is produced as follows [10].
1- Choose an initial state q1 = S1 according to the initial state distribution π.

2- Set t = 1.
3- Choose Ot = ok according to the symbol probability distribution in state Si, i.e. bj(k).
4- Transit to a new state q t+1 = Sj according to the state transition probability distribution for
state Si, i.e., aij.
5- Set t = t+1; return to step 3 if t < T; otherwise terminate the procedure.
5.4 The Three Basic Problems for HMM

Once we know that a system can be described as a HMM, three problems can be solved. The first two
of them are pattern recognition problems: Finding the probability of an observed sequence given a
HMM (evaluation); and finding the sequence of hidden states that most probably generated an
observed sequence (decoding). The third problem is generating a HMM given a sequence of
observations (learning). Each of these problems is first described and then their solutions are
presented.
1. Problem 1 (Evaluation Problem)
Consider the problem where we have a number of HMMs i.e. a number of triplets (A, B, π),
describing different systems, and a sequence of observations. We may want to know which HMM
most probably generated the given sequence. Consider the Ball and Urn example described in
section 5.3.1. Let’s suppose we have a sequence of observations of balls being drawn from the
urns (i.e. we have sequence of colors) and we may want to know that from which urn each of
these balls have come from. Lets suppose now we have one small group of N urns and one large
group of N urns. Evaluation problem would be to find the probability of each group of generating
that color ball sequence and then selecting that group (selecting that HMM) that gives the highest
probability. Mathematically stated we can write the problem like:
Given the observation sequence O = O 1, O2, O3… OT and a model λ = (A, B, π), how do we
efficiently (in the sense that less number of calculations are required) compute Pr (O| λ ), the
probability of the observation sequence given the model [10]?
2. Problem 2 (Decoding Problem)
39
Another related problem, and the one which is usually of most interest, is to find the most
probable sequence of hidden states that generated the observed output. In many cases we are
interested in the hidden states of the model since they represent something which is important to
us in some sense and which is not directly observable. Consider again the urn and the ball
example. We have a sequence of observation of colored balls and we may be interested in finding
the most probable sequence of the urns. Mathematically it can be described as:
Given the observation sequence O = O1, O2, O3… OT and a model λ = (A, B, π), how do we find a
corresponding state sequence Q = q1 q2 … qT which is optimal in some meaningful sense [10]?
Description: This problem is the one in which we attempt to uncover the hidden part of the model,
i.e. to find the “correct” state sequence. We usually use optimality criterion to solve this problem
as best as possible. The choice of criterion is a strong function of the intended use for the
uncovered state sequence.
We use the Viterbi Algorithm to determine the most probable sequence of hidden states given a
sequence of observations and a HMM.
3. Problem 3 (Learning Problem)
The third, and much the hardest, problem associated with HMMs is to take a sequence of
observations (from a known set), known to represent a set of hidden states, and fit the most
probable HMM; that is, determine the (A, B, π) triplet that most probably describes what is
observed. Thus learning problem actually is determining the model parameters most likely to have
generated a sequence of observations (learning). Mathematically this is described as:
How do we adjust the model parameters λ = (A, B, π) to maximize Pr (O|λ)?
This problem is the one in which we attempt to optimize the model parameters so as to best
describe how a given observation sequence comes about. The observation sequence to adjust the
model parameter is called a training sequence since it is used to train the HMM. Most of the time
training problem is very important because it lets us to create models for the physical phenomena
and optimize the model parameters [10].
The solutions of these problems are presented in the following section.
40
5.4 Solutions to the Three Problems of HMM
Although we have tried to provide more detail and understanding, however the approach followed
here parallels to that of [13].
5.4.1 Solution to Problem 1 (The Evaluation Problem)

The aim of this problem is to find the probability of the observation sequence O = (O 1, O2, …, OT),
given that we have the model i.e. Pr (O| λ ). Since we have assumed that the observations produced by
states are independent of each other and the time t, the probability of observation sequence O = (O1,
O2, …, OT) which are being generated by a certain state sequence q can be calculated by a product:
P ( O|q , B )=b q 1 ( O 1 ) . bq 2 ( O2) … .. bqT (O T ) (5.12)
And we can find the probability of the state sequence as
Pq| A , π )=π q 1. . aq 1 q 2. aq 2 q 3 … … . aqT−1 qT . (5.13)
The joint probability of O and q, i.e., the probability that O and q occur simultaneously, is simply the
product of the above two terms, i.e.:
p ( o , q| λ )=P ( O|q , B ) . P ¿ (5.14)
¿ π q 1 bq 1 ( o1 ) aq 1 q 2 bq 2 ( o 2) … … .. aqT −1qT b qT ( oT )
T
= π q 1 ∏ a qt−1 qt bqt ( o t )
t=2
The aim was actually to find the probability of observation sequence given the model which is
mathematically stated as Pr (O| λ ). This is done by summing the joint probability over all the paths of
hidden state sequences. This means that we want to find the probability of the observed sequence by
assuming a particular hidden state sequence, and in this way, we do it for all the other state sequences,
i.e. we keep on assuming a particular state sequence and find its probability and finally we add all to
find Pr (O| λ ). The related equation is:
P ( O| λ )=∑ P ( O|q , B ) . P ¿ ¿ (5.15)

all q
¿ ∑ π q 1 bq 1 ( o1 ) a q 1q 2 b q 2 ( o 2 ) … ….. a qT −1 qT bqT ( o T )
q 1 ,q 2 … qT
¿ ∑ ¿¿
q 1 ,q 2 … qT
The interpretation of the computation in 5.15 is as following:
41
Initially at time t = 1 the process starts by jumping to state q 1 with probability πq1 , and observation
symbol O1 is observed with the observation probability of b q1 (O1 ). The clock changes from t to t + 1
and a transition from q1 to q2 will occur with probability aq1 q2, and the symbol O2 will be observed then
with probability bq2 (O2). Thus the process continues in this manner. At time t=T the last transition is
made from qT −1 to qT with probability aqT−1 qT , and the symbol OT will be observed with probability bqT
(oT ).
“The calculation of Pr (O| λ ), according to its definition of 5.15 involves on the order of 2T.N T
computations, since at every t=1,2, …,T, there are N possible states which can be reached (i.e., there
are NT possible states sequences), and for each such state sequence about 2T calculations are required
for each term in the sum of 5.15” [10].
Clearly a more efficient procedure is required for solving this problem. An excellent tool which cuts
the computational requirements to linear, relative to T, is the well known forward algorithm.
5.4.1.1 The Forward Algorithm
The mathematical formulation here is taken from [13].
Consider a forward variable αt (i), defined as:
a t (i )=P (O1 O 2 … … O t , qt =S i∨λ) (5.16)
Where t represents time and i is the state. This gives that αt (i) will be the probability of the partial
observation sequence, o1 o2 . . . ot , (until time t) when being in state i at time t. The forward variable
can be calculated inductively, see Fig. 5.4.
42
Figure 5.4: Forward Procedure- Induction Step [13]
αt+1 (i) is found by summing the forward variable for all N states at time t multiplied with their
corresponding state transition probability, a ij , and by the emission probability b j (ot + 1). This can be
done with the following procedure:
1. Initialization
set t=1 ; (5.17)
a 1 ( i )=π i bi ( o1 ) , 1≤ i≤ N (5.18)
2. Induction
N
a t+1 ( j )=b j (ot +1) ∑ a t (i) aij , 1 ≤ j≤ N (5.19)
i=1
43
3. Update time
Set t = t+1;
Return to step 2, if t < T.
Otherwise terminate the algorithm (go to step 4).
4. Termination
N
Pr ( O|λ )=∑ α T ( j) (5.20)
j=1
This algorithm on the other hand requires only N2T calculations, rather than 2 T NT calculations [10].
Figure 5.4 arbitrarily shows how state S 3 can be reached at time t+1 from the N possible states, S i, 1 ≤
i ≤ N at time t. Since α t(i) is the probability of the joint event that O 1, O2 … Ot are observed and state
S3 is reached at time t+1 via state S i at time t. Summing this product over all the N possible states Si,
1 ≤ i ≤ N at time t results in the probability of S 3 at time t+1 with all the accompanying previous
partial observations. Once this is done and S i is known, it is easy to see that α t+1(3) is obtained by
accounting for observation O t + 1 in state 3, i.e. by multiplying the summed quantity by the probability
b3(Ot+1). This computation is performed not just for state 3 but for all states j, 1 ≤ j ≤ N, for a given t.
Finally step 3 gives the desired calculation of Pr (O| λ ) as the sum of the terminal forward variables
αT(i) [10]. This is the case since, by definition
αT(i) = Pr (O1 O2 … Ot, qt = Si | 𝜆)
And hence Pr (O| λ ) is just the sum of the αT(i)’s.
N
Pr ( O|λ )=∑ α T ( j) (5.21)
j=1
5.4.1.2 The Backward Algorithm
The mathematical formulation here is taken from [13].
44
The recursion described in the forward algorithm, can also be done in the reverse time. By defining
the backward variable βt (i) as:
β t ( i )=P(Ot +1 Ot +2 … … OT ∨qt =S i , λ) (5.22)
It is actually the opposite of the forward variable. It actually accounts for the probability of the partial
sequence from time t+1 onwards given that the system is in state S i at time t. Notice that the definition
for the forward variable is a joint probability whereas the backward probability is a conditional
probability.
Just like forward variable, we can calculate backward variable by induction. See the figure below to
learn about the induction of βt (i).
Figure 5.5: Backward Procedure- Induction Step [13]
The backward algorithm includes the following steps.
1. Initialization
Set t = T -1
45
β T ( i )=1 ,1 ≤i ≤ N (5.23)
2. Induction
N
β t ( i )=∑ β t +1 ( i ) aij b j ( ot +1 ) , 1≤ i≤ N (5.24)
j=1
3. Update Time
Set t = t-1;
Return to step 2, if t > 0.
Otherwise terminate the algorithm.
5.4.1.3 Scaling the Forward and the Backward Variable

We need to scale the the variables α t(i) and βt(i). Question arises that, why do we need scaling. The
calculation of αt(i) and βt(i) have multiplication of probabilities. As t grows to large numbers, each
term of αt(i) or βt(i) starts to head exponentially towards zero. For sufficiently large t (e.g., 100 or
more) the dynamic range of αt(i) and βt(i) computation will exceed the precision range of essentially
any machine. Consider the definition of αt(i). It can be seen that α t(i) consists of the sum of a large
number of terms, each of the form
t −1 t
∏ aq a q
s s +1
∏ bq (Os )
s
s=1 s=1
With qt = Si. Since each a and b term is less than 1 it can be seen that as t starts to get big, each term of
αt(i) starts to head exponentially to zero. Hence the only reasonable way of performing the
computation is by including a scaling factor[10].
The basic scaling procedure which is used to multiply α t(i) by a scaling coefficient that is independent
of i, with the goal of keeping the scaled αt(i) within the dynamic range of the computer for 1 ≤ t ≤ T.
A similar scaling is done to the βt(i) coefficients and then at the end of the computation, the scaling
coefficients are canceled out exactly [10]. The mathematical formulation below follows the one given
by [10].
To understand this scaling procedure better, consider the reestimation formula for the state transition
coefficients aij. If we write the formula directly in terms of the forwatrd and backward variables we
get
46
T−1
∑ α t ( i) aij b j ( Ot +1 ) βt +1 ( j)
t =1
aij = T N (5.25)
∑ ∑ α t ( i) aij b j ( Ot +1 ) βt +1 ( j)
t=1 j=1
Consider the computation of αt(i). For each t, we first compute αt(i) according to the induction
formula and then we multiply it by a scaling coefficient c t, where
1
N
ct = (5.26)
∑ α t (i)
i=1
Thus for a fixed t, we first compute
N
αt(i) = ∑ α^ t −1 ( j ) aij b j (O t ) (5.27)
j=1
then the scaled coefficient set α^ t(i) is computed as
α^ t(i) =
∑ α^ t−1 ( j ) a ij b j ( Ot ) (5.28)
j =1
ct
∑ α^ t −1 ( j ) aij b j ( Ot )
j=1
α^ t(i) = N N (5.28)
∑ ∑ α^ t−1 ( j ) aij b j ( Ot )
i=1 j=1
By inductiion we write α^ t-1(j) as
t −1
α^ t-1(j) = ( ∏ c τ ¿ αt-1(j) (5.29)
τ=1
Hence we can write α^ t-1(i) as
N
α t ( i )=bi (ot ) ∑ α t −1 ( j ) α ji 1 ≤i ≤ N (5.30)
j=1
Its interpretation is that, each αt(i) is effectively scaled by the sum over all the states of α t(i).
47
Next we can compute βt(i) terms from the backword recursion. Here the only difference that we would
encounter is that, we shall be using the same scale facotros for each time t for the betas as we used for
the alphas. Hence the scaled β’s are of the form
^β t (i)=c t β t (i) (5.31)
Since each of the scale factors is rescaling the magnitude of the α terms to 1, and since the magnitudes
of the α and β are comparable, using the same scaling factors on the β’s as was used for the α’s is an
effective way of keeping the computation within the reasonable bounds. Furthermore, in terms of the
scaled variables we see that the reestimation equation (5.25) becomes
T−1
∑ α^ t ( i) aij b j ( Ot +1 ) ^βt +1 ( j)
t =1
āij = T N (5.32)
∑ ∑ α^ t ( i) aij b j ( Ot +1 ) ^βt +1 ( j)
t=1 j=1
but each α^ t (i) can be written as
1
α^ t(i) = ( ∏ c τ ¿ α t ( i )=C t α t ( i ) (5.33)
τ=1
And each ^β t +1 (i) can be written as
T
^β t +1 ( i )=
[∏ ]
τ=t +1
c τ β t +1 ( i )=D t +1 βt +1 (i ) (5.34)
Thus now the re-estimation equation can be written as
T−1
∑ C t αt ( i ) aij b j ( Ot +1) Dt +1 βt +1 ( j)
t =1
āij = T N (5.35)
∑ ∑ Ct αt ( i ) aij b j ( Ot+ 1) Dt +1 β t +1 ( j)
t=1 j=1
Finally the term Ct Dt+1 can be seen to be of the form [10]
t T T
C t Dt+1 =
[ ∏ ][ ∏ ] [ ∏ ]
τ=1
cτ
τ =t+1
cτ =
τ =1
cτ =C T (5.36)
This CT is independent of t. Hence the term CT DT+1 cancel out of bot the numerator and the
denominator of (5.35) and the exact reestimation equation ios there fore realized.
48
The only real change to the HMM procedure because of scaling is the procedure for computing
P(o∨λ) [10]. We cannot merely sum up ~
α T (i) terms since these are scaled already. However, we
can use the property that
T N N
∏ c t ∑ α t (i )=CT ∑ α t ( i )=1 (5.37)

t =1 i=1 i=1
Thus we have
∏ c t . P ( o|λ )=1 (5.38)

t =1
Or
1
P ( o|λ )= T
(5.39)
∏ ct
t=1
Or
T
log ( P ( o| λ ) )=−∑ log c t (5.40)
t =1
Thus the log of P can be computed, but not P since it would be out of the dynamic range of the
machine anyway.
Finally we note that when using the Viterbi algorithm to give the maximum likelihood state sequence,
no scaling is required if we use logarithms in the following way. We define
ϕ t (i ) = max {log P[q 1 , q 2 … . qt ,O1 ,O2 … . O t∨λ ]} (5.41)

q1 ,q2 … .qt
And initially set
ϕ t (i ) =log (π i)+ log[b j (O t)]
With the recursion step
ϕ t (i ) =max [ ϕt−1 ( i ) + log aij ]+ log[b j (Ot )] (5.42)

1≤ i≤ N
And the termination step
log P¿ =max [ϕT ( i ) ] (5.43)

1 ≤i ≤ N
49
¿ ¿
Again we arrive at log P rather than P , but with significantly less computation and with no
numerical problems. (The reader should make notice that the terms log aij can be pre computed and
therefore do not cost anything in the computation. Furthermore, the terms log [b j (O t )] can be pre
computed when a finite observation symbol analysis (e.g., a codebook of observation sequences) is
used [10].
One thing should be obvious here that the scaling procedure applies to the reestimation of of the π and
B as well it should also be very clear that the scaling procesure need to be applied at every time
instant, but can be performed whenever desired or necssary e.g to prevent underfolow. If scaling is not
performed at some time instant then the scaling coefficient for that time instant are set to 1 at that time
and all the conditions discussed above are then met.
5.4.2 Solution to Problem 2 ( Decoding Problem)
Since there is no exact solution for Decoding Problem, we must follow some optimality criterion. We
shall follow the approach as adopted by [10]. As already said, unlike problem 1 for which an exact
solution can be given, there are several possible ways of solving problem 2, namely finding the
“optimal” state sequence associated with the given observation sequence. The difficulty lies with the
definition of the optimal state sequence; i.e. there are several possible optimality criteria. For
example, one possible optimality criterion is to choose the states q t which is individually most likely.
This optimality criterion maximizes the expected number of correct individual states. To implement
this solution to problem 2, we define the variable
γ t (i)=Pr (q t=Si ∨O, λ) (5.44)
i.e. the probability of being in state Si at time t, given the observation sequence O, and the model 𝜆.
Above equation can be expressed simply in terms of the forward-backward variables, i.e.,
α t (i) β t (i) α (i) β t (i)

= N t
γt(i) = Pr (O∨Ω) (5.45)
∑ α t (i)β t (i)
i=1
Since α t (i) accounts for the partial observation sequence o 1, o2, …, ot and state Si at t while β t (i)
accounts for the remainder of the observation sequence O t+1 Ot+2 … OT, given state Si at t. The
N
normalization factor Pr(O| γ ) =∑ α t (i) βt (i) make γt(i) a probability measure so that
i=1
50
N
∑ γ t (i) (5.46)
i=1
Using γt(i), we can solve for the individually most likely state q t at time t, as
qt = argmax [γt(i)], 1≤t≤T (5.47)
Although above equation maximizes the expected number of correct states there could be some
problems with the resulting state sequence. For example, when the HMM has state transitions which
have zero probability (aij = 0 for some i and j), then the optimal state sequence that we have found,
may not even be a valid state sequence. This is due to the fact that, we have only taken into account
the most likely state at the every individual time instant, without regard to the probability of
occurrence of sequence of states.
One possible solution to the above problem is to modify the criteria for optimality [10]. For example,
one could solve for the state sequence that maximizes the expected number of correct pairs i.e. duplets
of states (qt, qt+1) or triplets of states (qt, qt+1, qt+2) etc. Although these criteria might be reasonable for
some applications, the most widely used criterion is to find the single best state sequence (path), i.e. to
maximize Pr (Q | O,) which is equivalent to maximizing Pr (Q, O|𝜆). Obviously we need an algorithm
that guarantees such a path, which is feasible and possible actually. We are very fortunate to have
such method existing with us. Its called the Viterbi Algorithm, and its based on the principles of
dynamic programming. This algorithm not just finds it application here in our project but also in a
number of other fields e.g. decoding convolutional codes in satellite communications, computer
storage devices such as hard disk drives.
5.4.2.1 The Viterbi Algorithm
Viterbi Algorithm is a very famous algorithm to find the optimal state sequence. We shall follow the
approach used by [10]. To find the single best state sequence Q = {q 1 q2 … qT}, for the given
observation sequence O = {O1 O2 … OT}, we need to define the quantity
51
δt(i) = max Pr(q1 q2 … qT = i, o1 o2 … ot | 𝜆) (5.48)
i.e. is the best score along a single path, at time t, which accounts for the first t observations and ends
in state Si. By induction we have
δt+1(i) = [max δt(i) aij] . bj(Ot+1) (5.49)
Here we accounted for a state transition and the probability of observed token at current state (time =
t+1). To actually retrieve the state sequence, we need to keep track of the argument which maximized
the above equation, for each t and j. we do this via the array ψ t(j). The complete procedure for finding
the best state sequence can now be stated as follows:
1- Initialization
δt(i) = µi bi(O1), 1≤i≤N (5.50)
ψ1(j) = 0
2- Recursion
δt(i) = max [δt-1(i).aij] . bj(Ot),
2 ≤ t ≤ T; 1 ≤ j ≤ N (5.51)
ψt(j) = argmax [δt-1(i).aij]
2 ≤ t ≤ T; 1 ≤ j ≤ N
3- Termination
P* = max [δT(i)] (5.52)
52
qt* = argmax [δT(i)]
4- Path (state sequence) backtracking
qt* = ψt+1(qt+1*) (5.53)
t = T-1, T-2, …, 1
It should be noted that the Viterbi algorithm is similar in implementation to the forward calculation.
The major difference is the maximization in δ t(i) = max [δt-1(i).aij] . bj(Ot) over previous states which is
used in place of the summing procedure. Forward Algorithm explores all the paths and adds the
probability, whereas Viterbi Algorithm takes the path whose probability is maximum and at every
time instant it only depends upon the observed event at point t, and the most likely sequence at point
t − 1.
5.4.2.2 The Alternative Viterbi Algorithm
Because of its ease of implementation, we have implemented the alternative viterbi algorithm. The
following steps are included in the Alternative Viterbi Algorithm [13].
1. Processing
~
π i=log ( π i ) 1≤ i≤ N (5.54)
~
a ij=log ( aij ) ,1 ≤ i, j ≤ N
2. Initialization
Set t = 2;
~
bi ( o1 ) =log ( b i ( o1 ) ) , 1≤ i≤ N (5.55)
~
δ 1 ( i )=π i +~
bi ( o 1 ) ,1 ≤i ≤ N
ψ i (i )=0 ,1 ≤ i≤ N
3. Induction
~
b j ( ot )=log ( bi ( oi ) ) , 1 ≤ j≤ N (5.56)
~ ~
δ t ( j )=~
b t ( ot ) + max [¿ δ t−1 ( i ) +~
aij ], 1≤ j ≤ N ¿
1 ≤i ≤ N
53
ψ t ( j ) =arg max [~
δ t −1 ( i )+~
aij ] 1 ≤ j≤ N
1 ≤ i≤ N
4. Update
Set t = t+1;
Return to step 3 if t≤ T;
Otherwise, terminate the algorithm
(go to step 5)
5. Termination
~ ~
P¿ =¿ max δ T (i ) ¿
1≤ i ≤N
[ ] (5.57)
~
q T =arg max [δ T (i)]
1 ≤i ≤ N
6. Path (state sequence) backtracking

a. Initialization
Set t= T-1
b. Backtracking
q ¿t =ψ t +1 (q ¿t +1)
c. Update time
Set t =t -1
Return to step b if t ≥ 1;
Otherwise, terminate the algorithm
To exemplify how the Alternative Viterbi Algorithm works an example is given form [13].
Consider a model with N=3 state and an observation of length T=8. In the initialization (t = 1) is
δ 1 (1), δ 1 (2) and δ 1 (3) found. Let’s assume that δ 1 (2) is the maximum. Next time (t = 2) three
variables will be used namelyδ 2 (1) , δ 2 (2) andδ 2 (3). Let’s assume that δ 2 (1) is now the maximum.
54
In the same manner will the following variablesδ 3 (3), δ 4 (2), δ 5 (2) , δ 6 (1), δ 7 (3) andδ 8 (3) be the
maximum at their time, see Figure 5.6.
Figure 5.6: Example of Viterbi Search [13]
To find the state path the backtracking is used. It begins in the most likely end state and moves
towards the start of the observations by selecting the state in the ψ t (i) that at time t-1 refers to current
state.
5.4.3 Solution to Problem 3 (Learning Problem)

Not surprisingly we have followed again the approach of [10] here. In this problem, we have to
maximize the probability of the observation sequence given the model, i.e. P(O | 𝜆) by adjusting the
parameters of the model (A, B, π). There is no known way to analytically find the model which
maximizes the probability of the observation sequence. In fact, given any finite observation sequence
as training data, there is no formal way of estimating the model parameters. We can, however, choose
(A, B, π) such that P(O | 𝜆) is locally maximized using an iterative procedure such as the Baum-Welch
method. We will discuss one such method here which is just according to [10].
In order to describe the procedure for re-estimation of HMM parameters, we first define ξ t(i,j), the
probability of being in state Si at time t and state Sj at time t+1, given the model and the observation
sequence, i.e.
ξt(i,j) = Pr (qt = Si, qt+1 = Sj | O, Ω)
The sequence of events leading to the conditions required is illustrated in Figure 5.6.
55
Figure 5.7: Computation of ξt(i,j) i.e. Joint Event- System is in State Si at time t and state Sj at time
t+1 [13]
is fairly complex and has many local maximas. For getting further insight on reestimation procedure
please refer to [14]. It should be clear from the definition of the forward and backward varaibles, that
we can write ξt(i,j) in the form
α t (i)aij b j (O t +1 )β t +1 ( j)
ξt(i,j) =
P(O∨Ω)
α t (i)aij b j (Ot +1 )β t +1 ( j)
¿ N N
∑ ∑ α t (i) aij b j( Ot+1 ) β t +1 ( j)

i=1 j =1
where the numerator is just Pr (q t = Si, qt+1 = Sj , O| 𝜆) and the division by P(O| 𝜆) gives the desired
probability measure.
Previously we defined γt(i) as the probability of being in state Si at , given the observation sequence
and the model; hence we can find a relation between γ t(i) and ξt(i,j) by summing over j, given
γt(i) = ∑ ξt(i,j)
56
If we sum γt(i) over the time index t, we get a quantity which can be interpreted as the expected (over
time) number of times that state S i is acheived, or we can say that, the expected number of transitions
made from state Si (obviously for this interpretation we exclude the time slot t = T from the
summation). In a similar way, summation of ξ t(i,j) over t (from t=1 to t= T-1) can be interpreted as the
expected number of transitions from state Si to state Sj [10]. That is,
T +1
∑ γ t (i) = expected number of transitions from S i

t =1
T −1
∑ ξt (i, j) = expected number of transitions from S to S i j

t=1
Using the above formulas we can give a method for reestimation of the parameters of an HMM. A set
of reasonable reestimation formulas for A, B and π are:
π́ i = expected frequency in state Si at time (t = 1) = γ 1 (i)
aíj =
expected number of transitions ¿ state S i¿ S j ¿ state S i ¿

expected number of transitions ¿
´ ) = expected number of ×¿ state j∧observing symbol vk

b j(k expected number of ×¿ state j
If we define the current model as 𝜆 = (A, B, π) and use this to computer the above three equations
then Baum and his colleagues have shown that, either this model gives the a critical point of the
likelihood function that gives Pr(O| λ́ ) = Pr(O| 𝜆), in other words λ́ =𝜆, or model λ́ is more likely than
model 𝜆 in the sense that Pr(O| λ́ ) > Pr(O| 𝜆), i.e. we have found a new model λ́ from which the
observation sequence is more likely to have been produced.
“Based on the above procedure, if we iteratively use λ́ in place of 𝜆 and repeat the re-0estimation
calculation, we then can improve the probability of O being observed from the model until some
limiting point is reached. The final result of this reestimation procedure is called a maximum
likelihood estimation of the HMM” [10].
We should not forget one thing that the forward-backward algorithm leads to local maximum only,
and in most cases, the optimization surfaces
57
Chapter 6
Working of the Project

Since the system that we have developed is fairly complex to understand at once, and it contains a lot
of modules. As already described in chapter 1 that, Automatic Speech Recognition requires complex
interdisciplinary research. A lot of fields and sciences are involved in it. Moreover we have also a
hardware portion in our project so we also have to describe about that. We have divided the project in
five big modules to make it easy to understand and follow. We shall first of all draw a system level
block diagram showing these big modules and then we shall present another diagram that would tell
what each of the big module contains and finally each of them shall be explained. Although the
hardware portion would be described here to some extent, but we have reserved a separate chapter for
it, so that those readers who are not interested in hardware design will restrict them selves to this
chapter only, and then jump to Chapter 8 for results and conclusion. And those who are interested in
hardware design as well (like we were) can dive into chapter 7 and see the details.
6.1 System Level Block Diagram
Data Acquisition Signal Processing Feature Extraction
Hardware Training &

Application Recognition
58
Figure 6.1: Block Diagram Showing the Main Modules of the Project
Data Acquisition Block
Utterance from
the Speaker Amplifier
A/D Converter
Signal Processing Block
End Point
Detection
Pre Emphasis Voice Activation Frame Blocking
Filter Detection (VAD)
Frame
Windowing
59
Feature Extraction Block
512 Points DFT Mel Filters Log(mk)
Delta Cepstrum Cepstral IDCT

Wieghting
Training and Recognition Block
Vector Codebook
Quantization Generation
Training Mode
Reference Recognition Mode

Templates
60
Word Test Template
Recognition
Hardware Block
Serial Port of Max 232 Protocol Microcontroller

Computer
Wireless Car Remote Control

Wireless Link
61
Figure 6.2: Detailed Diagram Showing the Main Modules of the Project
Now we shall explain each and every block separately, and cover all the important details. However
we shall not dive deep into the details of hardware block, for that is the subject of next chapter.
6.2 Data Acquisition Block
The first block of Figure 6.1 shows the Data Acquisition Block. This block is further drawn in detail
in Figure 6.2. As shown in Figure 6.2, first of all the sound is recorded from a simple microphone, and
then it is internally converted into digital format with the help of A/D converter. The A/D converter
has a resolution of 16 bits / sample, which is more than enough for our application. The sampling
frequency is selected by the user. We are selecting the sampling frequency by our choice in
MATLAB. The sampling frequency selected by us in this project is 8 kilohertz. Since the maximum
frequency in speech signal is almost 4 kilohertz therefore 8 kilohertz will be sufficient sampling rate
for our application. Our recording time of the speech signal is 1.5 seconds, since we know that all the
words that we have to speak in our project are much smaller than 1.5 seconds. Thus we have selected
our recording time to be equal to 1.5 seconds. Another advantage of smaller recording time is that,
performance of speech recognition system is significantly improved if one uses a smaller recording
time. Since the words in the vocabulary is small, so if one selects a larger time e.g. 5 seconds, most of
the time, the speaker will be silent within these 5 seconds and hence the microphone will just record
noise and nothing else. The reader may argue that, even in 1.5 seconds, the speaker will be silent for
say half of the time, but here in this case, we have used an algorithm of Voice Activation Detection
[13]. This algorithm detects the end points of the speech signal and nullifies all the noise by simply
replacing it with zero (or no signal). This gives a great performance benefit even in noisy
environments where there is random noise in the background. Obviously the random noise should not
be of large intensity, otherwise it will not be considered as noise by the algorithm.
In MATLAB, speech signal is recorded by issuing the command “wavrecord”. Its format can be
understood by exploring the help of MATLAB, or on the website www.mathworks.com.
62
6.3 Signal Processing Block
This block has itself many small sub modules. Each of these sub modules is very important and need
to be explained separately under separate headings.
6.3.1 Pre-emphasis Filter
This filter is used at the start of the signal processing module, to spectrally flatten the sound
signal. It’s actually a high pass filter. Its equation is given is:
H ( z ) =1−0.95 z−1 (6.1)
M −1
s1 ( n )= ∑ h ( k ) s (n−k ) (6.2)
k=0
The frequency response of this pre-emphasis filter is shown below in figure 6.3.
63
Figure 6.3: Frequency Response of the Pre-emphasis Filter
6.3.2 Voice Activation Detection/ End Point Detection
As already been told in Article 6.2 that if we use end point detection for the audio signal, then the
performance of speech recognizer improves a lot. The problem of locating the endpoints of an
utterance in a speech signal is a major problem for the speech recognizer. Inaccurate endpoint
detection will decrease the performance of the speech recognizer. Although it may seem that the
problem of end point detection is simple, however it is not actually simple. We must ensure a good
SNR to make the task simple. Some of the most commonly used estimates used for finding the speech
signals are short term energy estimate, short term power estimate, and zero crossing rate. In our
implementation we have used the short term power of the signal to estimate/locate the speech content
in whole of the 1.5 recorded sound signal. Here we have followed the approach as used by [13].
m
1
P s ( m) = ∑ s 21( n) (6.3)
1
L n=m− L+1
64
For each block of L samples these measures calculate some value. One thing should be noted that the
index for these functions is m and not n, this because these measures do not have to be calculated for
every sample (the measures can for example be calculated in every 20 ms). The short-term power
estimate will increase when speech is present in s 1(n). These measures will need some triggers for
making decision about where the utterances begin and end. We have used the following function for
comparing with the trigger [13].
W s (m)=P s (m ) Sc
1 1
(6.4)
Sc is a scale factor for avoiding small values and in a typical application we normally take S c=1000.
The trigger for this function is described as:
t W =μW + α δ W (6.5)
The µW is the mean and δW is the variance forW s (m) . After some testing the following approximation
1
of α will give a pretty good voice activation detection in various level of additive background noise:
α =0.2 . δ W −0.8 (6.6)
The voice activation detection function can now be defined as:
VAD ( m )= 1 ,if W s (m ) ≥ t W
{ 1
0 , if W s (m )<t W
1
} (6.7)
VAD(n) is found as VAD(m) in the block of measure. E.g. if measures is calculated every 160 sample
(block length L=160 in our case), which corresponds to 20 ms if the sampling rate is8 kHz (in our
case). The first 160 samples of VAD(n) found as VAD(m) then m=1. Figure 6.4 contains a series of
graphs from MATLAB to show how the VAD works.
65
Figure 6.4: Signal Prior to Multiplied by the VAD window
In figure 6.4 we can see the signal for the word “Move”. The blue colored signal is the signal
of the word “Move” and the red colored signal is the “VAD” window. We can see that the
algorithm as very correctly detected where the speech content is. In the next figure, you will
see the power variation of the signal and thus you will see that, actually VAD has acted on the
basis of this power which is given in equation 6.3.
66
Figure 6.5: Power Content of the Input Signal for Word “Move”
We can clearly see in figure 6.5 that, in whole of the input signal, there is a significant amount of
power only in a limited portion of the whole 1.5 seconds. Thus the algorithm correctly locates the
portion of time line, where there is effectively all the speech content and discards the other portion
which is just the background noise. Finally we would like to see the clean signal, without any
background noise.
Figure 6.6: Clean Signal after multiplication with VAD window, without Background Noise
6.3.3 Frame Blocking and Windowing
After the signal has been pre-emphasized and its end points have been detected then we have to divide
the signal into frames so that we may be able to use the stationary property of signal. Since the speech
signal can be assumed to be of stationary nature if it is of length 20ms in time domain. Therefore we
shall divide it in blocks of 20ms. Since our sampling frequency is 8KHz. Therefore each block shall
be of length 160. After dividing the signal into frames of length 160 (i.e. 20ms each) we shall
67
multiply it with hamming window. Window function is used here to improve the spectral properties
by giving more resolution. We have used hamming window in our case. The general block diagram is
as follows:
Figure 6.7: Steps in Frame Blocking and Windowing [13]
Here we can see that, one dimensional signal has now been broken into two dimensions. The index k
shows the sample within each block, whereas m shows each successive block. In our case, since the
sampling frequency is 8KHz and the signal length is 1.5 seconds so we shall get a total of 12000
samples.
samples=1.5× 8000 (6.8)
¿ 12000
Now since each frame is of length 160 as already explained, so a total of 149 frames shall come out of
the 12000 samples. All these values correspond to the case of our project.
Frame Length=sampling frequency ×time of one frame (6.9)
Frame Length=8000 × 20× 10−3=160
12000
Total No of Frames= =149 (6.10)
160
68
One thing should be noted that, in our project we have implemented a 50% overlap between the
successive frames so that we don’t miss any spectral properties across the boundaries. For this reason
we apply hamming window in order to reduce the discontinuity at either end of the block. The
function of the hamming window is
2π n
w ( n )=0.54−0.46 cos ( N−1 ) (6.11)
The window that we have used for windowing is shown below:
Figure 6.8: A 160 Point Hamming Window
69
6.4 Feature Extraction Block
This is one of the most important blocks of this project and the one on which much time has been
spent because of its complexity. It itself is composed of many small parts that together make it a big
block.
In this project we are using Mel Frequency Cepstral Coefficient. Mel frequency Cepstral Coefficients
are coefficients that represent audio based on perception. Since we know that human ears, for
frequencies lower than 1 kHz, hears tones with a linear scale and with logarithmic scale for the
frequencies higher than 1 kHz. The mel-frequency scale is linear frequency spacing below 1000 Hz
and a logarithmic spacing above 1000 Hz. This coefficient has a great success in speaker recognition
application. It is derived from the Fourier Transform of the audio clip. In this technique the frequency
bands are positioned logarithmically, whereas in the Fourier Transform the frequency bands are not
positioned logarithmically. As the frequency bands are positioned logarithmically in MFCC, it
approximates the human system response more closely than any other system. These coefficients
allow better processing of data.
In the Mel Frequency Cepstral Coefficients the calculation of the Mel Cepstrum is same as the real
Cepstrum except the Mel Cepstrum’s frequency scale is warped to keep up a correspondence to the
Mel scale. The diagram of the mel frequency is as below:
70
Figure 6.9: Mel frequency graph
The reason for the choice of MFCC has already been told in the starting lines of this article. We have
followed the approach as given by [15]. We actually place 20 linearly spaced triangular shaped filters
that is being converted into mel frequency domain by applying the defining equation between mel
domain and linear frequency domain. The equation is given as:
H ( k , m )=0 for f (k)< f c (m – 1) (6.12)
f ( k ) −f c (m−1)
H ( k , m )= for f c (m−1)≤ f (k )< f c (m) (6.13)
f c ( m ) −f c (m−1)
71
f ( k ) −f c (m+1)
H ( k , m )= for f c (m)≤ f (k )<f c ( m+1) (6.14)
f c ( m ) −f c (m+1)
H ( k , m )=0 for f ( k)≥ f c (m+1) (6.15)
The diagram of the logarithmically spaced filters of our project is
Figure 6.10: Logarithmically Spaced Triangular Mel Filters
The approach that we have followed is called the Davis Approach and its diagram is shown as below:
72
Figure 6.11: Davis Implementation of Mel Filters [15]
These 20 filters are spaced according to mel frequency domain. Every triangular filter now will give
one new mel spectrum coefficient,m k by summing up the filtered result.
Filter Lower Limit {Hz} Upper Limit {Hz}

1 0 154.759
2 77.3795 249.2458
3 163.3126 354.1774
4 258.745 470.7084
5 364.7267 600.121
6 482.4239 743.8391
7 613.1315 903.4442
8 758.2878 1080.6923
9 919.4901 1277.5338
10 1098.5119 1496.1345
11 1297.3232 1738.8999
12 1518.1115 2008.501
13 1763.3063 2307.9044
14 2035.6063 2640.4045
15 2338.0049 3009.6599
16 2673.8324 3419.7335
17 3046.7829 3875.1375
18 3460.9602 4380.8829
19 3920.9215 4942.5344
20 4431.728 5566.272
Table 6.1: Mel Frequency Band Limits for 20 Filters
As already told initially in the block diagram; that there are certain steps that we have to take to find
out the MFCC. These steps are very well known and shall be described by the help of diagram:
512- point ¿ X 2 (n; m)∨¿ Mel scaled

DFT filter bank
73
Delta Cepstral
Cepstrum Weighting IDCT Log ( mk )
Figure 6.12: Summary of Feature Extraction
First of all we take FFT of each of the block of the signal to bring it in frequency domain. Obviously
we have to zero pad by an amount of 512-160=350. After taking the FFT we pass it through the
triangular Mel filters. This is shown in the following equation
N−1
mk = ∑ | X 2 ( n ; m )| H k mel ( n) (6.16)
n=0
After this, we take the inverse discrete cosine transform by applying the following equation
N−1
c s ( n ; m )= ∑ log ( mk ) cos
k=0
( π ( 22n+1
N )
)
, n=0…N-1 (6.17)
After this, we multiply the cepstrum coefficients with a window function as given by [10].
W c (m)=1+Q sin( πm/Q)/2 for 1 ≤m ≤Q (6.18) to give
c^ l(m) = cl(m) . Wc(m)
Finally we have to find delta cepstrum , which the time derivative of the sequence of
weighted cepstral vectors. This is approximated by a first-order orthognal polynomial over a
finite length window of 2K + 1 frames (K=2 in our case, interested reader can refer to [10] for
details) , centered around the current vector. The cepstral derivative is computed as
K
Δc^ l(m)= [ ∑ k c^ l−k (m)¿ ¿. G 1≤m≤Q (6.19)
k=−K
74
Where G is a gain term chosen to make the variances of c^ l(m) and Δc^ l(m) equal.
The observation vector Ol used for recognition and training is the concatenation of the
weighted cepstral vectro, and the corresponding weighted delta cepstrum vector, i.e.,
Ql = { c^ l(m), Δc^ l(m) }
Normally 25 coefficients are taken. Since we had used 20 filters so we had to zero pad at the
last of Mel Frequency Cepstrum Coefficients to get 25 MFCC and 25 Delta MFCC. This
Delta MFCC is used to catch the changes between the different frames.
In this way feature extraction block is ended. Just as an example the feature vectors of the
word “Move” are plotted in MATLAB in the figure below. There are 159 feature vectors,
each vector with length 50 (25 MFCC, and 25 Delta MFCC). We have also used
normalization of the feature vectors. For details see [13].
Figure 6.13: Feature Vectors of the Word Move
75
6.5 Training and Recognition Mode
This module has many small sub modules all the portion of Hidden Markov Modeling is actually done
in this module. The models are trained and tested here in this module. We shall explain some sub
blocks combinedly whereas we shall put emphasis on some blocks by seprately describing them under
separate headings.
Since, we are using discrete observation symbol density, therefore we shall use the concept of vector
quantization. The approach followed here is similar to that of [10].
6.5.4 Vector Quantization

For the case in which we wish to use an HMM with a discrete observation symbol density, rather than
the continuous vectors above, a vector quantizer (VQ) is required to map each continuous observation
vector into a discrete codebook index. Once the codebook of vectors has been obntained, the mapping
between continuous vectors and codebook indices becomes a simple nearest neighbour computation,
i.e., the continuous vector is assigned the index of the nearest codebook vector. Thus the major issue
in VQ is the design of an appropriate codebook for quantization[10].
Fortunately a great deal of work has gone into devising an excellent iterative procedure for designing
codebooks based on having a representative training sequence of vectors. The procedure basically
partitions the training vectors into M disjoint sets (where M is the size of the codebook), represents
each such set by a single vector (v m, 1 ≤ m ≤ M), which is generally the centroid of the vectors in the
training set assigned to the mth region, and then iteratively optimizes the partition and the codebook.
There is a distortion penalty associated with the VQ since we are representing an entire region of the
vector space by a single vector. Clearly it is advantageous to keep the distortion penalty as small as
possible. However, this implies a large size codebook, and that leads to problems in implementing
HMMs with a large number of parameters [10].
6.5.4.1 Theory of Vector Quantization

The approach followe here is that of [16]. Assume we have a training set of MFCC vectors, v i, i= 1, 2,
…, I, which are a good representation of the types of MFCC vectors that occur when the words in the
vocabulary are pronounced by a wide range of speakers. We want to determine the optimum set of
codebook MFCC vectors, vm, m = 1,2 ,…, M, such that the distortion is minimum.
More formally stated, if we define d(v R, vT)as the distance between two MFCC vectors, a R and aT, then
the goal of vector quantization is to find the set, am such that
I
1
||DM|| = min { ∑ min[d ( v M , v i )]} (6.20)
I i=1
76
is satisfied. The quantity ||DM|| is the average distortion of the vector quantizer.
The way in which this equation is solved, for a given value of M, is discussed below. The algorithm
first finds the optimum solution for M = 2 (two codebook entries), then splits each optimum LPC
vector into two components, and finds the optimum solution for ^
M = 2M. This procedure iterates until
M is as largeas desired. The local distance is the localized distance,
v R V T v´R
d(vR, vT) = –1
v T V T v´T
where VT is the autocorrelation matrix of the sequence that gives rise to MFCC vector v T.
During the course of running the algorithm several performance criterions are monitored, including
(i) Average distortion, ||DM||.
(ii) Sigma ratio (cluster seperation) of the resulting codebook entries, defined as
M M
1 1
σ=
∑ (
M i=1 M −1 j=1 )
∑ d (a^ M , a^ i)
¿|D M|∨¿ ¿
where the numerator is the average intercluster distance, and ||D M|| is the average
intracluster distance.
(iii) Cluster cardinality, Ni, defined as the number of tokens in the ith cluster.
(iv) Cluster distortion, di, defined as the average distortion for the ith cluster.
It should be clear that the average distortion ||DM||, satisfies the relation
M
1
||DM|| = ∑ d́ i . N i
I i=1
and that the cluster occupancy satisfies the relation
∑ N i=I
i=1
We have used the vector quantization tool in our case to implement this algorithm. This tool can
be used by typing the command “vqdtool” in MATLAB.
77
Figure 6.14: VQDTOOL in MATLAB
Since we know that, for each block we get a vector of length 50, and then we know that, for each
word we have 149 blocks. So for each word we shall get 149 vectors of length 50. Each of these
vectors shall be mapped to one of the index in the codebook and we shall get an observation sequence
as
O=o1, o2, o3, o4, . . . o149
Once the codebook is prepared, then we can work on the training portion of HMM. That includes
actually optimizing the parameters of HMM using the Baum Welch criterion [10]. For each word of
the vocabulary, we have a Hidden Markov Model. The parameters A, B, π are optimized for each of
the word by taking test inputs from a number of speakers. We took test inputs from 10 speakers,
although the more they are, the better it is (with a large codebook).
Once the HMM model of each word is prepared, we can go the recognition phase, in which each a
users utters a word from the vocabulary and it is compared against all the HMM models i.e. the
probability of generating that sequence from all the HMM is found and the HMM (word model) that
gives highest probability is selected. The training and recognition is summarized below in figure 6.14.
78
Figure 6.15: Overall Diagram of the Isolated Word Recognizer – using Hidden Markov Model and
Vector Quantization Approach [16]
We have used the alternative viterbi algorithm for the decision rule, which is actually just like the
viterbi algorithm but in the logarithmic form.
6.6 Hardware Block
As already told that; we shall not explain the hardware portion in great detail in this chapter. We shall
only describe the most basic things here.
The hardware that is being used here is:
1. A Microcontroller
2. Max 232 IC
3. Serial Port Cable
4. Remote Control
79
5. Wireless Car
When a word is recognized, then after the recognition a corresponding command is issued to the serial
port via MATLAB. A microcontroller is continuously listening to the serial port at the other end. It
activates the corresponding button of the remote control based upon the word uttered. We have
selected the words, Move, Stop, Left, Right, and Reverse. Moreover as already told, we have
interfaced a car with the serial port of the computer that is in turn interfaced with the software in
MATLAB. When a user speaks the word, it is recognized by the system and the car moves
accordingly, e.g. it moves when the user says “Move”, it stops when the user says “Stop” and vice
versa. This is summarized again in the following block diagram.
MATLAB Serial Port Microcontroller Remote Control
Wireless Car
Figure 6.16: Block Diagram of Hardware Part
80
Chapter 7
The Hardware of the Project
This is a small chapter, since not much hardware is involved in the project. Rather the basic aim of the
project was to come up with the Speaker Independent Speech Recognition System, with an easy to use
GUI for the user. However in order to demonstrate that such kind of system can also be integrated
with hardware for small home based applications and other industrial heavy duty applications, we also
implemented it in hardware with a small example of controlling a wireless car. We shall redraw the
block diagram to revise the components being used and explain all of them one by one.
Hardware Block
Serial Port of Max 232 Protocol Microcontroller

Computer
Wireless Car Remote Control
81
Figure 7.1: Block Diagram of Hardware
7.1 Serial Port
The serial port of computer is accessed by the MATLAB. The code for serial port access is:
s = serial ('COM14','BaudRate',4800);
fopen(s);
fprintf(s,'M');
fclose(s);
This code sends an ascii character ‘M’ to the microcontroller corresponding to the word ‘Move’.
When this character is received by the microcontroller, it instructs the remote control to move the car
in the forward direction. Similarly for reverse, we shall send ‘R’, and the car will move back.
7.2 Max 232 IC
This is the interface between the serial port of the computer and the microcontroller. This IC is
actually used for logic conversion between the computer and the microcontroller.
7.3 Microcontroller
This microcontroller decides according to what it receives from the serial port by MATLAB. If the
speaker speaks right, then the word will be recognized and MATLAB will send ‘r’ for right. When the
micro controller receives this, it instructs the remote control to move the car to right hand side.
7.4 Remote Control
Remote control of the car is used to give commands to the car. However the circuit of the remote
control has been modified according to our needs. Since the remote control operates at 3 volts,
82
therefore we should have two power supplies, a 5 volt supply for the microcontroller and a 3 volt
supply for the remote control. We have used cells in our case for remote and a 5 volt regulated DC
supply for the microcontroller. We had to use 4 transistors for making 4 switches to deliver the
commands from microcontroller pins to the remote controller pins. The microcontroller pins are
applied at the base of these transistors to provide the signal and in turn the remote gets the function
via the transistors.
7.5 Wireless Car
There is nothing much special about the wireless car. It operates according to the command it gets.
The frequency over which it communicates with the remote control is 35MHz.
Although we have implemented a specific application, but once the speech is correctly recognized, we
can integrate any application we like (provided that the accuracy of speech recognizer should be very
good, as in our case).
Chapter 8
Conclusion and Results
8.1 Results
We have tested our project for a variety of different speakers. Both type of speakers were tested; those
whose voice had been used to train the model and those whose voice have not been used to train the
model. The results were very much satisfactory. For those who had taken part in training the models,
accuracy was above 99%. For those who had not taken part in training the model, accuracy was fairly
83
above for them as well (at least 90%). We have used a very modest figure here.
As far as the accuracy is concerned, it has been proved by our project that HMM can model the words
very fairly and gives a very good accuracy. Vector Quantizer has not degraded the performance of the
speech recognizer to much extent and we are still able to achieve a very high accuracy. We have used
MFCC because that corresponds to the human perception of sounds.
The amount of memory that has been used by the parameters and codebook is also not much high, and
its equal to 82.3 Kbytes. The time taken by the project for recognition is less than half a second for
each word. Time taken for training is 11 seconds for each word.
Serial No Parameter Name Result
1 Technique Used for Speech Recognition Hidden Markov Modeling
2 No of Words 5
3 Accuracy- Speaker Dependent 99%
4 Accuracy-Speaker Independent Above 90%
5 Time Required for Training 11 seconds
6 Time Required for Recognition Less than 500 millisecond
7 Memory Required for Codebook 76.3 Kbytes
8 Memory Required for HMM Parameters 6.03 Kbytes
9 Total Memory Required 82.33 Kbytes
Table 8.1: Results of Project
8.2 Snapshots of the GUI
As it has already been described that, the project has very simple to use GUI, so that, any user can use
it with great ease and navigation options. Here are snapshots of the GUI.
84
Figure 8.1: Welcome Window
Figure 8.2: Main Window
85
Figure 8.3: Training Window
Figure 8.4: Recognition Window
86
Figure 8.5: Confirmation Window
8.3 Conclusion
The conclusion of this Bachelor’s Degree project is that, A Speaker Independent Speech Recognition
has been implemented with Hidden Markov Modeling with the help of HMM techniques, Vector
Quantization, Mel Frequency Cepstrum Coefficients. A part from that, we have also interfaced a
hardware portion with the software of our project. We have interfaced a wireless car with our speech
recognizer which moves according to the command issue to it in form of speech input. The speech
recognizer recognizes the commands and accordingly instructs the car to move. Sound recorder of
MATLAB work very fine with 16 bit resolution and 8 KHz sampling rate (in our case). A HMM
library for 5 words and codebook for MFCC of words was also built. The size of codebook which we
used was 256 vectors of length 50 each. The trained models were afterwards used for recognition and
87
gave very good results. The car moved according to the commands of the user and it was concluded
that, any home based or heavy duty industrial application can be interfaced with it. Overall the results
are very much satisfactory.
8.4 Future Work
We have used a word based acoustic model. However this model is only good for limited vocabulary.
If we want to increase the vocabulary, we should move towards a phone based model. There are about
40 phonemes in English; we should create models for them for the enhancement of vocabulary.
There is still need to increase the vocabulary. If someone can come up with an idea to increase the
vocabulary to an increasingly large amount of number, that would be a big step indeed. Often there
are words that have very matching spectral content, and thus they degrade the accuracy of the
recognizer, those should be removed from the vocabulary. One should not use synonyms in the
vocabulary, as it clearly increases the over load, and brings nothing new. We have not included the
energy information of the frames in our feature vectors. We have also not included the acceleration
coefficients. If we add this information as well in the codebook then the performance of recognizer
can be drastically increased. Also the codebook size used by us is 256. We can increase this codebook
size depending upon the speed and time constraints to get a bigger codebook to get more accuracy.
Moreover the distortion measure can be reduced to get better codebook. Finally system can be fully
implemented on a DSP Board and has been done by some people.
88
Appendix A
Vector Quantization in MATLAB
Since we have used the vqdtool of MATLAB for vector quantization, therefore we are also including
its tutorial. The material is taken from the help of MATLAB for the convenience of the reader to read.
The dialog box of vqdtool is shown below:
89
Figure A.1: VQDTOOL of MATLAB
The purpose of different options is explained below.
Training Set
Enter the samples of the signal you would like to quantize. This data set can be a MATLAB function
or a variable defined in the MATLAB workspace. The typical length of this data vector is 1e5.
Source of initial codebook
Select Auto-generate to have the block choose the initial codebook values. Choose User defined to
enter your own initial codebook values.
Number of levels
Enter the number of codeword vectors, N, in your codebook matrix, where N = 2.
90
Initial codebook
Enter your initial codebook values. From the Source of initial codebook list, select User defined in
order to activate this parameter. The codebook must have the same number of rows as the training set.
You must provide at least two codeword vectors.
Distortion measure
When you select Squared error, the block finds the nearest codeword by calculating the squared error
(unweighted). When you select Weighted squared error, the block finds the nearest codeword by
calculating the weighted squared error.
Weighting factor
Enter a vector or matrix. The block uses these values to compute the weighted squared error. When
the weighting factor is a vector, its length must be equal to the number of rows in the training set. This
weighting factor is used for each training vector. When the weighting factor is a matrix, it must be the
same size as the training set matrix. The individual weighting factors cannot be negative. The
weighting factor vector or matrix cannot contain all zeros.
Stopping criteria
Choose Relative threshold to enter the maximum acceptable fractional drop in the squared
quantization error. Choose Maximum iteration to specify the number of iterations at which to stop.
Choose whichever comes first and the block stops the iteration process as soon as the relative
threshold or maximum iteration value is attained.
Relative threshold
This parameter is available when you choose Relative threshold or Whichever comes first for the
Stopping criteria parameter. Enter the value that is the maximum acceptable fractional drop in the
squared quantization error.
91
Maximum iteration
This parameter is available when you choose Maximum iteration or Whichever comes first for the
Stopping criteria parameter. Enter the maximum number of iterations you want the block to perform.
Tie-breaking rules
When a training vector has the same distortion for two different codeword vectors, select Lower
indexed codeword to associate the training vector with the lower indexed codeword. Select Higher
indexed codeword to associate the training vector with the lower indexed codeword.
Codebook update method
When you choose Mean, the new codeword vector is calculated by taking the average of all the
training vector values that were associated with the original codeword vector. When you choose
Centroid, the block calculates the new codeword vector by taking the weighted average of all the
training vector values that were associated with the original codeword vector Note that if, for the
Distortion measure parameter, you choose Squared error, the Codebook update method parameter is
set to Mean.
Destination
Choose Current model to create a Vector Quantizer block in the model you most recently selected.
Type gcs in the MATLAB Command Window to display the name of your current model. Choose
New model to create a block in a new model file.
Block type
Select Encoder to design a Vector Quantizer Encoder block. Select Decoder to design a Vector
Quantizer Decoder block. Select Both to design a Vector Quantizer Encoder block and a Vector
Quantizer Decoder block.
92
Encoder block name
Enter a name for the Vector Quantizer Encoder block.
Decoder block name
Enter a name for the Vector Quantizer Decoder block.
Overwrite target block
When you do not select this check box and a Vector Quantizer Encoder and/or Decoder block with the
same block name exists in the destination model, a new Vector Quantizer Encoder and/or Decoder
block is created in the destination model. When you select this check box and a Vector Quantizer
Encoder and/or Decoder block with the same block name exists in the destination model, the
parameters of these blocks are overwritten by new parameters.
Generate Model
Click this button and VQDTool uses the parameters that correspond to the current plots to set the
parameters of the Vector Quantizer Encoder and/or Decoder blocks.
Design and Plot
Click this button to design a quantizer using the parameters on the left side of the GUI and to update
the performance curve and entropy plots on the right side of the GUI.
You must click Design and Plot to apply any changes you make to the parameter values in the
VQDTool GUI.
Export Outputs
Click this button, or press Ctrl+E, to export the Final Codebook, Mean Squared Error, and Entropy
values to the workspace, a text file, or a MAT-file.
93
94
References
[1] ISIP internet accessible speech recognition technology
[2] CMU Sphinx- Open Source Speech Recognition Engines
[3] Samudravijaya K and Maria Barot. A comparison of public domain software tools for speech
recognition.
[4] L. R. Rabiner and M. R. Sambur. An algorithm for determining the endpoints of isolated
utterances. Bell System Technical Journal, Vol 54, pages 297-315, 1975.
[5] Rabiner-Juang. “Fundamentals of Speech Recognition”.
[6] Rabiner L. R. et al. “Speaker Independent Recognition of Isolated Words using clustering
techniques,” IEEE Transactions on Acoustics, Speech and Signal, Vol ASSP-27, August
1979, pp 336-349.
[7] C. Rowden. Speech Processing. New York: Mc Graw Hill Book Company, 1991.
[8] Syrdal A., Bennett R., Greenspan S. Applied Speech Technology. Boca Raton,
Florida: CRC Press, 1994.
[9] Waibel A., Lee K. Knowledge Based Approaches, Introduction. San Mateo,
CA:MorganKaufmannPublishersInc., 1990, pp 198-202.
[10] Lawrence R. Rabiner “A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition”.
95
[11] Richard D. Peacocke and Daryl H. Graf Bell- Northern Research “An introduction
to Speech and Speaker Recognition”.
[12] D. O’Shaughnessy, Speech Communication: Human and Machine, Addison
Wesley, Reading, Mass., 1987.
[13] Mikael Nilsson et al. “Speech Recognition using Hidden Markov Model”
[14] L. E. Baum and J. A. Egon, “ An inequality with applications to statistical estimation to

probabilistic functions of a Markov Process and to a model for ecology,” Bull. Meteorol.
Soc., vol.73, pp. 360-363, 1967.
[15] Sigurdur Sigurdsson et al. “Mel Frequency Cepstral Coefficients: An Evaluation of

Robustness of MP3 Encoded Music”
[16] L. R. Rabiner, S. E. Levinson and M. M. Sondhi; “ On the applications of vector quantization

and hidden markov models to speaker independent, isolated word recognition”.
[17] Shahid Mahmood Awan; “Isolated Urdu Word Speech Recognition (IUWSR) by Hidden
Markov Model (HMM)”
[18] Milan G Mehta; “Speech Recognition System”
[19] IVR Systems [Online]. Available; http://www.ivr-system.net/?paged=2
[20] Hidden Markov Models [online];

http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/hmms/s2_pg1.html
96

Speaker Independent Speech Recognition System: Session 2007-2011

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speaker Independent Speech Recognition System: Session 2007-2011

Uploaded by

Copyright:

Available Formats

Speaker Independent Speech Recognition System

Zeeshan Nadir 2007-Elect-237

Muhammad Salman 09M/07-Elect-259

Supervised By: Prof. Dr. Haroon A. Babri

Department of Electrical Engineering

University of Engineering and Technology Lahore

Submitted to the faculty of the Electrical Engineering Department of the

Internal Examiner External Examiner

Department of Electrical Engineering

ASR Automatic Speech Recognition

HMM Hidden Markov Model

ANN Artificial Neural Network

VAD Voice Activation Detection

IDCT Inverse Discrete Cosine Transform

MFCC Mel Frequency Cepstrum Coefficients

LPC Linear Predictive Coding

FFT Fast Fourier Transform

DFT Discrete Fourier Transform

IVR Interactive Voice Response

4. Communication and Information Theory

1.1.1 Overview of the Project

Pattern Matching Recognized Words

Acoustic Models Language Models

Figure 1.1: Basic System Architecture of Speech Recognition

Graphical Software in Serial Port

Figure 1.2: Abstract Diagram of the Project

1.3 Thesis Organization

Chapter 4 explains different kinds of knowledge models that are used.

Chapter No 8 finally shows the results, analysis and conclusion.

Speech Recognition System

2.1 Speech Recognition- Definition and Issues

2.2 Existing Systems

2.3 Problem Statement

2.4 Objectives of the Project

2.5 General Design of System

Word Detector Feature Recognition

Figure 2.1: Block Diagram of the Recognition System Output to serial

Word Detector Feature Extraction Recognition Component Vocabulary

Figure 2.2: Block Diagram of the Training System

2.6 Description of Project

n is the number of states in the Hidden Markov Model of each word

m is the possible number of observation symbols for each word

MATLAB Serial Port Microcontroller Remote Control

Figure 2.4: description of the hardware portion of the project

Speech Recognition Techniques

3.1 Characteristics of Speech Recognition Systems

Feature Unit Matching Lexical Syntactic Semantic

Figure 3.1: Block Diagram for Continuous Speech Recognizer [10]

3.2 Techniques in Speech Recognition

3.2.1 Zero-Crossing and Energy-based Speech recognition

3.2.2 Feature Dependent Speech Recognition System

3.2.3 Template-Based Speech Recognition

Figure 3.2: Matching with Dynamic Time Wrapping [11]

3.2.4 Stochastic Speech Recognition Systems

Stochastic Modeling is another technique of building up speech recognition systems. Probabilistic

3.2.5 Knowledge Based Approach

3.2.6 Acoustic-Phonetic Approach to Speech Recognition

Speech Segmentation Control

Figure 3.4: Phonetic lattices for word string [5]

3.2.7 Artificial Intelligence (AI) Approach to Speech Recognition:

 Syntactic knowledge- the combination of words to form grammatically correct sentences.

 Semantic knowledge- understanding of the task domain so as to be able to validate sentences

 Pragmatic knowledge- inference ability necessary in resolving ambiguity of meaning based