3 PDF

UNIVERSITY OF GONDAR
INSTITUTE OF TECHNOLOGY
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
Focus Area: Communication Engineering

Thesis
Title of Thesis: Voice Recognition Using MFCC Algorithm
Group members: ID No.
1. Natnael Asmamaw ----------------------------------------------------07630/09

2. Sisay Ayehu -------------------------------------------------------------01037/09
3. Yidideya Addis --------------------------------------------------------01238/09
4. Zelalem Bogale ------------------------------------------------------01318/09
Submission Date: 25/8/2021
Advisor: Mr. Belayneh Eskeziya
Gondar, Ethiopia
VOICE RCOGNITION SYSTEM USING MFCC ALGORITHM 2013 E.C
DECLARATION OF AUTHORSHIP
We declare that this project titled, ‘voice recognition using MFCC algorithm’ and the work
presented in it are our own. We confirm that:
• This work will done wholly or mainly while in candidature for a bachelor degree at this
University.
• Where any part of this project has not previously been submitted for a degree or any other
qualification at this University.
• We have seen some other literature review from the work of others.
Group members: signature
1. Natnael Asmamaw
2. Sisay Ayehu
3. Yidideya Addis
4. Zelalem Bogale
UOG,IOT , ECEG BSc thesis for communication Eng. Page I

APROVAL
This is to certify that the thesis entitled, “VOICE RECOGNATION SYSTEM in partial
fulfillment of the requirements for the award of Bachelor of Science Degree in Electrical and
computer Engineering communication stream at university of Gondar Which is recorded of/as
their work carried by them during the academic year 2021 G.C under the supervision and
guidance of Mr. Belayneh Eskeziya. The extent and source of information are derived from the
existing literature and have been indicated through the thesis at the appropriate place. To the best
of their knowledge, the matter embodied in the thesis has not been submitted to any other
university / institute for the award of any Degree.
Project Adviser: Signature Date
1. Mr. Belayneh Eskeziya
Focus area coordinator
2. Mr. Thomas Worku
Department head
3. Mr. Mekete Asmare
Team Members: Signature Date
1. Natnael Asmamaw
2. Sisay Ayehu
3. Yidideya Addis
4. Zelalem Bogale
UOG,IOT , ECEG BSc thesis for communication Eng. Page II

ACKNOWLEDGMENT
For our document of the final thesis to arrive at its final stage, many individuals give us their
forwarding contribution since the beginning.
We express our deep sense of gratitude and sincere thanks to school of electrical and computer
engineering for giving such an opportunity and for its contribution in giving project room.
We would like to express our special thanks of gratitude to Mr. Belayneh Eskeziya and all of our
instructors, for their valuable recommendations and gave us the golden opportunity to do this
final thesis for voice recognition using MFCC algorithm.
Last but not least, we want to thank our friends who treasured us for our hard work and
encouraged us and finally to God who made all the things possible for giving help and patience
in going such hard time.
UOG,IOT , ECEG BSc thesis for communication Eng. Page III

ABSTRACT
It is easy for human to recognize familiar voice but using computer programs to identify a voice
when compared with others is a herculean task. This is due to the problem that is encountered
when developing the algorithm to recognize human voice. It is impossible to say a word the
same way in two different occasions. Human speech analysis by computer gives different
interpretation based on varying speed of speech delivery. This research paper gives detail
description of the process behind implementation of an effective voice recognition algorithm.
The algorithm utilize discrete Fourier transform to compare the frequency spectra of two voice
samples because it remained unchanged as speech is slightly varied.
Markove inequality is then used to determine whether the two voices came from the same
person. The algorithm is implemented and tested using MATLAB.
Keywords: Markova’s inequality, discrete Fourier transform, frequency spectra, voice
recognition.
UOG,IOT , ECEG BSc thesis for communication Eng. Page IV

Table of Contents
DECLARATION OF AUTHORSHIP..................................................................................................................... I
APROVAL ....................................................................................................................................................... II
ACKNOWLEDGMENT .................................................................................................................................... III
ABSTRACT..................................................................................................................................................... IV
List of Figures. ............................................................................................................................................. VII
List of Acronyms:........................................................................................................................................ VIII
CHAPTER ONE ............................................................................................................................................... 1
1. INTRODUCTION ......................................................................................................................................... 1
1.1 Background of Thesis .......................................................................................................................... 1
1.1.1 Human Voice ................................................................................................................................ 2
1.1.2 The Speech Signal ........................................................................................................................ 2
1.1.3 Speech Synthesis: ......................................................................................................................... 3
1.1.4 Speech Analysis: ........................................................................................................................... 4
1.1.5 Voice Recognition ........................................................................................................................ 4
1.1.6 Text-Dependent: .......................................................................................................................... 5
1.1.7 Text Independent: ........................................................................................................................ 5
1.1.8 Voice Recognition Techniques ..................................................................................................... 5
1.1.8.1 Template Matching ................................................................................................................... 5
1.1.8.2 Feature Analysis ........................................................................................................................ 6
1.2 Statement of the Problem .................................................................................................................. 6
1.3 Objectives of the Thesis ...................................................................................................................... 7
1.3.1 General Objective ........................................................................................................................ 7
1.3.2 Specific Objective ......................................................................................................................... 7
1.4 Methodology....................................................................................................................................... 7
1.5 Scope of the Thesis ........................................................................................................................... 10
1.6 Organization of the Project ............................................................................................................... 10
CHAPTER TWO ............................................................................................................................................ 11
2. LITERATURE REVIEW ............................................................................................................................... 11
2.1 Signal Sampling ................................................................................................................................. 11
2.2 The Characteristic Parameters of Speech Signal............................................................................... 12
UOG,IOT , ECEG BSc thesis for communication Eng. Page V

2.3 MFCC ................................................................................................................................................. 13
2.4 GMM (Gaussian Mixture Model) ...................................................................................................... 13
2.5 Linear Predictive Coding ................................................................................................................... 15
CHAPTER THREE .......................................................................................................................................... 17
3. SYSTEM DESIGN AND ANALYSIS .............................................................................................................. 17
3.1 Speech Recognition Basics ................................................................................................................ 17
3.1.1 Utterance ................................................................................................................................... 17
3.2 The Feature Extraction ...................................................................................................................... 18
3.2.1 Frame Blocking ........................................................................................................................... 18
3.2.2 Windowing ................................................................................................................................. 19
3.2.3 FFT .............................................................................................................................................. 19
3.2.4 Mel Frequency Cepstrum ........................................................................................................... 19
3.2.5 Cepstrum .................................................................................................................................... 20
3.2.6 MFCC .......................................................................................................................................... 21
3.2.7 MFCC Approach ......................................................................................................................... 22
CHAPTER FOUR ........................................................................................................................................... 26
4. RESULT AND DISCUSSION ....................................................................................................................... 26
CHAPTER FIVE ............................................................................................................................................. 29
5. CONCLUSION AND RECOMMENDATION ............................................................................................... 29
5.1 Applications....................................................................................................................................... 29
5.2 Recommendation.............................................................................................................................. 30
REFERENCES ................................................................................................................................................ 31
APPENDIX .................................................................................................................................................... 32
APPENDIX:A............................................................................................................................................. 32
MFCC MATLAB CODE .......................................................................................................................... 32
APPENDIX: B ............................................................................................................................................ 35
Testing Code:....................................................................................................................................... 35
APPENDIX: C ............................................................................................................................................ 38
Voice Recording Mat lab code: ........................................................................................................... 38
APPENDIX:D ............................................................................................................................................ 38
Training and Testing Code: ................................................................................................................. 38
UOG,IOT , ECEG BSc thesis for communication Eng. Page VI

List of Figures.
Figure 1. 1 Schematic Diagram of the Speech Production/Perception Process ............................................ 2
Figure 1. 2 block diagram of the system ....................................................................................................... 8
Figure 1. 3 Speaker Identification Training .................................................................................................. 8
Figure 1. 4 Speaker Identification Testing .................................................................................................... 9
Figure 3. 1 utterance of voice ..................................................................................................................... 17
Figure 3. 2 feature extraction process ......................................................................................................... 18
Figure 3. 3 Filter Bank on Mel frequenc ..................................................................................................... 20
Figure 3. 4 cepstrum signal line .................................................................................................................. 21
Figure 3. 5 mel frequency mapped.............................................................................................................. 22
Figure 3.6 MFCC Approach ....................................................................................................................... 23
Figure 3. 7 MFCC approach Algorithm ...................................................................................................... 24
Figure 3. 8 FFT approach Algorithm........................................................................................................... 25
figure 4. 1. Voice training wave form ......................................................................................................... 27
figure 4. 2. Voice tasting phase wave form ................................................................................................. 28
UOG,IOT , ECEG BSc thesis for communication Eng. Page VII

List of Acronyms:
ASR ……………………………………………………...… Automatic speech recognition

DCT …….………………………………………………..… Discrete cosine transform
DTW ………………………………………………….....…. Dynamic time warping
EGG ………………………………....……………………... Electro glotto graph
FTT ……….………………………………………………... Fast Fourier transform
FT ………………………………………………………...… Fourier transform
HMM ……………………………………………………….. Hidden marcov model
LPC ……………………………………………………….… Linear predictive coding
LT ………………………………………………………....... Linear predictive
MFCC ………………………..…………………………...… Mel frequency cepstral coefficient
OC ……………………………………………………..….… Optical character recognition
PC ……………………………………………….………….. Personal computer
PIN ………………………………………………………….. Personal identifier number
SNR ……………………………………………..…....…....... Signal to noise ratio
UOG,IOT , ECEG BSc thesis for communication Eng. Page VIII

CHAPTER ONE
1. INTRODUCTION
1.1 Background of Thesis
Voice Recognition or Voice Authentication is an automated method of identification of the

person who is speaking by the characteristics of their voice biometrics. Voice is one of many
forms of biometrics used to identify an individual and verify their identity. Naturally human can
recognize a familiar voice but getting computer to do the same is more difficult task. This is due
to the fact that it is impossible to say a word exactly the same way on two different occasions.
Advancement in computing capabilities has led to a more effective way of recognizing human
voice using feature extraction. Voice recognition system is one of the best and highly effective
biometrics technique which could be used for telephone banking and forensic investigation by
law enforcement agency. Speech recognition technology is a process of extracting the speech
characteristic information from people's voice, and then been operated through the computer and
to recognize the content of the speech. It’s interdisciplinary involving many fields, where
modern speech recognition technology consist of many domains of technology, such as signal
processing, theory of information, phonetics, linguistics, artificial intelligence, etc. Over the past
few decades, scholars have done many research about speech recognition technology. With the
development of computer, microelectronics and digital signal processing technology, speech
recognition has acts an important role at present. Using the speech recognition system not only
improves the efficiency of the daily life, but also makes people’s life more diversified. In the
meantime, vector quantization (VQ) theory was invented, and linear prediction technology was
developed more and more perfect. In 1980s, the artificial neural network (ANN) technology has
been applied in the field of speech recognition successfully. The application of artificial neural
network technology becomes a new way of researching voice recognition, which has the
advantage of non-linearity, robustness, fault tolerance and learning characteristics. At the same
time, the conjunctions speech recognition algorithms have been proposed, which makes the
speech recognition research start from micro to macro.
UOG,IOT , ECEG BSc thesis for communication Eng. Page 1

1.1.1 Human Voice

The voice is made up of sound made by human being using vocal folds for talking, singing,
laughing, crying, screaming etc. The human voice is specifically that part of human sound
production in which the vocal folds are the primary sound source. The mechanism for generating
the human voice can be subdivided into three; the lungs, the vocal folds within the larynx, and
the articulators
1.1.2 The Speech Signal

Human communication is to be seen as a comprehensive diagram of the process from speech
production to speech perception between the talker and listener, See Figure 3.2
Figure 1. 1 Schematic Diagram of the Speech Production/Perception Process
Five different elements, A. Speech formulation, B. Human vocal mechanism, C. Acoustic air,
D .Perception of the ear, E. Speech comprehension .The first element (A. Speech formulation) is
associated with the formulation of the speech signal in the talker’s mind. This formulation is
used by the human vocal mechanism (B. Human vocal mechanism) to produce the actual speech

waveform. The waveform is transferred via the air (C. Acoustic air) to the listener. During this
transfer the acoustic wave can be affected by external sources, for example noise, resulting in a
more complex waveform. When the wave reaches the listener’s hearing system (the ears) the
listener percepts the waveform (D. Perception of the ear) and the listener’s mind (E. Speech
comprehension) starts processing this waveform to comprehend its content so the listener
understands what the talker is trying to tell him. One issue with speech recognition is to
“simulate” how the listener process the speech produced by the talker. There are several actions
taking place in the listeners head and hearing system during the process of speech signals. The
perception process can be seen as the inverse of the speech production process. The basic
theoretical unit for describing how to bring linguistic meaning to the formed speech, in the mind,
is called phonemes. Phonemes can be grouped based on the properties of either the time
waveform or frequency characteristics and classified in different sounds produced by the human
vocal tract .Speech is time-varying signal.
Well-structured communication process, Depends on known physical movements. Composed of
known, distinct units (phonemes). Is different for every speaker, May be fast, slow, or varying in
speed, May have high pitch, low pitch, or be whispered, has widely-varying types of
environmental noise, May not have distinct boundaries between units (phonemes), has an
unlimited number of words.
1.1.3 Speech Synthesis:

Speech synthesis is the artificial production of human speech. A text-to-speech (TTS) system
converts normal language text into speech; other systems render symbolic linguistic
representations like phonetic transcriptions into speech. Synthesized speech can also be created
by concatenating pieces of recorded speech that are stored in a database. Systems differ in the
size of the stored speech units; a system that stores phones or diaphones provides the largest
output range, but may lack clarity. For specific usage domains, the storage of entire words or
sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of
the vocal tract and other human voice characteristics to create a completely "synthetic" voice
output. The quality of a speech synthesizer is judged by its similarity to the human voice, and by
it stability to be understood. An intelligible text-to-speech program allows people with visual

impairments or reading disabilities to listen to written works on a home computer. Many

computer operating systems have included speech synthesizers since the early 1980s.
1.1.4 Speech Analysis:

Voice problems that require voice analysis most commonly originate from the vocal cords since
it is the sound source and is thus most actively subject to tiring. However, analysis of the vocal
cords is physically difficult. The location of the vocal cords effectively prohibits direct
measurement of movement. Imaging methods such as x-rays or ultrasounds do not work because
the vocal cords are surrounded by cartilage which distorts image quality. Movements in the vocal
cords are rapid, fundamental frequencies are usually between 80 and 300 Hz, thus preventing
usage of ordinary video. High-speed videos provide an option but in order to see the vocal cords
the camera has to be positioned in the throat which makes speaking rather difficult. Most
important indirect methods are inverse filtering of sound recordings and electro glotto graphs
(EGG). In inverse filtering methods, the speech sound is recorded outside the mouth and then
filtered by a mathematical method to remove the effects of the vocal tract. This method produces
an estimate of the waveform of the pressure pulse which again inversely indicates the
movements of the vocal cords. The other kind of inverse indication is the electro glotto graphs,
which operates with electrodes attached to the subject’s throat close to the vocal cords. Changes
in conductivity of the throat indicate inversely how large a portion of the vocal cords are
touching each other. It thus yields one-dimensional information of the contact area. Neither
inverse filtering nor EGG is thus sufficient to completely describe the glottal movement and
provide only indirect evidence of that movement.
1.1.5 Voice Recognition

Voice Recognition (sometimes referred to as Speaker Recognition) is the identification of the
person who is speaking by extracting the feature of their voices when a questioned voice print is
compared against a known voice print. This technology involves sounds, words or phrases
spoken by humans are converted into electrical signals, and these signals are transformed into
coding patterns to which meaning has been assigned. There are two major applications of voice
recognition technologies and methodologies. The first is voice verification or authentication
which is used to verify the speaker claims to be of a certain identity and the voice is used to
verify this claim. The second is voice identification which is the task of determining an unknown

speaker’s identity. In a better perspective, voice verification is one to one matching where one
speaker’s voice is matched to one template or voice print, whereas voice identification is one to
many matching where the speaker’s voice is compared against many voice templates. Speaker
recognition system has two phases: Enrollment and Verification. During enrollment, the
speaker’s voice is recorded and typically a number of features are extracted to form a voice print
or template. In the verification phase, a speech sample or “utterance” is compared against a
previously created voice print. For identification systems, the utterance is compared against
multiple voice prints in order to determine the best match while verification systems compare an
utterance against a single voice print.
Voice Recognition Systems can also be categorized into two: text independent and text
dependent.
1.1.6 Text-Dependent:
This means text must be the same for the enrollment and verification. The use of shared secret
passwords and PINs or knowledge-based information can be employed in order to create a multi-
factor authentication scenario.
1.1.7 Text Independent:

Text-Independent systems are most often used for speaker identification as they require very
little cooperation by the speaker. In this case the text used during enrollment is different from the
text during verification. In fact, the enrollment may happen without the user’s knowledge, as in
the case for many forensic applications.
1.1.8 Voice Recognition Techniques

The most common approaches to voice recognition can be divided into two classes: Template
Matching and Feature Analysis.
1.1.8.1 Template Matching:

Template matching is the simplest technique and has the highest accuracy when used properly,
but it also suffers from the most limitations. As with any approach to voice recognition, the first
step is for the user to speak a word or phrase into a microphone. The electrical signal from the
microphone is digitized by an "analog-to-digital (A/D) converter", and is stored in memory. To
determine the "meaning" of this voice input, the computer attempts to match the input with a
digitized voice sample, or template that has a known meaning. This technique is a close analogy

to the traditional command inputs from a keyboard. The program contains the input template,
and attempts to match this template with the actual input using a simple conditional statement.
This type of system is known as "speaker dependent." and recognition accuracy can be about 98
percent.
1.1.8.2 Feature Analysis:

A more general form of voice recognition is available through feature analysis and this technique
usually leads to "speaker-independent" voice recognition. Instead of trying to find an exact or
near-exact match between the actual voice input and a previously stored voice template, this
method first processes the voice input using "Fourier transforms" or "linear predictive coding
(LPC)", then attempts to find characteristic similarities between the expected inputs and the
actual digitized voice input. These similarities will be present for a wide range of speakers, and
so the system need not be trained by each new user. The types of speech differences that the
speaker-independent method can deal with, but which pattern matching would fail to handle,
include accents, and varying speed of delivery, pitch, volume, and inflection. Speaker-
independent speech recognition has proven to be very difficult, with some of the greatest hurdles
being the variety of accents and inflections used by speakers of different nationalities.
Recognition accuracy for speaker-independent systems is somewhat less than for speaker-
dependent systems, usually between 90 and 95 percent. I have implemented template matching
technique.
This approach has been intensively studied and is also the back bone of most voice recognition
products in the market.
1.2 Statement of the Problem

The main problem that drives us to do this thesis is the limitation of existing security system
access. Many of the area Time and Attendance Systems, Access Control Systems, Telephone-
Banking/Broking, Biometric Login to telephone aided shopping systems, Information and
Reservation Services, Security control for confidential information and Forensic purposes these
have security problem Therefore there arises a need to do so in a systematic manner which we
have tried to implement with our system.

1.3 Objectives of the Thesis
1.3.1 General Objective

The objective of the project is to Design a voice recognition model. Which can be implied on
different applications.
1.3.2 Specific Objective

 To record voice directly by matlab.
 T0 use MFCC and FFT extraction technique.
 To compare the recorded voice with samples.
 To detect the speaker.
1.4 Methodology
Our methods of work are organized and accomplished through a sequence of stages. Prior to all,
we have reviewed related literatures. Then we have made the general block diagram for our
system that enables as to easily analyze each components of the system as shown in the
following figurative expressions.
Speech communication has evolved to be efficient and robust and it is clear that the route to
computer based speech recognition is the modeling of the human system. Unfortunately from
pattern recognition is the modeling point of view human recognition speech through a very
complex interaction between many levels of processing using syntactic information as well very
powerful low level pattern classification and processing. Powerful classification algorithms and
sophisticated front ends are in the final analysis, not enough; many other forms of knowledge,
e.g. linguistic, semantic and pragmatic, must be built into the recognizer. Nor, even at a lower
level of sophistication, is it sufficient merely to generate “a good” representation of speech (i.e. a
good set of features to be used in a pattern classifier); the classifier itself must have a
considerable degree of sophistication. It is the case, however, it do not effectively discriminate
between classes and, further, that the better the features the easier is the classification task.
Automatic speech recognition is therefore an engineering compromise between the ideal, i.e.
complete model of the human, and the practical, i.e. the tools that science and technology
provide and that costs allow .At the highest level, all speaker recognition systems contain two
main modules (refer to Fig 1.1):feature extraction and feature matching. Feature extraction is the

process that extracts a small amount of data from the voice signal that can later be used to
represent each speaker. Feature matching involves the actual procedure to identify the unknown
speaker by comparing extracted features from his/her voice input with the ones from a set of
known speakers. We will discuss each module in detail in later sections.
Figure 1. 2 block diagram of the system
Similarity
Reference
model
Input Feature 1(speaker Maximum Identificati
#1) selection
speec extractio on result
h n (speaker
ID)
Similarity
extractio
extractio Reference
ext model
(speaker #N)
Figure 1. 3 Speaker Identification Training

Input Feature Similarit Decision

extraction y Verification
speech result
(accept/Reject)
Reference
Speaker ID model Threshold
(#M) Speaker
Figure 1. 4 Speaker Identification Testing
All Recognition systems have to serve two different phases. The first one is referred to the
enrollment sessions or training phase while the second one is referred to as the operation sessions
or testing phase. In the training phase, each registered speaker has to provide samples of their
speech so that the system can build or train a reference model for that speaker. In case of speaker
verification systems, in addition, a speaker-specific threshold is also computed from the training
Samples During the testing (operational) phase (see Figure 1.3), the input speech is matched with
stored reference model(s) and recognition decision is made. Speech recognition is a difficult task
and it is still an active research area. Automatic speech recognition works based on the premise
that a person’s speech exhibits characteristics that are unique to the speaker. However this task
has been challenged by the highly variant of input speech signals. The principle source of
variance is the speaker himself. Speech signals in training and testing sessions can be greatly
different due to many facts such as people voice change with time, health conditions (e.g. the
speaker has a cold), speaking rates, etc. There are also other factors, beyond speaker variability,
that present a challenge to speech recognition technology .Examples of these are acoustical noise
and variations in recording environments (e.g. speaker uses different telephone handsets). The
challenge would be make the system “Robust”. So what characterizes a “Robust System?
When people use an automatic speech recognition (ASR) system in real environment, they
always hope it can achieve as good recognition performance as human's ears do which can
constantly adapt to the environment characteristics such as the speaker, the background noise and
the transmission channels. Unfortunately, at present, the capacities of adapting to unknown

conditions on machines are greatly poorer than that of ours. In fact, the performance of speech
recognition systems trained with clean speech may degrade significantly in the real world
because of the mismatch between the training and testing environments. If the recognition
accuracy does not degrade very much under mismatch conditions, the system is called “Robust
1.5 Scope of the Thesis

We assumed that the voice recognition for security system has its own data base; so that we
didn’t develop any; we also assumed it has tasting phase and training phase matching so that
there wouldn’t be a delay in comparing two voice; therefore this would not be a limitation; and
due to data base knowledge limitations we are limited to show our system only by recording
testing voice and training voice machining
1.6 Organization of the Project

Chapter one is all about introductory part of our project. It gives an overview of the work
including, the statement of the problem, project objective, methodologies we used, the project
scope and the limitations we had. Chapter two is all about related works an over view of
literature review
Chapter three describes the detail of our system implement and design. This chapter includes
short brief of the components we have used in our design. Chapter four is all about the result and
its discussion.

CHAPTER TWO
2. LITERATURE REVIEW
Speech is the most natural way of communicating for human beings. While this has been true
since the dawn of civilization, the invention and widespread use of the telephone, television,
radio and audio phonic storage media has given more importance to the communication of voice
and voice processing. Advances in digital signal processing technology has been the use of
speech in many different areas of application such as compression of speech enhancement,
synthesis and recognition
Speech recognition or Automatic Speech Recognition (ASR) system is converts the acoustic
signal (audio) to a machine readable format. ASR recognizes the words & these words are
worked as input for a particular application, it may be worked as command or for document
preparation. Now a day there is glamour of designing an intelligent machine that can recognize
the spoken word & understand its meaning & capture corresponding actions. One of the most
difficult aspects of performing research in speech recognition by machine is its interdisciplinary.
The early studies focus on monolithic approach to individual problems.
2.1 Signal Sampling

First, signal changes with the time where demonstrates short-time characteristics, which
indicates that signal is stable in a very short period of time. Second, spectrum energy of the
human’s speech signal normally centralized in frequency between 0-4000Hz.
It is an analog signal when speak out from human, and it will convert to a digital signal when
input into computer, the conversion of this process introduce the most basic theory for signal
processing- signal sampling. It provides principles that the time domain speech analog signal
X(t) convert into the frequency domain discrete time signal X(n) while keeps characteristics of
the original signal in the same time. And to fulfill discretization of the sampling, another theory
Nyquist theory is adopted. The theory requires sampling frequency Fs must equal and larger than
two times of the highest frequency for sampling and rebuilding the signal, which can be
represented as F≥2*Fmax , it offers a way to extract the enough characteristics of the original
analog signal in the condition of the least sampling frequency.in the process of signal sampling.

Due to inappropriate high sampling frequency lead to sampling too much data （N=T/△t with a
certain length of signal (T), it will increase unnecessary workload of computer and taken too
much storage; On the contrary, the discrete time signal won’t represent the characteristics of the
original signal if the sampling frequency is too low and the sampling point are insufficient. [5]
So we always utilize about 8000Hz as the sampling frequency according to Nyquist Theory that
F≥2 * Fmax.
2.2 The Characteristic Parameters of Speech Signal

Before the recognition of speech, characteristic parameters of the input speech signal is need to
be extracted. The purpose of characteristic parameters extraction is to analyze speech signal
processing and removes the redundant information which has nothing to do with speech
recognition and obtain the important information .Generally speaking, there are two kinds of
characteristic parameters, the first one is the characteristic parameters in time domain and the
second is in transform domain. The advantage of characteristic parameters in time domain is
simple calculation. However, it cannot be compressed and also not suitable for the
characterization of amplitude spectrum characteristics.
So far, the speech signal characteristic parameters are almost based on short-time spectrum
estimation, the author learnt 2 related parameters that can be used in Matlab including Linear
Predictive Coefficient, and the Mel frequency Cepstrum Coefficient in this research. The method
of Linear predictive analysis is one of the most important and widely used speech analysis
techniques. The importance of this method is grounded both in its ability to provide accurate
estimates of the speech parameters and in its relative speed of computation. The method of
Linear predictive analysis is based on the assumption that the speech can be characterized by a
predictor model, which looks at past values of the output alone; hence it is an all pole model in
the Z transform domain.
The method of Mel Frequency Cepstral Coefficients is also a powerful technique, which is
calculated based on the mel scale. Before calculating the MFCC coefficient, it is necessary
framing the whole speech signal into multi sub frames, the Hamming windows and Fast Fourier
transformation are computed for each frame. The power spectrum is segmented into a number of
critical bands by means of a filter-bank typically consists of overlapping triangular filters which
will adapt the frequency resolution to the properties of the human ear. The discrete cosine

transformation applied to the logarithm of the filter-bank outputs results in the raw MFCC vector
triangular filters. So the MFCC imitate the ear perception behavior and give, good identification
than LPC.
2.3 MFCC
Mel Frequency Cepstral Coefficient, is the characteristic parameter which widely used in speech
recognition or speaker recognition. Before the Mel Frequency Cepstral Coefficients, the
researchers always use the Linear Prediction Coefficients or Linear Prediction Cepstral
Coefficients as the Characteristic parameters of speech signal. Mel Frequency Cepstral
Coefficients is the representation of short time power spectrum of a speech signal, and it is
calculated by DCT 1to convert into time domain, based on a linear cosine transform of a log
power spectrum on a nonlinear mel scale of frequency. Then the result will be set of the acoustic
vectors.
MFCC are commonly used as Characteristic parameters in speech recognition algorithm. In the
theory of MFCC, the Critical Band is a very important concept which can solve problem of
frequency division, it is also an important indicator of Mel frequency. The purpose of
introducing critical bandwidth is to describe the masking effect. When two similar or same
pitches voiced in the same time, human ear can’t distinguish the difference and only can receive
one pitch. The condition that two pitches can be received is that the weight difference of two
frequencies suppose two larger than certain bandwidth, and we called this as critical bandwidth.
In critical bandwidth, if the sound pressure of a speech signal with noise is constant, the loudness
of speech signal with noise is constant then. But once the noise bandwidth beyond the critical
bandwidth, the loudness will change obviously.
2.4 GMM (Gaussian Mixture Model)

GMM, the abbreviation of Gaussian Mixture Model, which can be seen as a probability density
function. The method of GMM is widely used in many fields, such as recognition, prediction,
clustering analysis. The parameters of GMM are estimated from the training data by
Expectation-Maximization (EM) algorithm or Maximum A Posteriori (MAP) estimation.
Compared with many other model and method, the Gaussian Mixture Model has many
advantages, hence it is widely used in the speech recognition field. As a fact, the GMM can be

seen as a CHMM with only one state, but yet it cannot be simply regarded as a hidden markov
model (HMM) due to GMM completely avoid the segmentation of the system state. When
compared with the HMM model, the GMM mode will make the researching of speech
recognition more simply, but the performance has no lossy. And the gaussian mixture model
which is calculating the characteristic parameters space distribution by the weight sum of multi
gaussian density function when compared with the VQ method, from this point of view, the
GMM has more accuracy and superiority. It is very hard to match the process of human's
pronunciation organs, but we can simulate a model which express the process of sound, and it
can be implemented by building a probability model for speech processing, while gaussian
mixture model is the very probability model which can qualified the condition.
The formula of M order gaussian mixture model can be defined as:
𝑝(𝑥𝑡 ) = ∑𝑀
𝑖=𝑙 𝛼𝑖 𝑝𝑖 (𝑥𝑡 ) (2)
Where xt is a D dimension Speech Characteristic parameters vector, 𝑃𝑖 (𝑥𝑡 ) is the element of

gaussian mixture model, namely the probability density function of each model. 𝛼𝑖 is the weight
coefficient of 𝑝𝑖 (𝑥𝑡) . M is the order of gaussian mixture model, namely the number of
probability density function. The author can know that:
∑𝑀
𝑖=1 α = 1 (2.1)
1 (𝑥𝑡 −𝑢𝑖 )𝑇 ∑−1

𝑖 𝑥𝑡− 𝑢𝑖
𝑝𝑖 (𝑥𝑡 ) = 𝐷 exp{− } (2.2)
1/2 2
{2𝜋 2 }[∑ 𝑖]
Thus the element 𝑝𝑖 𝑥𝑡 of gaussian mixture model can be described the mean value and
covariance.
EM algorithm is the abbreviation of expectation maximization, which is an iterative method. The
EM algorithm can search the maximum likelihood estimation of parameters in statistical models.
The EM algorithm can be divided into two steps, the first step is the expectation (E) step, which
can generate a function for the expectation of the log-likelihood.

The second step is the maximization (M) step, which can compute the parameters and maximize
the expected log-likelihood searched on the step E. In many research fields, the Expectation
maximization is the most popular technique, which is used to calculate the parameters of a
parametric mixture model distribution. It is an iterative algorithm with Three steps: Initialization,
the expectation step and the maximization step.
Each class j of M clusters, which is constituted by a parameter vector (θ), composed by the mean
(µ𝑗 ) and by the covariance matrix (𝑝𝑗 ). On the initial time, the implementation can generate
randomly the initial values of mean (µ𝑗 ) and of covariance matrix (𝑝𝑗 ). The EM algorithm aims
to approximate the parameter vector (θ) of the real distribution.
Expectation step is responsible to estimate the probability of each element belong to each cluster
p(𝑐𝑗 |𝑥𝑘 ). Each element is composed by an attribute vector (𝑥𝑘 ). With initial guesses for the
parameters of our mixture model, the probability of hidden state 𝑖 can be defined as.
𝑝𝑖 𝑝(𝑥𝑡|𝑖𝑡 =𝑖,𝜆) 𝑝𝑏𝑖 (𝑥𝑖 )

P(𝑖𝑡=𝑖|𝑥𝑡,𝜆 ) = = (1 + 𝑥)𝑛 = ∑𝑀 (2.3)
𝑝(𝑥𝑡 | 𝜆 ) 𝑚=1 𝑝𝑚 𝑏𝑚 (𝑥𝑡 )
Maximization step is responsible to estimate the parameters of the probability distribution of

each class for the next step. First is computed the mean (µ𝑗 ) of class j obtained through the mean
of all points in function of the relevance degree of each point. The advantage of GMM is that the
sample points after projection is not get a certain tags, but also get the probability of each class
that is an important information. The calculation of GMM in each steps is very large and the
solving method of GMM is based on EM algorithm, where it is likely to fall into local extremum,
which is related with the initial value.
2.5 Linear Predictive Coding

LPC starts with the assumption that a speech signal is produced by a buzzer at the end of a tube
(voiced sounds), with occasional added hissing and popping sounds. Although apparently crude,
this model is actually a close approximation to the reality of speech production. The glottis (the

space between the vocal cords) produces the buzz, which is characterized by its intensity
(loudness) and frequency (pitch). The vocal tract (the throat and mouth) forms the tube, which is
characterized by its resonances, which are called formants. Hisses and pops are generated by the
action of the tongue, lips and throat during sibilants and plosives. LPC analyzes the speech signal
by estimating the formants, removing their effects from the speech signal, and estimating the
intensity and frequency of the remaining buzz. The process of removing the formants is called
inverse filtering, and the remaining signal after the subtraction of the filtered modeled signal is
called the residue. The numbers which describe the intensity and frequency of the buzz, the
formants, and the residue signal, can be stored or transmitted somewhere else. LPC synthesizes
the speech signal by reversing the process: use the buzz parameters and the residue to create a
source signal, use the formants to create a filter (which represents the tube), and run the source
through the filter, resulting in speech .Because speech signals vary with time, this process is done
on short chunks of the speech signal, which are called frames; generally 30 to 50 frames per
second give intelligible speech with good compression.

CHAPTER THREE
3. SYSTEM DESIGN AND ANALYSIS

Voice recognition for authentication allows the users to identify themselves using nothing but
their voices. This can be much more convenient than traditional means of authentication which
require to carry a key with you or remember a PIN. There are a few distinct concepts of using the
human voice for authentication, i.e. there are different kinds of voice recognition systems for
authentication purposes.
Banking application is an example: over the phone which is handled by an automatic dialog
system, it may be necessary to provide account information of the caller. The speech recognition
part of the system first recognizes the number that has been specified. The same speech signal is
then used by the speaker authentication part of the system to check if the biometric template of
the account holder matches the voice characteristics Security agencies have several means of
collecting information. One of these is electronic eavesdropping of telephone and radio
conversations. As this results in high quantities of data, filter mechanisms must be applied in
order to find the relevant information.
3.1 Speech Recognition Basics

3.1.1 Utterance
An utterance is the vocalization (speaking) of a word or words that represent a single meaning to
the computer. Utterances can be a single word, a few words, a sentence, or even multiple
sentences.
Figure 3. 1 utterance of voice

3.2 The Feature Extraction

This stage is known as the front-end processing of speech. The main objective of the feature
extraction is to simplify the recognition by summarizing the vast amount of speech data without
losing acoustic properties that defines the speech. The schematic diagram of the steps shown in
Figure 3.2.
Deriving the acoustic characteristics of speech signal is called feature extraction. Feature
extraction is used in training and testing phases. They consist of the following steps: 1. Frame
Blocking 2. Windowing 3. FFT (Fast Fourier Transform) 4. Wrapping 5 Mel-frequency.
Cepstrum (Mel frequency Cepstral coefficients)
Input
Frame blocking Windowing
Signal FFT
Mel-Frequency
Cepstrum
Wrapping
Figure 3. 2 feature extraction process
3.2.1 Frame Blocking

Investigations show that the characteristics of voice signal remains fixed in a short enough time
interval period (called quasi-stationary). For this reason, the speech signals are processed in short
time intervals. It is divided into frames with sizes usually between 30-100 milliseconds. Each
frame overlaps the front frame by a predefined size. The overlay scheme aims to smooth the
transition from frame to frame.

3.2.2 Windowing
The second step is to the window all frames. This is done to eliminate discontinuities at the edges
of the frames. If the function of windows is defined as (n) w, 0 < n < N-1 where N is the number
of samples per frame, the resulting signal will be; and (n) = x (n) w (n). Generally hamming
windows are used.
3.2.3 FFT
The next step is to take fast Fourier transform of each frame. This transformation is a quick way
of Fourier transform discrete transforms and changes the time domain to frequency.
3.2.4 Mel Frequency Cepstrum

The human ear perceives the frequencies of non-linearly. Research shows that the scale is linear
up to 1 kHz and logarithmic above. The filter Bank of Mel-Scale (scale of the melody) that
characterizes the human ear preciseness of frequency. Used as a pass band filter for this stage of
identification. The signs of each frame is passed through band pass Mel-scale filter to mimic the
human ear. As mentioned previously, psychophysical studies have shown that human perception
of the contents of the frequency of sounds for speech signals does not follow a linear scale. Thus
for each tone with a real frequency, f, measured in Hz, a subjective tone is measured on a scale
known as “mel scale. The mel frequency scale is a spacing of linear frequency below 1000 Hz
and a logarithmic rate above 1000 Hz. As a point of reference, the pitch of a tone of 1 kHz, 40
dB above the threshold of auditory perception, is defined as 1000 mels. We can therefore use the
following approximate formula to calculate the mels for a given frequency f in Hz:
(3.1)
An approach to simulate subjective spectrum is to use a Bank of filters, a filter for each
component of desired frequency of mel. Filter Bank has a triangular band-pass frequency
response, and space as well as bandwidth is determined by a constant interval of
mel. S() modified spectrum thus consists in the power output of these filters when S() entry. The
number of K , mel cepstral coefficients, is usually chosen as 20 note that this filter bank is
applied in the frequency domain; so it simply means that the windows of the shape of the triangle

in Figure 4.2 in the spectrum. A useful way of thinking about this deformed mel filter bank is to
see each filter as a bin of histogram (where containers have overlap) in the frequency domain. A
useful and efficient way to implement this is to consider these triangular filters on the Mel scale,
which in effect would be equally spaced filters.
Figure 3. 3 Filter Bank on Mel frequenc
3.2.5 Cepstrum
Cepstrum name was derived from the spectrum by reversing the first four letters of the spectrum.
We can say cepstrum is the transformer of the Fourier transform of the registry with unwrapped
phase Fourier transformer.
 Mathematically we can say Cepstrum of signal = FT (log (FT (thesignal)) + j2IIm)
Where is m the integer required to properly unwrap the angle or imaginary part of the complex
function of the registry.

 Algorithmically say – signal – FT – registration – phase unwrapping – FT-Cepstrum
For the definition of the values real real cepstrum uses the logarithm function. While for the
definition of complex values while the complex cepstrum uses the complex logarithm function.
The real cepstrum uses the information of the magnitude of the spectrum. whereas complex
cepstrum contains information on magnitude and phase of the initial spectrum, allowing the
construction of the signal. We can calculate the cepstrum in many ways. Some of them need an
algorithm deformation stage, others are not. Following figure shows cepstrum signal line.
Fourier transform
Log
Discrete cosine transform
𝑠(𝑛) = 𝑒(𝑛) ∗ 𝛳(𝑛) (3.2)
Figure 3. 4 cepstrum signal line
3.2.6 MFCC
In this paper we are using Mel frequency Cepstral coefficient. Mel frequency Cepstral
coefficients are coefficients which represent audio based on perception. This coefficient has a
great success in speaker recognition applications. It is derived from the audioclip Fourier
transformation. In this technique the frequency bands are placed logarithmic, while in the Fourier
frequency bands placed not logarithmic. As logarithmic frequency bands are placed in MFCC, it
approximates the response of the human system more closely than any other system. These
coefficients enable better data processing. At the Mel frequency Cepstral coefficients compute
Mel Cepstrum is the same that the real Cepstrum except Mel Cepstrum frequency scale is twisted
to a corresponding Mel scale.

Figure 3. 5 mel frequency mapped
𝑓
𝑚 = 1127.01048𝑙𝑜𝑔𝑐 (1 + ) (3.3)
700
𝑚
𝑓 = 700 (𝑐11127.01048 ) (3.4)
3.2.7 MFCC Approach

A block diagram of the structure of an MFCC processor is as shown in Fig 3.4. The speech input
is typically recorded at a sampling rate above 10000 Hz. This sampling frequency was chosen to
minimize the effects of aliasing in the analog-to-digital conversion. These sampled signals can
capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by
humans. The main purpose of the MFCC processor is to mimic the behavior of the human ears.
In addition, rather than the speech waveforms themselves, MFCC‟s are shown to be less
susceptible to mentioned variations.

Continuous Silence
Widowing FFT
detection
Speech
Met Mel- Spectrum

Mel
Cepstrum frequency
Spectrum
Cepstrum Warping
Figure 3.6 MFCC Approach
We first stored the speech signal as a 10000 sample vector. It was observed from our experiment
that the actual uttered speech eliminating the static portions came up to about 2500 samples, so,
by using a simple threshold technique we carried out the silence detection to extract the actual
uttered speech .It is clear that what we wanted to achieve was a voice based biometric system
capable of recognizing isolated words. As our experiments revealed almost all the isolated words
were uttered within 2500 samples. But, when we passed this speech signal through a MFCC
processor it spilt this up in the time domain by using overlapping windows each with about 250
samples. Thus when we convert this into the frequency domain we just have about 250 spectrum
value sunder each window. This implied that converting it to the Mel scale would be redundant
as the Mel scale is linear till 1000 Hz. So, we eliminated the block which did the Mel warping.
We directly used the overlapping triangular windows in the frequency domain. We obtained the
energy within each triangular window, followed by the DCT of their logarithms to achieve good
compaction within a small number of coefficients as described by the MFCC approach.

MFCC approach Algorithm
Record your voice Record voice of

for training Start Defining overlapping
person’s for
triangular window
testing
Silence detection
Silence detection
Do the window
Convert speech into FFT
Convert individual columns in

to frequency domain
Defining overlapping
triangle
Convert Mel frequency

cpestrum
Determine energy with
each window
Defining Overlapping
Triangular window
Determine DCT of
Determine energy with each spectrum energy
window
Determine Mean square

Determine DCT of error
spectrum energy
If it is <1.5
No
Yes
The same user End Are not the same

user
Figure 3. 7 MFCC approach Algorithm

FTT approach Algorithm
Record your voice Record voice of person’s

Start
10 times to create for Recognition
database
Save this recording

to 8820*20 maxis Do the window
Rows of the Convert speech in to

matrix are FFT and normalized
trunked in to
small length
Convert individual
columns in to Find the standard
frequency domain deviations
Compare both standard

Normalized the spectra deviation
of each recording
No
Yes Conform whether user
voice is within two
standard deviation
Permission to access Access not permitted

(Granted) (denied)
Figure3. 8 FFT approach Algorithm

CHAPTER FOUR
4. RESULT AND DISCUSSION
When come to the result the recorded voice in the training phase matching with the voice that
recoded for the tasting phase with the minimum requirement or the threshold the result gives the
permission to access the system. e third set of numbers from zero to nine is also been utilized for
testing.
With a MATLAB code which achieved the GMM and MFCC algorithm in the experiment where
generated are simulated based on the first sample set for word “Yes and No” where the algorithm
works for the different isolated words. The first sample set was recorded from two people and
there are also 2 different samples from each person which gathered the total of 4 samples. And
the output figures of each person’s speech in the left column indicates the error between the
training sample and testing samples is very small, it reveals the algorithm will have the better
performance to recognize the first word, while the up are the errors of different test samples in
one files, we still can discern the content of the speech quickly when the speech content is
changing. The simulation results show that the utilization of algorithm can reach almost
successful probability not perfect when two words with the different pronunciations without the
noise.
Whereas the voice that recorded in the tasting phase not matching with the voice that recorded in
the training phase the result become not access the system.
The wave form of result the recorded voice in the training phase
In this project, author has recorded two different speech samples. The first set is " YES " and "
NO " from 2 person which has the different pronunciations while the second set is" ON " and "
OFF " from 2 person which has the same pronunciations.

figure4. 1. Voice training wave form
Figure are simulated based on the Second sample set for testing word “On and Off” which also
recorded from three people, and by taking 2 different samples from each person respectively
with 4 different samples of the total. And three output figures are corresponding error between
the trained signal and the reference signal, as we know, The pronunciations of "ON " and " OFF"
are almost the same, which make it very hard to recognize by common methods. And the left
column shows similarities as the experiment on “Yes and No”, that the error between the training
sample and testing samples is also quite small, in this case it can be concluded that whether the
words pair has similar pronunciations or not have no influence on the algorithm. Meanwhile, the
right column of the figure are the errors of different test samples in one files, we can also discern
the content of the speech quickly when the speech content is changing and the result of it is also
not perfect success probability.
And when accretion of the noise in the signal as the Table 7 and 8 shown, the performance of the
recognition will change with different SNR level, it’s can be observed that the performance of
the word pair with the similar pronunciation monosyllable like " ON " and "OFF " will be so
poor in large noise environment that the total successful probability is only 25% when SNR is
equal to 1, but it behaves very stable when the SNR level equal or larger.

The wave form of result the recorded voice in the tasting phase
figure4. 1. Voice tasting phase wave form

CHAPTER FIVE
5. CONCLUSION AND RECOMMENDATION
The goal of this paper was to create a voice recognition system, and apply it to a speech of an
unknown speaker. By investigating the extracted features of the unknown speech and then
compare them to the stored extracted features for each different voice in order to identify the
unknown voice. The feature extraction is done by using MFCC (Mel Frequency Cepstral
Coefficients) us with the faster speaker identification process than only MFCC approach or FFT
approach. But they cannot chive the required result.
5.1 Applications
Single purpose command and control system such as voice dialing for cellular, home, and office
phones where multi-function computers (PCs) are redundant.
Some of the applications of speaker verification systems are:
• Time and Attendance Systems
• Access Control Systems
• Telephone-Banking/Broking
• Biometric Login to telephone aided shopping systems
• Information and Reservation Services
• Security control for confidential information
• Forensic purposes
Voice based Telephone dialing is one of the applications we simulated. The key focus of this
application is to aid the physically challenged in executing a mundane task like telephone
dialing. Here the user initially trains the system by uttering the digits from 0 to 9. Once the
system has been trained, the system can recognize the digits uttered by the user who trained the
system. This system can also add some inherent security as the system based on cepstral
approach is speaker dependent. The algorithm is run on a particular speaker and the MFCC

coefficients determined. Now the algorithm is applied to a different speaker and the mismatch
was clearly observed. Thus the inherent security provided by the system was confirmed.
Presently systems have also been designed which incorporate Speech and Speaker Recognition.
Typically a user has two levels of check. She/he has to initially speak the right password to gain
access to a system. The system not only verifies if the correct password has been said but also
focused on the authenticity of the speaker. The ultimate goal is do have a system which does a
Speech, Iris, Fingerprint Recognition to implement access control.
5.2 Recommendation
This paper focused on “Isolated voice Recognition”. But we feel the idea can be extended to
“Continuous voice Recognition” and ultimately create a Language Independent Recognition
System based on algorithms which make these systems robust. The use of Statistical Models like
MFCC.
It is significant to know that this design is limited to database knowledge when we simulate the
project, we are not exactly recognize the required output.
The detection used in this work is only based on the frame energy in MFCC which is not good
for a noisy environment with low SNR. The error rate of determining the beginning and ending
of speech segments will greatly increase which directly influence the recognition performance at
the pattern recognition part. So, we should try to use some effective way to do detection. One of
these methods could be to use the statistical way to find a distribution which can separate the
noise and speech from each other.

REFERENCES
[1] Lawrence Rabiner, Biing-Hwang Juang – „ Fundamentals of Speech Recognition’

[2] Wei Han, Cheong-Fat Chan, Chiu-Sing Choy and Kong-Pang Pun – „ An Efficient MFCC
Extraction Method in Speech Recognition’
[3] Department of Electronic Engineering, the Chinese University of Hong Kong, Hong, IEEE –
ISCAS, 2006[3] Leigh D. Alsteris and Kuldip K. Paliwal – „for face recognition.
[4] B. Gold and N. Morgan, Speech and Audio Signal Processing, John Wiley and Sons, New
York, NY, 2000.
[5] J. R Deller, John G Proakis and John H.L. Hansen, Discrete Time Processing of Speech
Signals, Macmillan, New York, NY, 1993.
[6] C. Becchetti and Lucio Prina Ricotti, Speech Recognition, John Wiley and Sons, England,
1999.
[7] E. Karpov, “Real Time Speaker Identification,” Master`s thesis, Department of Computer
Science, University of Joensuu, 2003.

APPENDIX
APPENDIX:A
MFCC MATLAB CODE
fs = 10000;% Sampling Frequency
t = hamming(4000);% Hamming window to smooth the speech signal
w = [t ; zeros(6000,1)];
f = (1:10000);
mel(f) = 2595 * log(1 + f / 700);% Linear to Mel frequency scale conversion
tri = triang(100);
win1 = [tri ; zeros(9900,1)];% Defining overlapping triangular windows for
win2 = [zeros(50,1) ; tri ; zeros(9850,1)];% frequency domain analysis
win3 = [zeros(100,1) ; tri ; zeros(9800,1)];
win10 = [zeros(450,1) ; tri ; zeros(9450,1)]
x = wavrecord(1 * fs, fs,'double');% Record and store the uttered speech

plot(x);
wavplay(x);
i = 1;
while abs(x(i)) <0.05% Silence detection
i = i + 1;
end
x(1 : i) = [];
x(6000 : 10000) = 0;
x1 = x.* w;
mx = fft(x1);% Transform to frequency domain
nx = abs(mx(floor(mel(f))));% Mel warping
nx = nx./ max(nx);
nx1 = nx.* win1;
nx2 = nx.* win2;
nx3 = nx.* win3
;nx4 = nx.* win4;
nx5 = nx.* win5;
nx6 = nx.* win6;
nx7 = nx.* win7;
nx8 = nx.* win8
nx9 = nx.* win9;
nx10 = nx.* win10;
nx11 = nx.* win11;
nx12 = nx.* win12;
nx13 = nx.* win13;
nx14 = nx.* win14;
nx15 = nx.* win15;
nx16 = nx.* win16;
nx17 = nx.* win17;
nx18 = nx.* win18;
nx19 = nx.* win19;

nx20 = nx.* win20;

sx1 = sum(nx1.^ 2);% Determine the energy of the signal within each window
sx2 = sum(nx2.^ 2);% by summing square of the magnitude of the spectrum
sx3 = sum (nx3.^ 2);
sx4 = sum (nx4.^ 2);
sx5 = sum(nx5.^ 2);
sx6 = sum(nx6.^ 2);
sx7 = sum(nx7.^ 2);
sx8 = sum(nx8.^ 2);
sx9 = sum(nx9.^ 2);
sx10 = sum(nx10.^ 2);
sx11 = sum(nx11.^ 2);
sx12 = sum(nx12.^ 2);
sx13 = sum(nx13.^ 2);
sx14 = sum(nx14.^ 2);
sx15 = sum(nx15.^ 2);
sx16 = sum(nx16.^ 2);
sx17 = sum(nx17.^ 2);
sx18 = sum(nx18.^ 2);
sx19 = sum(nx19.^ 2);
sx20 = sum(nx20.^ 2);
sx = [sx1, sx2, sx3, sx4, sx5, sx6, sx7, sx8, sx9, sx10, sx11, sx12, sx13, sx14, sx15, sx16,
sx17,sx18, sx19, sx20];
sx = log(sx);
dx = dct (sx);% Determine DCT of Log of the spectrum energies
fid = fopen('sample.dat','w');fwrite(fid, dx,'real*8');% Store this feature vector as a .dat file
fclose(fid);

APPENDIX: B
Testing Code:
fs = 10000;% Sampling Frequency
t = hamming(4000);% Hamming window to smooth the speech signal
w = [t ; zeros(6000,1)];
f = (1:10000);mel(f) = 2595 * log(1 + f / 700);% Linear to Mel frequency scale conversion
tri = triang(100);
win1 = [tri ; zeros(9900,1)];% Defining overlapping triangular windows for
win2 = [zeros(50,1) ; tri ; zeros(9850,1)];% frequency domain analysis
y = wavrecord(1 * fs, fs,'double');%Store the uttered password for authentication
i = 1;
whileabs(y(i)) < 0.05% Silence Detection

i = i + 1;
end
y(1 : i) = [];y(6000 : 10000) = 0;
y1 = y.* w;my = fft(y1);% Transform to frequency domain
ny = abs(my(floor(mel(f))));% Mel warping
ny = ny./ max(ny);
ny1 = ny.* win1;
ny2 = ny.* win2;
ny3 = ny.* win3;
ny4 = ny.* win4;
ny5 = ny.* win5;
ny6 = ny.* win6;
ny7 = ny.* win7;
ny8 = ny.* win8;
ny9 = ny.* win9;
ny10 = ny.* win10;
ny11 = ny.* win11;
ny12 = ny.* win12;
ny13 = ny.* win13;
ny14 = ny.* win14;
ny15 = ny.* win15;
ny16 = ny.* win16
ny17 = ny.* win17;
ny18 = ny.* win18;
ny19 = ny.* win19;
ny20 = ny.* win20;
sy1 = sum(ny1.^ 2);
sy2 = sum(ny2.^ 2);
sy3 = sum(ny3.^ 2);
sy4 = sum(ny4.^ 2);
sy5 = sum(ny5.^ 2);

sy6 = sum(ny6.^ 2);

sy7 = sum(ny7.^ 2);
sy8 = sum(ny8.^ 2);
sy9 = sum(ny9.^ 2);
sy10 = sum(ny10.^ 2);% Determine the energy of the signal within each window
sy11 = sum(ny11.^ 2);% by summing square of the magnitude of the spectrum
sy12 = sum(ny12.^ 2);
sy13 = sum(ny13.^ 2);
sy14 = sum(ny14.^ 2);
sy15 = sum(ny15.^ 2);
sy16 = sum(ny16.^ 2);
sy17 = sum(ny17.^ 2);
sy18 = sum(ny18.^ 2);
sy19 = sum (ny19.^ 2);
sy20 = sum (ny20.^ 2);
sy = [sy1, sy2, sy3, sy4, sy5, sy6, sy7, sy8, sy9, sy10, sy11, sy12, sy13, sy14, sy15, sy16,
sy17,sy18, sy19, sy20];
sy = log(sy);
dy = dct(sy);% Determine DCT of Log of the spectrum energies
fid = fopen('sample.dat','r');
dx = fread(fid, 20,'real*8');% Obtain the feature vector for the password
fclose(fid);% evaluated in the training phase
dx = dx.';
MSE=(sum((dx - dy).^ 2)) / 20;% Determine the Mean squared error
if MSE<1.5
fprintf('\n\nYou are the same user\n\n');
%Grant=wavread('Grant.wav'); % “Access Granted” is output if within threshold
%wavplay(Grant);
Else
fprintf('\n\nYou are not a same user\n\n');
%Deny=wavread('Deny.wav'); % “Access Denied” is output in case of a failure

%wavplay(Deny);
end
APPENDIX: C
Voice Recording Mat lab code:
fori = 1:10
file = sprintf('%s%d.wav','g',i);
input('You have 2 seconds to say your name. Press enter when ready to record--> ');
y = wavrecord(88200,44100);
sound(y,44100);wavwrite(y,44100,file);
end
APPENDIX:D
Training and Testing Code:
name = input ('Enter the name that must be recognized -- >','s');
ytemp = zeros (88200,20);r = zeros (10,1);
for j = 1:10
file = sprintf('%s%d.wav','g',j);
[t, fs] = wavread(file);
s = abs (t);start = 1;
last = 88200;
for i = 1:88200
if s (i) >=.1 && i <=7000start = 1;
break end if s (i) >=.1 && i > 7000start = i-7000;
break
end
end
fori = 1:88200k = 88201-i;
if s (k)>=.1 && k>=8120

last = 88200;
break
end
if s (k)>= .1 && k <81200last = k + 7000;
break
end
end r (j) = last-start;ytemp (1: last - start + 1,2 * j) = t (start:last);
ytemp (1: last - start + 1,(2*j - 1)) = t (start:last);
end
% The rows of the matrix are truncated to the smallest length % of the 10 recordings.
y = zeros (min (r),20);
fori = 1:20y (:,i) = ytemp (1:min (r),i);
end
% Convert the individual columns into frequency % domain by applying the Fast Fourier
Transform.
%Then take the modulus squared of all the entries in the matrix.
fy = fft (y);fy = fy.*conj (fy);
% Normalize the spectra of each recording and place into the matrix fn.
%Only frequiencies upto 600 are needed to represent the speech of most
% humans.
fn = zeros (600,20);
fori = 1:20
fn (1:600,i) = fy (1:600,i)/sqrt(sum (abs (fy (1:600,i)).^2));
end
% Find the average vector pu
pu = zeros (600,1);
fori = 1:20
pu = pu + fn (1:600,i);
end
pu = pu/20;
% Normalize the average vector

tn = pu/sqrt(sum (abs (pu).^2));

% Find the Standard Deviation
std = 0;
fori = 1:20std = std + sum (abs (fn (1:600,i)-tn).^2);
end std = sqrt (std/19);
%%%%%%%% Verification Process
%%%%%%%%
input ('You will have 2 seconds to say your name. Press enter when ready')
% Record Voice and confirm if the user is happy with their recording
usertemp = wavrecord (88200,44100);sound (usertemp,44100);'';rec = input ('Are you happy
with this recording? \nPress 1 to record again or just press enter toproceed--> ');
whilerec == 1rec = 0;input ('You will have 2 seconds to say your name. Press enter when ready')
usertemp = wavrecord (88200,44100);sound (usertemp,44100);
Train.m
functioncode = train(traindir, n
k = 16;% number of centroids required
for i = 1:n% train a VQ codebook for each speake
file = sprintf('%s%d.wav', traindir, i);
disp(file);
[s, fs] = wavread(file);
v = mfcc(s, fs);% Compute MFCC's
code{i} = vqlbg(v, k);% Train VQ codebook
end

3 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 PDF

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF GONDAR

Focus Area: Communication Engineering

Group members: ID No.

1. Natnael Asmamaw ----------------------------------------------------07630/09

Submission Date: 25/8/2021

Advisor: Mr. Belayneh Eskeziya

Group members: signature

UOG,IOT , ECEG BSc thesis for communication Eng. Page I

Project Adviser: Signature Date

1. Mr. Belayneh Eskeziya

Focus area coordinator

2. Mr. Thomas Worku

3. Mr. Mekete Asmare

Team Members: Signature Date

UOG,IOT , ECEG BSc thesis for communication Eng. Page II

UOG,IOT , ECEG BSc thesis for communication Eng. Page III

UOG,IOT , ECEG BSc thesis for communication Eng. Page IV

UOG,IOT , ECEG BSc thesis for communication Eng. Page V

UOG,IOT , ECEG BSc thesis for communication Eng. Page VI

UOG,IOT , ECEG BSc thesis for communication Eng. Page VII

ASR ……………………………………………………...… Automatic speech recognition

UOG,IOT , ECEG BSc thesis for communication Eng. Page VIII

Voice Recognition or Voice Authentication is an automated method of identification of the

UOG,IOT , ECEG BSc thesis for communication Eng. Page 1

1.1.1 Human Voice

1.1.2 The Speech Signal

Figure 1. 1 Schematic Diagram of the Speech Production/Perception Process

UOG,IOT , ECEG BSc thesis for communication Eng. Page 2

1.1.3 Speech Synthesis:

UOG,IOT , ECEG BSc thesis for communication Eng. Page 3

impairments or reading disabilities to listen to written works on a home computer. Many

1.1.4 Speech Analysis:

1.1.5 Voice Recognition

UOG,IOT , ECEG BSc thesis for communication Eng. Page 4

1.1.7 Text Independent:

1.1.8 Voice Recognition Techniques

1.1.8.1 Template Matching:

UOG,IOT , ECEG BSc thesis for communication Eng. Page 5

1.1.8.2 Feature Analysis:

1.2 Statement of the Problem

UOG,IOT , ECEG BSc thesis for communication Eng. Page 6

1.3 Objectives of the Thesis

1.3.1 General Objective

1.3.2 Specific Objective

UOG,IOT , ECEG BSc thesis for communication Eng. Page 7

Figure 1. 2 block diagram of the system

Figure 1. 3 Speaker Identification Training

UOG,IOT , ECEG BSc thesis for communication Eng. Page 8

Input Feature Similarit Decision

Figure 1. 4 Speaker Identification Testing

UOG,IOT , ECEG BSc thesis for communication Eng. Page 9

1.5 Scope of the Thesis

1.6 Organization of the Project

UOG,IOT , ECEG BSc thesis for communication Eng. Page 10

2.1 Signal Sampling

UOG,IOT , ECEG BSc thesis for communication Eng. Page 11

2.2 The Characteristic Parameters of Speech Signal

UOG,IOT , ECEG BSc thesis for communication Eng. Page 12

2.4 GMM (Gaussian Mixture Model)

UOG,IOT , ECEG BSc thesis for communication Eng. Page 13

Where xt is a D dimension Speech Characteristic parameters vector, 𝑃𝑖 (𝑥𝑡 ) is the element of

1 (𝑥𝑡 −𝑢𝑖 )𝑇 ∑−1

UOG,IOT , ECEG BSc thesis for communication Eng. Page 14