Professional Documents
Culture Documents
INSTITUTE OF TECHNOLOGY
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
Gondar, Ethiopia
VOICE RCOGNITION SYSTEM USING MFCC ALGORITHM 2013 E.C
DECLARATION OF AUTHORSHIP
We declare that this project titled, ‘voice recognition using MFCC algorithm’ and the work
presented in it are our own. We confirm that:
• This work will done wholly or mainly while in candidature for a bachelor degree at this
University.
• Where any part of this project has not previously been submitted for a degree or any other
qualification at this University.
• We have seen some other literature review from the work of others.
1. Natnael Asmamaw
2. Sisay Ayehu
3. Yidideya Addis
4. Zelalem Bogale
APROVAL
This is to certify that the thesis entitled, “VOICE RECOGNATION SYSTEM in partial
fulfillment of the requirements for the award of Bachelor of Science Degree in Electrical and
computer Engineering communication stream at university of Gondar Which is recorded of/as
their work carried by them during the academic year 2021 G.C under the supervision and
guidance of Mr. Belayneh Eskeziya. The extent and source of information are derived from the
existing literature and have been indicated through the thesis at the appropriate place. To the best
of their knowledge, the matter embodied in the thesis has not been submitted to any other
university / institute for the award of any Degree.
Department head
1. Natnael Asmamaw
2. Sisay Ayehu
3. Yidideya Addis
4. Zelalem Bogale
ACKNOWLEDGMENT
For our document of the final thesis to arrive at its final stage, many individuals give us their
forwarding contribution since the beginning.
We express our deep sense of gratitude and sincere thanks to school of electrical and computer
engineering for giving such an opportunity and for its contribution in giving project room.
We would like to express our special thanks of gratitude to Mr. Belayneh Eskeziya and all of our
instructors, for their valuable recommendations and gave us the golden opportunity to do this
final thesis for voice recognition using MFCC algorithm.
Last but not least, we want to thank our friends who treasured us for our hard work and
encouraged us and finally to God who made all the things possible for giving help and patience
in going such hard time.
ABSTRACT
It is easy for human to recognize familiar voice but using computer programs to identify a voice
when compared with others is a herculean task. This is due to the problem that is encountered
when developing the algorithm to recognize human voice. It is impossible to say a word the
same way in two different occasions. Human speech analysis by computer gives different
interpretation based on varying speed of speech delivery. This research paper gives detail
description of the process behind implementation of an effective voice recognition algorithm.
The algorithm utilize discrete Fourier transform to compare the frequency spectra of two voice
samples because it remained unchanged as speech is slightly varied.
Markove inequality is then used to determine whether the two voices came from the same
person. The algorithm is implemented and tested using MATLAB.
Keywords: Markova’s inequality, discrete Fourier transform, frequency spectra, voice
recognition.
Table of Contents
DECLARATION OF AUTHORSHIP..................................................................................................................... I
APROVAL ....................................................................................................................................................... II
ACKNOWLEDGMENT .................................................................................................................................... III
ABSTRACT..................................................................................................................................................... IV
List of Figures. ............................................................................................................................................. VII
List of Acronyms:........................................................................................................................................ VIII
CHAPTER ONE ............................................................................................................................................... 1
1. INTRODUCTION ......................................................................................................................................... 1
1.1 Background of Thesis .......................................................................................................................... 1
1.1.1 Human Voice ................................................................................................................................ 2
1.1.2 The Speech Signal ........................................................................................................................ 2
1.1.3 Speech Synthesis: ......................................................................................................................... 3
1.1.4 Speech Analysis: ........................................................................................................................... 4
1.1.5 Voice Recognition ........................................................................................................................ 4
1.1.6 Text-Dependent: .......................................................................................................................... 5
1.1.7 Text Independent: ........................................................................................................................ 5
1.1.8 Voice Recognition Techniques ..................................................................................................... 5
1.1.8.1 Template Matching ................................................................................................................... 5
1.1.8.2 Feature Analysis ........................................................................................................................ 6
1.2 Statement of the Problem .................................................................................................................. 6
1.3 Objectives of the Thesis ...................................................................................................................... 7
1.3.1 General Objective ........................................................................................................................ 7
1.3.2 Specific Objective ......................................................................................................................... 7
1.4 Methodology....................................................................................................................................... 7
1.5 Scope of the Thesis ........................................................................................................................... 10
1.6 Organization of the Project ............................................................................................................... 10
CHAPTER TWO ............................................................................................................................................ 11
2. LITERATURE REVIEW ............................................................................................................................... 11
2.1 Signal Sampling ................................................................................................................................. 11
2.2 The Characteristic Parameters of Speech Signal............................................................................... 12
List of Figures.
Figure 1. 1 Schematic Diagram of the Speech Production/Perception Process ............................................ 2
Figure 1. 2 block diagram of the system ....................................................................................................... 8
Figure 1. 3 Speaker Identification Training .................................................................................................. 8
Figure 1. 4 Speaker Identification Testing .................................................................................................... 9
Figure 3. 1 utterance of voice ..................................................................................................................... 17
Figure 3. 2 feature extraction process ......................................................................................................... 18
Figure 3. 3 Filter Bank on Mel frequenc ..................................................................................................... 20
Figure 3. 4 cepstrum signal line .................................................................................................................. 21
Figure 3. 5 mel frequency mapped.............................................................................................................. 22
Figure 3.6 MFCC Approach ....................................................................................................................... 23
Figure 3. 7 MFCC approach Algorithm ...................................................................................................... 24
Figure 3. 8 FFT approach Algorithm........................................................................................................... 25
figure 4. 1. Voice training wave form ......................................................................................................... 27
figure 4. 2. Voice tasting phase wave form ................................................................................................. 28
List of Acronyms:
CHAPTER ONE
1. INTRODUCTION
1.1 Background of Thesis
Five different elements, A. Speech formulation, B. Human vocal mechanism, C. Acoustic air,
D .Perception of the ear, E. Speech comprehension .The first element (A. Speech formulation) is
associated with the formulation of the speech signal in the talker’s mind. This formulation is
used by the human vocal mechanism (B. Human vocal mechanism) to produce the actual speech
waveform. The waveform is transferred via the air (C. Acoustic air) to the listener. During this
transfer the acoustic wave can be affected by external sources, for example noise, resulting in a
more complex waveform. When the wave reaches the listener’s hearing system (the ears) the
listener percepts the waveform (D. Perception of the ear) and the listener’s mind (E. Speech
comprehension) starts processing this waveform to comprehend its content so the listener
understands what the talker is trying to tell him. One issue with speech recognition is to
“simulate” how the listener process the speech produced by the talker. There are several actions
taking place in the listeners head and hearing system during the process of speech signals. The
perception process can be seen as the inverse of the speech production process. The basic
theoretical unit for describing how to bring linguistic meaning to the formed speech, in the mind,
is called phonemes. Phonemes can be grouped based on the properties of either the time
waveform or frequency characteristics and classified in different sounds produced by the human
vocal tract .Speech is time-varying signal.
Well-structured communication process, Depends on known physical movements. Composed of
known, distinct units (phonemes). Is different for every speaker, May be fast, slow, or varying in
speed, May have high pitch, low pitch, or be whispered, has widely-varying types of
environmental noise, May not have distinct boundaries between units (phonemes), has an
unlimited number of words.
speaker’s identity. In a better perspective, voice verification is one to one matching where one
speaker’s voice is matched to one template or voice print, whereas voice identification is one to
many matching where the speaker’s voice is compared against many voice templates. Speaker
recognition system has two phases: Enrollment and Verification. During enrollment, the
speaker’s voice is recorded and typically a number of features are extracted to form a voice print
or template. In the verification phase, a speech sample or “utterance” is compared against a
previously created voice print. For identification systems, the utterance is compared against
multiple voice prints in order to determine the best match while verification systems compare an
utterance against a single voice print.
Voice Recognition Systems can also be categorized into two: text independent and text
dependent.
1.1.6 Text-Dependent:
This means text must be the same for the enrollment and verification. The use of shared secret
passwords and PINs or knowledge-based information can be employed in order to create a multi-
factor authentication scenario.
to the traditional command inputs from a keyboard. The program contains the input template,
and attempts to match this template with the actual input using a simple conditional statement.
This type of system is known as "speaker dependent." and recognition accuracy can be about 98
percent.
1.4 Methodology
Our methods of work are organized and accomplished through a sequence of stages. Prior to all,
we have reviewed related literatures. Then we have made the general block diagram for our
system that enables as to easily analyze each components of the system as shown in the
following figurative expressions.
Speech communication has evolved to be efficient and robust and it is clear that the route to
computer based speech recognition is the modeling of the human system. Unfortunately from
pattern recognition is the modeling point of view human recognition speech through a very
complex interaction between many levels of processing using syntactic information as well very
powerful low level pattern classification and processing. Powerful classification algorithms and
sophisticated front ends are in the final analysis, not enough; many other forms of knowledge,
e.g. linguistic, semantic and pragmatic, must be built into the recognizer. Nor, even at a lower
level of sophistication, is it sufficient merely to generate “a good” representation of speech (i.e. a
good set of features to be used in a pattern classifier); the classifier itself must have a
considerable degree of sophistication. It is the case, however, it do not effectively discriminate
between classes and, further, that the better the features the easier is the classification task.
Automatic speech recognition is therefore an engineering compromise between the ideal, i.e.
complete model of the human, and the practical, i.e. the tools that science and technology
provide and that costs allow .At the highest level, all speaker recognition systems contain two
main modules (refer to Fig 1.1):feature extraction and feature matching. Feature extraction is the
process that extracts a small amount of data from the voice signal that can later be used to
represent each speaker. Feature matching involves the actual procedure to identify the unknown
speaker by comparing extracted features from his/her voice input with the ones from a set of
known speakers. We will discuss each module in detail in later sections.
Similarity
Reference
model
Input Feature 1(speaker Maximum Identificati
#1) selection
speec extractio on result
h n (speaker
ID)
Similarity
extractio
extractio Reference
ext model
(speaker #N)
Reference
Speaker ID model Threshold
(#M) Speaker
All Recognition systems have to serve two different phases. The first one is referred to the
enrollment sessions or training phase while the second one is referred to as the operation sessions
or testing phase. In the training phase, each registered speaker has to provide samples of their
speech so that the system can build or train a reference model for that speaker. In case of speaker
verification systems, in addition, a speaker-specific threshold is also computed from the training
Samples During the testing (operational) phase (see Figure 1.3), the input speech is matched with
stored reference model(s) and recognition decision is made. Speech recognition is a difficult task
and it is still an active research area. Automatic speech recognition works based on the premise
that a person’s speech exhibits characteristics that are unique to the speaker. However this task
has been challenged by the highly variant of input speech signals. The principle source of
variance is the speaker himself. Speech signals in training and testing sessions can be greatly
different due to many facts such as people voice change with time, health conditions (e.g. the
speaker has a cold), speaking rates, etc. There are also other factors, beyond speaker variability,
that present a challenge to speech recognition technology .Examples of these are acoustical noise
and variations in recording environments (e.g. speaker uses different telephone handsets). The
challenge would be make the system “Robust”. So what characterizes a “Robust System?
When people use an automatic speech recognition (ASR) system in real environment, they
always hope it can achieve as good recognition performance as human's ears do which can
constantly adapt to the environment characteristics such as the speaker, the background noise and
the transmission channels. Unfortunately, at present, the capacities of adapting to unknown
conditions on machines are greatly poorer than that of ours. In fact, the performance of speech
recognition systems trained with clean speech may degrade significantly in the real world
because of the mismatch between the training and testing environments. If the recognition
accuracy does not degrade very much under mismatch conditions, the system is called “Robust
CHAPTER TWO
2. LITERATURE REVIEW
Speech is the most natural way of communicating for human beings. While this has been true
since the dawn of civilization, the invention and widespread use of the telephone, television,
radio and audio phonic storage media has given more importance to the communication of voice
and voice processing. Advances in digital signal processing technology has been the use of
speech in many different areas of application such as compression of speech enhancement,
synthesis and recognition
Speech recognition or Automatic Speech Recognition (ASR) system is converts the acoustic
signal (audio) to a machine readable format. ASR recognizes the words & these words are
worked as input for a particular application, it may be worked as command or for document
preparation. Now a day there is glamour of designing an intelligent machine that can recognize
the spoken word & understand its meaning & capture corresponding actions. One of the most
difficult aspects of performing research in speech recognition by machine is its interdisciplinary.
The early studies focus on monolithic approach to individual problems.
Due to inappropriate high sampling frequency lead to sampling too much data (N=T/△t with a
certain length of signal (T), it will increase unnecessary workload of computer and taken too
much storage; On the contrary, the discrete time signal won’t represent the characteristics of the
original signal if the sampling frequency is too low and the sampling point are insufficient. [5]
So we always utilize about 8000Hz as the sampling frequency according to Nyquist Theory that
F≥2 * Fmax.
transformation applied to the logarithm of the filter-bank outputs results in the raw MFCC vector
triangular filters. So the MFCC imitate the ear perception behavior and give, good identification
than LPC.
2.3 MFCC
Mel Frequency Cepstral Coefficient, is the characteristic parameter which widely used in speech
recognition or speaker recognition. Before the Mel Frequency Cepstral Coefficients, the
researchers always use the Linear Prediction Coefficients or Linear Prediction Cepstral
Coefficients as the Characteristic parameters of speech signal. Mel Frequency Cepstral
Coefficients is the representation of short time power spectrum of a speech signal, and it is
calculated by DCT 1to convert into time domain, based on a linear cosine transform of a log
power spectrum on a nonlinear mel scale of frequency. Then the result will be set of the acoustic
vectors.
MFCC are commonly used as Characteristic parameters in speech recognition algorithm. In the
theory of MFCC, the Critical Band is a very important concept which can solve problem of
frequency division, it is also an important indicator of Mel frequency. The purpose of
introducing critical bandwidth is to describe the masking effect. When two similar or same
pitches voiced in the same time, human ear can’t distinguish the difference and only can receive
one pitch. The condition that two pitches can be received is that the weight difference of two
frequencies suppose two larger than certain bandwidth, and we called this as critical bandwidth.
In critical bandwidth, if the sound pressure of a speech signal with noise is constant, the loudness
of speech signal with noise is constant then. But once the noise bandwidth beyond the critical
bandwidth, the loudness will change obviously.
seen as a CHMM with only one state, but yet it cannot be simply regarded as a hidden markov
model (HMM) due to GMM completely avoid the segmentation of the system state. When
compared with the HMM model, the GMM mode will make the researching of speech
recognition more simply, but the performance has no lossy. And the gaussian mixture model
which is calculating the characteristic parameters space distribution by the weight sum of multi
gaussian density function when compared with the VQ method, from this point of view, the
GMM has more accuracy and superiority. It is very hard to match the process of human's
pronunciation organs, but we can simulate a model which express the process of sound, and it
can be implemented by building a probability model for speech processing, while gaussian
mixture model is the very probability model which can qualified the condition.
The formula of M order gaussian mixture model can be defined as:
𝑝(𝑥𝑡 ) = ∑𝑀
𝑖=𝑙 𝛼𝑖 𝑝𝑖 (𝑥𝑡 ) (2)
∑𝑀
𝑖=1 α = 1 (2.1)
Thus the element 𝑝𝑖 𝑥𝑡 of gaussian mixture model can be described the mean value and
covariance.
EM algorithm is the abbreviation of expectation maximization, which is an iterative method. The
EM algorithm can search the maximum likelihood estimation of parameters in statistical models.
The EM algorithm can be divided into two steps, the first step is the expectation (E) step, which
can generate a function for the expectation of the log-likelihood.
The second step is the maximization (M) step, which can compute the parameters and maximize
the expected log-likelihood searched on the step E. In many research fields, the Expectation
maximization is the most popular technique, which is used to calculate the parameters of a
parametric mixture model distribution. It is an iterative algorithm with Three steps: Initialization,
the expectation step and the maximization step.
Each class j of M clusters, which is constituted by a parameter vector (θ), composed by the mean
(µ𝑗 ) and by the covariance matrix (𝑝𝑗 ). On the initial time, the implementation can generate
randomly the initial values of mean (µ𝑗 ) and of covariance matrix (𝑝𝑗 ). The EM algorithm aims
to approximate the parameter vector (θ) of the real distribution.
Expectation step is responsible to estimate the probability of each element belong to each cluster
p(𝑐𝑗 |𝑥𝑘 ). Each element is composed by an attribute vector (𝑥𝑘 ). With initial guesses for the
parameters of our mixture model, the probability of hidden state 𝑖 can be defined as.
space between the vocal cords) produces the buzz, which is characterized by its intensity
(loudness) and frequency (pitch). The vocal tract (the throat and mouth) forms the tube, which is
characterized by its resonances, which are called formants. Hisses and pops are generated by the
action of the tongue, lips and throat during sibilants and plosives. LPC analyzes the speech signal
by estimating the formants, removing their effects from the speech signal, and estimating the
intensity and frequency of the remaining buzz. The process of removing the formants is called
inverse filtering, and the remaining signal after the subtraction of the filtered modeled signal is
called the residue. The numbers which describe the intensity and frequency of the buzz, the
formants, and the residue signal, can be stored or transmitted somewhere else. LPC synthesizes
the speech signal by reversing the process: use the buzz parameters and the residue to create a
source signal, use the formants to create a filter (which represents the tube), and run the source
through the filter, resulting in speech .Because speech signals vary with time, this process is done
on short chunks of the speech signal, which are called frames; generally 30 to 50 frames per
second give intelligible speech with good compression.
CHAPTER THREE
Deriving the acoustic characteristics of speech signal is called feature extraction. Feature
extraction is used in training and testing phases. They consist of the following steps: 1. Frame
Blocking 2. Windowing 3. FFT (Fast Fourier Transform) 4. Wrapping 5 Mel-frequency.
Cepstrum (Mel frequency Cepstral coefficients)
Input
Frame blocking Windowing
Signal FFT
Mel-Frequency
Cepstrum
Wrapping
3.2.2 Windowing
The second step is to the window all frames. This is done to eliminate discontinuities at the edges
of the frames. If the function of windows is defined as (n) w, 0 < n < N-1 where N is the number
of samples per frame, the resulting signal will be; and (n) = x (n) w (n). Generally hamming
windows are used.
3.2.3 FFT
The next step is to take fast Fourier transform of each frame. This transformation is a quick way
of Fourier transform discrete transforms and changes the time domain to frequency.
(3.1)
An approach to simulate subjective spectrum is to use a Bank of filters, a filter for each
component of desired frequency of mel. Filter Bank has a triangular band-pass frequency
response, and space as well as bandwidth is determined by a constant interval of
mel. S() modified spectrum thus consists in the power output of these filters when S() entry. The
number of K , mel cepstral coefficients, is usually chosen as 20 note that this filter bank is
applied in the frequency domain; so it simply means that the windows of the shape of the triangle
in Figure 4.2 in the spectrum. A useful way of thinking about this deformed mel filter bank is to
see each filter as a bin of histogram (where containers have overlap) in the frequency domain. A
useful and efficient way to implement this is to consider these triangular filters on the Mel scale,
which in effect would be equally spaced filters.
3.2.5 Cepstrum
Cepstrum name was derived from the spectrum by reversing the first four letters of the spectrum.
We can say cepstrum is the transformer of the Fourier transform of the registry with unwrapped
phase Fourier transformer.
Where is m the integer required to properly unwrap the angle or imaginary part of the complex
function of the registry.
For the definition of the values real real cepstrum uses the logarithm function. While for the
definition of complex values while the complex cepstrum uses the complex logarithm function.
The real cepstrum uses the information of the magnitude of the spectrum. whereas complex
cepstrum contains information on magnitude and phase of the initial spectrum, allowing the
construction of the signal. We can calculate the cepstrum in many ways. Some of them need an
algorithm deformation stage, others are not. Following figure shows cepstrum signal line.
Fourier transform
Log
3.2.6 MFCC
In this paper we are using Mel frequency Cepstral coefficient. Mel frequency Cepstral
coefficients are coefficients which represent audio based on perception. This coefficient has a
great success in speaker recognition applications. It is derived from the audioclip Fourier
transformation. In this technique the frequency bands are placed logarithmic, while in the Fourier
frequency bands placed not logarithmic. As logarithmic frequency bands are placed in MFCC, it
approximates the response of the human system more closely than any other system. These
coefficients enable better data processing. At the Mel frequency Cepstral coefficients compute
Mel Cepstrum is the same that the real Cepstrum except Mel Cepstrum frequency scale is twisted
to a corresponding Mel scale.
𝑓
𝑚 = 1127.01048𝑙𝑜𝑔𝑐 (1 + ) (3.3)
700
𝑚
𝑓 = 700 (𝑐11127.01048 ) (3.4)
Continuous Silence
Widowing FFT
detection
Speech
We first stored the speech signal as a 10000 sample vector. It was observed from our experiment
that the actual uttered speech eliminating the static portions came up to about 2500 samples, so,
by using a simple threshold technique we carried out the silence detection to extract the actual
uttered speech .It is clear that what we wanted to achieve was a voice based biometric system
capable of recognizing isolated words. As our experiments revealed almost all the isolated words
were uttered within 2500 samples. But, when we passed this speech signal through a MFCC
processor it spilt this up in the time domain by using overlapping windows each with about 250
samples. Thus when we convert this into the frequency domain we just have about 250 spectrum
value sunder each window. This implied that converting it to the Mel scale would be redundant
as the Mel scale is linear till 1000 Hz. So, we eliminated the block which did the Mel warping.
We directly used the overlapping triangular windows in the frequency domain. We obtained the
energy within each triangular window, followed by the DCT of their logarithms to achieve good
compaction within a small number of coefficients as described by the MFCC approach.
Silence detection
Silence detection
Do the window
Convert speech into FFT
Determine DCT of
Determine energy with each spectrum energy
window
If it is <1.5
No
Yes
Convert individual
columns in to Find the standard
frequency domain deviations
No
Yes Conform whether user
voice is within two
standard deviation
CHAPTER FOUR
When come to the result the recorded voice in the training phase matching with the voice that
recoded for the tasting phase with the minimum requirement or the threshold the result gives the
permission to access the system. e third set of numbers from zero to nine is also been utilized for
testing.
With a MATLAB code which achieved the GMM and MFCC algorithm in the experiment where
generated are simulated based on the first sample set for word “Yes and No” where the algorithm
works for the different isolated words. The first sample set was recorded from two people and
there are also 2 different samples from each person which gathered the total of 4 samples. And
the output figures of each person’s speech in the left column indicates the error between the
training sample and testing samples is very small, it reveals the algorithm will have the better
performance to recognize the first word, while the up are the errors of different test samples in
one files, we still can discern the content of the speech quickly when the speech content is
changing. The simulation results show that the utilization of algorithm can reach almost
successful probability not perfect when two words with the different pronunciations without the
noise.
Whereas the voice that recorded in the tasting phase not matching with the voice that recorded in
the training phase the result become not access the system.
The wave form of result the recorded voice in the training phase
In this project, author has recorded two different speech samples. The first set is " YES " and "
NO " from 2 person which has the different pronunciations while the second set is" ON " and "
OFF " from 2 person which has the same pronunciations.
Figure are simulated based on the Second sample set for testing word “On and Off” which also
recorded from three people, and by taking 2 different samples from each person respectively
with 4 different samples of the total. And three output figures are corresponding error between
the trained signal and the reference signal, as we know, The pronunciations of "ON " and " OFF"
are almost the same, which make it very hard to recognize by common methods. And the left
column shows similarities as the experiment on “Yes and No”, that the error between the training
sample and testing samples is also quite small, in this case it can be concluded that whether the
words pair has similar pronunciations or not have no influence on the algorithm. Meanwhile, the
right column of the figure are the errors of different test samples in one files, we can also discern
the content of the speech quickly when the speech content is changing and the result of it is also
not perfect success probability.
And when accretion of the noise in the signal as the Table 7 and 8 shown, the performance of the
recognition will change with different SNR level, it’s can be observed that the performance of
the word pair with the similar pronunciation monosyllable like " ON " and "OFF " will be so
poor in large noise environment that the total successful probability is only 25% when SNR is
equal to 1, but it behaves very stable when the SNR level equal or larger.
The wave form of result the recorded voice in the tasting phase
CHAPTER FIVE
The goal of this paper was to create a voice recognition system, and apply it to a speech of an
unknown speaker. By investigating the extracted features of the unknown speech and then
compare them to the stored extracted features for each different voice in order to identify the
unknown voice. The feature extraction is done by using MFCC (Mel Frequency Cepstral
Coefficients) us with the faster speaker identification process than only MFCC approach or FFT
approach. But they cannot chive the required result.
5.1 Applications
Single purpose command and control system such as voice dialing for cellular, home, and office
phones where multi-function computers (PCs) are redundant.
Some of the applications of speaker verification systems are:
• Time and Attendance Systems
• Telephone-Banking/Broking
• Forensic purposes
Voice based Telephone dialing is one of the applications we simulated. The key focus of this
application is to aid the physically challenged in executing a mundane task like telephone
dialing. Here the user initially trains the system by uttering the digits from 0 to 9. Once the
system has been trained, the system can recognize the digits uttered by the user who trained the
system. This system can also add some inherent security as the system based on cepstral
approach is speaker dependent. The algorithm is run on a particular speaker and the MFCC
coefficients determined. Now the algorithm is applied to a different speaker and the mismatch
was clearly observed. Thus the inherent security provided by the system was confirmed.
Presently systems have also been designed which incorporate Speech and Speaker Recognition.
Typically a user has two levels of check. She/he has to initially speak the right password to gain
access to a system. The system not only verifies if the correct password has been said but also
focused on the authenticity of the speaker. The ultimate goal is do have a system which does a
Speech, Iris, Fingerprint Recognition to implement access control.
5.2 Recommendation
This paper focused on “Isolated voice Recognition”. But we feel the idea can be extended to
“Continuous voice Recognition” and ultimately create a Language Independent Recognition
System based on algorithms which make these systems robust. The use of Statistical Models like
MFCC.
It is significant to know that this design is limited to database knowledge when we simulate the
project, we are not exactly recognize the required output.
The detection used in this work is only based on the frame energy in MFCC which is not good
for a noisy environment with low SNR. The error rate of determining the beginning and ending
of speech segments will greatly increase which directly influence the recognition performance at
the pattern recognition part. So, we should try to use some effective way to do detection. One of
these methods could be to use the statistical way to find a distribution which can separate the
noise and speech from each other.
REFERENCES
APPENDIX
APPENDIX:A
MFCC MATLAB CODE
fs = 10000;% Sampling Frequency
t = hamming(4000);% Hamming window to smooth the speech signal
w = [t ; zeros(6000,1)];
f = (1:10000);
mel(f) = 2595 * log(1 + f / 700);% Linear to Mel frequency scale conversion
tri = triang(100);
win1 = [tri ; zeros(9900,1)];% Defining overlapping triangular windows for
win2 = [zeros(50,1) ; tri ; zeros(9850,1)];% frequency domain analysis
win3 = [zeros(100,1) ; tri ; zeros(9800,1)];
win4 = [zeros(150,1) ; tri ; zeros(9750,1)];
win5 = [zeros(200,1) ; tri ; zeros(9700,1)];
win6 = [zeros(250,1) ; tri ; zeros(9650,1)];
win7 = [zeros(300,1) ; tri ; zeros(9600,1)];
win8 = [zeros(350,1) ; tri ; zeros(9550,1)];
win9 = [zeros(400,1) ; tri ; zeros(9500,1)];
win10 = [zeros(450,1) ; tri ; zeros(9450,1)]
win11 = [zeros(500,1) ; tri ; zeros(9400,1)];
win12 = [zeros(550,1) ; tri ; zeros(9350,1)];
win13 = [zeros(600,1) ; tri ; zeros(9300,1)];
win14 = [zeros(650,1) ; tri ; zeros(9250,1)];
win15 = [zeros(700,1) ; tri ; zeros(9200,1)];
win16 = [zeros(750,1) ; tri ; zeros(9150,1)];
win17 = [zeros(800,1) ; tri ; zeros(9100,1)];
win18 = [zeros(850,1) ; tri ; zeros(9050,1)];
win19 = [zeros(900,1) ; tri ; zeros(9000,1)];
win20 = [zeros(950,1) ; tri ; zeros(8950,1)];
x = wavrecord(1 * fs, fs,'double');% Record and store the uttered speech
plot(x);
wavplay(x);
i = 1;
while abs(x(i)) <0.05% Silence detection
i = i + 1;
end
x(1 : i) = [];
x(6000 : 10000) = 0;
x1 = x.* w;
mx = fft(x1);% Transform to frequency domain
nx = abs(mx(floor(mel(f))));% Mel warping
nx = nx./ max(nx);
nx1 = nx.* win1;
nx2 = nx.* win2;
nx3 = nx.* win3
;nx4 = nx.* win4;
nx5 = nx.* win5;
nx6 = nx.* win6;
nx7 = nx.* win7;
nx8 = nx.* win8
nx9 = nx.* win9;
nx10 = nx.* win10;
nx11 = nx.* win11;
nx12 = nx.* win12;
nx13 = nx.* win13;
nx14 = nx.* win14;
nx15 = nx.* win15;
nx16 = nx.* win16;
nx17 = nx.* win17;
nx18 = nx.* win18;
nx19 = nx.* win19;
APPENDIX: B
Testing Code:
fs = 10000;% Sampling Frequency
t = hamming(4000);% Hamming window to smooth the speech signal
w = [t ; zeros(6000,1)];
f = (1:10000);mel(f) = 2595 * log(1 + f / 700);% Linear to Mel frequency scale conversion
tri = triang(100);
win1 = [tri ; zeros(9900,1)];% Defining overlapping triangular windows for
win2 = [zeros(50,1) ; tri ; zeros(9850,1)];% frequency domain analysis
win3 = [zeros(100,1) ; tri ; zeros(9800,1)];
win4 = [zeros(150,1) ; tri ; zeros(9750,1)];
win5 = [zeros(200,1) ; tri ; zeros(9700,1)];
win6 = [zeros(250,1) ; tri ; zeros(9650,1)];
win7 = [zeros(300,1) ; tri ; zeros(9600,1)];
win8 = [zeros(350,1) ; tri ; zeros(9550,1)];
win9 = [zeros(400,1) ; tri ; zeros(9500,1)];
win10 = [zeros(450,1) ; tri ; zeros(9450,1)];
win11 = [zeros(500,1) ; tri ; zeros(9400,1)];
win12 = [zeros(550,1) ; tri ; zeros(9350,1)];
win13 = [zeros(600,1) ; tri ; zeros(9300,1)];
win14 = [zeros(650,1) ; tri ; zeros(9250,1)];
win15 = [zeros(700,1) ; tri ; zeros(9200,1)];
win16 = [zeros(750,1) ; tri ; zeros(9150,1)];
win17 = [zeros(800,1) ; tri ; zeros(9100,1)];
win18 = [zeros(850,1) ; tri ; zeros(9050,1)];
win19 = [zeros(900,1) ; tri ; zeros(9000,1)];
win20 = [zeros(950,1) ; tri ; zeros(8950,1)];
y = wavrecord(1 * fs, fs,'double');%Store the uttered password for authentication
i = 1;
whileabs(y(i)) < 0.05% Silence Detection
i = i + 1;
end
y(1 : i) = [];y(6000 : 10000) = 0;
y1 = y.* w;my = fft(y1);% Transform to frequency domain
ny = abs(my(floor(mel(f))));% Mel warping
ny = ny./ max(ny);
ny1 = ny.* win1;
ny2 = ny.* win2;
ny3 = ny.* win3;
ny4 = ny.* win4;
ny5 = ny.* win5;
ny6 = ny.* win6;
ny7 = ny.* win7;
ny8 = ny.* win8;
ny9 = ny.* win9;
ny10 = ny.* win10;
ny11 = ny.* win11;
ny12 = ny.* win12;
ny13 = ny.* win13;
ny14 = ny.* win14;
ny15 = ny.* win15;
ny16 = ny.* win16
ny17 = ny.* win17;
ny18 = ny.* win18;
ny19 = ny.* win19;
ny20 = ny.* win20;
sy1 = sum(ny1.^ 2);
sy2 = sum(ny2.^ 2);
sy3 = sum(ny3.^ 2);
sy4 = sum(ny4.^ 2);
sy5 = sum(ny5.^ 2);
%wavplay(Deny);
end
APPENDIX: C
Voice Recording Mat lab code:
fori = 1:10
file = sprintf('%s%d.wav','g',i);
input('You have 2 seconds to say your name. Press enter when ready to record--> ');
y = wavrecord(88200,44100);
sound(y,44100);wavwrite(y,44100,file);
end
APPENDIX:D
Training and Testing Code:
name = input ('Enter the name that must be recognized -- >','s');
ytemp = zeros (88200,20);r = zeros (10,1);
for j = 1:10
file = sprintf('%s%d.wav','g',j);
[t, fs] = wavread(file);
s = abs (t);start = 1;
last = 88200;
for i = 1:88200
if s (i) >=.1 && i <=7000start = 1;
break end if s (i) >=.1 && i > 7000start = i-7000;
break
end
end
fori = 1:88200k = 88201-i;
if s (k)>=.1 && k>=8120
last = 88200;
break
end
if s (k)>= .1 && k <81200last = k + 7000;
break
end
end r (j) = last-start;ytemp (1: last - start + 1,2 * j) = t (start:last);
ytemp (1: last - start + 1,(2*j - 1)) = t (start:last);
end
% The rows of the matrix are truncated to the smallest length % of the 10 recordings.
y = zeros (min (r),20);
fori = 1:20y (:,i) = ytemp (1:min (r),i);
end
% Convert the individual columns into frequency % domain by applying the Fast Fourier
Transform.
%Then take the modulus squared of all the entries in the matrix.
fy = fft (y);fy = fy.*conj (fy);
% Normalize the spectra of each recording and place into the matrix fn.
%Only frequiencies upto 600 are needed to represent the speech of most
% humans.
fn = zeros (600,20);
fori = 1:20
fn (1:600,i) = fy (1:600,i)/sqrt(sum (abs (fy (1:600,i)).^2));
end
% Find the average vector pu
pu = zeros (600,1);
fori = 1:20
pu = pu + fn (1:600,i);
end
pu = pu/20;
% Normalize the average vector