Speech Recognition

UNIVERSITY OF GAZIANTEP
SPEECH RECOGNITION
EEE 499 GRADUATION PROJECT
IN
ELECTRICAL & ELECTRONICS ENGINEERING
SUBMITTED TO
Doç. Dr. ERGUN ERÇELEBĐ
BY
HÜSEYĐN ÇELĐK
Fall 2008
ABSTRACT
In this study, the properties of human voice and the issue of speech recognition has
been studied and a software speech recognition tool was designed with MATLAB
program.
The tool was designed to be Command base, Speaker dependent, and it has a small
vocabulary that should include less than or equal to 50 voice commands.
The project is mainly about the processes for defining a mathematical equivalent
model of voice command which must be unique for that voice command. And followed
by creating a library(database) for the models of the voice commands that are going to
be used. The recognition process is comparing the unique model of the voice command
with the models in the libray (database). And after comparing process, the best
acceptable match which is above a recognition threshold percentage is assigned as the
recognized command.
In this project, 2 different methods was used for defining an equivalent model of a
voice command.
• Lineer Predictive Coding Method (LPC)

• Mel Frequency Cepstrum Coefficients Method (MFCC)
ii
ÖZET
Bu çalışmada, insan sesinin özellikleri ve ses tanımayla ilgili temel ilkeler incelendi.
MATLAB programı kullanılarak, bilgisayar ortamında bir ses tanıma programı
geliştirildi.
Geliştirilen program, komut tabanlı, konuşmacı bağımlı , 50 ses komutu veya daha
az komuttan oluşan küçük bir kelime haznesi kapsayacak şekilde tasarlandı.
Proje genel olarak, bir ses komutunun sadece o ses komutuna özel olacak şekilde bir
matematiksel denklik modeli oluşturulmasıyla ilgilidir. Sonrasında programda
kullanılacak olan komutların denklik modelleri için bir kütüphane (veritabanı)
oluşturulur. Tanıma aşamasıda ise tanınması istenen ses komutunun denklik modeli
kütüphanedeki modellerle karşılaştırılır ve en uygun eşleşmenin elde edildiği komut
tanınan komut olarak ekrana yansıtılır.
Bu projede, bir ses komutunun matematiksel denklik modelinin oluşturulma aşaması

için 2 farklı metod denemiştir.
• Doğrusal Öngörü Analizi Metodu (LPC)

• Melodi Frekansı Cepstral Katsayıları Analiz Metodu (MFCC)
iii
TABLE OF CONTENTS
ABSTRACT …………………………………………………………………………....ii
ÖZET …………………………………………………………………………………..iii
1 CHAPTER I : INTRODUCTION AND OBJECTIVES .................................. 1
2 CHAPTER II : BASIC ACOUSTICS AND SPEECH SIGNAL ..................... 3

2.1 The Speech Signal ............................................................................................ 3
2.2 Speech Production ............................................................................................ 5
2.3 Properties of Human Voice............................................................................... 6
3 CHAPTER III : SPEECH RECOGNITION...................................................... 7

3.1 Speech Recognition Tool.................................................................................. 8
3.2 Main Block Diagram of Speech Recognition Tool .......................................... 9
3.3 Speech Processing........................................................................................... 10
3.3.1 Speech Representation in Computer Environment ................................. 10
3.3.2 Symbolic Representation of a Speech Signal ......................................... 12
3.3.2.1 Pre -Works on the Recorded Sound................................................ 13
3.3.2.2 Feature Extracting........................................................................... 16
3.3.2.3 Fingerprint Calculation ................................................................... 23
3.4 Fingerprint Processing ................................................................................... 24
3.4.1 Fingerprints Library................................................................................ 25
3.4.2 Fingerprints Comparison ........................................................................ 25
3.4.3 Decision .................................................................................................. 30
4 CHAPTER IV : Project Demonstration with Code Explanations ................. 31

4.1 Training Part and Building Commands Database........................................... 31
4.1.1 Matlab Functions ( .m files).................................................................... 34
4.2 Recognition Part ............................................................................................. 36
4.2.1 Matlab Functions ( .m files).................................................................... 39
5 CHAPTER V : CONCLUSION......................................................................... 41
6 APPENDICES ...................................................................................................... 43
6.1 Appendix 1 : References............................................................................... 43
6.2 Appendix 2 : Test Results............................................................................. 45
6.3 Appendix 3 : Matlab Codes .......................................................................... 56
1 CHAPTER I : INTRODUCTION AND OBJECTIVES
Speech is the most natural way to communicate for humans. While this has been
true since the dawn of civilization, the invention and widespread use of the telephone,
audio-phonic storage media, radio, and television has given even further importance to
speech communication and speech processing.
The advances in digital signal processing technology has led the use of speech
processing in many different application areas like speech compression, enhancement,
synthesis, and recognition. In this project, the issue of speech recognition is studied and
a speech recognition system is developed with MATLAB program.
Speech recognition can simply be defined as the representation of a speech signal

via a limited number of symbols. The aim here is to find the written equivalent of the
signal. And each voice command must have a unique equivalent model. This is the main
and the most important part of this study.
Speech recognition presents great advantages to human-computer interaction. It is

easy to obtain speech data, and it does not require special skills like using keyboard,
entering data via clicking the buttons on the GUI programs, and so on. Transferring text
data into electronical media using speech is about 8-10 times faster than hand-writing,
and about 4-5 times faster than using keyboard by the most skilled typist. Moreover, the
user can continue entering text while moving or doing any work that requires her to use
her hands. Since a microphone or a telephone can be used, it is more economic to enter
data, and it is possible to enter data from a remote point via telephone.
1
In this project,
• A speaker dependent, small vocabulary, isolated word speech recognition system for
noise-free environments was developed.
• And to make the program smart and be industrial, an executable windows file (.exe)
with a graphical user interface (GUI) was build.
• One of the main advantages of the designed tool is each time you don’t need to press
a button when you are going to say a voice command. The tool provides the user a
continual recording and recognition process.
• And it provides the user an easy way of using the tool. New voice commands can
easily be added to the system and the tool shows the steps of the processes.
• Also featuring method and its parameters (order, frame length, recognition threshold
percentage ) can be easily changed.
• The project reported in chapters,
• Chapter I gives an introduction of the project and its objectives.

• Chapter II introduces the general view of a speech signal, speech production,
and properties of human voice.
• Chapter III is devoted to speech processing and feature extraction techniques
used in speech recognition. And Speech Recognition Tool is introduced.
• Chapter IV presents project demonstration, a sample training and recognition
session using the GUI application developed with several screenshots.
• Chapter V gives the conclusion and discusses of the project and the future works
are mentioned.
• Appendices : References, Test Results, Matlab Codes.
2
2 CHAPTER II : BASIC ACOUSTICS AND SPEECH SIGNAL
As relevant background to the field of speech recognition, this chapter intends to

discuss how the speech signal is produced and perceived by human beings. This is an
essential subject that has to be considered before one can pursue and decide which
approach to use for speech recognition.
2.1 The Speech Signal
Human communication is to be seen as a comprehensive diagram of the process

from speech production to speech perception between the talker and listener,
See Figure II.1
Figure II.1 - Schematic Diagram of the Speech Production/Perception Process.
Five different elements, A.Speech formulation, B.Human vocal mechanism,

C.Acoustic air, D.Perception of the ear, E.Speech comprehension.
3
The first element (A.Speech formulation) is associated with the formulation of the
speech signal in the talker’s mind. This formulation is used by the human vocal
mechanism (B.Human vocal mechanism) to produce the actual speech waveform. The
waveform is transferred via the air (C.Acoustic air) to the listener. During this transfer
the acoustic wave can be affected by external sources, for example noise, resulting in a
more complex waveform. When the wave reaches the listener’s hearing system (the
ears) the listener percepts the waveform (D.Perception of the ear) and the listener’s
mind (E.Speech comprehension) starts processing this waveform to comprehend its
content so the listener understands what the talker is trying to tell him.
One issue with speech recognition is to “simulate” how the listener process the
speech produced by the talker. There are several actions taking place in the listeners
head and hearing system during the process of speech signals. The perception process
can be seen as the inverse of the speech production process.
The basic theoretical unit for describing how to bring linguistic meaning to the
formed speech, in the mind, is called phonemes. Phonemes can be grouped based on the
properties of either the time waveform or frequency characteristics and classified in
different sounds produced by the human vocal tract.
Speech is:
• Time-varying signal,
• Well-structured communication process,
• Depends on known physical movements,
• Composed of known, distinct units (phonemes),
• Is different for every speaker,
• May be fast, slow, or varying in speed,
• May have high pitch, low pitch, or be whispered,
• Has widely-varying types of environmental noise,
• May not have distinct boundaries between units (phonemes),
• Has an unlimited number of words.
4
2.2 Speech Production
To be able to understand how the production of speech is performed one need to

know how the human’s vocal mechanism is constructed, see Figure II.2 . The most
important parts of the human vocal mechanism are the vocal tract together with nasal
cavity, which begins at the velum. The velum is a trapdoor-like mechanism that is used
to formulate nasal sounds when needed. When the velum is lowered, the nasal cavity is
coupled together with the vocal tract to formulate the desired speech signal. The cross-
sectional area of the vocal tract is limited by the tongue, lips, jaw and velum and varies
from 0-20 cm2.
When humans produce speech, air is expelled from the lungs through the trachea.
The air flowing from the lungs causes the vocal cords to vibrate and by forming the
vocal tract, lips, tongue, jaw and maybe using the nasal cavity, different sounds can be
produced.
Figure II.2 - Human Vocal Mechanism
5
2.3 Properties of Human Voice
One of the most important parameter of sound is its frequency. The sounds are
discriminated from each other by the help of their frequencies. When the frequency of a
sound increases, the sound gets high-pitched and irritating. When the frequency of a
sound decreases, the sound gets deepen.
Sound waves are the waves that occur from vibration of the materials. The highest
value of the frequency that a human can produce is about 10 kHz. And the lowest value
is about 70 Hz. These are the maximum and minimum values. This frequency interval
changes for every person. And the magnitude of a sound is expressed in decibel (dB).
A normal human speech has a frequency interval of 100Hz - 3200Hz and its
magnitude is in the range of 30 dB - 90 dB.
A human ear can perceive sounds in the frequency range between 16 Hz and 20 kHz.
And a frequency change of 0.5 % is the sensitivity of a human ear.
Speaker Characteristics,
• Due to the differences in vocal tract length, male, female, and children’s speech are
different.
• Regional accents are the differences in resonant frequencies, durations, and pitch.
• Individuals have resonant frequency patterns and duration patterns that are unique
(allowing us to identify speaker).
• Training on data from one type of speaker automatically “learns” that group or
person’s characteristics, makes recognition of other speaker types much worse.
6
3 CHAPTER III : SPEECH RECOGNITION
The main goal of a speech recognition system is to substitute for a human listener,
although it is very difficult for an artificial system to achieve the flexibility offered by
human ear and human brain. Thus, speech recognition systems need to have some
constraints. For instance, number of words is a constraint for a word-based recognitions
system. In order to increase the performance of the recognition, the process is dealt with
in parts, and researches are concentrated on those parts.
Speech recognition is the process of extraction of linguistic information from speech

signals. The linguistic information which is the most important element of speech is
called phonetic information.
Although this project is a command base speech recognition system, it so difficult to

identify a voice command by investigating it as a whole unit. Here, I preferred a smaller
approach as it is done in phoneme-base systems. A single word (voice command)
consists of phonemes, that’s why we are going to investigate the voice commands in 30
ms intervals.
The work principle of speech recognition systems is roughly based on the

comparison of input data to prerecorded patterns. These patterns are the equivalent
models of the voice commands saved in training process. By this comparison, the
pattern to which the input data is most similar is accepted as the symbolic representation
of the data. This preprocessing is called Feature Extraction. First, short time feature
vectors are obtained from the input speech data, and then these vectors are compared to
the patterns classified prior to comparison. The feature vectors extracted from speech
signal are required to best represent the speech data, to be in size that can be processed
efficiently, and to have distinct characteristics. Thus, obtaining a very clear distinction
of speech is the main goal of the feature vector extraction.
7
3.1 Speech Recognition Tool
My Speech Recognition Tool consists of 4 main parts.
Figure III.1 – 4 main parts of a speech recognition system
In the Training process the voice commands that are going to be used in recognition
are defined to the system.
The second part is creating a library for the training voice commands. They are
stored in the library.
The third part is Recognizing Process, you say a command and the fingerprint of
this command is compared with the fingerprints of the commands in the library. And the
best acceptable match that is above a threshold percentage is assigned as the recognized
command.
The last part is Processing, while you are training the system you can also assign
some functionalities to these commands. And processing is the part where the function
of the recognized command is operated.
8
3.2 Main Block Diagram of Speech Recognition Tool
Microphone
Recorded
Sound
FingerPrint
Processing
Pre-Works
Pure Speech
Signal
Current Database
FingerPrint FingerPrints
Feature FingerPrints FingerPrints
Extracting Comparison Library
Comparison
Distances

Under
Matched
Not
Speech Decision
Recognized
Processing
Best Match
Recognized
Function
of the
Command
Process Return
Figure III.10 - Main Block Diagram of Speech Recognition Tool
9
3.3 Speech Processing
3.3.1 Speech Representation in Computer Environment
The speech sound of a human is got in to computer by a microphone. Here the

microphone entrance unit of the computer is used as an analog input entrance unit. The
sound waves are caught by the microphone as an analog input. Then the analog speech
signal is converted to digitalized signal. In this project 8000 Hz sampling frequency and
256 (8 bits/sample) quantization level is used.
The speech signal and all its characteristics can be represented in two different
domains, the time and the frequency domain.
See Figure III.2,
III.2 a) Representation of a speech signal in time domain
III.2 b) Representation of a speech signal in frequency domain
III.2 c) Spectrogram of a speech signal
10
Figure III.2 - Representation of speech signal in computer environment
11
3.3.2 Symbolic Representation of a Speech Signal
This is the part where we obtain the unique equivalent model of a speech signal.
First, the sound is recorded with microphone then the digitalized recorded signal is
processed and modified. Finally a unique equivalent model is obtained which we call it
as “ finger print ”.
Figure III.3 - Processes applied for Symbolic representation of Speech Signal.
Here the first part is applying some pre processes on the recorded signal, the aim is
to obtain a pure speech signal which is purified from the noises and removing the
silence part from the whole speech. This is one of the most important and difficult part
in the project. The start and end point of the speech must be found correctly. There
shouldn’t be any silence at the beginning and at the end of the speech. The final speech
signal (voice command) must only consist of unvoiced and voiced parts of speech.
See figure III.4
The second part is obtaining the fingerprint of the voice command. The signal is
splitted into small parts (frames) that are in 30 ms length. And all these frames are
separately investigated by looking into their time domain, frequency domain and power
domain properties. Some modifications are applied to these frames and finally all the
frames combined together and they represent the fingerprint of the voice command
which is expected to be unique for that voice command.
12
3.3.2.1 Pre-Works on the Recorded Sound
Figure III.4 - Block Diagram of Pre-Works on the recorded Speech Signal.
13
A) Band-Pass Filter
Because of the electrical layout of the computer and the environment, a powerful
noise occurs called 50 Hz Noise. We must reject this noise. And as you know the
frequency of human voice is in the range 100 Hz - 3200 Hz. So a band pass filter is used
with cutoff frequencies 70 Hz and 3200 Hz. A FIR type digital filter is used.
For the removing of the background noise also another method is used in this part.
The signal is converted to a .wav file and reconstructed. This method is really effective
for rejecting the background noise.
B) Pre-emphasis Filter
Finally, the digitized speech signal is processed by a first order digital network in
order to spectrally flatten the signal. The unvoiced frames have high frequency but low
energy. In order to investigate unvoiced frames we use a pre-emphasis filter. This filter
is easily implemented in the time domain by taking difference.
Y (n) = S (n) − 0.97 . S (n − 1)
Now we a have a pure voice command “ Y ( n) ” and it is ready for the process
feature extracting.
C) Short-Time Energy
After filtering the signal, now we should find its short-time energy. This process is
required for defining the starting and ending points of the speech.
The signal is separated into frames that have 40 samples each. And the energy of
each frame is found by adding the absolutes values of 40 continual samples each other.
This process goes on for the whole signal.
14
The Energy of the signal is found with the below equation,
40
( ∑ X ( i + 40(n − 1) ) ) 2
1 length( X )
P ( n) = i =1
, n =1, 2, 3,...,
2 40 40
D) Start – End Point Detection
After finding the short-time energy of the signal, we find a threshold value from the
mean value of the energy and we reject the frames that are below this value. Because
the threshold is value the that the signal passes from silence part to speech part. We are
interested in the voiced part of the recorded signal. The rest of the signal does not
contain any required information. And also any silence part at the beginning or at the
end of the signal will cause recognition failures. So finding the correct points for the
beginning and end of the signal is very important.
I use this equation to find a good threshold level,
l
1
. ∑ P ( n)
l length( X )
Threshold = n =1
, l=
4 40
S (i ) = X [ 40. ( P ( n) > T ) ] , i = 1, 2, 3,...
15
3.3.2.2 Feature Extracting
This stage is often referred as speech processing front end. The main goal of Feature
Extraction is to simplify recognition by summarizing the vast amount of speech data
without losing the acoustic properties that defines the speech. See Figure III.5
Figure III.5 - The Block Diagram of Feature Extracting Process.
16
A) Frame Blocking
Investigations show that speech signal characteristics stays stationary in a

sufficiently short period of time interval (It is called quasi-stationary). For this reason,
speech signals are processed in short time intervals. This time interval must be chosen
very carefully and correctly. In this interval the properties of sound should not change
so much and the interval should also be long enough that will give enough information
about that frame. So the signal is divided into frames that are ∼ 30 ms length.With a
8000 Hz sampling frequency, 30ms = 240 samples.And each frame overlaps its
previous frame by a predefined size. The overlapping size is defined to be half of the
frame length. The goal of the overlapping scheme is to smooth the transition from frame
to frame. See figure III.6
B) Frame Windowing
Each frame is multiplied by an N sample window W(n). Here I used a hamming

window. This hamming window is used to minimize the adverse effects of chopping an
N sample section out of the running speech signal. While creating the frames, the
chopping of N sample from the running signal may have a bad effect on the signal
parameters. To minimize this effect windowing is done. Also windowing smoothes the
side-band lobes of the formant frequencies and it is done in order to eliminate
discontinuities at the edges of the frames. See figure III.7
2.π .n
W (n) = 0.54 − 0.46 cos( ) , 0 <= n <= N-1
N −1
Each frame is convoluted by the window function.
S (n) = X (n) ∗ W (n)
17
Figure III.6 - Frame Blocking, The speech signal is separated into 7 frames each has T ms length.
Figure III.7 - Windowing, the frames are convoluted with the window function.
18
C) Featuring Method
This is the part where the framed and windowed speech signal is converted to its
symbolic representation vector (Fingerprint vector). Here we have two different
methods. MFCC Method and LPC Method.
1) MFCC – Mel Frequency Cepstrum
The MFC coefficients, are the coefficients of the Fourier transform representation of
the log magnitude spectrum and taking the Discreate Cosine Transform.
Figure III.8 – Block Diagram of Obtaining Mel Frequency Cepstral Coefficients.
19
a) Fast Fourier Transform ( FFT )
The next important step in the processing of the signal is to obtain a frequency
spectrum of each block. The information in the frequency spectrum is often enough to
identify the frame. The purpose of the frequency spectrum is to identify the formants,
which are the peaks in the frequency spectrum. One method to obtain a frequency
spectrum is to apply an FFT to each block. The resulting information can be examined
manually to find the peaks, but it is quite noisy, which makes the take difficult for a
computer to identify the peaks. FFT of each frame is obtained from the below formula.
∞
1
X (W ) =
T
∑ X ( n) e
n =1
jwn
b) Mel-frequency Wrapping
The human ear perceives the frequencies non-linearly. Researches show that the
scaling is linear up to 1 kHz and logarithmic above that. The Mel-Scale (Melody Scale)
filter bank which characterizes the human ear perceiveness of frequency is as shown in
Figure III.9. It is used as a band pass filtering for this stage of identification.
The signals for each frame is passed through Mel-Scaled band pass filter to mimic
the human ear.
f
m = 2595. log10 (1 + ) , f 
→ Normal freq. ( Hz )
700
m 
→ Mel − Scaled freq.
20
Figure III.9 - Mel-Scaled Filter Bank.
c) Mel Frequency Cepstral Coefficients
As of the final step, each frame is inverse Fourier transformed to take them back to
the time domain. Instead of using inverse FFT, Discrete Cosine Transform is used as it
is more appropriate.
The discrete form for a signal x(n) is defined as,
N −1
π  1 
Xk = ∑
n=0
X n cos  
N 
n + k
2  
, k = 0,1, 2, 3,...., N − 1
As a result of this process, Mel-Frequency Cepstral Coefficients are obtained. These

coefficients are called feature vectors. And in this project 12 mel frequency cepstral
coefficients and 12 delta cepstral coefficients are generated per frame and these are used
as the feature matrix. So I called this as 12th order MFCC. This is the default value for
the MFCC order. The order can be easily changed from the control panel (GUI). After
the cepstral coefficients were generated a Cepstral mean normalization [CMN] was
done to get ride of the bias signal present across the coefficients. The fingerprint matrix
is an n by m matrix. Which consists of n frames and m coefficients each in frame. Here
m is equal to 2 times of order. Default m is 2 × 12 = 24 .
21
2) LPC – Linear Predictive Coding
LPC models this process as a linear sum of earlier samples using a digital filter
inputting excitement signal. An alternate explanation is that linear prediction filters
attempt to predict future values of input signal based on past samples.
And Linear Predictive Coding is the method to model the vocal tract filter. This
vocal tract filter is the model of H(z) in the figure below. This filter is an all-pole filter
which consists of only poles.
X ( n) → H(z) → Xɶ (n)
p
1
H ( z) = p
, A( z ) = 1 − ∑a j z− j , p → Order
∑a
j =1
j z −j j =1
a = [ 1 , a(2) , a (3) , ... , a( p + 1) ]
Xɶ (n) = − a(2). X (n − 1) − a(3). X (n − 2) − ... − a( p + 1) . X (n − p )
With these equations LPC estimates the current value [n] from previous values of a
sequence x[n].And in this project an order of 11 which is the order of the digital filter
that is used to describe the featuring vector. And the order can be easily changed from
the control panel (GUI).After finding the LPC filter coefficients for each frame, these
coefficients are converted to a Digital FIR type filter and the fingerprint matrix is
created ( n by m matrix ). Which consists of n frames and m coefficients each in
frame. Here m is equal to frame length(samples each frame) over 12.
Default m is , 240 12 = 20 .
22
3.3.2.3 Fingerprint Calculation
Fingerprint Matrices consists of n frames and m coefficients in each frame.
1) MFCC – Mel Frequency Cepstral Coefficients
 f11 f12 f13 ⋯ f1 m 

 f 21 f 22 ⋮ 
 
FMFCC =  f 31 ⋱ ⋮ 
 
 ⋮ ⋱ ⋮ 
 f n1 ⋯ ⋯ ⋯ f nm  n× m
2) LPC – Linear Predictive Coding Coefficients
 f11 f12 f13 ⋯ f1 m 

 f 21 f 22 ⋮ 
 
FLPC =  f 31 ⋱ ⋮ 
 
 ⋮ ⋱ ⋮ 
 f n1 ⋯ ⋯ ⋯ f nm  n×m
23
3.4 Fingerprint Processing
After we obtained the featuring vector of a voice command now the system is ready
for Recognition process.
Current Database
FingerPrint FingerPrints
FingerPrints FingerPrints
Comparison Library
Comparison
Distances
Under
Matched
Not
Decision
Recognized
Best Match
Recognized
Function
of the
Command
Process Return
Figure III.10 - Fingerprint Processing and Recognition Process.
24
3.4.1 Fingerprints Library
As the system is a command base speech recognition system, the voice commands
that are going to be used must be determined before starting the recognition process.
Each voice command is recorded 3 times and 3 patterns are obtained for that command.
Then these three patterns are saved to the library in the name of that command and
having numbers 1 to 3.
This process is done for each voice commands that are going to be used. And finally
we have 3 times n (number of voice commands) fingerprints saved to the library.
3.4.2 Fingerprints Comparison
This is the part where the fingerprint of the current voice command (the command
that is wanted to be recognized) is compared with the fingerprints in the library. Here
both current and library featuring vector are matrices and comparison is done by
calculating the Euclidian distance squares between the current and the library
fingerprint matrices. And comparison is done by frame to frame distance calculation.
Each row (frame) in the current fingerprint is compared with every row (frames) of
library fingerprint. And finally, after overall comparisons of current fingerprint and
library fingerprint, one comparison matrix is obtained for each comparison.
Here the fingerprints will have different number of frames because all the voice
commands have not equal lengths in time domain. And as I mentioned before frame
number represents the length of the voice command. Even the patterns of same
command will have different lengths and so different number of. But the thing that does
not change is the number of coefficients in each frame. Because all the fingerprints are
obtained with same method and same order.
In MFCC Method 24 coefficients are generated per frame and in LPC Method 20
coefficients are generated per frame as the default values of column coefficients.
25
Here is an example of fingerprints comparison of current fingerprint compared with
one of the fingerprints in the library.
 c11 c12 c13 c14  Current Fingerprint Matrix
FC =  c21 c22 c23 c24  3 frames

 
 c31 c32 c33 c34  3× 4
4 coeffs each frame
 l11 l12 l13 l14 

 l21 l22 l23 l24  Library Fingerprint Matrix
 
FL =  l31 l32 l33 l34  5 frames
 
 l41 l42 l43 l44  4 coeffs each frame
 l51 l52 l53 l54 
5× 4
The comparison starts with the fingerprint that has smaller size. Here current
fingerprint have a size of ( 3 x 4 ) which is smaller than the size of the library
fingerprint ( 5 x 4 ) . So we start with current fingerprint. And then, we represent the
fingerprints in the below form,
26
 LFrame 1   L1 
 CFrame 1   C1   LFrame 2  L 
       2 
FC =  CFrame 2  = C
 2 , FL =  LFrame 3  =  L3 
 CFrame 3   C3     
     LFrame 4   L 4 
 LFrame 5   L5 
   
C1 = [ c11 c12 c13 c14 ] , ... , C3 = [ c31 c32 c33 c34 ]
L1 = [ l11 l12 l13 l14 ] , ... , L5 = [ l51 l52 l53 l54 ]
And the Comparison Matrix is ; ( || → Comparison Operator )
 ( C1 || L1 ) ( C 1 || L2 ) ( C 1 || L 3 ) ( C 1 || L 4 ) ( C 1 || L 5 )
 
D =  ( C 2 || L1 ) ( C 2 || L 2 ) ( C 2 || L 3 ) ( C 2 || L 4 ) ( C 2 || L 5 )
 ( C 3 || L1 ) ( C 3 || L 2 ) ( C 3 || L 3 ) ( C 3 || L 4 ) ( C 3 || L 5 ) 

4
( C 1 || L 1 ) = ∑(c
n =1
1n − l 1n ) 2
4
( C x || L y ) = ∑(c
n =1
xn − l yn ) 2
27
 d11 d12 d13 d14 d15 
 
D =  d21 d22 d23 d24 d25 
 d31 d32 d33 d34 d35 
  r×c

Square Matrix Extension Matrix
↓ ↓
Diagonal is frame to frame comparison Frame numbers are not equal
d x y = ( C x || L y )
d n n → Comparison distance of nth frames of Current and

Library finferprints
After finding the comparison matrix as in the above form, we do an optimization.

Because the fingerprints of same voice commands will have different numbers of
frames but they contain same information overall. And this optimization is done in
order to minimize this time warping effect. And also this process produces much higher
distances when the frame numbers of the fingerprints are not equal. Generally patterns
of same voice commands have closer frame numbers.
Optimization is done with the below technique,
D optimum = [ D1 D2 D3 D4 D5 ]
28
D 1 = min ( d11 , d12 ) 



D 2 = min ( d 21 , d 22 , d 23 )  Square Matrix


D 3 = min ( d32 , d33 , d34 ) 

D n = min ( dn ( n−1) , dnn , d n ( n+1) ) , where n = 1, 2 ,..., r
Distance is equal to the minimum value of the diagonal parameter and the
parameters that are in the left side and right side of it.
D 4 = D 5 = 3 × mean ( d34 , d35 ) } Extension Matrix
Dn = 3 × mean ( d r ( r + 1 ) , ... , d r c ) , where n = r + 1 , ... , c
Distance is equal to 3 times mean value of the distances of last frame of the current
fingerprint compared to extra frames of the library fingerprint or vice versa.
The final distance is square root of sum of the optimum distance values.
c
D final = ∑n =1
Dn
29
3.4.3 Decision
After comparing the current fingerprint with all the library fingerprints and finding the
distances, we obtain 3 different distance value for each voice command in the library. Then
all the distances are combined together and we find the minimum of them. Because where
the minimum distance occurs gives us the best match. Which means that the current
fingerprint is most likely, similar to that command. And when we find best fingerprint
match, we also find what was said in the current voice command if it exists in the library.
Here there seems to be a problem. Our system is command base system which has
limited number of voice commands. If a command which exists in the library said to the
tool, there is no problem, the best match will be probably ( ∼ 95 % ) be what we expected.
But what if a command that the system does not know is said to the tool ? Again there is
going to be a minimum distance and the tool will determine it as the best match. And the
name of matched library command will be shown in the listbox. But the thing is that the
result is going to be wrong. Because that new command does not exists in the library. So in
order not to fall into this mistake we should determine a matching threshold level as in the
form of percentage. And firstly before we assign the command recognized or unrecognized
we look this percentage level (matching percentage ) .If the level is above the determined
value then the command is assigned as recognized and its name is shown in the listbox with
its matching percentage. And if the level is below the determined level then the command is
assigned as not recognized and “???” is shown in the listbox.
D final
Matching Percentage = c r
∑∑
j =1 i =1
[ Library FingerPrint ] r × c
The advantage of determining a matching threshold percentage level is not to

recognize commands those are not in the library but the technique I used for this process
is only valid for small libraries which can be include 10 to 20 commands. Because
while the number of commands in the library increase, also the probabality of matching
the current fingerprint increases. After this part, again the tool returns the beginning and
listens for new voice commands automatically and continually.
30
4 CHAPTER IV : Project Demonstration with Code Explanations
4.1 Training Part and Building Commands Database
Matlab m files used in this part
• Training.m , Training.fig
• Record_Training.m
• Featuring.m
• Melcepst_met.m
• Lpc_met.m
• Plotts.m
• Save_Mat.m
• Lib_Commands.mat
We start demonstration Chapter with Training process. We determine the voice

commands to the tool and save them to “ Library ” folder.
To start the training tool, we run “Training.m” function. This function has a
graphical user interface. See Figure IV.1
31
Figure IV.1 - A screen shot from Training GUI .
• Listbox : Previously recorded voice commands. There are 3 patterns from each of them
saved in the “Library” folder.
• Remove Button : Removes the selected command from the listbox and deletes 3 patterns
of it from the library.
• Record Button : Records new voice command with the name entered to the Command
textbox. When it is pressed , it provides the user 3 continually recording process.
• Featuring Method Panel : Applies entered parameters to the recorded speech signal while
finding its fingerprint.
• Featuring Method Selection Box : 1. MFCC Method , 2. LPC Method
• Order Textbox : Order of the applied method.
• Frame Length Textbox : Length of the frames in ms. for the framing process.
• Okay Button : Saves the current data to the library.
• Reject Button : Rejects the current data.
• Play Button : Sounds the current recorded speech.
• Featuring Matrix Size Textbox : Displays the size of the current fingerprint. (#frame x #Coeff.)
32
When the Record button pressed recording loop starts,
Record_Training.m
Method 1 Method 2
Melcepst_met.m Featuring.m Lpc_met.m
Plotts.m
Save_Mat.m
Record_Training.m
Figure IV.2 - Running sequence of the Matlab functions used in Training Process.
33
4.1.1 Matlab Functions ( .m files)
Training.m
This is the main function of the Training process. All the sub functions are called
inside this function and they arranged in a sequence. Also configuring and initializing
the tool are done in this part.
Record_Training.m
Recording process is operated in this function. When the record button is pressed
the system enters a recording loop and recording sound continue until 3 separate
patterns of a voice command is saved to the library.
Recording sound from microphone is realized by creating an anolog input object.
And it gives output every in every 1000 sample of the recorded signal. Than if the
energy of that frame is greater than the defined threshold value, tool assigns frames to a
variable by end to end until a frame exists that has an energy of smaller than this
threshold. After obtaining the speech signal, it is filtered, normalized and sent to
“Featuring.m” function.
Featuring.m
In this part first the recorded speech signal is converted to pure speech signal by
finding the start point and end point of the speech. Then silence parts in the speech are
removed from the whole recorded signal. This part is realized by finding the energy of
the speech and then removing the frames below the defined threshold level.
After obtaining the pure voice signal, its fingerprint is found with the selected
method and method specifications. Then the datas and variables are sent to the function
“Plotts.m”.
Featuring process is exacuted just for seeing the obtained fingerprint, and certifying
if it is suitable or not.
34
Plotts.m
This is the function where all the results are shown on the GUI panel. Recorded
signal, its energy, start and end points, extracted pure speech (command) signal,
featuring matrix and its size are plotted. Then if the user certifies that the datas are
suitable, he presses Okay Button and the function “Save_Mat.m” is called where the
voice command is saved to library. Else if the user does not certify that the datas are
suitable, he presses Reject Button and all the current datas and variables cleared. In both
cases pressing Okay Button or Reject Button the recording process starts from the
beginning automatically and stops if the Okay Button is pressed 3 times. Which means
3 separate patterns of the new voice command are saved to library. Or it stops
automaticaly if nothing said for 6.25 seconds.
Save_Mat.m
When it is called, it saves the current recorded command with the name entered to
command textbox. The signal is recorded to a mat file with the name entered to the
command textbox and appending a pattern number.(1-2-3).
Melcepst_met.m - Lpc_met.m
If the selected Featuring Method from the GUI panel is “1- MFCC” the function
“Melcepst_met.m” is called and if it is “2- LPC” the function “Lpc_met.m” is called for
feature extracting process in the “Featuring.m” function.
Lib_Commands.mat
Name of saved voice commands are stored in this file and they are dislayed in the
listbox on the GUI.
35
4.2 Recognition Part
Matlab m files used in this part
• SpeechRecognition.m , SpeechRecognition.fig
• Record_Recognition.m
• Featuring.m
• Melcepst_met.m
• Lpc_met.m
• Compare.m
• Disteusq.m
• Library_Call.m
• Lib_Commands.mat
After building fingerprints database, now the tool is ready for recognition process.
To start the Speech Recognition Tool, we run “SpeechRecognition.m” function. This

function has a graphical user interface. See Figure IV.3 .
36
Figure IV.3 - A screen shot from Speech Recognition GUI .
• Listbox : Name of the voice commands which exist in the library. Also when a command
recognized, its name is highlighted on the listbox. If it is not recognized “???” line is highlighted.
• Mode Selection Box : Selection of the process, 1-Recognition , 2-Training.
• Featuring Method Panel : Applies entered parameters to the recorded speech signal while
finding its fingerprint. Parameters are same as they are in the Training Tool.
• Recognition Level Textbox : Defines the Recognition level threshold percentage.
• Start Button : Starts Recognition Process, and listening for voice commands to be recognized.
• Stop Button : Stops all the processes.
• Energy Textbox : Shows recording process is running and also it displays the energy of the
recorded frames.
• Recognition Result and Matching level Textbox : Displays the current recognized command
and its matching percentage.
37
When the Start button pressed recording loop and recognition process start,
Figure IV.2 - Running sequence of the Matlab functions used in Recognition Process.
38
4.2.1 Matlab Functions ( .m files)
SpeechRecognition.m
This is the main function of the Recognition process. Configuring and initializing
the tool are done in this part.
Record_Recognition.m
Recording process is operated in this function and all the sub functions are called
inside this function and they arranged in a sequence. If Training Mode is selected and
the start button is pressed, “Training.m” function is called. If Recognition Mode is
selected and the start button is pressed, the system enters in an infinite recording loop
and recording sound continues until stop button pressed or nothing is said for 10
seconds. Also before starting the analog input object, “Library_Call.m” function is
called for the database voice commands.
Recording sound and the other processes are same as it is done in Training process
(“Record_Training.m”) as it should be. Because the processes done to a speech signal
must be totally same in Training process and Recognition process.
After obtaining the speech signal, it is filtered, normalized and sent to

“Featuring.m” function.
Library_Call.m
All the voice commands in .mat file format in the “Library” folder are read and their
fingerprints are found by calling the function “Featuring.m”, then these fingerprints are
assigned to variables to be used in the comparison (“Compare.m”) process.
39
Featuring.m
After obtaining the pure voice signal, its fingerprint is found with the selected
method and method specifications. Then the datas and variables are sent to the function
“Compare.m”.
Melcepst_met.m - Lpc_met.m
If the selected Featuring Method from the GUI panel is “1- MFCC” the function
“Melcepst_met.m” is called and if it is “2- LPC” the function “Lpc_met.m” is called for
feature extracting process in the “Featuring.m” function.
Compare.m
This is the function where the current fingerprint is compared with the database
fingerprints. For comparing process, “Disteusq.m” function is executed for each
comparison. After obtaining comparison results, the distances, it is decided that which
one of the comparison results give the minumum distance, which means that it is the
best match. Then if this match satisfies the required conditons, the name of that library
command is displayed with its matching percentage on the GUI and also its name is
highlighted in the listbox.
Disteusq.m
This function is used for finding the euclidian distances of two fingerprints. The
output is just a single number which represents the difference between the two input
matrices.
40
5 CHAPTER V : CONCLUSION
In this project, basic principles and properties of human speech were investigated
and digital signal processing techniques on speech signal were studied. Finally, a
speaker dependent, small vocabulary, isolated word speech recognition system for
noise-free environments was developed. And an executable windows file (.exe) with a
graphical user interface (GUI) was build to make the program smart and be industrial.
One of the main advantages of the designed tool is each time you don’t need to
press a button when you are going to say a voice command. The tool provides the user a
continual recording and recognition process. And it provides the user an easy way of
using the tool. New voice commands can easily be added to the system and the tool
shows the steps of the processes. And featuring method and its parameters (order, frame
length, recognition threshold percentage ) can be easily changed on the panel.
Also, the tool can be used in dictation process with some small changes in the
algorithm. This is going to be one of the future works of this project.
Two different methods were used for feature extracting. First one is Mel Frequency
Cepstral Coefficients (MFCC) and the second one is Linear Predictive method(LPC).
These two methods were tried with different parameters and some tests were made with
different characteristics. And it was observed that the MFC coefficients, have been
shown to be a more robust, reliable feature set for speech recognition than the LPC
coefficients. Because of the sensitivity of the low order MFC coefficients to overall
spectral slope and the sensitivity of the high-order MFC coefficients to noise. And it is
seen that, LPC approximates speech linearly at all frequencies but MFCC is more robust
and also take into account the psychoacoustic properties of the human auditory system.
And both in two methods it is seen that when the order is increased, the recognition
efficiency increases. Also more training patterns for each voice command give better
results and the talking style of the speaker does not effect recognition so much.
But when we consider the recognition time period and optimization parameters, the
orders were found to be as 12 (12 MFCC – 12 Delta Coeff.) for MFCC method and 11
41
for LPC method. And not to make the system working slowly, 3 patterns were taken
from each voice command.
The optimum frame length was found to be 30 ms. Because the speech in less than
30ms does not include enough information and more than 30 ms includes more
information. That’s why 30 ms is found to be optimum, which includes exactly what is
needed.
After the overall tests and observations (See Appendix 2), it was seen that the
environment noise had a bad effect on recognition. And the best results were obtained
from the test 6 (See Appendix 2 – Table 6) which was tested with one speaker, 50
library commands, 50 testing commands, 5 try (5x50 =250 commands), 3 patterns for
each command, 20% Recognition level threshold percentange and with 24 MFC (12
Mel – 12 Delta) coefficients. And 96.8 % overall efficiency was obtained. Which
means, totaly 250 voice commands tried and only 8 of them recognized wrongly.
Also another good result was obtained with test 3 (See Appendix 2 – Table 3). 2
patterns were taken from 3 speakers for each voice command and a library was created
with 30 voice commands, totaly 6x30 = 180 commands. And the system was tested with
3 speakers, with these 30 commands, with 20% Recognition level percentage and
appliying 24 MFC coefficients. The overall result is 95%. Which means, totaly 180
voice commands were tried and only 9 of them recognized wrongly. In this test it was
observed that taking patterns from different speakers has an effect that makes the
system less dependent on speaker characteristics. And the speaker dependent system
becomes an independent system from the speaker or speakers.
These results seem to be good but infact they are not. When i first started this
project, the aim was to obtain an efficiency of 99 percent. This is the at least value for a
speech recognition system to use the system in commercial or industrial applications.
And i know that this difference occured beacuse of my own algorithms such as start and
end points detection algorithm, comparison algorithm and others i used in this project.
But these algorithms are in testing stage. I will improve these algorithms or i will
change some of them in the future works of this project.
42
6 APPENDICES
6.1 Appendix 1 : References
[1] Lawrance R.Rabiner , Ronald W.Schafer , “Digital Processing of Speech Signals” ,

Prentice Hall, New Jersey, 1978
[2] Lawrence Rabiner and Biing-Hwang Juang , “Fundamentals of Speech Recognition”,

Prentice Hall, New Jersey, 1993
[3] D.Raj Reddy, “Invited papers presented at the 1974 IEEE symposium”,
ACADEMIC PRESS, 1975
[4] James L. Flonagon, Lawrance R.Rabiner, “Speech Synthesis”, Bell Laboratories,

Murray Hill, 1973
[5] Gérard Blanchet, Maurice Charbit, “Digital Signal and Image Processing using
MATLAB”, ISTE Ltd, 2006
[6] Jaan Kiusalaas , “Numerical Methods in Engineering with MATLAB” , Cambridge

University Press 2005
[7] Prof. Dr. H. G. Tillmann , "An Introduction to Speech Recognition" , Institut für
Phonetik und Sprachliche Kommunikation, University of Munich, Germany.
http://www.speech-recognition.de/slides.html
[8] Mike Brookes , “VOICEBOX : Speech Processing Toolbox for MATLAB” ,

Department of Electrical & Electronic Engineering, Imperial College ,
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
[9] “MATLAB 7 - Creating Graphical User Interfaces” , Mathworks
[10] Matlab 7 , Help Folder
43
[11] Halil Đbrahim BÜLBÜL , Abdulkadir KARACI , Speech Command Recognition
In Computer : Pattern Recognition Method” , Kastamonu Education Journal , March
2007, Vol:15 No:1
[12] Nursel YALÇIN , “Speech Recognition Theory And Techniques” , Kastamonu

Education Journal , March 2008 , Vol:16 No:1
[13] Cemal HANĐLÇĐ , Figen ERTAŞ , “On the Parameters of Text-Independent

Speaker Identification Using Continuous HMMs” , Uludağ University Engineering
Faculty Journal, Vol:12, No:1 , 2007
[14] Dr.-Ing. Bernd Plannerer , “Automatic Speech Recognition”
[15] T. Thrasyvoulou , S. Benton , “ Speech parameterization using the Mel scale Part II ”
[16] Wilson Clark , “CSE552/652 - Hidden Markov Models for Speech Recognition” ,
Department of Computer Science and Engineering OGI School of Science &
Engineering , http://www.cse.ogi.edu/class/cse552/
[17] Tony Robinson , “Speech Analysis” ,

http://mi.eng.cam.ac.uk/~ ajr/SA95/SpeechAnalysis.html
[18] Ozan MUT , MS Thesis , Gebze High Technology Institute, Computer

Engineering Faculty
[19] Volkan Tunalı , “A Speaker Dependent, Large Vocabulary, Isolated Word Speech
Recognition System For Turkish” , Ms Thesis, Marmara University Institute For
Graduate Studies in Pure and Applied Sciences
[20] Seydi vakkas üstün , “Yapay Sinir Ağları Kullanarak Tükçedeki Sesli Harflerin
Tanınması” , MS thesis , Yıldız Technical University Institute of Engineering and
Science, Istanbul, Turkey
44
6.2 Appendix 2 : Test Results
Framing length is selected to be 30 ms in all tests.
Tests with one speaker, 20 Library commands , 30 Testing commands ,

50 % Recognition level threshold percentage
• Test with 1 pattern , 24 MFCC Coefficients (12 Mel – 12 Delta) - See Table 1
• Test with 3 patterns , 24 MFCC Coefficients (12 Mel – 12 Delta) - See Table 2
Tests with three speakers , 30 Library commands , 30 Testing commands ,

6 (3x2) patterns for each command , 20 % Recognition level threshold percentage
• Test with 24 MFCC Coefficients (12 Mel – 12 Delta) - See Table 3
• Test with 11 LPC Coefficients - See Table 4
Tests with one speaker , 50 Library commands , 50 Testing commands,

3 patterns for each command , 20 % Recognition level threshold percentage
45
Table 1
Commands / Test No 1 2 3 4 5 Total

1 sol sol stop sol sol sol 80%
2 sağ sağ geri sağ stop sağ 60%
3 ileri geri ileri ileri geri ileri 60%
4 geri geri ileri geri geri geri 80%
5 dur ??? dur close dur dur 60%
6 left left left ??? left left 80%
7 right right right right right right 100%
8 go go go go go go 100%
9 back back ??? back back back 80%
10 stop stop stop stop stop stop 100%
11 aç aç aç aç aç aç 100%
12 kapat kapat kapat kapat kapat kapat 100%
13 open open ??? open open open 80%
14 close close close close close close 100%
15 forward close forward forward forward forward 80%
16 backward ??? samet backward kapat backward 40%
17 ali ali ali ali ali ali 100%
18 samet samet samet samet samet samet 100%
19 fatma ??? ??? fatma ??? fatma 40%
20 ayşe ayşe ayşe ayşe ayşe ayşe 100%
Total 80% 65% 90% 80% 100% 83%
21 aşağı ??? ??? ??? ??? ??? 100%
22 yukarı ??? ??? ??? ??? ??? 100%
23 hızlı ??? close ??? ??? ??? 80%
24 yavaş yavaş ??? ??? ??? left 60%
25 fast back back ??? ??? back 40%
26 slow close ??? close ??? ??? 60%
27 gel ??? left ??? left ??? 60%
28 git ??? ??? geri ??? geri 60%
29 merve ??? ??? ??? ??? ??? 100%
30 hüseyin ??? ileri ??? ??? ??? 80%
Total 76.7% 63.3% 86.7% 83.3% 90% 80%
46
Table 2
Test No / 1 2 3 4 5 Total
Commands
1 sol sol sol sol sol sol 100%
2 sağ sağ sağ sağ sağ sağ 100%
3 ileri ileri ileri geri ileri ileri 80%
4 geri geri geri geri geri geri 100%
5 dur dur dur dur dur dur 100%
6 left left left left left left 100%
9 back back back back back back 100%
11 aç aç aç back aç aç 80%
13 open open open open open open 100%
15 forward forward forward forward forward forward 100%
16 backward backward backward backward backward backward 100%
19 fatma ??? fatma fatma fatma fatma 80%
Total 95% 100% 90% 100% 100% 97%
21 aşağı ??? ??? ayşe ??? ??? 80%
22 yukarı ??? ali ??? ??? ??? 80%
23 hızlı close ??? ??? ??? ??? 80%
24 yavaş ??? ??? ??? ??? ??? 100%
25 fast back ??? back back ??? 40%
26 slow close ??? close close ??? 40%
27 gel ??? left ??? left ??? 60%
28 git ??? ??? ??? ??? ??? 100%
29 merve ileri ??? ??? ileri ??? 60%
30 hüseyin ??? ??? ??? ??? ??? 100%
Total 83.3% 93.3% 83.3% 86.7% 100% 89.3%
47
Table 3
Speaker 1 Speaker 2 Speaker 3
Test No / 1 2 1 2 1 2
Commands Total
1 sol sol sol sol sol sol sol 100%
2 sağ sağ sağ sağ sağ sağ sağ 100%
3 ileri ileri ileri geri ileri ileri ileri 83.3%
4 geri geri geri geri geri geri geri 100%
5 dur dur dur dur dört dur dur 83.3%
6 left left left left left left left 100%
7 right right right right right right right 100%
8 go go go go go go go 100%
9 back dur back dört back back back 66.6%
10 stop stop stop stop stop stop stop 100%
11 aç aç aç aç aç stop aç 83.3%
12 kapat kapat kapat kapat kapat kapat kapat 100%
13 open open open open open open open 100%
14 close close close close close close close 100%
15 forward forward forward forward forward forward forward 100%
16 backward backward backward backward backward backward backward 100%
17 ali ali ali ali ali ali ali 100%
18 samet samet samet samet samet samet samet 100%
19 fatma fatma fatma fatma fatma fatma fatma 100%
20 ayşe ayşe ayşe ayşe ayşe ayşe ayşe 100%
21 bir bir geri bir bir bir geri 66.6%
22 iki iki iki iki iki iki iki 100%
23 üç üç üç üç üç üç üç 100%
24 dört dört dört dört dört dört dört 100%
25 beş üç beş beş beş beş beş 83.3%
26 altı altı altı altı altı altı altı 100%
27 yedi yedi yedi geri yedi yedi yedi 83.3%
28 sekiz sekiz sekiz sekiz sekiz sekiz sekiz 100%
29 dokuz dokuz dokuz dokuz dokuz dokuz dokuz 100%
30 on on on on on on on 100%
93.3% 96.6% 90% 96.6% 96.6% 96.6%
Total 95% 93.3% 96.6% 95%
48
Table 4
Speaker 1 Speaker 2 Speaker 3
Test No / 1 2 1 2 1 2 Total
Commands
1 sol forward sol sol forward ??? sol 50%
2 sağ sağ sağ sağ sağ sağ sağ 100%
3 ileri ileri geri ileri geri ileri ileri 66.6%
4 geri geri geri geri geri geri geri 100%
5 dur dur dur dur dur dur dur 100%
6 left left left left left left left 100%
7 right right right right right right right 100%
8 go go dört go dört right go 50%
9 back back back back back back back 100%
10 stop stop stop stop stop stop stop 100%
11 aç aç aç aç aç aç aç 100%
12 kapat kapat kapat kapat kapat kapat kapat 100%
13 open open open open open open open 100%
14 close close dur right close right close 50%
15 forward forward forward forward forward forward forward 100%
16 backward backward backward backward backward backward backward 100%
17 ali ali ali samet ali ali aç 66.6%
18 samet samet samet samet samet samet samet 100%
19 fatma open fatma fatma fatma fatma fatma 83.3%
20 ayşe ayşe ayşe ayşe ayşe ayşe ayşe 100%
21 bir bir bir bir bir bir bir 100%
22 iki iki iki iki iki iki iki 100%
23 üç üç üç üç üç üç üç 100%
24 dört dört dört dört dört dört dört 100%
25 beş beş beş beş beş beş beş 100%
26 altı altı samet altı altı altı altı 83.3%
27 yedi geri yedi geri geri yedi yedi 50%
28 sekiz sekiz sekiz sekiz sekiz sekiz sekiz 100%
29 dokuz ??? close dokuz dokuz dokuz close 50%
30 on sol sol on ??? on on 50%
83.3% 80% 90% 83.3% 90% 93.3%
86.6%
Total 81.6% 86.6% 91.6%
49
Table 5
Commands
2 sağ sağ sağ sağ sağ one 80%
3 ileri ileri geri geri ileri ileri 60%
5 dur dur dur dur two dur 80%
7 right right right ten right right 80%
9 back back back dört dört back 60%
11 aç aç five aç aç aç 80%
17 yukarı yukarı yukarı dört yukarı yukarı 80%
18 aşağı aşağı aşağı aşağı aşağı aşağı 100%
19 hızlı hızlı hızlı hızlı hızlı hızlı 100%
20 yavaş yavaş yavaş yavaş yavaş yavaş 100%
21 one on one close one one 60%
22 two two two two two two 100%
23 three three three three three three 100%
24 four sol four four close four 60%
25 five five five five five five 100%
26 six six six six six bir 100%
27 seven seven seven seven seven seven 100%
28 eight eight eight eight eight eight 100%
29 nine nine nine nine nine nine 100%
30 ten ten ten ten ten ten 100%
31 bir bir six bir bir bir 80%
32 iki iki hüseyin iki iki iki 80%
33 üç üç üç üç üç üç 100%
34 dört dört dört dört dört dört 100%
35 beş beş beş beş beş beş 100%
36 altı altı altı altı altı altı 100%
37 yedi yedi yedi yedi yedi yedi 100%
38 sekiz sekiz sekiz sekiz sekiz sekiz 100%
39 dokuz dokuz dokuz dokuz dokuz dokuz 100%
40 on on on on on on 100%
41 hüseyin six hüseyin hüseyin hüseyin hüseyin 80%
42 mustafa mustafa mustafa ??? sağ mustafa 60%
45 hasan hasan hasan hasan hasan hasan 100%
46 fatma fatma fatma altı fatma fatma 80%
47 merve merve merve merve merve merve 100%
48 eda eda eda eda eda eda 100%
50 hüsniye hüsniye bir hüsniye hüsniye hüsniye 80%
Total 94% 90% 86% 90% 96% 91.2%
50
Table 6
Commands
2 sağ close sağ sağ sağ sağ 80%
3 ileri ileri ileri ileri ileri ileri 100%
9 back back back back dört back 80%
17 yukarı yukarı yukarı yukarı yukarı yukarı 100%
20 yavaş yavaş yavaş yavaş seven yavaş 80%
21 one one one one one one 100%
24 four four four four close four 80%
26 six six bir six six six 80%
31 bir bir bir bir bir bir 100%
32 iki iki iki iki iki iki 100%
41 hüseyin hüseyin six hüseyin hüseyin hüseyin 80%
42 mustafa mustafa sağ mustafa mustafa mustafa 80%
45 hasan hasan hasan seven hasan hasan 80%
46 fatma fatma fatma fatma fatma fatma 100%
50 hüsniye hüsniye hüsniye hüsniye hüsniye hüsniye 100%
Total 98% 94% 98% 94% 100% 96.8%
51
Table 7
Commands
1 sol sol on sol sol sol 80%
2 sağ close sağ sağ sağ sağ 80%
3 ileri ileri ileri ileri ileri ileri 100%
9 back back ten dört dört back 60%
17 yukarı yukarı yukarı yukarı dört yukarı 80%
20 yavaş seven yavaş yavaş yavaş yavaş 80%
21 one one one one one one 100%
22 two two two two left two 80%
24 four on four four four one 60%
26 six six six six six six 100%
31 bir six bir bir bir bir 80%
32 iki iki iki iki iki iki 100%
41 hüseyin hüseyin hüseyin hüseyin eigth hüseyin 80%
42 mustafa hasan mustafa kapat mustafa mustafa 60%
45 hasan hasan hasan hasan hasan hasan 100%
50 hüsniye hüsniye hüsniye hüsniye bir hüsniye 80%
Total 90% 96% 96% 90% 98% 94%
52
Table 8
Commands
1 sol sol close close sol on 40%
2 sağ sağ stop sağ sağ sağ 80%
3 ileri ileri ileri geri ileri geri 60%
4 geri bir geri bir geri geri 60%
5 dur six yedi dur dur dur 60%
6 left left left nine left left 80%
7 right right ten ten right ten 40%
8 go close seven go ten go 40%
9 back back dört dört back back 60%
15 forward forward four forward forward four 60%
16 backward kapat backward backward kapat kapat 40%
17 yukarı ali yukarı two yukarı yukarı 60%
19 hızlı hızlı dokuz dokuz hızlı hızlı 60%
21 one close one close one one 60%
23 three three six three six three 60%
24 four four on four sol four 60%
27 seven dört seven seven seven seven 80%
28 eight eight six eight eight eight 80%
30 ten dört ten ten dört dört 40%
31 bir bir bir six geri bir 60%
32 iki bir sekiz iki bir iki 40%
33 üç üç üç üç eigth üç 80%
35 beş beş back left beş back 40%
36 altı open altı open altı altı 60%
37 yedi yedi yedi yedi geri geri 60%
38 sekiz three sekiz sekiz sekiz three 60%
39 dokuz dokuz dur dokuz dokuz dokuz 80%
40 on on four four on on 60%
41 hüseyin hüseyin hüseyin hüseyin hüseyin hüseyin 100%
42 mustafa mustafa sağ mustafa ??? open 40%
43 samet samet altı samet samet samet 80%
44 ali ali ali ali samet ali 80%
45 hasan dört hasan hasan hasan stop 60%
49 ayşe ayşe altı ayşe ayşe ayşe 80%
50 hüsniye bir hüsniye six hüsniye six 40%
Total 74% 64% 72% 76% 78% 72.8%
53
Table 9
Commands
1 sol close sol sol sol on 60%
2 sağ sağ sağ sağ stop sağ 80%
3 ileri geri ileri ileri ileri geri 60%
4 geri geri geri geri bir geri 80%
5 dur dur yedi dur six dur 60%
6 left left left left left nine 80%
7 right right ten right ten rigth 60%
8 go go ten seven go go 60%
9 back back dört back dört back 60%
13 open open open seven open open 80%
15 forward four four forward forward forward 80%
16 backward backward kapat backward kapat backward 60%
17 yukarı ali yukarı yukarı two yukarı 60%
19 hızlı hızlı hızlı dokuz hızlı dokuz 60%
21 one close one one close one 60%
23 three three three six three three 80%
24 four four sol sol four four 60%
27 seven seven seven seven dört seven 80%
28 eight eight eigth eight eight six 80%
30 ten ten dört ten dört ten 60%
31 bir bir geri bir bir bir 80%
32 iki iki bir iki bir iki 80%
33 üç üç üç sekiz üç üç 80%
35 beş beş left back beş beş 60%
36 altı altı altı open altı open 60%
37 yedi yedi yedi geri yedi geri 60%
38 sekiz three sekiz sekiz three sekiz 60%
39 dokuz dokuz dur dokuz dokuz dokuz 80%
40 on four on on four on 60%
41 hüseyin hüseyin ??? eigth hüseyin hüseyin 60%
42 mustafa sağ mustafa sağ mustafa ??? 40%
43 samet samet samet altı samet altı 60%
44 ali ali ali samet ali ali 80%
45 hasan hasan dört stop hasan hasan 60%
46 fatma fatma fatma altı fatma fatma 80%
49 ayşe ayşe ayşe ayşe ayşe altı 80%
50 hüsniye hüsniye six hüsniye hüsniye hüsniye 80%
Total 80% 74% 70% 74% 80% 75.6%
54
Table 10
Commands
1 sol sol sol sol ??? sol 80%
2 sağ stop sağ samet close sağ 40%
3 ileri ileri geri ileri geri ileri 60%
4 geri six geri geri geri geri 80%
5 dur left dur dur left dur 60%
6 left ??? left left left left 80%
7 right ten ten right right right 60%
8 go go go go go ten 80%
9 back back back dört back back 80%
10 stop stop stop stop ??? stop 80%
11 aç aç aç back aç aç 80%
12 kapat kapat kapat kapat kapat close 80%
13 open open forward open open open 80%
15 forward four forward forward four forward 60%
16 backward backward backward kapat backward backward 80%
17 yukarı yukarı two two yukarı yukarı 60%
19 hızlı hızlı hızlı hızlı dokuz hızlı 80%
20 yavaş yavaş yavaş close yavaş yavaş 80%
21 one close one one one one 80%
23 three three six three three six 60%
24 four four on four four on 60%
25 five four five five five five 80%
27 seven back seven seven seven seven 80%
28 eight eight eight six six eight 60%
30 ten ten ten dört dört ten 60%
31 bir bir geri bir geri bir 60%
32 iki iki close sekiz iki iki 60%
35 beş beş left beş beş left 60%
36 altı open altı open altı altı 80%
37 yedi yedi yedi yedi yedi geri 80%
39 dokuz dokuz dokuz dur dur dokuz 60%
40 on four on on on four 60%
41 hüseyin hüseyin sekiz hüseyin close hüseyin 60%
42 mustafa mustafa mustafa sağ ??? ??? 40%
43 samet samet samet altı samet samet 80%
44 ali ali samet ali ali ali 80%
45 hasan hasan hasan stop hasan hasan 80%
49 ayşe ayşe ayşe ayşe ayşe altı 80%
50 hüsniye six hüsniye hüsniye six hüsniye 60%
Total 76% 78% 72% 72% 82% 76%
55
6.3 Appendix 3 : Matlab Codes
Training.m
function varargout = Training(varargin)
gui_Singleton = 1;
gui_State = struct('gui_Name', mfilename, ...
'gui_Singleton', gui_Singleton, ...
'gui_OpeningFcn', @Training_OpeningFcn, ...
'gui_OutputFcn', @Training_OutputFcn, ...
'gui_LayoutFcn', [] , ...
'gui_Callback', []);
if nargin && ischar(varargin{1})
gui_State.gui_Callback = str2func(varargin{1});
end
if nargout
[varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:});
else
gui_mainfcn(gui_State, varargin{:});
end
% --------------------------------------------------------------------
function Training_OpeningFcn(hObject, eventdata, handles, varargin)
handles.output = hObject;
guidata(hObject, handles);
% Configure and initialize the GUI objects.

set(handles.record_button,'Enable','on');
set(handles.text6,'Enable','on');
set(handles.okay_button,'Enable','off');
set(handles.reject_button,'Enable','off');
set(handles.sound_button,'Enable','off');
set(handles.text8,'Enable','off');
set(handles.energy_text,'Enable','off');
set(handles.text22,'Visible','off');
set(handles.size_text,'Visible','off');
handles.Fs = 8000 ; % sampling frequency

handles.nBits = 8 ; % bits/sample
% creating an analog input object from sound card

ai = analoginput('winsound', 0);
addchannel(ai, 1);
% Configure the analog input object.

set(ai, 'SampleRate', handles.Fs);
set(ai, 'BitsPerSample', 2*handles.nBits);
set(ai, 'BufferingConfig',[1000,30]);
set(ai, 'SamplesPerTrigger', 1000);
set(ai, 'TriggerRepeat', 1);
% Microphone Setup
start(ai);
test = getdata(ai);
56
test = test-mean(test);
st = sum(abs(test));
stop(ai);
handles.th = 3*st; % threshold level for recording
clear ai;
fid = load('Lib_Commands.mat'); % assign names of the library commands to a variable.
set(handles.listbox1,'String',fid.fid) % initializing the commands listbox
handles.cc = 0 ;
handles.aa = 1 ;
handles.method = 1; % initializing the selected featuring method
% --------------------------------------------------------------------
function varargout = Training_OutputFcn(hObject, eventdata, handles)
varargout{1} = handles.output;
% when record button is pressed
function record_button_Callback(hObject, eventdata, handles)
handles.Method = get(handles.popupmenu2, 'Value') ; % getting the number of selected method

(MFCC-1,LPC-2)
name = get(handles.edit3,'String'); % name of the new command
handles.Framelen = str2double(get(handles.frame_text,'String')); % getting the time of framing the

signal (in milisecond)
handles.Order = str2double(get(handles.order_text,'String')); % getting the number of order for

featuring.
handles.cc = 0 ;
handles.aa = 1 ;
set(handles.text10,'String',handles.aa);
set(handles.text11,'String',name);
set(handles.text22,'Visible','on');
set(handles.size_text,'Visible','on');
% starts recording process for the training commands,returns the

% fingerprint of the current command and its properties.
[handles ] = Record_Training(hObject, eventdata, handles) ;
% assinging the variables to use in the other functions

handles.name = name ;
Plotts(hObject, eventdata, handles) % plotting the datas.
% --------------------------------------------------------------------
% when record button is pressed

function okay_button_Callback(hObject, eventdata, handles)
handles.Method = get(handles.popupmenu2, 'Value') ;
57
name = get(handles.edit3,'String');
handles.Framelen = str2double(get(handles.frame_text,'String'));
handles.Order = str2double(get(handles.order_text,'String'));
cc = handles.cc ;
namee = handles.name ;
if cc <= 2
cc = cc + 1 ;
Save_Mat( handles.recorded_signal ,namee, cc) ; % saves the voice command to the library as .wav file
if cc == 3 % if 3 patterns are obtained for a voice command , the system stops

cc = 0;
% adding the new command to the listbox

fid = load('Lib_Commands.mat');
lfid = length(fid.fid);
fid.fid{lfid+1}=name;
set(handles.listbox1,'String',fid.fid )
fid = cellstr(fid.fid) ;
save('Lib_Commands.mat','fid')
handles.cc = cc ;
return
else
handles.cc = cc ;
handles.aa = handles.aa + 1 ;
set(handles.text10,'String',handles.aa);
set(handles.text11,'String',handles.name);
% recording the next pattern (until 3 patterns obtained)

end
else % finishes recording
58
cc = 0;
handles.cc = cc ;
end
% --------------------------------------------------------------------
% when reject button pressed

function reject_button_Callback(hObject, eventdata, handles)
% clears the current command and starts rerecording
handles.Framelen = str2double(get(handles.frame_text,'String'));
% --------------------------------------------------------------------
% --------------------------------------------------------------------
function sound_button_Callback(hObject, eventdata, handles)
% sounds the current command

wavplay(handles.Command_Signal,8000)
pause(0.3)
% --------------------------------------------------------------------
% when remove button pressed

function rmv_button_Callback(hObject, eventdata, handles)
% removes the selected command from the listbox and library
value = get(handles.listbox1,'Value');
value = int8(value);
% re edits the library and the commands listbox
name = fid.fid{value};
[stat,mess]=fileattrib('Library');
lib_path = mess.Name;
59
delete([lib_path '\' name '1.mat'])
fid.fid(value)='';
fid=fid.fid;
save('Lib_Commands.mat','fid')
set(handles.listbox1,'Value',(value-1))
set(handles.listbox1,'String',fid)
% --------------------------------------------------------------------
function popupmenu2_Callback(hObject, eventdata, handles)
selection = get(handles.popupmenu2, 'Value') ;

switch selection
case 1
set(handles.order_text,'String',12); % default order values for featuring methods
case 2
set(handles.order_text,'String',9);
end
% --------------------------------------------------------------------
function popupmenu2_CreateFcn(hObject, eventdata, handles)
if ispc
set(hObject,'BackgroundColor','white');
else
set(hObject,'BackgroundColor',get(0,'defaultUicontrolBackgroundColor'));
end
% --------------------------------------------------------------------
function energy_text_Callback(hObject, eventdata, handles)
function energy_text_CreateFcn(hObject, eventdata, handles)
if ispc
else
end
% --------------------------------------------------------------------
function listbox1_Callback(hObject, eventdata, handles)
function listbox1_CreateFcn(hObject, eventdata, handles)
if ispc
else
end
% --------------------------------------------------------------------
function edit3_Callback(hObject, eventdata, handles)
function edit3_CreateFcn(hObject, eventdata, handles)
if ispc
else
end
% --------------------------------------------------------------------
function frame_text_Callback(hObject, eventdata, handles)
function frame_text_CreateFcn(hObject, eventdata, handles)
if ispc
else
end
% --------------------------------------------------------------------
function order_text_Callback(hObject, eventdata, handles)
function order_text_CreateFcn(hObject, eventdata, handles)
if ispc
else
end
% --------------------------------------------------------------------
60
SpeechRecognition.m
function varargout = SpeechRecognition(varargin)
gui_Singleton = 1;
gui_State = struct('gui_Name', mfilename, ...
'gui_Singleton', gui_Singleton, ...
'gui_OpeningFcn', @SpeechRecognition_OpeningFcn, ...
'gui_OutputFcn', @SpeechRecognition_OutputFcn, ...
'gui_LayoutFcn', [] , ...
'gui_Callback', []);
if nargin && ischar(varargin{1})
gui_State.gui_Callback = str2func(varargin{1});
end
if nargout
[varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:});
else
gui_mainfcn(gui_State, varargin{:});
end
% --------------------------------------------------------------------
% --------------------------------------------------------------------
function SpeechRecognition_OpeningFcn(hObject, eventdata, handles, varargin)
% Configure and initialize the GUI objects.

set(handles.popupmenu2,'Visible','on');
set(handles.start_button,'Visible','on');
set(handles.stop_button,'Enable','off');
set(handles.energy_text,'Visible','off');
% initializing the commands listbox

ldf = length(fid.fid)+1;
fid.fid(ldf)={'???'};
fid=fid.fid;
% --------------------------------------------------------------------
% --------------------------------------------------------------------
function varargout = SpeechRecognition_OutputFcn(hObject, eventdata, handles)
varargout{1} = handles.output;
% --------------------------------------------------------------------
% --------------------------------------------------------------------
switch selection
case 2 % Training Mode
61
case 1 % Recognizing mode
% initialize the commands listbox

lf = length(fid.fid);
ldf = length(fid.fid)+1;
fid.fid(ldf)={'???'};
fid=fid.fid;
end
% --------------------------------------------------------------------
% --------------------------------------------------------------------
if ispc
else
end
% --------------------------------------------------------------------
% --------------------------------------------------------------------
% when start button pressed

function start_button_Callback(hObject, eventdata, handles)
switch selection
case 2
Training ; % run Training function
case 1
set(handles.stop_button,'Enable','on');
set(handles.start_button,'Enable','off');
set(handles.size_text,'Visible','on');
set(handles.energy_text,'Visible','on');
handles.Fs = 8000 ;
handles.nBits = 8 ;
62
% get the properties of featuring process
handles.Framelen = str2double(get(handles.len_text,'String'));
handles.level = str2double(get(handles.level_text,'String'));
% starts listening for the voice commands

handles = Record_Recognition(hObject, eventdata, handles) ;
end
% --------------------------------------------------------------------
function stop_button_Callback(hObject, eventdata, handles)
%stops recording
stop(handles.ai);
set(handles.start_button,'Enable','on');
% --------------------------------------------------------------------
function close_button_Callback(hObject, eventdata, handles)
selection = questdlg(['Close ' get(handles.figure1,'Name') '?'],...

['Close ' get(handles.figure1,'Name') '...'],...
'Yes','No','Yes');
if strcmp(selection,'No')
return;
end
stop(handles.ai);
delete(handles.ai);
clear(handles.ai);
close(handles.figure1)
% --------------------------------------------------------------------
selectionn = get(handles.popupmenu3, 'Value') ;
switch selectionn
case 1
set(handles.order_text,'String',12); % default order values for featuring methods
case 2
set(handles.order_text,'String',9);
end
% --------------------------------------------------------------------
63
if ispc
else
end
% --------------------------------------------------------------------
function order_text_Callback(hObject, eventdata, handles)
function order_text_CreateFcn(hObject, eventdata, handles)
if ispc
else
end
% --------------------------------------------------------------------
function energy_text_Callback(hObject, eventdata, handles)
function energy_text_CreateFcn(hObject, eventdata, handles)
if ispc
else
end
% --------------------------------------------------------------------
function recog_text_Callback(hObject, eventdata, handles)
function recog_text_CreateFcn(hObject, eventdata, handles)
if ispc
else
end
% --------------------------------------------------------------------
function listbox1_Callback(hObject, eventdata, handles)
function listbox1_CreateFcn(hObject, eventdata, handles)
if ispc
else
end
% --------------------------------------------------------------------
function len_text_Callback(hObject, eventdata, handles)
function len_text_CreateFcn(hObject, eventdata, handles)
if ispc
else
end
% --------------------------------------------------------------------
function level_text_Callback(hObject, eventdata, handles)
function level_text_CreateFcn(hObject, eventdata, handles)
if ispc
else
end
% --------------------------------------------------------------------
64
Record_Training.m
function [handles ] = Record_Training(hObject, eventdata, handles)
set(handles.record_button,'Enable','off');
set(handles.energy_text,'Enable','on');
% setting the initial values

n = 0 ; z = 0 ; data = 0 ; audio_prev = 0 ; signal = 0 ; FngrPrnts_pcov = 0 ; stop_rec = 0 ;
Fs = handles.Fs;
nbits = handles.nBits;
th = handles.th;
b1 =fir1(250,[2*70/Fs 2*3200/Fs]); % Speech filter coefficients 70Hz-3200Hz

b2 = [1 -0.97]; % Preemphasis filter coeff.
addchannel(ai, 1);

set(ai, 'SampleRate', Fs);
set(ai, 'BitsPerSample', 2*nbits);
set(ai, 'TriggerRepeat', inf);
start(ai); % starts recording process
while 1
pause(0.01) % Waiting time in each loop to give enough time for setting GUI objects.
% Does not cause any sample loss in recording process.
% Analog input object works asyncroniously.
handles.ai=ai; % Assinging the analog object to use its properties in the other functions
audio = getdata(ai); % getting the recorded data (1000 samples)
EE = (sum( abs( audio - mean(audio) ) )); % Energy of the frame and subtracting DC value
set(handles.energy_text, 'String', EE);
if EE >= th ; % if the energy of the frame is greater than th the system senses
% that the speaker is saying a command.
n = n+1 ;
% for each frame that has an energy greater than th, these frames are arrenged
% in a row and attached each other.
if n==1
data = [audio_prev ; audio] ;
else
data = [data ; audio] ;
end
else
if n > 0 % this the point where the command ends.

% ( while n is greater than zero and if the energy of the current frame )
% ( is less than th, this means the speaker finished saying the command )
stop_rec = stop_rec+1;
65
data = [data ; audio ] ;
if (stop_rec>1)
stop(ai);
delete(ai)
clear ai
% Filtering process
signal_filtered1 = filter(b1,1,data); % Band-Pass 70Hz-3200Hz
signal_filtered2 = filter(b2,1,signal_filtered1); % Pre-Emphasis
signal = (20/th)*signal_filtered2(350:end); % Amplifying
wavwrite(signal,8000,8,'command.wav') ; % Converting the signal 16 bit to 8 bit/sample

recorded_signal = wavread('command.wav');
delete command.wav
% Sending the Signal to Featuring process

[ FngrPrnts,Command_Signal,Paudio,start_point,end_point ] = Featuring(handles,recorded_signal) ;
break
end
else % The part which no command is said
audio_prev = audio ;
z = z+1 ;
end
end
if (z >= 50) % if there exists 50 frames continually that all have energy less than th , the systems stops.
% 50*1000/8000 = 6.25 second
% if nothing is said in 6.25 second the system stops
set(handles.start_button,'Visible','on');
set(handles.stop_button,'Visible','off');
% Resetting the variables to their initialized values.

n = 0 ; z = 0 ; data = 0 ; audio_prev = 0 ; signal = 0 ; FngrPrnts = 0 ;
stop(ai)
delete(ai)
clear ai
break
end
end
handles.recorded_signal = recorded_signal;
handles.Paudio = Paudio;
handles.Command_Signal = Command_Signal;
handles.FngrPrnts = FngrPrnts;
handles.start_point = start_point;
handles.end_point = end_point;
66
Record_Recognition.m
function handles = Record_Recognition(hObject, eventdata, handles)
% Assinging the fingerprints of library commands to variables.

[ lib_com1,lib_com2,lib_com3,lf ] = Library_Call(hObject, eventdata, handles) ;
% setting the initial values

n = 0 ; z = 0 ; data = 0 ; audio_prev = 0 ; signal = 0 ; FngrPrnts = 0 ; stop_rec = 0 ;
Fs = handles.Fs;
nBits = handles.nBits;
b1 =fir1(250,[2*70/Fs 2*3200/Fs]); % Speech filter coefficients 70Hz-3200Hz

b2 = [1 -0.97]; % Preemphasis filter coeff.

addchannel(ai, 1);

set(ai, 'SampleRate', Fs);
set(ai, 'BitsPerSample', 2*nBits);
set(ai, 'TriggerRepeat', 1);
% Microphone Setup
start(ai);
test = getdata(ai);
test = test-mean(test);
st = sum(abs(test));
stop(ai);
th = 3*st; % recording threshold
% Recording Process starts

set(ai, 'TriggerRepeat', inf);
start(ai);
while (1)
pause(0.01) % Waiting time in each loop to give enough time for setting GUI objects.
% Does not cause any sample loss in recording process.
% Analog input object works asyncroniously.
handles.ai=ai; % Assinging the analog object to use its properties in the other functions
audio = getdata(ai); % getting the recorded data (1000 samples)
EE = (sum( abs( audio - mean(audio) ) )); % Energy of the frame and subtracting DC value
set(handles.energy_text, 'String', EE);
if EE >= th ; % if the energy of the frame is greater than th the system senses
% that the speaker is saying a command.
n = n+1 ;
% for each frame that has an energy greater than th, these frames
% are arrenged in a row and attached each other.
if n==1
data = [audio_prev ; audio] ;
else
data = [data ; audio] ;
end
67
else
if n > 0 % this the point where the command ends.

% ( while n is greater than zero and if the energy of the current frame )
% ( is less than th, this means the speaker finished saying the command )
stop_rec = stop_rec+1;
data = [data ; audio ] ;
if (stop_rec>1) % one frame in the middle of the signal is acceptable.
% Filtering process
signal_filtered1 = filter(b1,1,data); % Band-Pass 70Hz-3200Hz
signal_filtered2 = filter(b2,1,signal_filtered1); % Pre-Emphasis
signal = (20/th)*signal_filtered2(350:end); % Amplifying
wavwrite(signal,8000,8,'command.wav') % Converting the signal 16 bit to 8 bit/sample

recorded_signal = wavread('command.wav');
delete command.wav
% Sending the Signal to Featuring process.

[ FngrPrnts,Command_Signal,Paudio,start_point,end_point ] = Featuring(handles, recorded_signal) ;
% Printing size of the fingerprint.

[r,c]=size(FngrPrnts);
set(handles.size_text,'String',[ num2str(r) ' x ' num2str(c) ]);
% Comparing the current fingerprint with the library fingerprints

% The decision and resulting processes are realized in this function.
Compare( FngrPrnts,lib_com1,lib_com2,lib_com3,lf,hObject, eventdata, handles ) ;

n = 0 ; z = 0 ; data = 0 ; audio_prev = 0 ; signal = 0 ; FngrPrnts = 0 ;audio=0;stop_rec = 0;
end
else
audio_prev = audio ; % The part which no command is said
z = z+1 ;
end
end
if (z >= 80) % if there exists 80 frames all have energy less than six continually, the systems stops.
% 80*1000/8000 = 10 second
% if nothing is said in 10 second the system stops automatically.
stop(ai)
delete(ai)
clear ai;
set(handles.start_button,'Enable','on');

n = 0 ; z = 0 ; data = 0 ; audio_prev = 0 ; signal = 0 ; FngrPrnts = 0 ;
break
end
end
68
Save_Mat.m
function Save_Mat(recorded_signal,name,cc)
% saves the voice command with a pattern number
number = int2str(cc) ;
name = char(name);
save([lib_path '\' name number],'recorded_signal')
Plotts.m
function Plotts(hObject, eventdata, handles)
set(handles.record_button,'Enable','off');
set(handles.okay_button,'Enable','on');
set(handles.reject_button,'Enable','on');
set(handles.sound_button,'Enable','on');
lre = length(handles.recorded_signal);
lpe = length(handles.Paudio);
min_r = min(handles.recorded_signal);
max_r = max(handles.recorded_signal);
l_ff = length(handles.FngrPrnts');
% Printing size of the fingerprint.

[r,c]=size(handles.FngrPrnts);
set(handles.size_text,'String',[ num2str(r) ' x ' num2str(c) ]);
% plots the recorded signal,its power and start and end points on the same graph.
axes(handles.axes3) ; , cla ;
plot(40*(0:lpe-1),handles.Paudio,'r','LineWidth',2) , hold on
plot((0:lre-1),handles.recorded_signal) , hold on
plot(handles.start_point*ones(1,length(min_r:1/1000:max_r)),((min_r:1/1000:max_r)),'k','LineWidth',2) , hold on
plot(handles.end_point*ones(1,length(min_r:1/1000:max_r)),((min_r:1/1000:max_r)),'k','LineWidth',2) , hold off
% plots the pure command signal

plot(handles.Command_Signal)
% plots the fingerprint of the command signal

plot(handles.FngrPrnts')
if handles.Method == 1
axis([1 l_ff+0.1 -7.1 7.1])
else
axis([1 l_ff+0.1 0 20.1])
end
69
Featuring.m
function [ FngrPrnts,Command_Signal,Paudio,start_point,end_point ] = Featuring(handles,

recorded_signal)
signal = 0.5*recorded_signal/max(abs(recorded_signal)) ; % Normalizing the signal for power

calculation
lx=length(signal); nn=0;
for c=1:40:lx-39
nn=nn+1;
Paudio(nn) = sum(( abs( signal(c:c+39) ) )).^2 ;
end
Paudio = sqrt(Paudio/40)/2;
mean_p = mean(Paudio)/3.5;
zz = 0; gg = 0; g = nn+1;
stopp1 = 0 ; stopp2 = 0 ;
startpoint = 1; endpoint = 15;
for f=1:nn
g = g-1;
if ((stopp1==0)&&(f<=nn-3)) % start point searching - 4 frames that have energy greater than mean value
if ( (Paudio(f)>=mean_p) && (Paudio(f+1)>=mean_p) && (Paudio(f+2)>=mean_p) &&

(Paudio(f+3)>=mean_p) )
startpoint = f ; stopp1 = 1;
end
end
if ((stopp2==0)&&(g>=4)) % end point searching - 4 frames that have energy greater than mean
value
if ( (Paudio(g-3)>=mean_p) && (Paudio(g-2)>=mean_p) && (Paudio(g-1)>=mean_p) &&

(Paudio(g)>=mean_p) )
endpoint = g ; stopp2 = 1;
end
end
if ((stopp1==1)&&(stopp2==1))
break
end
end
Fs = handles.Fs ; Order = handles.Order ; Framelen = handles.Framelen;
if ( ( startpoint >= endpoint ) || ( stopp1 ~= 1 ) || ( stopp2 ~= 1 ) ) % error values
start_point = 1; end_point = 1000-352;
Command_Signal = recorded_signal( start_point : end_point ) ;

FngrPrnts = zeros(10,2*Order);
70
else
start_point = 40*startpoint - 20 ;
end_point = 40*endpoint ;
Command_Signal = recorded_signal( start_point : end_point ) ;
if (handles.Method == 1) % MFCC method
len = Fs*(Framelen)/1000 ;
FngrPrnts = Melcepst_met(Command_Signal,Fs,Order,len);
ss = size(FngrPrnts);
elseif (handles.Method == 2) % LPC method
len = Fs*(Framelen)/1000 ;
FngrPrnts = Lpc_met(Command_Signal,Fs,Order,len);
ss = size(FngrPrnts);
end
end
Library_Call.m
function [ lib_com1,lib_com2,lib_com3,lf ] = Library_Call(hObject, eventdata, handles)
lf = length(fid.fid);
for f = 1:lf
lib_wav11{f} = load([lib_path '\' char(fid.fid(f)) '1' ]);

end
for f = 1:lf
lib_wav1{f} = lib_wav11{f}.recorded_signal ;
end
for f = 1:lf
lib_com1{f} = Featuring(handles, lib_wav1{f} ) ;

end
71
LPC_met.m
function FngrPrnts = Lpc_met(Command_Signal,Fs,nLPC,len)
data_frame = enframe(Command_Signal,hamming(len),floor(len/2)); % % framing the input with

hamming window
[r,c] = size(data_frame); % r = number of frames , % c = number of samples in each frame
for n = 1:r
a_lpc = lpc(data_frame(n,:),nLPC) ; % LPC filter coeffs.

FngrPrnts1(n,:)= abs(freqz(1,a_lpc,ceil(len/12))); % LPC filter transfer Func. , 20 samples(coeff.)
each frame
end
FngrPrnts = FngrPrnts1;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%
function f=enframe(x,win,inc)
nx=length(x(:));
nwin=length(win);
if (nwin == 1)
len = win;
else
len = nwin;
end
nf = fix((nx-len+inc)/inc);
f=zeros(nf,len);
indf= inc*(0:(nf-1)).';
inds = (1:len);
f(:) = x(indf(:,ones(1,len))+inds(ones(nf,1),:));
if (nwin > 1)
w = win(:)';
f = f .* w(ones(nf,1),:);
end
72
Melcepst_met.m
function FngrPrnts = Melcepst_met(s,fs,nc,n)
fh=0.5; fl=0;
inc=floor(n/2); % overlap 50%
p = floor(3*log(fs)); % number of filters
z=enframe(s,hamming(n),inc); % framing the input with hamming window
f=rfft(z.'); % discrete fourier transform
[m,a,b]=melbankm(p,n,fs,fl,fh); % Mel Filters
pw=f(a:b,:).*conj(f(a:b,:));
pth=max(pw(:))*1E-20;
ath=sqrt(pth);
y=log(max(m*abs(f(a:b,:)),ath)+eps);
c=rdct(y).'; % discrete Cosine transform
nf=size(c,1); nc=nc+1;
if p>nc
c(:,nc+1:end)=[];
elseif p<nc
c=[c zeros(nf,nc-p)];
end
c(:,1)=[]; nc=nc-1;
vf=(4:-1:-4)/60; af=(1:-1:-1)/2;
ww=ones(5,1);
cx=[c(ww,:); c; c(nf*ww,:)];
vx=reshape(filter(vf,1,cx(:)),nf+10,nc); vx(1:8,:)=[];
ax=reshape(filter(af,1,vx(:)),nf+2,nc);
ax(1:2,:)=[];
c=[c ax];
FngrPrnts = c ;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function f=enframe(x,win,inc)
nx=length(x(:));
nwin=length(win);
if (nwin == 1)
len = win;
else
len = nwin;
end
nf = fix((nx-len+inc)/inc);
f=zeros(nf,len);
indf= inc*(0:(nf-1)).';
inds = (1:len);
f(:) = x(indf(:,ones(1,len))+inds(ones(nf,1),:));
if (nwin > 1)
w = win(:)';
f = f .* w(ones(nf,1),:);
end
73
function y=rfft(x,n,d)
s=size(x);
if prod(s)==1
y=x;
else
if nargin <3
d=find(s>1); d=d(1);
if nargin<2
n=s(d);
end
end
if isempty(n)
n=s(d);
end
y=fft(x,n,d);
y=reshape(y,prod(s(1:d-1)),n,prod(s(d+1:end)));
s(d)=1+fix(n/2);
y(:,s(d)+1:end,:)=[];
y=reshape(y,s);
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function y=rdct(x,n,a,b)
fl=size(x,1)==1;
if fl x=x(:); end
[m,k]=size(x);
if nargin<2 n=m; end
if nargin<4 b=1;
if nargin<3 a=sqrt(2*n); end
end
if n>m x=[x; zeros(n-m,k)];
elseif n<m x(n+1:m,:)=[];
end
x=[x(1:2:n,:); x(2*fix(n/2):-2:2,:)];
z=[sqrt(2) 2*exp((-0.5i*pi/n)*(1:n-1))].';
y=real(fft(x).*z(:,ones(1,k)))/a;
y(1,:)=y(1,:)*b;
if fl y=y.'; end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [x,mn,mx]=melbankm(p,n,fs,fl,fh)
f0=700/fs; fn2=floor(n/2);
lr=log((f0+fh)/(f0+fl))/(p+1);
bl=n*((f0+fl)*exp([0 1 p p+1]*lr)-f0); b2=ceil(bl(2));
b3=floor(bl(3)); b1=floor(bl(1))+1;
b4=min(fn2,ceil(bl(4)))-1; pf=log((f0+(b1:b4)/n)/(f0+fl))/lr; fp=floor(pf);
pm=pf-fp;
k2=b2-b1+1; k3=b3-b1+1; k4=b4-b1+1;
r=[fp(k2:k4) 1+fp(1:k3)];
c=[k2:k4 1:k3];
v=2*[1-pm(k2:k4) pm(1:k3)];
mn=b1+1;
mx=b4+1;
if nargout > 1
x=sparse(r,c,v);
else
x=sparse(r,c+mn-1,v,p,1+fn2);
end
74
Disteusq.m
function dif=Disteusq(x,y)
[rx,cx] = size(x); [ry,cy] = size(y);
if ((rx*cx)>(ry*cy))
aa=x; bb=y; x = bb; y = aa; % to find the distance always in the same form rx <= ry
end
[nx,p]=size(x); ny=size(y,1);
% calculate the distances of each frames

if p>1
z=permute(x(:,:,ones(1,ny)),[1 3 2])-permute(y(:,:,ones(1,nx)),[3 1 2]);
d=sum(z.*conj(z),3);
else
z=x(:,ones(1,ny))-y(:,ones(1,nx)).';
d=z.*conj(z);
end
[r,c] = size(d); m = min([r c]); ds = abs(r-c);
diff(1) = min( d(1,1:2) ) ;
for n = 2 : m-1
diff(n) = min( d( n , ( (n-1):(n+1) ) ) ) ; % frame to frame comparison with warping
end
diff(m) = min( d( m,((m-1):m) ) ) ;
if ( r ~= c )
diff(m) = min( d( m,((m-1):(m+1)) ) ) ;
for n = (m+1):(m+ds)
diff(n)= 3*mean( d( m , ((m+1):(m+ds)) ) ); % extra frame distances.
end
end
dif = sum(diff);
75
Compare.m
function Compare( FngrPrnts,lib_com1,lib_com2,lib_com3,lf,hObject, eventdata, handles )
level = str2double(get(handles.level_text,'String')); % get the recognition level threshold percentage
for n = 1:lf
Ediff1(n) = Disteusq( FngrPrnts , lib_com1{n} ) ; % Calculate the distance between current
fingerprint and
Ediff2(n) = Disteusq( FngrPrnts , lib_com2{n} ) ; % library fingreprints
Ediff3(n) = Disteusq( FngrPrnts , lib_com3{n} ) ;
End
min1 = min( Ediff1 ) ; min2 = min( Ediff2 ) ; min3 = min( Ediff3 ) ;

min_all = min( [ min1 min2 min3 ] ); % minumum distance of all
if min_all == min1
p = find( Ediff1 == min1); comm = lib_com1{p}; % find the listbox index value where min. dist.
occurs
elseif min_all == min2
p = find( Ediff2 == min2); comm = lib_com2{p};
elseif min_all == min3
p = find( Ediff3 == min3); comm = lib_com3{p};
end
error1 = min_all / sum(sum(abs(comm))) ; % squared error

error2 = sqrt(min_all) / sum(sum(abs(comm))) ; % true error
error = ( error1 + error2 ) / 2; % final error
matched = ceil(100 - ( 100 * error) ); % matching percentage
if (matched < level) % under matched
set(handles.listbox1,'Value',(lf+1)); % highlight "???"

com_name = get(handles.listbox1,'String');
set(handles.text28,'String',[ ''' ' com_name{lf+1} ' ''' ' , ' '% ' num2str(matched) ' Matched' ]);
else % good matched
set(handles.listbox1,'Value',p); % highlight the name of the recognized command

com_name = get(handles.listbox1,'String');
set(handles.text28,'String',[ ''' ' com_name{p} ' ''' ' , ' '% ' num2str(matched) ' Matched' ]);
end
76

Speech Recognition

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Recognition

Uploaded by

Copyright:

Available Formats

UNIVERSITY OF GAZIANTEP

EEE 499 GRADUATION PROJECT

ELECTRICAL & ELECTRONICS ENGINEERING

Doç. Dr. ERGUN ERÇELEBĐ

• Lineer Predictive Coding Method (LPC)

Bu projede, bir ses komutunun matematiksel denklik modelinin oluşturulma aşaması

• Doğrusal Öngörü Analizi Metodu (LPC)

1 CHAPTER I : INTRODUCTION AND OBJECTIVES .................................. 1

2 CHAPTER II : BASIC ACOUSTICS AND SPEECH SIGNAL ..................... 3

3 CHAPTER III : SPEECH RECOGNITION...................................................... 7

4 CHAPTER IV : Project Demonstration with Code Explanations ................. 31

Speech recognition can simply be defined as the representation of a speech signal

Speech recognition presents great advantages to human-computer interaction. It is

• The project reported in chapters,

• Chapter I gives an introduction of the project and its objectives.

As relevant background to the field of speech recognition, this chapter intends to

2.1 The Speech Signal

Human communication is to be seen as a comprehensive diagram of the process

Figure II.1 - Schematic Diagram of the Speech Production/Perception Process.

Five different elements, A.Speech formulation, B.Human vocal mechanism,

To be able to understand how the production of speech is performed one need to

Figure II.2 - Human Vocal Mechanism

Speech recognition is the process of extraction of linguistic information from speech

Although this project is a command base speech recognition system, it so difficult to

The work principle of speech recognition systems is roughly based on the

My Speech Recognition Tool consists of 4 main parts.

Figure III.1 – 4 main parts of a speech recognition system

Figure III.10 - Main Block Diagram of Speech Recognition Tool

3.3.1 Speech Representation in Computer Environment

The speech sound of a human is got in to computer by a microphone. Here the

See Figure III.2,

III.2 a) Representation of a speech signal in time domain

III.2 b) Representation of a speech signal in frequency domain

III.2 c) Spectrogram of a speech signal

Figure III.3 - Processes applied for Symbolic representation of Speech Signal.

Figure III.4 - Block Diagram of Pre-Works on the recorded Speech Signal.

Y (n) = S (n) − 0.97 . S (n − 1)

D) Start – End Point Detection

I use this equation to find a good threshold level,

S (i ) = X [ 40. ( P ( n) > T ) ] , i = 1, 2, 3,...

Figure III.5 - The Block Diagram of Feature Extracting Process.

Investigations show that speech signal characteristics stays stationary in a

Each frame is multiplied by an N sample window W(n). Here I used a hamming

Each frame is convoluted by the window function.

S (n) = X (n) ∗ W (n)

1) MFCC – Mel Frequency Cepstrum

Figure III.8 – Block Diagram of Obtaining Mel Frequency Cepstral Coefficients.

c) Mel Frequency Cepstral Coefficients

The discrete form for a signal x(n) is defined as,

As a result of this process, Mel-Frequency Cepstral Coefficients are obtained. These

a = [ 1 , a(2) , a (3) , ... , a( p + 1) ]

Xɶ (n) = − a(2). X (n − 1) − a(3). X (n − 2) − ... − a( p + 1) . X (n − p )

frame. Here m is equal to frame length(samples each frame) over 12.

Fingerprint Matrices consists of n frames and m coefficients in each frame.

1) MFCC – Mel Frequency Cepstral Coefficients

 f11 f12 f13 ⋯ f1 m 

2) LPC – Linear Predictive Coding Coefficients

 f11 f12 f13 ⋯ f1 m 

Figure III.10 - Fingerprint Processing and Recognition Process.

3.4.2 Fingerprints Comparison

 c11 c12 c13 c14  Current Fingerprint Matrix

FC =  c21 c22 c23 c24  3 frames

 l11 l12 l13 l14 

C1 = [ c11 c12 c13 c14 ] , ... , C3 = [ c31 c32 c33 c34 ]