Professional Documents
Culture Documents
Dhruvin M. Lingaria
August 2015
Developing a smart home environment for the assistive living requires great efforts.
The key element of the smart environment is the ubiquitous voice user interface with
several additional capabilities such as the recognition of several gestures, which can be a
new feature of voice controlled devices. There are many identification technologies used
in current intelligent guard systems. Relative to other techniques, the voice recognition
techniques. The assistive device project has incorporated the technology of voice
recognition to perform the GSM calling. Arduino UNO is the microprocessor used to
create an interface between the voice module and the GSM module SIM900. Platform
was developed using inexpensive hardware and software elements available on the
market People with disabilities showed high robustness for assistive device. Sample
voice commands were stored in the temporary memory for the ATMEGA 328P when
field tests with several sets of voice commands was done. The GSM module SIM900
could easily connect to the local cellular network carriers. Hence voice recognized
A PROJECT REPORT
Presented to the Department of Electrical Engineering
California State University, Long Beach
In Partial Fulfillment
of the Requirements for the Degree
Master of Science in Electrical Engineering
Committee Members:
Christopher Druzgalski, Ph.D. (Chair)
Anastasios Chassiakos, Ph.D.
James Ary, Ph.D.
College Designee:
Antonella Sciortino, Ph.D.
By
Dhruvin M Lingaria
B.E, 2012, Rizvi College Of Engineering, Mumbai, India
August 2015
ProQuest Number: 1600584
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
ProQuest 1600584
Published by ProQuest LLC (2015). Copyright of the Dissertation is held by the Author.
ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
Copyright 2015
Dhruvin M Lingaria
ALL RIGHTS RESERVED
ACKNOWLEDGEMENTS
I am thankful to the God for the wellbeing and good health that were necessary to
Druzgalski, Ph.D., Professor of Biomedical Engineering, for arranging all the necessary
facilities for the research. I place on record, my sincere thank you to Dr. Anastasios
Ph.D. for valuable guidance, sharing expertise, and encouragement extended to me.
I take this opportunity to express gratitude to all of the department faculty members
for their help and support. I also show my gratitude to one and all, who directly or
iii
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS...…………………………………………………….............iii
LIST OF TABLES…………………………………………………………………..........vi
LIST OF FIGURES………………………………………………………………….......vii
LIST OF ABBREVIATIONS……………………………………………….………........ix
CHAPTER
1. INTRODUCTION…………………………………………….……………….……..1
2. SYSTEM DESIGN…………..……………...…………………………………..........2
2.1. Procedure................................................................................…………...........2
2.1.1. Arduino Uno……………………….…………………………...…….......3
2.1.2. GSM Shield………………………………...…………….…….…………6
2.1.3. GSM Antenna.............................................................................................8
2.1.4. Voice Module...................................................................................….......9
2.2. Integrating GSM Shield with the Arduino Uno.…………………………......10
2.3. Integrating Voice Module with Arduino Uno.………………………….........11
2.4. Flow Chart: Voice Recognized Telephone Calling……………………….....12
iv
CHAPTER Page
4. VOICE RECOGNITION……………………………………………………….......16
6. RESULTS………………….………………………………………….………......62
7.1. Applications….………………………………………….…………………...68
7.2. Future Scope…………….………………………………….……………......70
8. CONCLUSION…………………….……………………………….…………….71
REFERENCES……………………….…………………………………………….72
iv
LIST OF TABLES
TABLE Page
Data………………………………………………………………………………...63
vi
LIST OF FIGURES
FIGURE Page
5. GSM antenna……………………………………..………………………..……..……8
30. SIM900………………………………………………………………………..…….58
39. Assistive device using voice recognition for GSM voice call………………………67
vii
LIST OF ABBREVIATIONS
LIST
ix
CHAPTER 1
INTRODUCTION
Voice technology is of enormous benefit for people with physical disabilities. People
with different kinds of disabilities may benefit from various kinds of speech and voice
physically disabled people to use their voice and initiate a GSM call. It is a very robust
product that can be used in any environment to suit the user. Voice module V3 records
the voice of different users to recognize the voice. Every user speaks the number 0 to 9
to train the voice module. These voice samples are stored into the voice module library.
Each user has his or her own voice samples from which the voice module recognizes the
voice. There are variations in the voices of each user as well each user has a different
voice sample every time they speak the same number. So, training the voice module for
different samples of voice is very crucial. Once the voice is recognized the Arduino
UNO creates an interface with the GSM module for research for the cellular network.
As soon as the up link connection is created with the cellular tower, the Arduino UNO
gives the command to GSM module to make a call; this call can be an emergency number
for the disabled people to use the emergency services. This project can be helpful to
many disabled people who want to take to doctors far away from them or need medical
1
CHAPTER 2
SYSTEM DESIGN
2.1. PROCEDURE
Selection of the Components:
Selecting the components required to perform the project is one of the most important
steps to develop any product. Depending on the resources required the product can be
expensive or economical. Due to this factor the production of the project can be decided
1. Arduino UNO
4. Bread board
5. Connecting wires
6. Resistors
7. SIM card
8. Microphone
The components for the assistive device are easily available, cost effective and
efficient to handle. They do not require regular maintenance, due to this quality they can
be used regularly under robust conditions. As per the project requirement the components
2
use 5v and 2A for performing the task. The voice module and GSM module are the added
features for the assistive device giving the required output. Assembling all the
components together, the resultant output will have multiple applications in the
3
Description:
digital input/output pins (of which 6 can be used as PWM outputs), a 16 MHz ceramic
resonator, 6 analog inputs, a USB connection, an ICSP header, a power jack, and a reset
with a USB cable or power it with an AC-to-DC adapter or battery to get started.
The UNO differs from all preceding boards in that it does not use the FTDI USB-to-
serial driver chip. Instead, it features the Atmega16U2 (Atmega8U2 up to version R2)
Arduino UNO board version 2 has a resistor connecting the 8U2 HWB line to ground,
making it easier to put into DFU mode. Arduino UNO board version 3 has the following
new features: 1.0 pinout: added SDA and SCL pins that are near to the AREF pin and
two other new pins placed near to the RESET pin, the IOREF helps the shields to adapt to
the voltage provided by the Arduino UNO board. In the near future, shields will be
adaptable with both the board that uses the AVR, which works at 5V with the Arduino.
The second pin is not a connected pin and is reserved for future purposes. Stronger
RESET circuit and ATMEGA 16U2 are replaced by 8U2. "UNO" means one in Italian.
4
TABLE 1. Technical description of Arduino UNO [2]
Microcontroller ATmega328
Operating Voltage 5V
Input Voltage (recommended) 7-12V
Input Voltage (limits) 6-20V
Digital I/O Pins 14 (of which 6 provide PWM output)
Analog Input Pins 6
DC Current per I/O Pin 40mA
DC Current for 3.3V Pin 50mA
Flash Memory 32 KB of which 0.5 KB used by bootloader
SRAM 2 KB (ATmega328)
EEPROM 1 KB (ATmega328)
Clock Speed 16MHz
Length 68.6mm
Width 53.4mm
Weight 25g
5
2.1.2 GSM Shield
6
Description:
The GSM shield allows an Arduino board to make/receive voice calls, to connect to
internet, and to send/receive SMS messages. The GSM shield uses a radio modem
SIM900. It is possible to communicate with the board using AT commands. The GSM
library has numerous methods for communication with the shield. The shield uses digital
pins 2 and 3 for software serial communication with the SIM900. Pin 2 is connected to
the SIM900’s TX pin and pin 3 to its RX pin. The modem's power key pin is connected
connection. GPRS data uplink and downlink transfer speed maximum is 85.6 kbps. The
cellular network interfacing with the board requires a SIM card provided by a network
operator. The most recent version of the board operates on the 1.0 pinout on rev 3 of the
7
2.1.3 GSM Antenna
Description:
GSM systems have specific antenna design requirements because GSM technology
administers the capability for global communications between wireless carriers. The
antennas that make this possible are altogether technologically advanced. The phones
and the towers themselves use antennas to communicate with each other. The constant
development of technology means that antenna design companies have to work hard to
not only keep up with the demand for innovations, but to produce them as well. GSM has
already generated newer and more improved generations in the form of 3G and 4G
technologies like UMTS, EDGE, HSDPA, and LTE while competing with CDMA
9
The module could identify the voice and receives configuration commands which it
responds through serial port interface. With the help of this module we would be able to
control the car or other electrical devices by voice. This module can store 80 pieces of
voice instruction. Those 80 pieces are divided into groups, with 7 instructions in each
group. Initially we should record the voice instructions as per group. Once that is done,
we import one of the groups by serial command before it could recognize the 7 voice
instructions within that groups. We first need to import the other group to implement the
instructions. This module is not dependent on the speaker. If a different speaker speaks
the voice instruction in place of you, it will not identify the instruction. Please note that
Technical Parameters:
Voltage: 4.5-5.5V
Current: <40mA
interface
The GSM module has to be integrated with the Arduino UNO so the calling through
the phone of the patient can be initiated. The GSM module has a SIM card inserted in it.
10
The computer or laptop can make a phone call to any other phone with GSM network.
The pins in the GSM shield are activated when the call or SMS is sent to the GSM shield.
Once the GSM module is integrated to the Arduino UNO, voice module has to
program with the Arduino UNO. The voice is recognized and amplified for the system to
use it for the phone call. The method of recognition of the voice is as follows.
Say “Dial”
Number is repeated
Send
Dialing
Number is dialed
Call in process
11
2.4 Flow Chart: Voice recognized telephone calling
12
CHAPTER 3
process and /or analysis the speech signals. Applications of speech processing include:
Speech Coding
Speech Recognition
Speech Enhancement
The speech production process is initiated when the speaker formulates a message in
his/her mind to transmit to the listener via speech. The next step in the process is the
conversion of the message into a language code. This correlates to converting the
message into a set of phoneme sequences corresponding to the sounds that users’ make
denoting pitch associated with the duration of sounds, and loudness of sounds [6].
13
FIGURE 9. Schematic diagram of the speech production/speech perception process [6].
The rate of discrete symbol information in the crude message text is rather low (about
50 bits per second corresponding to about 8 sounds per second, where each sound is one
Once the language code is converted, with the inclusion of prosody information, the
information rate rises to about 200 bps. In the next stage the representation of the
information in the signals becomes continuous with an equivalent rate of about 2000 bps
at the neuromuscular control level and about 30,000- 50,000 bps at the acoustic signal
level. The continuous information rate at the basilar membrane is in the range of 30,000
– 50,000 bps, while at the neural transduction stage it is about 2000 bps. The higher level
processing within the brain converts the neural signals to a discrete representation, which
14
3.2 Basic Assumption of speech processing
The speech processing system has basic assumption that the source of excitation and
Hence, it is an appropriate to model the source of excitation and the vocal tract
system separately. In continuous speech the vocal tract changes shape slowly and
gradually as it is reasonable to assume that the vocal tract has a fixed characteristics over
15
CHAPTER 4
VOICE RECOGNITION
humans are converted into electrical signals, and these signals are transformed into
coding patterns to which meaning has been assigned" [7]. The notion could more
generally be called "sound recognition". The emphasis is more on the human voice
because we most often and most typically use our voices to communicate our ideas to
others in our surroundings. In the situation of a virtual environment, the user would
probably gain the feeling of immersion and being part of the simulation, if they could use
their most common form of communication, the voice. Due to fundamental difference
between human speech and the more traditional forms of computer input it is difficult to
use voice as an input to computer simulation. While computer programs are commonly
designed to produce an explicit and well-defined response upon receiving the proper
input, the human voice and spoken words are anything but precise. Each human voice is
different, and identical words can have different meanings if spoken with different
inflections or in different contexts. Several approaches have been tried, with varying
16
Dimension of recognizable vocabulary.
4.1.1. Features
The technology of the voice recognition adopted by the intelligent access system is
based on the SPCE061A single chip. The system hardware is made up of:
recognition and the design of the assistive control circuit. The system software consist of
the voice recognition module, the voice training module, the voice-playing module, the
speech data processing module, and the cipher of input/output module. The voice module
completes the part of collecting and distilling the voice data, speech recognition and
voice playing in terms of initializing the system and the identification training. As per the
voice recognition arithmetic theory, the character distilling, pretreatment, and pattern
The purpose and function of the voice recognition are different, the recognition is
classified into two types, one is dependent to text and the other is irrelevant to text. The
that is relevant, and then everybody's speech model is built up accurately. Identification
also requires users to pronounce according to the stated contents, the effect is very good.
Pronunciation is more important than the text of the content for the voice recognition
system so it is difficult to build up speech models. If customers make use of the system
conveniently it can be applied widely. The usage helps the system to be classified as
speaker recognition and confirmation. The former judges a voice that needs to be
identified from several samples. The latter judges whether or not an identified voice
comes from a certain speaker. Its output has two kinds of result, yes or no. The central
processor of this system is the SPCE061A single chip. The speaker approval of what is
relevant to text is accomplished on the chip, and then homologous order and operation
18
are carried out. The system is mainly made up of speaker identification module and
gating circuit.
In training, the voice of the speaker gets into the voice signal collection circuit
through a microphone, and then the collected voice signals are processed by the voice
processing circuit, the characteristic guidelines of the speaker are distilled and saved. At
voice that needs to be identified is matched with the information in the database of
machine [11].
SPR4096
Switch Circuit
Microphone
SPCE061A
Control Circuit
Single Chip
Speaker
Number dialed
Keyboard
19
The hardware part of this system include SPCE061A single chip, the audio output
circuit, the voice recognition circuit, the FLASH circuit, the audio input circuit and the
keyboard circuit etc. The main mission of the hardware is to change voice signals into
digital signals, collect samples, upload, identify and play the voice datum [10].
The SPCE061A chip has the system frequency 0.375~49.152 MHz and voltage scope
added function, double channels audio output function of 10 bits DAC and Watch Dog
Timer (WDT) in the single chip. The interruption controller can handle three kinds of
FIQ interruption, eleven kinds of IRQ interruption and one soft interruption controlled by
the instruction BREAK. There are voice processing function and abundant C function
databases provided by the single chip. It is very suitable to implode voice recognition
products.
20
FIGURE 11. SPCE061A Pin Diagram. www.go-gddq.com.
The principle of voice recognition circuits is that voice signals are analyzed by the
intelligent system after distilling voice, firstly noises are filtered and the useful signals of
voice signals are distilled through a filter group, then the signals are processed and
zero, etc. The voice signals carry on mode match with the voice datum in the voice
database after analysis and processing, lastly the voice recognition result is output
according to the match result. The basic structure of the voice recognition circuit is
21
Voice Database
Output of
recognition
result
The chip SPCE061A adopts low voltage supply in order to reduce power consumption.
SPCE061A has two power supplies, one is inside power supply VDD and the other one is
I/O power supply VDDH. I/O power supply is 5V, and inside power supply is 3.3V. The
main motive to reduce inside power supply is to lower the power consumption and
working temperature of the single chip. Though the voltage range of SPCE061A is very
wide, to make the chip run more stably and satisfy the voltage demand of I/O ports and
outside parts the power supply circuit is shown in figure 15. AC 220V is converted into
module and every I/O ports inside the system. DC 5V is converted into DC 3.3V by
TR1972-33 [10].
22
FIGURE 13. Power supply circuit [10].
Firstly voice signals are pretreated and the signals are amplified properly, secondly
analog signals are converted into digital signals in order that digital equipment can
process the signals conveniently, and then characteristics are distilled in order that some
parameters of the signal characteristics can be used to replace the voice signal. Lastly
different treatment will be adopted according to the missions. Voice recognition can be
divided into two stages: the training stage and the identification stage.
In the training stage the voice signals expressed by characteristic parameters are
processed, standard datum that can show common characteristics of identification basic
units are obtained. Reference templates are formed based on above datum, and then the
reference template database is formed after all reference templates of identified basic
units are combined together. In the identification stage the identified voice signals after
23
FIGURE 14. The structure of voice signal process [10].
The noises seriously disturb the processing and identification of voice signals, so the
noises must be disposed firstly. The input analog voice signals from microphones must be
sampled and measured in order to obtain digital voice signals. Before converting voice
signals into digital signals, it is necessary to filter and counter disturb. In filtering the
signal part and noises beyond ½ sample frequency are filtered. The clean voice signals
24
are obtained later, and then low frequency disturbance is filtered through fore-
voice signals are improved and they can cutout DC floating, and can improve function of
B. Characteristic distilling:
The system adopts the evaluation method that uses the contrast value between
characteristic parameters. The basic idea is to distill group characteristic parameters from
a voice segment of the same speaker that is to say to map the segment on a dot of the
multi-space. Different voices from the same speaker will produce different dots in the
characteristic space; the function of multi-variable probability density can describe the
distribution. For different single pronunciation from the same speaker, these dots are
relatively concentrated. But the pronunciation distribution from different speakers is apart
farther, the group characteristic parameters can describe the thumbprint of speakers
effectively. According to this principle, for single parameter, the F ratio between two
kinds of distribution parameters can be used as effective measurement rule. The F ratio
shows the contrast between dispersion degree of different speakers and self-dispersion
degree of each speaker. The ratio of one characteristic parameter is bigger, for this
characteristic, the former is bigger than the latter averagely. Therefore the recognition
system adopts a bigger F ratio and then the system capability is improved [10].
C. Module match:
parameters is more and more embedded. Typical methods are: the vector measure
25
arithmetic, the Gauss mixture module arithmetic, the dynamic time whole (DTM)
arithmetic and the manual nerve net arithmetic. The above methods have both advantage
and weakness. When the DTM arithmetic is applied in the identification of long voice,
the operations of module match are too great. But the arithmetic is simple and effective
for short voice (the length of valid voice is subter-3 seconds) identification. So the
method is especially applicable to the speaker recognition system of voice and text. The
identity using characteristics extracted from their voices. A speaker recognition system
included in the speech signals. Speaker recognition technology makes it possible to use
the speakers’ voice to verify their identity and control access to various services such as
database access services, voice mail, banking by telephone, information services, security
recognition methods can be divided into text dependent (speech-dependent) and text-
independent (speech independent) techniques. The methods which were used earlier
discriminate the users based on the same spoken letters/words or numbers but the other
method don’t rely on definite speech. As any pattern recognition system, the speaker
recognition system consists of a feature extraction part and a classification one. The
speaker recognition can be divided into two different categories such as supervised and
26
unsupervised recognition depending on the character. In this project we considerate of
speaker pronounced the word. Speaker verification represents the procedure of accepting
The speech signal is divided into overlapping frames (256 samples) and overlaps (128
window of length 256 samples. The FFT (Fast Fourier Transform) computes the
spectrum of each window sequence. The cepstrum of each windowed frame s[n] is then
Then, we translate the regular frequencies to a scale that is more appropriate for
speech. The Mel-scale poorly approximates the linearly-spaced frequency bands used in
the normal cepstrum than the human auditory system's response. The cepstrum and the
Mel frequency cepstrum have differences such as in the MFC, the frequency bands are
equally spaced on the Mel scale. The Mel-frequency cepstral coefficients (MFCCs) are
2. Using triangular overlapping windows map the powers of the spectrum onto the
Mel scale.
3. Each of the Mel frequencies have to undergo the log of the powers.
27
4. Take the DCT (Discrete Cosinus Transform) of the set of Mel log powers thinking
them as a signal.
Therefore, a sequence of MFCCs is thus obtained for each frame. Every MFCC set serve
as a melodic cepstral acoustic vector. Melodic cepstral acoustic vectors can perform as
feature vectors but we need to achieve more powerful speech features. Hence, the MFCC
acoustic vectors undergo derivation process. The first order derivatives of the Mel
cepstral coefficients is computed as delta Mel cepstral coefficient. Then, the delta delta
Mel frequency cepstral coefficients (DDMFCCs) are derivative of DMFCC, being the
we want to model the intra-speaker variability. The computed DDMFC coefficients show
us how fast the voice of a speaker is changing in time. A DDMFC acoustic vector is thus
obtained for each frame of the initial speech signal S. Each acoustic vector is composed
of 256 samples, but the first 12 coefficients are mainly encoded with the speech
information. So, we truncate each vector at its first 12 samples, then position it as a
voice discriminator, therefore it could be successfully used as a vocal feature vector for
speaker recognition. Each acoustic matrix has 12 rows and a number of columns
depending on the length of each vocal signal S. Therefore, because of their different
sizes, these speech feature vectors cannot be compared using the linear metrics, such as
the widely known Euclidean distance. A solution would be transforming the acoustic
matrices, through re-sampling or padding with zero values, such that they get the same
dimensions and the Euclidean metric could be used. The disadvantage of this approach is
28
the possible loss of valuable speech information from the feature vectors. There are
many other possible speech feature vectors that can be obtained with this delta delta Mel-
cepstral analysis. For example, a vocal feature vector for signal S could be made from
some statistical values computed for each DDMFCC (or MFCC) acoustic vector of this
spoken words by the users. Template-matching technique is the most effective technique
for text dependent recognition process. Dynamic time warping (DTW) algorithms or
hidden Markov models (HMM) method are used extensively for voice recognition.
The DDMFCC – based feature extraction is performed and the feature vectors are
obtained. Thus, V(S), the feature vector of speech signal S, can be computed as the
truncated 12-row delta delta Mel ceptral acoustic matrix. Another featuring solution
tested is computing V(S) as the mean of the DDMFCC matrix. So, we obtain V(S) as the
unidimensional vector containing the mean values of the columns of the acoustic matrix.
feature extraction process is then applied to them, the feature set {V(S) 1,……, V(S)n}
being obtained.
the collection of spoken words a training set can be obtained related to the same speech,
provided by these speakers and filtered for noise removal. A vocal prototype is assigned
29
to each speech signal when they are trained. The feature training sets are obtained by
computing feature vectors from these prototypes. Consider N advised speakers, then the
Sp = {Sp1 ,..,SpN}, Where each Spi = {Si1 ,….. , Si n(i)} represents the set of signal prototypes
corresponding to the ith speaker. For each Si j , where i = 1’, N’ , j = 1’ , n(i)’ , the vocal
features extraction is performed and the obtained sequence {{ V (S11),…., V(S1 n(1)),……,
{V(SN 1),…, V(SN n(N))} represents the feature training set of classifiers.
The next step is to consider the minimum distance classification procedure. We consider
N classes of N advised speaker in the class. Our algorithm introduces each input vocal
sequence Si in the class of the speaker comparable to the smallest mean distance between the
feature vector of the input signal and his prototype vectors. Therefore, the closest speaker is
‘d’ represents the metric. The classification result, consisting of N classes of speech
utterances, represents also the speaker identification result. The accurate speaker is thus
identified each input data. The next stage of the recognition process, the speaker verification,
has to decide if that identified speaker is the one who really produced it. We propose a
threshold based verification procedure to be performed within each resulted speaker class.
So, each mean distance computed in any class must not exceed a special chosen threshold
value. Any threshold-based recognition approach implies the task of choosing a proper
30
overall maximal distance between any two prototype vectors belonging to the same training
feature subset, as a threshold. Thus, a satisfactory threshold is obtained from the following
equation:
A high recognition rate, approximately 85%, has been reached by our speech-dependent
voice recognition system. 5 input vocal utterances and 3 registered speakers. The speech
input signals and their corresponding feature vectors are represented in the next two figures.
The speech of all these vocal utterances contain a single word: hello.
31
FIGURE 16. Experimental Vocal feature vectors [12].
FIGURE 17. Training set and the corresponding feature set [12].
32
Using the values registered we obtain the identification result: Speaker 1= {S1, S4},
Speaker 2 = {S3, S4} and Speaker 3 = {S2}. Compute threshold valve T= 1.3915,
therefore we get the recognition: Speaker 1 = {S1, S4}, Speaker 2 = {S3}, Speaker 3 =
data ensuring that the entire vocal range is captured. Thus, it is useful for not cooperative
subjects, for example like those in the surveillance systems. The speech-independent
recognition methods are lucrative which are based on Vector Quantization (VQ) or
Gaussian Mixture Model (GMM). The VQ-based methods are parametric approaches
K Gaussian distributions are used to represent GMM method based non parametric
technique. We utilize the same delta delta Mel cepstral analysis for the feature extraction
part of this recognition system. Voice recognition techniques like those based on Vector
dimensional feature vectors, each vector V(S) being computed as the truncated DDMFCC
acoustic matrix. The sequence of speech signals to be recognized, {S1,…, Sn}, is not
characterized by the same speech anymore. A similar minimum mean distance based
classifier is used, with a uniquely different training set. We consider a large spoken
letters /words, consisting mostly all the English language phonemes. Each registered
speaker provides this speech several times, so same text will be obtained from all the
prototype signals of Sp. The equation helps identify the speaker. We provide a threshold
based verification technique, but not an automatic one, like in the previous case. T is the
threshold value which is analytically chosen, such as to satisfy the necessary condition:
34
Where C1,…., Cn represent the identified voice classes. Many numerical tests using this
approach were performed and obtained a high voice recognition rate [12].
35
FIGURE 21. Prototype vocal signals and their feature vectors [12].
The prototype speech signals and their corresponding DDMFCC – based speech
feature vectors. Then, the mean distance values between the input feature vectors and
the training feature subsets are computed. Using the values registered in the TABLE 2
we obtain the identification result: Speaker 1= {S2, S6, S9}, Speaker 2 = {S1, S3, S7} and
Speaker 3= {S4, S5, S8}. We get the threshold value T = 7.67, therefore we obtain the
final recognition: Speaker 1 = {S2, S6, S9}, Speaker 2 = {S1, S3, S7}, Speaker 3 = {S4, S5,
S8} and unregistered Speaker = {S5}. This is the voice recognition technique I used in
36
4.2.2. Dynamic Time Warping Algorithm (DTW):
that calculates an optimal warping path between two time series. The distance and
Suppose we have two numerical sequences (a1,a2, ..., an) and (b1, b2, ..., bm). The
two sequences length can differ from each other. The local distances calculation between
the elements of the two sequences using different types of distances is the initial of the
algorithm. The absolute distance between the values of the two elements (Euclidian
distance) is the method used for distance calculation. That results in a matrix of distances
Starting with local distances matrix, then sequences with minimal distance matrix
Where aij is the minimal distance between the subsequences (a1,a2, ..., ai) and (b1,
b2, ..., bj). A path through minimal distance matrix is a warping path from a11 element
to anm element consisting of those aij elements that have formed the anm distance. The
equation below gives the global warp cost of the two sequences.
37
Where, wi: elements that belong to warping path, and p: the number of wi elements. The
calculations made for two short sequences are shown in figure 1 including the highlight
However even though voice recognition is done partly in frequency domain, a still
unknown brain-like functioning algorithm should be discovered to explain how the voice
is divided into syllables and phonemes for recognition. Since there are too many
unknown facts about how the brain recognizes the voice through different paths and
processes, it may be still better to approach the problem by probabilistic algorithm than
analytic algorithm. For this reason, two different voice recognition algorithms have been
studied while the common feature in both these algorithms is to extract the feature
parameters of the speech signal. The NN (Neural Network) recognition algorithm first
applying the feature parameters of an unknown new syllable or word to the huge
coefficient matrix. Recognition using a neural network speech recognition method with a
large coefficient matrix for the whole learning process is time-consuming. If you add a
new speech signal to the recognition algorithm, the entire process from the beginning
should be repeated which is a problem due to high time consumption. In the second
method, HMM (Hidden Markov Model) recognition algorithm, for every new input voice
signal, voice feature parameters are generated which are used in the learning process to
create a new HMM model. So with each new HMM model created for every word,
38
during the testing phase, all these models are compared with the test word to find out the
matching voice sample. The disadvantage that a HMM model has is, that for every new
voice that is added to the model, a new individual HMM model needs to be created, and
each model should be compared with all the existing HMM models to get a match,
slowing down the recognition process speed. HMM method is fast in initial training, and
when a new voice information is added into the HMM database, only the new voice is
used in the training process to create a new HMM model. Compared to the neural
network algorithm, for a large number of speech samples, the HMM algorithm provides a
4.3 Train
Arduino VR Module
5V ---------> 5V
1 ---------> TX
0 ---------> RX
Choose right Arduino board (Tool -> Board, UNO recommended), Choose right serial
port.
Open Serial Monitor. Set baud rate at 115200, Newline or Both NL & CR should be set.
39
FIGURE 22. Test on Serial Monitor [9].
settings. Write settings and press send for enter the settings.
40
FIGURE 24. Settings of the voice module [9].
Train Voice Recognition Module. Train record 0 with signature "On" by sending
sigtrain 0 On command to. When Serial Monitor prints "Speak now", you need speak
your voice (can be any word, meaningful word recommended, may be 'On' here), and
when Serial Monitor prints "Speak again", we need to speak the letter/ words again.
Serial Monitor prints "Success", and "record 0" is trained when two voices are matched,
or if are not matched, keep on repeating the process until you get success.
When training, the two led on the Voice Recognition Module can benefit your
training process. After sending train command, the blinking of SYS_LED will prompt to
be ready, as soon as the STATUS_LED lights is on then speak, when the STATUS_LED
lights are off the recording finishes. When the training is successful the SYS_LED is
blinks again and these status is repeated. Passed: SYS_LED and STATUS_LED blink
41
FIGURE 25. Input “Sigtrain 0 On” in serial monitor [9].
Train another record. Send sigtrain 1 Off command to train record 1 with signature
"Off". Choose your favorite words to train (it can be any word, meaningful word
42
FIGURE 26. Input “Sigtrain 1 Off” in serial monitor [9].
43
FIGURE 28. Recognize the voice input [9].
Train finish. Train sample also support several other commands [9].
44
4.4 Protocol
Base Format
1. Control
2. Return
Code
1. FRAME CODE
AA --> Frame Head
0A --> Frame End
2. CHECK
00 --> Check System Settings
01 --> Check Recognizer
02 --> Check Record Train Status
03 --> Check Signature of One Record
3. SYSTEM SETTINGS
10 --> Restore System Settings
11 --> Set Baud Rate
12 --> Set Output IO Mode
13 --> Set Output IO Pulse Width
15 --> Reset Output IO
15 --> Set Power on Auto Load
4. RECORD OPERATION
20 --> Train One Record or Records
21 --> Training of One Record and Set Signature
22 --> Set Signature for Record
45
5. RECOGNIZER CONTROL
30 --> Load a Record
31 --> Clear Recognizer
32 --> Group Control
4.5 Details
Use "Check System Settings" command to check current settings of Voice Recognition
Module, include serial baud rate, output IO mode, output IO pulse width, auto load and
group function.
Format:
| AA | 02 | 00 | 0A |
Return:
| AA | 08 | 00 | STA | BR | IOM | IOPW | AL | GRP | 0A |
STA : Trained status (0-untrained 1-trained FF-record value out of range)
BR: Baud rate (0,3-9600 1-2400 2-4800 4-19200 5-38400)
IOM: Output IO Mode (0-Pulse 1-Toggle 2-Clear 3-Set)
IOPW: Output IO Pulse Width(Pulse Mode) (1~15)
AL: Power on auto load (0-disable 1-enable)
GRP: Group control by external IO (0-disable 1-system group 2-user group)
Use "Check Record Train Status" command to check if the record is trained.
Format:
Check all records
| AA | 03 | 02 | FF| 0A |
Check specified records
| AA | 03+n | 02 | R0 | ... | Rn | 0A |
Return:
| AA | 5+2n | 02 | N | R0 | STA | ... | Rn | STA | 0A |
*N: number of trained records.
**R0 ~ Rn: record.
STA: trained status (0-untrained 1-trained FF-record value out of range)
Effect after Voice Recognition Module is restarted. To set baud rate of Voice
Recognition Module this command is used.
Format:
| AA | 03 | 11 | BR | 0A |
Return:
| AA | 03 | 11 | 00 | 0A |
BR: Serial baud rate. (0-9600 1-2400 2-4800 3-9600 4-19200 5-38400)
To set Voice Recognition Module output IO mode and to bring it to immediate effect
after the instruction execution this instruction is used.
Format:
| AA | 03 | 12 | MODE | 0A |
Return:
| AA | 03 | 12 | 00 | 0A |
MODE: Output IO mode. (0-pulse mode 1-Toggle 2-Set 3-Clear)
Use this command to set output IO pulse width of Voice Recognition Module, take effect
immediately after the instruction execution. When output IO mode is "Pulse" Pulse
width is used.
Format:
| AA | 03 | 13 | LEVEL | 0A |
Return:
| AA | 03 | 13 | 00 | 0A |
LEVEL: pulse width level. Details:
- 00 10ms
- 01 15ms
- 02 20ms
48
- 03 25ms
- 04 30ms
- 05 35ms
- 06 40ms
- 07 45ms
- 08 50ms
- 09 75ms
- 0A 100ms
- 0B 200ms
- 0C 300ms
- 0D 400ms
- 0E 500ms
- 0F 1s
Use this command to reset output IO. To generate a user-defined pulse in output IO
set/clear mode this command is used.
Format:
| AA| 03 | 14 | FF | 0A | (reset all output IO)
| AA| 03+n | 14 | IO0 | ... | IOn | 0A | (reset output IOs)
Return:
| AA | 03 | 14 | 00 | 0A |
IOn: number of output io
49
11. Train One Record or Records (20)
50
Return:
| AA | 04+SIGLEN | 22 | 00 | RECORD | SIG | 0A | (Set signature return)
| AA | 04 | 22 | 00 | RECORD | 0A | (Delete signature return)
SIG: signature string
SIGLEN: signature string length
Load records (1~7) to recognizer of Voice Recognition Module, after execution the Voice
Recognition Module start to recognize immediately.
Format:
| AA| 2+n | 30 | R0 | ... | Rn | 0A |
Return:
| AA| 2+n | 30 | N | R0 | STA0 | ... | Rn | STAn | 0A |
N: number of loading successfully R0~Rn: Record STA0~STAn: Load result.(0-
Success FF-Record value out of range FE-Record untrained FD-Recognizer full FC-
Record already in recognizer)
Format:
| AA | 02 | 31 | 0A |
Return:
| AA | 03 | 31 | 00 | 0A |
Set group control mode (disable, system, user), if group control function is enabled
(system or user), then voice recognition module is controlled by the external control IO.
Format:
| AA| 04 | 32 | 00 | MODE | 0A |
MODE: new group control mode. (00-disable 01-system 02-user FF-check)
51
Return:
| AA| 03 | 32 | 00 | 0A |
or
| AA| 05 | 32 | 00 | FF | MODE | 0A | (check command return)
To return data when user trains voice command Prompt command is used.
Format:
NONE
Return:
| AA | 07 | 0A | RECORD | PROMPT | 0A |
RECORD: record which is in training
PROMPT: prompt string
53
24. Voice Recognized command is only used for Voice Recognition Module to return
data when voice is recognized.
Format:
NONE
Return:
| AA | 07 | 0D | 00 | GRPM | R | RI | SIGLEN | SIG | 0A |
GRPM: group mode indicate. (FF-not in group mode 00~0A-system group mode 80~87-
user group mode)
R: record which is recognized.
RI: Recognizer index value.
SIGLEN: signature length of the recognized record.
54
CHAPTER 5
GSM is an acronym that stands for Global System for Mobile Communications. In
1984 GSM a digital cellular network was developed as a standard for a mobile telephone
system that could be used across Europe. Mobile service use GSM as an international
standard. Subscribers can easily roam worldwide and access any GSM network as high
mobility offered by it. It offers much higher capacity than the current analog systems. A
larger number of subscribers is allowed due to more optimal allocation of the radio
spectrum. Voice communications, Short Message Service (SMS), fax, voice mail, and
other supplemental services such as call forwarding and caller ID are some of the services
offered by GSM. 450 MHz, 850 MHZ, 900 MHz, 1800 MHz, and 1900 MHz are the
most common frequency bands the GSM works on. To increase the amount of spectrum
available for each band some bands also have Extended GSM (EGSM). GSM works with
Time Division Multiple Access (TDMA) and Frequency Division Multiple Access
(FDMA) [14].
55
FIGURE 29. Use of GSM in biomedical [14].
56
5.1 GSM Module – SIM900
are 850/900/1800/1900MHz. They can be used not only to access the Internet, but also
for oral communication and for SMSs provided they are connected to a microphone and a
small loud speaker. The dimensions of the GSM- SIM900 module are as follows: 0.94
inches x 0.94 inches x 0.12 inches. With L-shaped contacts are placed on four sides so
that they can be soldered at the bottom and both on the side. An AMR926EJ-S processor
controls phone communication, data communication over an integrated TCP/IP stack and
an UART and a TTL serial interface. The processor internally manages the module the
communication with the circuit interfaced with the cell phone itself. A SIM card (3 or
1.8 V) which needs to be attached to the outer surface of the SIM900 module. The
GSM900 device integrates an SPI bus, a PWM module, an A/D converter, an I²C, an
RTC, and an analog interface. The radio section is GSM phase 2/2+ compatible and is
either class 4 (2 W) at 850/ 900 MHz or class 1 (1 W) at 1800/1900 MHz. The TTL
serial interface is in charge of communicating all the data relative to the received SMS
and those that come in during TCP/IP sessions in GPRS. GPRS class 10: max. 85.6 kbps
determines the data rate. Receiving the circuit commands coming from the PIC that
controlling the remote control is been monitored by TTL serial interface, that can be
0.8A and supplied with continuous energy ranging between 3.4 and 4.5 V during
transmission [11].
57
FIGURE 30. SIM900 [11].
58
5.2 GSM call processing
Call processing consists of different steps that are set up, maintain, and end of call.
The American National Standard for Telecommunications has put forth a Telecom
switching system from the acceptance of an incoming call through the final disposition of
the call. The end-to-end sequence of operations performed by a network from the instant
a call attempt is initiated until the instant the call release is completed.
Initialization:
The first part to mobile call processing is initialization. You get a connection to a
nearby cell site so that your account can be checked by cellular network checks. A valid
on. The system checks a frequency list contained in its SIM card that is the removable
memory chip in the system. These frequencies are checked carried by these bit streams,
searching for a Broadcast Control Channel or BCCH within one of them. Each BCCH
transmit a unique data marker, so the mobile knows when it has found its channel. This
is a big difference between AMPS and GSM. With AMPS, to set up calls a dedicated
radio frequency is in each cell. To set up information any frequency can carry with the
help of GSM. It’s the channel within the data stream that’s important to find, not a
specific radio frequency. A base station’s Broadcast Control Channel continuously sends
out identifying information about its cell site. Information such as the area code for the
59
frequency hopping. The BCCH is not a dedicated radio frequency. It resembled by the
channel within the bit stream carried by any of the frequencies in a cell.
The mobile is the receiver checking the any base station signals within the range. The
mobile acts as a scanning radio. The mobile goes through each BCCH frequency for
testing reception and the received level for each channel is measured. The GSM system
decides after this test which cell site should take the call. That’s usually the cell site
Once homed in on the Broadcast Control Channel the mobile monitors the ongoing
data stream from the base station. The BCCH searches for a frequency control burst or
frequency control channel burst (FCCB). 142 burst of bits have 3 tailing bits in front and
behind. This distinctive burst says that synchronization bits will follow. A wireless
connection would be set up between mobile and the cellular system with the help of those
bits. And once that is done, mobile and base station can communicate and everything can
start working.
One burst of many within a single GSM TDMA frame is the digital signature which
mobile is searching in the BCCH. Bits resembles single pulses of electrical energy as if
they are single dashes of a Morse code key. Long and short pulses of energy representing
60
letters are used in Morse code. The pulses we use in digital radio do the same thing with
uniform length. Voice and data are represented in form of groups of bits. Bits are used
for signaling.
GSM is a time based multiplexed system. There are many calls on the same
frequency so they are divided by time they represent cars in a long lane. A new call must
fit somewhere in the frequency band as every third car. The mobile and base station are
provided with exact timing details for the coming conversation the synchronization bits.
Once our system is assigned a place in this digital freight train it can take and send
information. The first is the radio part and the second is the network part.
The radio subsystem. Sometimes called the air interface. How we set up, maintain,
and then later tear down a radio connection from the system to the cell site. The network
subsystem or the switching element decides who gets on the system, how to set up the
call and terminate it. What services and resources the system can use is determined by
network subsystem. We have two parts, each working to help the other out. A call
would not go through if these two parts do not work together in synchronization [15].
61
CHAPTER 6
RESULTS
Speaker 1 5
62
Voice samples for the speaker independent are considered for the below table.
TABLE 3. Voice samples for speaker independent voice recognition from experimental data
Number Sample Sample Sample Sample Sample Sample Sample Sample Sample
of 1 2 3 4 5 6 7 8 9
Speaker
63
FIGURE 35. Experimental output waveform of Number ‘0’.
From the above waveform I was able to find out the relation of speech with time and
frequency. Sampling of the voice was required for recognition of the voice from
different speakers. The voice samples of numbers from 0 to 9 were taken for the each
speaker to record and recognize their voice, so that the Arduino UNO can give
64
The Arduino UNO analyzed the data for GSM module to dial the number. GSM
module checks for the cellular network provider before initiating the call. The network
once found by the GSM module the following data would appear on the serial monitor of
The network once connected allows the network provider to initiate the call. The
65
FIGURE 38. Serial monitor showing experimental status of Call Ready.
A period of time is provided to GSM module to locate and log on to the cellular
network. The call ready status shows the searching for the channel for communication is
over and the call can be made to the other device. The GSM module SIM900 mostly uses
the 900 MHz for communication be it searches other frequency channels too. It is a
string of text which is send. The module consist of AT command and are considered the
modem language.
The full connection of the Assistive Device Using Voice recognition for GSM calling
is show below.
66
FIGURE 39. Assistive device using voice recognition for GSM voice call.
67
CHAPTER 7
7.1 Applications
Healthcare: ASR for doctors in the order to create patient records automatically:
Voice recognition using GSM module for emergency call help doctors to reach out
the patient as soon as possible. There is an increase in number fatalities every year in
physically handicapped people to keep a track of this the assistive device will prove to be
vital. In emergency cases patients require immediate attention from the doctor so just a
phone call from the assistive device can save time and notify the doctor.
Help for disable (especially to access the web and control the computer):
In today’s world web access and controlling the computer are important in any
occupation services. Disable people have difficulty in accessing their laptops and
computers. So, their voice can help them operate electronics devices due to this they will
be able to work efficiently without any problems. Improving the working environment
for the disables assistive devices can be innovative in the field of biomedical with more
research.
68
Military:
communication simple. Also used in fighter planes where the pilot’s hand are too busy to
type. Voice recognition and GSM are vital for transferring data over the channel and keep
Voice prompter
Information services
Agent technology
Customer care
69
7.2 Future scope
In The United States of America and Japan the Arduino based robots are quite famous
due to their facial expression and mirroring properties. Creating an emotional bond with
the machine is one of the goals of the human robot interface. Body language and facial
expression that a voice recognition system can read can also be used for threat
If you smile at a robot while you are having a conversation and it smiles back at you,
this creates the emotional bonding with the human during the conversation. The system
might start adjusting to your behavior. The system may mirror those responses if the user
is fastidious about the robot or reciprocate an angry response or work to diffuse the
situation. All depends on the machines programming so that all the functions can be
performed accurately. Due to this advances, potential applications and the trends are
going forward.
70
CHAPTER 8
CONCLUSION
With a computer, multimedia hardware, and a relevant technical paper in the public
voice recognition system for physically handicapped people. The accuracy is usually in
the mid-80% range, as long as the environment is quiet. The key properties of the
proposed platform are scalability and universality. Mel- frequency Cepstral Coefficient
method was applied for voice recognition and results were effective. Training the voice
module for the samples of the voice increased the accuracy for voice recognition. Given
the stringent property of voice being volatile and there is a change in the waveform every
time the same word is spoken, the voice module V3 could meticulously match the voice
with samples in the memory. GSM module SIM900 proved to be one of the most
effective and economical components to connect to the GSM network. Integration the
voice module V3 and the GSM module SIM900 with the Arduino UNO for producing an
assistive device can be used for emergency calls and getting immediate help. The
platform is composed from easy to get and relatively cheap hardware elements.
71
REFERENCES
72
REFERENCES
[10] Bo Cui and Tongze Xue, "Design and realization of an intelligent access control
system based on voice recognition," in Computing, Communication, Control, and
Management, 2009. CCCM 2009. ISECS International Colloquium on, 2009, pp.
229-232.
[13] Soon Suck Jarng, "HMM voice recognition algorithm coding," in Information
Science and Applications (ICISA), 2011 International Conference on, 2011, pp. 1-7.
74