ABSTRACT

ABSTRACT
ASSISTIVE VOICE RECOGNITION DEVICE FOR GSM CALLING USING ARDUINO

UNO
By
Dhruvin M. Lingaria
August 2015
Developing a smart home environment for the assistive living requires great efforts.
The key element of the smart environment is the ubiquitous voice user interface with
several additional capabilities such as the recognition of several gestures, which can be a
new feature of voice controlled devices. There are many identification technologies used
in current intelligent guard systems. Relative to other techniques, the voice recognition
technology is generally regarded as one of the convenient and safe recognition
techniques. The assistive device project has incorporated the technology of voice
recognition to perform the GSM calling. Arduino UNO is the microprocessor used to
create an interface between the voice module and the GSM module SIM900. Platform
was developed using inexpensive hardware and software elements available on the
market People with disabilities showed high robustness for assistive device. Sample
voice commands were stored in the temporary memory for the ATMEGA 328P when
field tests with several sets of voice commands was done. The GSM module SIM900
could easily connect to the local cellular network carriers. Hence voice recognized
emergency calling can be the future of biomedical field.

ASSISTIVE VOICE RECOGITION DEVICE FOR GSM CALLING USING ARDUINO
UNO
A PROJECT REPORT
Presented to the Department of Electrical Engineering
California State University, Long Beach
In Partial Fulfillment
of the Requirements for the Degree
Master of Science in Electrical Engineering
Committee Members:
Christopher Druzgalski, Ph.D. (Chair)
Anastasios Chassiakos, Ph.D.
James Ary, Ph.D.
College Designee:
Antonella Sciortino, Ph.D.
By
Dhruvin M Lingaria
B.E, 2012, Rizvi College Of Engineering, Mumbai, India
August 2015
ProQuest Number: 1600584
All rights reserved
INFORMATION TO ALL USERS

The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
ProQuest 1600584
Published by ProQuest LLC (2015). Copyright of the Dissertation is held by the Author.
All rights reserved.

This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.
ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
Copyright 2015
Dhruvin M Lingaria
ALL RIGHTS RESERVED
ACKNOWLEDGEMENTS
I am thankful to the God for the wellbeing and good health that were necessary to
complete this project. I am grateful to my parents for their support, eternal
encouragement, and attention. I wish to express my sincere thanks to Dr. Christopher
Druzgalski, Ph.D., Professor of Biomedical Engineering, for arranging all the necessary
facilities for the research. I place on record, my sincere thank you to Dr. Anastasios
Chassiakos, Ph.D., Chair of Electrical Engineering Department for the continuous
encouragement. I am extremely thankful and indebted to Dr. Christopher Druzgalski,
Ph.D. for valuable guidance, sharing expertise, and encouragement extended to me.
I take this opportunity to express gratitude to all of the department faculty members
for their help and support. I also show my gratitude to one and all, who directly or
indirectly have lent their hand in this venture.
iii
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS...…………………………………………………….............iii
LIST OF TABLES…………………………………………………………………..........vi
LIST OF FIGURES………………………………………………………………….......vii
LIST OF ABBREVIATIONS……………………………………………….………........ix
CHAPTER
1. INTRODUCTION…………………………………………….……………….……..1
2. SYSTEM DESIGN…………..……………...…………………………………..........2
2.1. Procedure................................................................................…………...........2
2.1.1. Arduino Uno……………………….…………………………...…….......3
2.1.2. GSM Shield………………………………...…………….…….…………6
2.1.3. GSM Antenna.............................................................................................8
2.1.4. Voice Module...................................................................................….......9
2.2. Integrating GSM Shield with the Arduino Uno.…………………………......10
2.3. Integrating Voice Module with Arduino Uno.………………………….........11
2.4. Flow Chart: Voice Recognized Telephone Calling……………………….....12
3. OVERVIEW OF APPLICABLE SPEECH PROCESSING.....................................13
3.1. Information Rate of the Speech Signal…………………………………….....14

3.2. Basic Assumption of Speech Processing…….……….....................................15
iv
CHAPTER Page
4. VOICE RECOGNITION……………………………………………………….......16
4.1. Voice Recognition V3 (WIP)…….……….………………………………......17

4.1.1. Features……………………………………….…………………………..17
4.1.2. SPCE061A Single Chip……………………………….………………….17
4.2. Voice Recognition Techniques……..…………………………………..…….26
4.2.1. A Mel-Cepstral Vocal Sound Analysis Approach……………………….27
4.2.2. Dynamic Time Warping Algorithm (DTW)……………………...….......37
4.2.3. Hmm Voice Recognition Algorithm Coding….………………………....38
4.3. Train..…….…………………………………………………….…………......39
4.4. Protocol.……………………………………………………….……………...45
4.5. Details…………………………………………………...……………………46
5. GSM APPLICABILITY TO THE ASSISTIVE DEVICE......................................55
5.1. GSM Module – SIM900………………………………...………...………….57

5.2. GSM Call Processing…………….…………………...……….……………...59
6. RESULTS………………….………………………………………….………......62
7. APPLICATIONS AND FUTURE SCOPE………………….…............................68
7.1. Applications….………………………………………….…………………...68
7.2. Future Scope…………….………………………………….……………......70
8. CONCLUSION…………………….……………………………….…………….71
REFERENCES……………………….…………………………………………….72
iv
LIST OF TABLES
TABLE Page
1. Technical Description Of Arduino UNO……………....…..………………..….........5
2. Speaker Dependent Sample Values From Experimental Data ……….........……....62
3. Voice Samples for Speaker Independent Voice Recognition from Experimental
Data………………………………………………………………………………...63
vi
LIST OF FIGURES
FIGURE Page
1. Arduino UNO front…………………………..………………………….......................3
2. Arduino UNO back……………………………..……………………………….…......3
3. GSM shield front…………………………..…………………………………...…........6
4. GSM shield back………………………………………..………………………..….....6
5. GSM antenna……………………………………..………………………..……..……8
6. Front side of voice module…………………………..…………………………..….....9
7. Backside of voice module………………………..………………………………...…..9
8. Flow chart for voice recognized telephone calling….............................................…..12
9. Schematic diagram of the speech production/speech perception process…................14
10. Frame of the voice control system………………………...…………………….......19
11. SPCE061A pin diagram…………..………………………………………….……...21
12. The voice recognition block diagram…………………................................……….22
13. Power supply circuit……………………………..………………….…..………......23
14. The structure of voice signal process………………….…………………...…..……24
15. Experimental data of input speech signals………….……………………...…...…...31
16. Experimental vocal feature vectors……………………………………….…...…….32
17. Training set and the corresponding feature set………………………….…………..32
18. Experimental output of word ‘ready’…………..……………..………….…….……33

vii
FIGURE Page
19. Experimental output of word ‘call’………...……………………….……….………33
20. Experimental input waveforms vocal utterances…........………………………...….35
21. Protocol vocal signals and their feature vectors………………………………...…..36
22. Test on serial monitor……...…….……………………………...………………..…40
23. Input command “settings” in serial monitor…………………………..………….....40
24. Settings of the voice module….…….………………………………………….……41
25. Input “Sigtrain 0 on” in serial monitor…...................................................................42
26. Input “Sigtrain 1 off” in serial monitor…………............................………………...43
27. Load 0 and 1 of voice samples…..……………………………….……………….....43
28. Recognize the voice input....…………….…………………………………….…….44
29. Use of GSM in biomedical…………..………………………….…………….…….56
30. SIM900………………………………………………………………………..…….58
31. SIM900 pin diagram…………………..………………….……….………..……….58
32. Burst of bits…………………………….………….……………………..………….60
33. Graph of speaker dependent voice samples………………………...…………….....62
34. Graphical representation of samples of speaker independent voice recognition…....63
35. Experimental output waveform of number ‘0’…..………………………….………64
36. Experimental output waveform of number ‘1’…..……….…………………………64
37. Serial monitor data for GSM network……..…………………….………………….65
38. Serial monitor showing experimental status of call ready…………………………..66
39. Assistive device using voice recognition for GSM voice call………………………67
vii
LIST OF ABBREVIATIONS
LIST
GUI Graphical User Interface
SAPI Speech application programming
MSS Microsoft speech server
ICSP In circuit serial programing
SDA Serial Data Line
SCL Serial Clock Line
GSM Global System for Mobile Communications
IMSI International mobile subscriber Identity
WDT Watch Dog Timer
MFCC Mel-Frequency Cepstral Co-efficients
DDMFCC Delta Delta Mel- Frequency Cepstral Co-efficients
DTW Dynamic Time Warping Algorithm
HMM Hidden Markov Model
BCCH Broadcast Control Channel
ix
CHAPTER 1
INTRODUCTION
Voice technology is of enormous benefit for people with physical disabilities. People
with different kinds of disabilities may benefit from various kinds of speech and voice
processing technologies. In this project I have created an assistive device to help
physically disabled people to use their voice and initiate a GSM call. It is a very robust
product that can be used in any environment to suit the user. Voice module V3 records
the voice of different users to recognize the voice. Every user speaks the number 0 to 9
to train the voice module. These voice samples are stored into the voice module library.
Each user has his or her own voice samples from which the voice module recognizes the
voice. There are variations in the voices of each user as well each user has a different
voice sample every time they speak the same number. So, training the voice module for
different samples of voice is very crucial. Once the voice is recognized the Arduino
UNO creates an interface with the GSM module for research for the cellular network.
As soon as the up link connection is created with the cellular tower, the Arduino UNO
gives the command to GSM module to make a call; this call can be an emergency number
for the disabled people to use the emergency services. This project can be helpful to
many disabled people who want to take to doctors far away from them or need medical
services as soon as possible [1].
1
CHAPTER 2
SYSTEM DESIGN
2.1. PROCEDURE
Selection of the Components:
Selecting the components required to perform the project is one of the most important
steps to develop any product. Depending on the resources required the product can be
expensive or economical. Due to this factor the production of the project can be decided
on industrial bases. The required components are as follows:
1. Arduino UNO
2. GSM Module SIM900
3. Voice recognition module (V3)
4. Bread board
5. Connecting wires
6. Resistors
7. SIM card
8. Microphone
The components for the assistive device are easily available, cost effective and
efficient to handle. They do not require regular maintenance, due to this quality they can
be used regularly under robust conditions. As per the project requirement the components
2
use 5v and 2A for performing the task. The voice module and GSM module are the added
features for the assistive device giving the required output. Assembling all the
components together, the resultant output will have multiple applications in the
biomedical field. The required components are described below.
2.1.1 Arduino UNO
FIGURE 1. Arduino UNO Front.
FIGURE 2. Arduino UNO Back
3
Description:
The Arduino UNO is a microcontroller board based on the ATmega328. It has 14
digital input/output pins (of which 6 can be used as PWM outputs), a 16 MHz ceramic
resonator, 6 analog inputs, a USB connection, an ICSP header, a power jack, and a reset
button. It has everything to support the microcontroller; simply connect it to a computer
with a USB cable or power it with an AC-to-DC adapter or battery to get started.
The UNO differs from all preceding boards in that it does not use the FTDI USB-to-
serial driver chip. Instead, it features the Atmega16U2 (Atmega8U2 up to version R2)
programmed as a USB-to-serial converter.
Arduino UNO board version 2 has a resistor connecting the 8U2 HWB line to ground,
making it easier to put into DFU mode. Arduino UNO board version 3 has the following
new features: 1.0 pinout: added SDA and SCL pins that are near to the AREF pin and
two other new pins placed near to the RESET pin, the IOREF helps the shields to adapt to
the voltage provided by the Arduino UNO board. In the near future, shields will be
adaptable with both the board that uses the AVR, which works at 5V with the Arduino.
The second pin is not a connected pin and is reserved for future purposes. Stronger
RESET circuit and ATMEGA 16U2 are replaced by 8U2. "UNO" means one in Italian.
The Uno is the latest version of USB Arduino boards [2].
4
TABLE 1. Technical description of Arduino UNO [2]
Microcontroller ATmega328
Operating Voltage 5V
Input Voltage (recommended) 7-12V
Input Voltage (limits) 6-20V
Digital I/O Pins 14 (of which 6 provide PWM output)
Analog Input Pins 6
DC Current per I/O Pin 40mA
DC Current for 3.3V Pin 50mA
Flash Memory 32 KB of which 0.5 KB used by bootloader
SRAM 2 KB (ATmega328)
EEPROM 1 KB (ATmega328)
Clock Speed 16MHz
Length 68.6mm
Width 53.4mm
Weight 25g
5
2.1.2 GSM Shield
FIGURE 3. GSM Shield Front.
FIGURE 4. GSM Shield Back
6
Description:
The GSM shield allows an Arduino board to make/receive voice calls, to connect to
internet, and to send/receive SMS messages. The GSM shield uses a radio modem
SIM900. It is possible to communicate with the board using AT commands. The GSM
library has numerous methods for communication with the shield. The shield uses digital
pins 2 and 3 for software serial communication with the SIM900. Pin 2 is connected to
the SIM900’s TX pin and pin 3 to its RX pin. The modem's power key pin is connected
to Arduino pin 7. The SIM900 is a quad-band GSM/GPRS modem that works at
frequency GSM900MHz. It supports TCP/UDP and HTTP protocols through a GPRS
connection. GPRS data uplink and downlink transfer speed maximum is 85.6 kbps. The
cellular network interfacing with the board requires a SIM card provided by a network
operator. The most recent version of the board operates on the 1.0 pinout on rev 3 of the
Arduino Uno board [3].
7
2.1.3 GSM Antenna
FIGURE 5. GSM Antenna.
Description:
GSM systems have specific antenna design requirements because GSM technology
administers the capability for global communications between wireless carriers. The
antennas that make this possible are altogether technologically advanced. The phones
and the towers themselves use antennas to communicate with each other. The constant
development of technology means that antenna design companies have to work hard to
not only keep up with the demand for innovations, but to produce them as well. GSM has
already generated newer and more improved generations in the form of 3G and 4G
technologies like UMTS, EDGE, HSDPA, and LTE while competing with CDMA
protocol has moved to EV-DO [4].

8
2.1.4 Voice Module
FIGURE 6. Front side of Voice module.
FIGURE 7. Backside of Voice Module.
9
The module could identify the voice and receives configuration commands which it
responds through serial port interface. With the help of this module we would be able to
control the car or other electrical devices by voice. This module can store 80 pieces of
voice instruction. Those 80 pieces are divided into groups, with 7 instructions in each
group. Initially we should record the voice instructions as per group. Once that is done,
we import one of the groups by serial command before it could recognize the 7 voice
instructions within that groups. We first need to import the other group to implement the
instructions. This module is not dependent on the speaker. If a different speaker speaks
the voice instruction in place of you, it will not identify the instruction. Please note that
speaker independence requires strictly good MIC.
Technical Parameters:
Voltage: 4.5-5.5V
Current: <40mA
Digital Interface: 5V (TTL level UART interface)
Analog Interface: 3.5mm mono-channel microphone connector + microphone pin
interface
Size: 30mm x 47.5mm
Recognition accuracy: 99% (under ideal environment) [5]
2.2 Integrating GSM shield with the Arduino UNO
The GSM module has to be integrated with the Arduino UNO so the calling through
the phone of the patient can be initiated. The GSM module has a SIM card inserted in it.
10
The computer or laptop can make a phone call to any other phone with GSM network.
The pins in the GSM shield are activated when the call or SMS is sent to the GSM shield.
2.3 Integrating Voice module with Arduino UNO
Once the GSM module is integrated to the Arduino UNO, voice module has to
program with the Arduino UNO. The voice is recognized and amplified for the system to
use it for the phone call. The method of recognition of the voice is as follows.
Steps for calling:
Say “Dial”
Phone number please
Say the phone digit one at a time
After the last digit say “Verify”
Number is repeated
Send
Dialing
Number is dialed
Call in process
11
2.4 Flow Chart: Voice recognized telephone calling
FIGURE 8. Flow chart for voice recognized telephone calling.

http://www.ti.com.cn/cn/lit/an/spra144/spra144.pd
12
CHAPTER 3
OVERVIEW OF APPLICABLE SPEECH PROCESSING
Speech processing is one of the applications of digital signal (DSP) techniques to
process and /or analysis the speech signals. Applications of speech processing include:
Speech Coding
Speech Recognition
Speaker Verification \ Identification
Speech Enhancement
Speech Synthesis (Text To Speech Conversion)
The speech production process is initiated when the speaker formulates a message in
his/her mind to transmit to the listener via speech. The next step in the process is the
conversion of the message into a language code. This correlates to converting the
message into a set of phoneme sequences corresponding to the sounds that users’ make
denoting pitch associated with the duration of sounds, and loudness of sounds [6].
13
FIGURE 9. Schematic diagram of the speech production/speech perception process [6].
3.1 Information Rate of the Speech Signal
The rate of discrete symbol information in the crude message text is rather low (about
50 bits per second corresponding to about 8 sounds per second, where each sound is one
of the 50 distinct symbols).
Once the language code is converted, with the inclusion of prosody information, the
information rate rises to about 200 bps. In the next stage the representation of the
information in the signals becomes continuous with an equivalent rate of about 2000 bps
at the neuromuscular control level and about 30,000- 50,000 bps at the acoustic signal
level. The continuous information rate at the basilar membrane is in the range of 30,000
– 50,000 bps, while at the neural transduction stage it is about 2000 bps. The higher level
processing within the brain converts the neural signals to a discrete representation, which
is eventually decoded into a low bit rate messages [6].
14
3.2 Basic Assumption of speech processing
The speech processing system has basic assumption that the source of excitation and
the vocal tract system are independent.
Hence, it is an appropriate to model the source of excitation and the vocal tract
system separately. In continuous speech the vocal tract changes shape slowly and
gradually as it is reasonable to assume that the vocal tract has a fixed characteristics over
a time interval of the order of 10ms [6].
15
CHAPTER 4
VOICE RECOGNITION
Voice recognition is "the technology by which sounds, words or phrases spoken by
humans are converted into electrical signals, and these signals are transformed into
coding patterns to which meaning has been assigned" [7]. The notion could more
generally be called "sound recognition". The emphasis is more on the human voice
because we most often and most typically use our voices to communicate our ideas to
others in our surroundings. In the situation of a virtual environment, the user would
probably gain the feeling of immersion and being part of the simulation, if they could use
their most common form of communication, the voice. Due to fundamental difference
between human speech and the more traditional forms of computer input it is difficult to
use voice as an input to computer simulation. While computer programs are commonly
designed to produce an explicit and well-defined response upon receiving the proper
input, the human voice and spoken words are anything but precise. Each human voice is
different, and identical words can have different meanings if spoken with different
inflections or in different contexts. Several approaches have been tried, with varying
degrees of success, to overcome these difficulties [8]. Improving performances in voice
recognition can be done taking into account the following criteria:
16
Dimension of recognizable vocabulary.
Spontaneous degree of speaking to be recognized.
Dependence/independence on the speaker.
Time to put in motion the system.
System accommodating time at new speakers.
Decision and recognition time.
Recognition rate (expressed by word or by sentence) [7].
4.1. Voice Recognition V3 (WIP)
4.1.1. Features
Group control and external group select pin
Auto load records when power on
Recognize maximum 7 voice commands at same time
Signature function helps to make out voice record
Store maximum 255 records of voice
LED indicate [9].
4.1.2. SPCE061A single chip
The technology of the voice recognition adopted by the intelligent access system is
based on the SPCE061A single chip. The system hardware is made up of:
SPCE061A single chip
The power and gating circuit
The extended memorizer SPR4096
The voice input and Output circuit.

17
The key technologies are the application of the SPCE061A single chip on voice
recognition and the design of the assistive control circuit. The system software consist of
the voice recognition module, the voice training module, the voice-playing module, the
speech data processing module, and the cipher of input/output module. The voice module
completes the part of collecting and distilling the voice data, speech recognition and
voice playing in terms of initializing the system and the identification training. As per the
voice recognition arithmetic theory, the character distilling, pretreatment, and pattern
matching are then analyzed [10].
The classification of voice recognition:
The purpose and function of the voice recognition are different, the recognition is
classified as speaker recognition and voice recognition. The speaker recognition is
classified into two types, one is dependent to text and the other is irrelevant to text. The
voice recognition system requires pronunciation to be according to the stated contents
that is relevant, and then everybody's speech model is built up accurately. Identification
also requires users to pronounce according to the stated contents, the effect is very good.
Pronunciation is more important than the text of the content for the voice recognition
system so it is difficult to build up speech models. If customers make use of the system
conveniently it can be applied widely. The usage helps the system to be classified as
speaker recognition and confirmation. The former judges a voice that needs to be
identified from several samples. The latter judges whether or not an identified voice
comes from a certain speaker. Its output has two kinds of result, yes or no. The central
processor of this system is the SPCE061A single chip. The speaker approval of what is
relevant to text is accomplished on the chip, and then homologous order and operation
18
are carried out. The system is mainly made up of speaker identification module and
gating circuit.
In training, the voice of the speaker gets into the voice signal collection circuit
through a microphone, and then the collected voice signals are processed by the voice
processing circuit, the characteristic guidelines of the speaker are distilled and saved. At
last the database of characteristic parameters of talkers is formed. In identifying, the
voice that needs to be identified is matched with the information in the database of
characteristic parameters of speaker. Output circuits manage the gating electrical
machine [11].
SPR4096
Switch Circuit
Microphone
SPCE061A
Control Circuit
Single Chip
Speaker
Number dialed
Keyboard
FIGURE 10. Frame of the voice control system [10].
19
The hardware part of this system include SPCE061A single chip, the audio output
circuit, the voice recognition circuit, the FLASH circuit, the audio input circuit and the
keyboard circuit etc. The main mission of the hardware is to change voice signals into
digital signals, collect samples, upload, identify and play the voice datum [10].
Hardware design of the system:
A. CPU core circuit:
The SPCE061A chip has the system frequency 0.375~49.152 MHz and voltage scope
is 2.6~5.5 V. There are multi-function programmable I/O ports of 32 bits, 7 x 10 bits
channels voltage ADC, 2 x 16 bits timer/counters, microphone input with automatic
added function, double channels audio output function of 10 bits DAC and Watch Dog
Timer (WDT) in the single chip. The interruption controller can handle three kinds of
FIQ interruption, eleven kinds of IRQ interruption and one soft interruption controlled by
the instruction BREAK. There are voice processing function and abundant C function
databases provided by the single chip. It is very suitable to implode voice recognition
products.
20
FIGURE 11. SPCE061A Pin Diagram. www.go-gddq.com.
B. Voice recognition circuit:
The principle of voice recognition circuits is that voice signals are analyzed by the
intelligent system after distilling voice, firstly noises are filtered and the useful signals of
voice signals are distilled through a filter group, then the signals are processed and
chosen by calculation match of function PAR-COR, linear prediction coefficient, times of
zero, etc. The voice signals carry on mode match with the voice datum in the voice
database after analysis and processing, lastly the voice recognition result is output
according to the match result. The basic structure of the voice recognition circuit is
shown in figure 12.
21
Voice Database
A filter group PAR-

COR function Linear
Prediction Co-
efficient Times of
Voice Mode
Noise Filtering Zero
Signals Match
Output of
recognition
result
FIGURE 12. The voice recognition block diagram [10].
C. Power supply circuit:
The chip SPCE061A adopts low voltage supply in order to reduce power consumption.
SPCE061A has two power supplies, one is inside power supply VDD and the other one is
I/O power supply VDDH. I/O power supply is 5V, and inside power supply is 3.3V. The
main motive to reduce inside power supply is to lower the power consumption and
working temperature of the single chip. Though the voltage range of SPCE061A is very
wide, to make the chip run more stably and satisfy the voltage demand of I/O ports and
outside parts the power supply circuit is shown in figure 15. AC 220V is converted into
DC 5V by 7805 steady-voltage module, DC 5V will supply power for voice recognition
module and every I/O ports inside the system. DC 5V is converted into DC 3.3V by
TR1972-33 [10].
22
FIGURE 13. Power supply circuit [10].
1. Processing of the voice signals:
Firstly voice signals are pretreated and the signals are amplified properly, secondly
analog signals are converted into digital signals in order that digital equipment can
process the signals conveniently, and then characteristics are distilled in order that some
parameters of the signal characteristics can be used to replace the voice signal. Lastly
different treatment will be adopted according to the missions. Voice recognition can be
divided into two stages: the training stage and the identification stage.
In the training stage the voice signals expressed by characteristic parameters are
processed, standard datum that can show common characteristics of identification basic
units are obtained. Reference templates are formed based on above datum, and then the
reference template database is formed after all reference templates of identified basic
units are combined together. In the identification stage the identified voice signals after
characteristic distilling are compared with reference templates one by one.
23
FIGURE 14. The structure of voice signal process [10].
A. Voice signals pretreatment:
The noises seriously disturb the processing and identification of voice signals, so the
noises must be disposed firstly. The input analog voice signals from microphones must be
sampled and measured in order to obtain digital voice signals. Before converting voice
signals into digital signals, it is necessary to filter and counter disturb. In filtering the
signal part and noises beyond ½ sample frequency are filtered. The clean voice signals
24
are obtained later, and then low frequency disturbance is filtered through fore-
aggravation technology, especially the disturbance of 50Hz or 60Hz. The high-frequency
voice signals are improved and they can cutout DC floating, and can improve function of
energy of clean voice by confining random noises. [10]
B. Characteristic distilling:
The system adopts the evaluation method that uses the contrast value between
dispersion degree of different speakers and self-dispersion degree of each speaker as
characteristic parameters. The basic idea is to distill group characteristic parameters from
a voice segment of the same speaker that is to say to map the segment on a dot of the
multi-space. Different voices from the same speaker will produce different dots in the
characteristic space; the function of multi-variable probability density can describe the
distribution. For different single pronunciation from the same speaker, these dots are
relatively concentrated. But the pronunciation distribution from different speakers is apart
farther, the group characteristic parameters can describe the thumbprint of speakers
effectively. According to this principle, for single parameter, the F ratio between two
kinds of distribution parameters can be used as effective measurement rule. The F ratio
shows the contrast between dispersion degree of different speakers and self-dispersion
degree of each speaker. The ratio of one characteristic parameter is bigger, for this
characteristic, the former is bigger than the latter averagely. Therefore the recognition
system adopts a bigger F ratio and then the system capability is improved [10].
C. Module match:
At present research on the method of module match based on various characteristic
parameters is more and more embedded. Typical methods are: the vector measure
25
arithmetic, the Gauss mixture module arithmetic, the dynamic time whole (DTM)
arithmetic and the manual nerve net arithmetic. The above methods have both advantage
and weakness. When the DTM arithmetic is applied in the identification of long voice,
the operations of module match are too great. But the arithmetic is simple and effective
for short voice (the length of valid voice is subter-3 seconds) identification. So the
method is especially applicable to the speaker recognition system of voice and text. The
system in this paper adopts the DTW arithmetic [10].
4.2. Voice Recognition Techniques
Voice recognition represents the computational task of validating users’ claimed
identity using characteristics extracted from their voices. A speaker recognition system
has to be able to recognize who is speaking on the basis of individual information
included in the speech signals. Speaker recognition technology makes it possible to use
the speakers’ voice to verify their identity and control access to various services such as
database access services, voice mail, banking by telephone, information services, security
control for confidential information areas, and inaccessible computers. Speaker
recognition methods can be divided into text dependent (speech-dependent) and text-
independent (speech independent) techniques. The methods which were used earlier
discriminate the users based on the same spoken letters/words or numbers but the other
method don’t rely on definite speech. As any pattern recognition system, the speaker
recognition system consists of a feature extraction part and a classification one. The
speaker recognition can be divided into two different categories such as supervised and
26
unsupervised recognition depending on the character. In this project we considerate of
supervised case, therefore developing a supervised voice recognition system.
Also, speaker recognition encompasses both identification and verification of voices.
Speaker identification represents the method of coming to conclusion which registered
speaker pronounced the word. Speaker verification represents the procedure of accepting
or rejecting the identity claim of a previously identified speaker [12].
4.2.1 A Mel-Cepstral vocal sound analysis approach:
Consider a vocal signal S to be analyzed. First, we perform a short-time analysis on it.
The speech signal is divided into overlapping frames (256 samples) and overlaps (128
samples). Each resulted segment is then windowed, by multiplying it with a Hamming
window of length 256 samples. The FFT (Fast Fourier Transform) computes the
spectrum of each window sequence. The cepstrum of each windowed frame s[n] is then
computed by applying the inverse Fourier transform to its log spectrum.
Then, we translate the regular frequencies to a scale that is more appropriate for
speech. The Mel-scale poorly approximates the linearly-spaced frequency bands used in
the normal cepstrum than the human auditory system's response. The cepstrum and the
Mel frequency cepstrum have differences such as in the MFC, the frequency bands are
equally spaced on the Mel scale. The Mel-frequency cepstral coefficients (MFCCs) are
commonly obtained as follows:
1. Take the FFT of a windowed signal.
2. Using triangular overlapping windows map the powers of the spectrum onto the
Mel scale.
3. Each of the Mel frequencies have to undergo the log of the powers.
27
4. Take the DCT (Discrete Cosinus Transform) of the set of Mel log powers thinking
them as a signal.
5. The MFCCs are the amplitudes of the resulting spectrum.
Therefore, a sequence of MFCCs is thus obtained for each frame. Every MFCC set serve
as a melodic cepstral acoustic vector. Melodic cepstral acoustic vectors can perform as
feature vectors but we need to achieve more powerful speech features. Hence, the MFCC
acoustic vectors undergo derivation process. The first order derivatives of the Mel
cepstral coefficients is computed as delta Mel cepstral coefficient. Then, the delta delta
Mel frequency cepstral coefficients (DDMFCCs) are derivative of DMFCC, being the
second order derivatives of MFCCs. We derive these Mel-cepstra coefficients because
we want to model the intra-speaker variability. The computed DDMFC coefficients show
us how fast the voice of a speaker is changing in time. A DDMFC acoustic vector is thus
obtained for each frame of the initial speech signal S. Each acoustic vector is composed
of 256 samples, but the first 12 coefficients are mainly encoded with the speech
information. So, we truncate each vector at its first 12 samples, then position it as a
column of a matrix. This truncated DDMFCC acoustic matrix represents a powerful
voice discriminator, therefore it could be successfully used as a vocal feature vector for
speaker recognition. Each acoustic matrix has 12 rows and a number of columns
depending on the length of each vocal signal S. Therefore, because of their different
sizes, these speech feature vectors cannot be compared using the linear metrics, such as
the widely known Euclidean distance. A solution would be transforming the acoustic
matrices, through re-sampling or padding with zero values, such that they get the same
dimensions and the Euclidean metric could be used. The disadvantage of this approach is
28
the possible loss of valuable speech information from the feature vectors. There are
many other possible speech feature vectors that can be obtained with this delta delta Mel-
cepstral analysis. For example, a vocal feature vector for signal S could be made from
some statistical values computed for each DDMFCC (or MFCC) acoustic vector of this
sound signal [12].
Text-dependent voice recognition:
The speech-dependent recognition techniques is used to differentiate on the basis of
spoken words by the users. Template-matching technique is the most effective technique
for text dependent recognition process. Dynamic time warping (DTW) algorithms or
hidden Markov models (HMM) method are used extensively for voice recognition.
The DDMFCC – based feature extraction is performed and the feature vectors are
obtained. Thus, V(S), the feature vector of speech signal S, can be computed as the
truncated 12-row delta delta Mel ceptral acoustic matrix. Another featuring solution
tested is computing V(S) as the mean of the DDMFCC matrix. So, we obtain V(S) as the
unidimensional vector containing the mean values of the columns of the acoustic matrix.
A sequence of same-speech input vocal utterances to be recognized: {S1,..., Sn}. The
feature extraction process is then applied to them, the feature set {V(S) 1,……, V(S)n}
being obtained.
The next stage is Speaker classification based on pattern recognition. Voice
identification system is accompanied with a supervised classifier, proposing a minimum
mean distance classification approach. A set of registered speakers is to be created. By
the collection of spoken words a training set can be obtained related to the same speech,
provided by these speakers and filtered for noise removal. A vocal prototype is assigned
29
to each speech signal when they are trained. The feature training sets are obtained by
computing feature vectors from these prototypes. Consider N advised speakers, then the
resulted training set receives the form
Sp = {Sp1 ,..,SpN}, Where each Spi = {Si1 ,….. , Si n(i)} represents the set of signal prototypes
corresponding to the ith speaker. For each Si j , where i = 1’, N’ , j = 1’ , n(i)’ , the vocal
features extraction is performed and the obtained sequence {{ V (S11),…., V(S1 n(1)),……,
{V(SN 1),…, V(SN n(N))} represents the feature training set of classifiers.
The next step is to consider the minimum distance classification procedure. We consider
N classes of N advised speaker in the class. Our algorithm introduces each input vocal
sequence Si in the class of the speaker comparable to the smallest mean distance between the
feature vector of the input signal and his prototype vectors. Therefore, the closest speaker is
identified as the nth i registered speaker, where:
‘d’ represents the metric. The classification result, consisting of N classes of speech
utterances, represents also the speaker identification result. The accurate speaker is thus
identified each input data. The next stage of the recognition process, the speaker verification,
has to decide if that identified speaker is the one who really produced it. We propose a
threshold based verification procedure to be performed within each resulted speaker class.
So, each mean distance computed in any class must not exceed a special chosen threshold
value. Any threshold-based recognition approach implies the task of choosing a proper
threshold value. We propose an automatic threshold detection method, considering the
30
overall maximal distance between any two prototype vectors belonging to the same training
feature subset, as a threshold. Thus, a satisfactory threshold is obtained from the following
equation:
A high recognition rate, approximately 85%, has been reached by our speech-dependent
voice recognition system. 5 input vocal utterances and 3 registered speakers. The speech
input signals and their corresponding feature vectors are represented in the next two figures.
The speech of all these vocal utterances contain a single word: hello.
FIGURE 15. Experimental data of input speech signals [12].
31
FIGURE 16. Experimental Vocal feature vectors [12].
FIGURE 17. Training set and the corresponding feature set [12].
32
Using the values registered we obtain the identification result: Speaker 1= {S1, S4},
Speaker 2 = {S3, S4} and Speaker 3 = {S2}. Compute threshold valve T= 1.3915,
therefore we get the recognition: Speaker 1 = {S1, S4}, Speaker 2 = {S3}, Speaker 3 =
{S2}, Unregistered Speaker = {S4}.
FIGURE 18. Experimental output of word ‘READY’.
FIGURE 19. Experimental output of word ‘CALL’.

33
Text-independent voice recognition:
The speech-independent recognition systems involve impressing volumes of training
data ensuring that the entire vocal range is captured. Thus, it is useful for not cooperative
subjects, for example like those in the surveillance systems. The speech-independent
recognition methods are lucrative which are based on Vector Quantization (VQ) or
Gaussian Mixture Model (GMM). The VQ-based methods are parametric approaches
which use VQ codebooks consisting of a small number of representative feature vectors,
K Gaussian distributions are used to represent GMM method based non parametric
technique. We utilize the same delta delta Mel cepstral analysis for the feature extraction
part of this recognition system. Voice recognition techniques like those based on Vector
Quantization, produce MFCC based unidimensional feature vectors. We use bi-
dimensional feature vectors, each vector V(S) being computed as the truncated DDMFCC
acoustic matrix. The sequence of speech signals to be recognized, {S1,…, Sn}, is not
characterized by the same speech anymore. A similar minimum mean distance based
classifier is used, with a uniquely different training set. We consider a large spoken
letters /words, consisting mostly all the English language phonemes. Each registered
speaker provides this speech several times, so same text will be obtained from all the
prototype signals of Sp. The equation helps identify the speaker. We provide a threshold
based verification technique, but not an automatic one, like in the previous case. T is the
threshold value which is analytically chosen, such as to satisfy the necessary condition:
34
Where C1,…., Cn represent the identified voice classes. Many numerical tests using this
approach were performed and obtained a high voice recognition rate [12].
FIGURE 20. Experimental Input waveforms vocal utterances [12].
35
FIGURE 21. Prototype vocal signals and their feature vectors [12].
The prototype speech signals and their corresponding DDMFCC – based speech
feature vectors. Then, the mean distance values between the input feature vectors and
the training feature subsets are computed. Using the values registered in the TABLE 2
we obtain the identification result: Speaker 1= {S2, S6, S9}, Speaker 2 = {S1, S3, S7} and
Speaker 3= {S4, S5, S8}. We get the threshold value T = 7.67, therefore we obtain the
final recognition: Speaker 1 = {S2, S6, S9}, Speaker 2 = {S1, S3, S7}, Speaker 3 = {S4, S5,
S8} and unregistered Speaker = {S5}. This is the voice recognition technique I used in
the project for assistive device [12].
36
4.2.2. Dynamic Time Warping Algorithm (DTW):
Dynamic Time Warping algorithm (DTW) [Sakoe , H. & S. Chiba-8] is an algorithm
that calculates an optimal warping path between two time series. The distance and
warping path values between the two series is calculated by algorithm.
Suppose we have two numerical sequences (a1,a2, ..., an) and (b1, b2, ..., bm). The
two sequences length can differ from each other. The local distances calculation between
the elements of the two sequences using different types of distances is the initial of the
algorithm. The absolute distance between the values of the two elements (Euclidian
distance) is the method used for distance calculation. That results in a matrix of distances
having n lines and m columns of general term:
Starting with local distances matrix, then sequences with minimal distance matrix
between them is determined by using a dynamic programming algorithm and the
following optimization criterion:
Where aij is the minimal distance between the subsequences (a1,a2, ..., ai) and (b1,
b2, ..., bj). A path through minimal distance matrix is a warping path from a11 element
to anm element consisting of those aij elements that have formed the anm distance. The
equation below gives the global warp cost of the two sequences.
37
Where, wi: elements that belong to warping path, and p: the number of wi elements. The
calculations made for two short sequences are shown in figure 1 including the highlight
of the warping path.
4.2.3. HMM Voice Recognition Algorithm Coding :
However even though voice recognition is done partly in frequency domain, a still
unknown brain-like functioning algorithm should be discovered to explain how the voice
is divided into syllables and phonemes for recognition. Since there are too many
unknown facts about how the brain recognizes the voice through different paths and
processes, it may be still better to approach the problem by probabilistic algorithm than
analytic algorithm. For this reason, two different voice recognition algorithms have been
studied while the common feature in both these algorithms is to extract the feature
parameters of the speech signal. The NN (Neural Network) recognition algorithm first
generates a large-sized coefficient matrix through training of characteristic feature
parameters representing syllables or words, then calculates an output index by directly
applying the feature parameters of an unknown new syllable or word to the huge
coefficient matrix. Recognition using a neural network speech recognition method with a
large coefficient matrix for the whole learning process is time-consuming. If you add a
new speech signal to the recognition algorithm, the entire process from the beginning
should be repeated which is a problem due to high time consumption. In the second
method, HMM (Hidden Markov Model) recognition algorithm, for every new input voice
signal, voice feature parameters are generated which are used in the learning process to
create a new HMM model. So with each new HMM model created for every word,
38
during the testing phase, all these models are compared with the test word to find out the
matching voice sample. The disadvantage that a HMM model has is, that for every new
voice that is added to the model, a new individual HMM model needs to be created, and
each model should be compared with all the existing HMM models to get a match,
slowing down the recognition process speed. HMM method is fast in initial training, and
when a new voice information is added into the HMM database, only the new voice is
used in the training process to create a new HMM model. Compared to the neural
network algorithm, for a large number of speech samples, the HMM algorithm provides a
higher speech recognition rate [13].
4.3 Train
Connection of Voice Recognition V3 Module with Arduino are as follows:
Arduino VR Module
5V ---------> 5V
1 ---------> TX
0 ---------> RX
GND ---------> GND
Open vr_sample_train (File -> Examples -> VoiceRecognitionV3 -> vr_sample_train)
Choose right Arduino board (Tool -> Board, UNO recommended), Choose right serial
port.
Click Upload button, wait until Arduino is uploaded.
Open Serial Monitor. Set baud rate at 115200, Newline or Both NL & CR should be set.
39
FIGURE 22. Test on Serial Monitor [9].
Send command settings (case insensitive) to check Voice Recognition Module
settings. Write settings and press send for enter the settings.
FIGURE 23. Input Command “settings” in serial monitor [9].
40
FIGURE 24. Settings of the voice module [9].
Train Voice Recognition Module. Train record 0 with signature "On" by sending
sigtrain 0 On command to. When Serial Monitor prints "Speak now", you need speak
your voice (can be any word, meaningful word recommended, may be 'On' here), and
when Serial Monitor prints "Speak again", we need to speak the letter/ words again.
Serial Monitor prints "Success", and "record 0" is trained when two voices are matched,
or if are not matched, keep on repeating the process until you get success.
When training, the two led on the Voice Recognition Module can benefit your
training process. After sending train command, the blinking of SYS_LED will prompt to
be ready, as soon as the STATUS_LED lights is on then speak, when the STATUS_LED
lights are off the recording finishes. When the training is successful the SYS_LED is
blinks again and these status is repeated. Passed: SYS_LED and STATUS_LED blink
together. Failed: SYS_LED and STATUS_LED blink together quickly.
41
FIGURE 25. Input “Sigtrain 0 On” in serial monitor [9].
Train another record. Send sigtrain 1 Off command to train record 1 with signature
"Off". Choose your favorite words to train (it can be any word, meaningful word
recommended, may be 'Off' here).
42
FIGURE 26. Input “Sigtrain 1 Off” in serial monitor [9].
Send load 0 1 command to load voice.
FIGURE 27. Load 0 and 1 of voice samples [9].
43
FIGURE 28. Recognize the voice input [9].
Train finish. Train sample also support several other commands [9].
COMMAND FORMAT EXAMPLE COMMENT
1. train train (r0) (r1)... train 0 2 45 Train records

2. load load (r0) (r1) ... load 0 51 2 3 Load records
3. clear clear clear Remove all records in
Recognizer
4. record record (r0) record 0 Check record train status
5. vr vr vr Check recognizer
status
6. getsig getsig (r) getsig 0 Get signature of record (r)
7. sigtrain sigtrain (r) (sig) sigtrain 0 ZERO Train one record(r) with
signature (sig)
8. settings settings settings Check current system
44
4.4 Protocol
Base Format
1. Control
| Head (0AAH) | Length| Command | Data | End (0AH) |

Length = L (Length + Command + Data)
2. Return
| Head (0AAH) | Length| Command | Data | End (0AH) |

Length = L (Length + Command + Data)
NOTE: Data area is different with different commands.
Code
ALL CODE ARE IN HEXADECIMAL FORMAT
1. FRAME CODE
AA --> Frame Head
0A --> Frame End
2. CHECK
00 --> Check System Settings
01 --> Check Recognizer
02 --> Check Record Train Status
03 --> Check Signature of One Record
3. SYSTEM SETTINGS
10 --> Restore System Settings
11 --> Set Baud Rate
12 --> Set Output IO Mode
13 --> Set Output IO Pulse Width
15 --> Reset Output IO
15 --> Set Power on Auto Load
4. RECORD OPERATION
20 --> Train One Record or Records
21 --> Training of One Record and Set Signature
22 --> Set Signature for Record
45
5. RECOGNIZER CONTROL
30 --> Load a Record
31 --> Clear Recognizer
32 --> Group Control
6. THESE 3 COMMANDS ARE ONLY USED FOR RETURN MESSAGE

0A --> Prompt
0D --> Voice Recognized
FF --> Error [10]
4.5 Details
1. Check System Settings (00)
Use "Check System Settings" command to check current settings of Voice Recognition
Module, include serial baud rate, output IO mode, output IO pulse width, auto load and
group function.
Format:
| AA | 02 | 00 | 0A |
Return:
| AA | 08 | 00 | STA | BR | IOM | IOPW | AL | GRP | 0A |
STA : Trained status (0-untrained 1-trained FF-record value out of range)
BR: Baud rate (0,3-9600 1-2400 2-4800 4-19200 5-38400)
IOM: Output IO Mode (0-Pulse 1-Toggle 2-Clear 3-Set)
IOPW: Output IO Pulse Width(Pulse Mode) (1~15)
AL: Power on auto load (0-disable 1-enable)
GRP: Group control by external IO (0-disable 1-system group 2-user group)
2. Check Recognizer (01)
Use "Check Recognizer" command to check recognizer of Voice Recognition Module.

Format:
| AA | 02 | 01 | 0A |
Return:
| AA | 0D | 01 | RVN | VRI0 | VRI1 | VRI2 | VRI3 | VRI4 | VRI5 | VRI6 | RTN | VRMAP
| GRPM | 0A |
RVN: Number of valid records in recognizer maximum of 7
46
VRIn (n=0~6): Record which is in recognizer, n: recognizer index value
RTN: Recognizer containing total number of records.
VRMAP: valid record bit map for VRI0~VRI6.
GRPM: group mode indicate. (FF-not in group mode 00~0A-system group 80~87-user
group mode)
3. Check Record Train Status (02)
Use "Check Record Train Status" command to check if the record is trained.
Format:
Check all records
| AA | 03 | 02 | FF| 0A |
Check specified records
| AA | 03+n | 02 | R0 | ... | Rn | 0A |
Return:
| AA | 5+2n | 02 | N | R0 | STA | ... | Rn | STA | 0A |
*N: number of trained records.
**R0 ~ Rn: record.
STA: trained status (0-untrained 1-trained FF-record value out of range)
4. Check Signature of One Record (03)
To check the signature of one record this command is used.

Format:
| AA | 03 | 03 | Record | 0A |
Return:
| AA | 03 | 03 | Record | SIGLEN | SIGNATURE | 0A |
SIGLEN: signature string length
SIGNATURE: signature string
5. Restore System Settings (10)
To restore Voice Recognition Module settings to default this command is used.

Format:
| AA | 02 | 10 | 0A |
47
Return:
| AA | 03 | 10 | 00 | 0A |
6. Set Baud Rate (11)
Effect after Voice Recognition Module is restarted. To set baud rate of Voice
Recognition Module this command is used.
Format:
| AA | 03 | 11 | BR | 0A |
Return:
| AA | 03 | 11 | 00 | 0A |
BR: Serial baud rate. (0-9600 1-2400 2-4800 3-9600 4-19200 5-38400)
7. Set Output IO Mode (12)
To set Voice Recognition Module output IO mode and to bring it to immediate effect
after the instruction execution this instruction is used.
Format:
| AA | 03 | 12 | MODE | 0A |
Return:
| AA | 03 | 12 | 00 | 0A |
MODE: Output IO mode. (0-pulse mode 1-Toggle 2-Set 3-Clear)
8. Set Output IO Pulse Width (13)
Use this command to set output IO pulse width of Voice Recognition Module, take effect
immediately after the instruction execution. When output IO mode is "Pulse" Pulse
width is used.
Format:
| AA | 03 | 13 | LEVEL | 0A |
Return:
| AA | 03 | 13 | 00 | 0A |
LEVEL: pulse width level. Details:
- 00 10ms
- 01 15ms
- 02 20ms
48
- 03 25ms
- 04 30ms
- 05 35ms
- 06 40ms
- 07 45ms
- 08 50ms
- 09 75ms
- 0A 100ms
- 0B 200ms
- 0C 300ms
- 0D 400ms
- 0E 500ms
- 0F 1s
9. Reset Output IO (14)
Use this command to reset output IO. To generate a user-defined pulse in output IO
set/clear mode this command is used.
Format:
| AA| 03 | 14 | FF | 0A | (reset all output IO)
| AA| 03+n | 14 | IO0 | ... | IOn | 0A | (reset output IOs)
Return:
| AA | 03 | 14 | 00 | 0A |
IOn: number of output io
10. Set Power On Auto Load (15)
Use this command to enable or disable "Power On Auto Load" function.

Format:
| AA| 03 | 15 | 00 | 0A | (disable auto load)
| AA| 03+n | 15 | BITMAP | R0 | ... | Rn | 0A | (set auto load)
Return:
| AA| 04+n | 15 | 00 |BITMAP | R0 | ... | Rn | 0A | (set auto load)
BITMAP: Record bitmap.{ 0 (zero record, disable auto load), 01 (one record), 03 (two
records), 07 (three records), 0F (four records), 1F (five records), 3F (six record), 7F
(seven records) }
R0~Rn: Record
49
11. Train One Record or Records (20)
Train records, can train several records one time.

Format:
| AA| 03+n | 20 | R0 | ... | Rn | 0A |
Return:
| AA| LEN | 0A | RECORD | PROMPT | 0A |
| AA| 05+2n | 20 | N | R0 | STA0 | ... | Rn | STAn | SIG | 0A |
*SIG: signature string
**PROMPT: prompt string
Rn: Record
STA: train result (0-Success 1-Timeout 2-Record value out of range)
N: number of train success
12. Train One Record and Set Signature (21)
Setting a signature by training one record, one record one time.

Format:
| AA| 03+SIGLEN | 21 | RECORD | SIG | 0A | (Set signature)
Return:
| AA| LEN | 0A | RECORD | PROMPT | 0A | (train prompt)
| AA| 05+SIGLEN | 21 | N | RECORD | STA | SIG | 0A |
SIG: signature string
PROMPT: prompt string
STA: train result(0-Success 1-Timeout 2-Record value out of range)
N: number of train success
13. Set Signature for Record (22)
Set a record signature, one record at a time.

Format:
| AA | 03+SIGLEN | 22 | RECORD | SIG | 0A | (Set signature)
| AA | 03 | 22 | RECORD | 0A | (Delete signature)
50
Return:
| AA | 04+SIGLEN | 22 | 00 | RECORD | SIG | 0A | (Set signature return)
| AA | 04 | 22 | 00 | RECORD | 0A | (Delete signature return)
SIG: signature string
SIGLEN: signature string length
14. Load a Record or Records to Recognizer (30)
Load records (1~7) to recognizer of Voice Recognition Module, after execution the Voice
Recognition Module start to recognize immediately.
Format:
| AA| 2+n | 30 | R0 | ... | Rn | 0A |
Return:
| AA| 2+n | 30 | N | R0 | STA0 | ... | Rn | STAn | 0A |
N: number of loading successfully R0~Rn: Record STA0~STAn: Load result.(0-
Success FF-Record value out of range FE-Record untrained FD-Recognizer full FC-
Record already in recognizer)
15. Clear Recognizer (31)
Stop recognizing, and empty recognizer of Voice Recognition Module.
Format:
| AA | 02 | 31 | 0A |
Return:
| AA | 03 | 31 | 00 | 0A |
16. Group Control (32)
17. Group select
Set group control mode (disable, system, user), if group control function is enabled
(system or user), then voice recognition module is controlled by the external control IO.
Format:
| AA| 04 | 32 | 00 | MODE | 0A |
MODE: new group control mode. (00-disable 01-system 02-user FF-check)
51
Return:
| AA| 03 | 32 | 00 | 0A |
or
| AA| 05 | 32 | 00 | FF | MODE | 0A | (check command return)
18. Set user group
Set user group content (record).

Format:
| AA| 03 | 32 | 01 | UGRP | 0A | (Delete UGRP)
| AA| LEN | 32 | 01 | UGRP | R0 | ... | Rn | 0A | (Set UGRP)
UGRP: user group number
R0~Rn: record index number (n=0,1,...,6)
Return:
| AA| 03 | 32 | 00 | 0A | (Success return)
19. Load system group
To clear recognizer use command Load system group to recognizer.

Format:
| AA| 04 | 32 | 02 | SGRP | 0A |
Return:
| AA| 04 | 32 | SGRP | VRI0 | VRI1 | VRI2 | VRI3 | VRI4 | VRI5 | VRI6 | RTN | VRMAP
| GRPM | 0A |
SGRP: System group number.
VRIn (n=0~6): n (recognizer index value) and Record which is in recognizer,
RTN: total number of records in recognizer.
GRPM: group mode indicate. (00~0A-system group)
20. Load user group
Load user group to recognizer, this command would clear recognizer.

Format:
52
| AA| 04 | 32 | 03 | UGRP | 0A |
Return:
| AA| 04 | 32 | UGRP | VRI0 | VRI1 | VRI2 | VRI3 | VRI4 | VRI5 | VRI6 | RTN | VRMAP
| GRPM | | 0A |
UGRP: System group number.
VRIn (n=0~6): Record which is in recognizer, n is recognizer index value
RTN: total number of records in recognizer.
GRPM: group mode indicate. (00~0A-system group)
21. Check user group
Check user group content.

Format:
| AA| 04 | 32 | 04 | 0A | (check all user group)
or
| AA| 04 | 32 | 04 | UGRP0| ... | UGRPn | 0A | (check user group)
Return:
| AA | 0A | 32 | UGRP | R0 | R1 | R2 | R3 | R4 | R5 | R6 | 0A |
UGRP: User group number.
R0~R6: Any record.
22. Prompt (0A)
To return data when user trains voice command Prompt command is used.
Format:
NONE
Return:
| AA | 07 | 0A | RECORD | PROMPT | 0A |
RECORD: record which is in training
PROMPT: prompt string
23. Voice Recognized (0D)
53
24. Voice Recognized command is only used for Voice Recognition Module to return
data when voice is recognized.
Format:
NONE
Return:
| AA | 07 | 0D | 00 | GRPM | R | RI | SIGLEN | SIG | 0A |
GRPM: group mode indicate. (FF-not in group mode 00~0A-system group mode 80~87-
user group mode)
R: record which is recognized.
RI: Recognizer index value.
SIGLEN: signature length of the recognized record.
0: on signature, on SIG area

SIG: signature content
25. Error (FF)
To return error status inVoice Recognition Module ERROR command is used.

Format:
NONE
Return:
| AA | 03 | FF | ECODE | 0A |
54
CHAPTER 5
GSM APPLICABILITY TO THE ASSISTIVE DEVICE
GSM is an acronym that stands for Global System for Mobile Communications. In
1984 GSM a digital cellular network was developed as a standard for a mobile telephone
system that could be used across Europe. Mobile service use GSM as an international
standard. Subscribers can easily roam worldwide and access any GSM network as high
mobility offered by it. It offers much higher capacity than the current analog systems. A
larger number of subscribers is allowed due to more optimal allocation of the radio
spectrum. Voice communications, Short Message Service (SMS), fax, voice mail, and
other supplemental services such as call forwarding and caller ID are some of the services
offered by GSM. 450 MHz, 850 MHZ, 900 MHz, 1800 MHz, and 1900 MHz are the
most common frequency bands the GSM works on. To increase the amount of spectrum
available for each band some bands also have Extended GSM (EGSM). GSM works with
Time Division Multiple Access (TDMA) and Frequency Division Multiple Access
(FDMA) [14].
55
FIGURE 29. Use of GSM in biomedical [14].
56
5.1 GSM Module – SIM900
The range of frequencies that A GSM/GPRS-compatible Quad-band cell phone works
are 850/900/1800/1900MHz. They can be used not only to access the Internet, but also
for oral communication and for SMSs provided they are connected to a microphone and a
small loud speaker. The dimensions of the GSM- SIM900 module are as follows: 0.94
inches x 0.94 inches x 0.12 inches. With L-shaped contacts are placed on four sides so
that they can be soldered at the bottom and both on the side. An AMR926EJ-S processor
controls phone communication, data communication over an integrated TCP/IP stack and
an UART and a TTL serial interface. The processor internally manages the module the
communication with the circuit interfaced with the cell phone itself. A SIM card (3 or
1.8 V) which needs to be attached to the outer surface of the SIM900 module. The
GSM900 device integrates an SPI bus, a PWM module, an A/D converter, an I²C, an
RTC, and an analog interface. The radio section is GSM phase 2/2+ compatible and is
either class 4 (2 W) at 850/ 900 MHz or class 1 (1 W) at 1800/1900 MHz. The TTL
serial interface is in charge of communicating all the data relative to the received SMS
and those that come in during TCP/IP sessions in GPRS. GPRS class 10: max. 85.6 kbps
determines the data rate. Receiving the circuit commands coming from the PIC that
controlling the remote control is been monitored by TTL serial interface, that can be
either AT standard or AT-enhanced SIMCOM type. The module absorbs a maximum of
0.8A and supplied with continuous energy ranging between 3.4 and 4.5 V during
transmission [11].
57
FIGURE 30. SIM900 [11].
FIGURE 31. SIM900 Pin Diagram [11].
58
5.2 GSM call processing
Call processing consists of different steps that are set up, maintain, and end of call.
The American National Standard for Telecommunications has put forth a Telecom
Glossary. A call processing means: The sequence of operations performed by a
switching system from the acceptance of an incoming call through the final disposition of
the call. The end-to-end sequence of operations performed by a network from the instant
a call attempt is initiated until the instant the call release is completed.
Initialization:
The first part to mobile call processing is initialization. You get a connection to a
nearby cell site so that your account can be checked by cellular network checks. A valid
telephone number and an account is verified then the call proceeds.
We need a connection to the cellular system that is we need a frequency to transmit
on. The system checks a frequency list contained in its SIM card that is the removable
memory chip in the system. These frequencies are checked carried by these bit streams,
searching for a Broadcast Control Channel or BCCH within one of them. Each BCCH
transmit a unique data marker, so the mobile knows when it has found its channel. This
is a big difference between AMPS and GSM. With AMPS, to set up calls a dedicated
radio frequency is in each cell. To set up information any frequency can carry with the
help of GSM. It’s the channel within the data stream that’s important to find, not a
specific radio frequency. A base station’s Broadcast Control Channel continuously sends
out identifying information about its cell site. Information such as the area code for the
current location, network identity, information on surrounding cells, and usage of
59
frequency hopping. The BCCH is not a dedicated radio frequency. It resembled by the
channel within the bit stream carried by any of the frequencies in a cell.
The mobile is the receiver checking the any base station signals within the range. The
mobile acts as a scanning radio. The mobile goes through each BCCH frequency for
testing reception and the received level for each channel is measured. The GSM system
decides after this test which cell site should take the call. That’s usually the cell site
delivered the highest signal strength to the system.
FIGURE 32. Burst of bits [15].
Once homed in on the Broadcast Control Channel the mobile monitors the ongoing
data stream from the base station. The BCCH searches for a frequency control burst or
frequency control channel burst (FCCB). 142 burst of bits have 3 tailing bits in front and
behind. This distinctive burst says that synchronization bits will follow. A wireless
connection would be set up between mobile and the cellular system with the help of those
bits. And once that is done, mobile and base station can communicate and everything can
start working.
One burst of many within a single GSM TDMA frame is the digital signature which
mobile is searching in the BCCH. Bits resembles single pulses of electrical energy as if
they are single dashes of a Morse code key. Long and short pulses of energy representing
60
letters are used in Morse code. The pulses we use in digital radio do the same thing with
uniform length. Voice and data are represented in form of groups of bits. Bits are used
for signaling.
GSM is a time based multiplexed system. There are many calls on the same
frequency so they are divided by time they represent cars in a long lane. A new call must
fit somewhere in the frequency band as every third car. The mobile and base station are
provided with exact timing details for the coming conversation the synchronization bits.
Once our system is assigned a place in this digital freight train it can take and send
information. The first is the radio part and the second is the network part.
The radio subsystem. Sometimes called the air interface. How we set up, maintain,
and then later tear down a radio connection from the system to the cell site. The network
subsystem or the switching element decides who gets on the system, how to set up the
call and terminate it. What services and resources the system can use is determined by
network subsystem. We have two parts, each working to help the other out. A call
would not go through if these two parts do not work together in synchronization [15].
61
CHAPTER 6
RESULTS
TABLE 2. Speaker Dependent Sample Values from experimental data
No. of Sample Sample 2 Sample 3 Sample 4 Sample
Speaker 1 5
1 0.8512 0.7495 1.5512 1.6914 1.2527
2 1.0957 0.935 1.1123 1.4814 1.4456
3 1.3115 0.2556 1.5252 1.5013 1.5123
FIGURE 33. Graph of speaker dependent voice samples.
62
Voice samples for the speaker independent are considered for the below table.
TABLE 3. Voice samples for speaker independent voice recognition from experimental data
Number Sample Sample Sample Sample Sample Sample Sample Sample Sample
of 1 2 3 4 5 6 7 8 9
Speaker
1 5.6678 4.1606 5.9543 6.0646 10.6853 5.1522 7.5031 8.7624 3.2465
2 3.4676 7.2581 3.8976 4.7342 9.7366 5.7855 4.8871 7.9964 8.0245
3 6.2476 6.4327 4.3671 2.9857 8.6545 6.3879 10.8775 5.8976 4.9082
FIGURE 34. Graphical representation of samples of speaker independent voice recognition.
63
FIGURE 35. Experimental output waveform of Number ‘0’.
FIGURE 36. Experimental output waveform of number ‘1’.
From the above waveform I was able to find out the relation of speech with time and
frequency. Sampling of the voice was required for recognition of the voice from
different speakers. The voice samples of numbers from 0 to 9 were taken for the each
speaker to record and recognize their voice, so that the Arduino UNO can give
instructions to GSM module to begin initialization of call procedure.
64
The Arduino UNO analyzed the data for GSM module to dial the number. GSM
module checks for the cellular network provider before initiating the call. The network
once found by the GSM module the following data would appear on the serial monitor of
the Arduino UNO.
FIGURE 37. Serial Monitor Data for GSM network.
The network once connected allows the network provider to initiate the call. The
GSM module initialization steps can be shown in the serial monitor.
65
FIGURE 38. Serial monitor showing experimental status of Call Ready.
A period of time is provided to GSM module to locate and log on to the cellular
network. The call ready status shows the searching for the channel for communication is
over and the call can be made to the other device. The GSM module SIM900 mostly uses
the 900 MHz for communication be it searches other frequency channels too. It is a
string of text which is send. The module consist of AT command and are considered the
modem language.
The full connection of the Assistive Device Using Voice recognition for GSM calling
is show below.
66
FIGURE 39. Assistive device using voice recognition for GSM voice call.
67
CHAPTER 7
APPLICATIONS AND FUTURE SCOPE
7.1 Applications
The applications of assistive device using voice recognition are as follows:
Healthcare: ASR for doctors in the order to create patient records automatically:
Voice recognition using GSM module for emergency call help doctors to reach out
the patient as soon as possible. There is an increase in number fatalities every year in
physically handicapped people to keep a track of this the assistive device will prove to be
vital. In emergency cases patients require immediate attention from the doctor so just a
phone call from the assistive device can save time and notify the doctor.
Help for disable (especially to access the web and control the computer):
In today’s world web access and controlling the computer are important in any
occupation services. Disable people have difficulty in accessing their laptops and
computers. So, their voice can help them operate electronics devices due to this they will
be able to work efficiently without any problems. Improving the working environment
for the disables assistive devices can be innovative in the field of biomedical with more
research.
68
Military:
Handheld devices for speech-to-speech translation for basic purpose of making
communication simple. Also used in fighter planes where the pilot’s hand are too busy to
type. Voice recognition and GSM are vital for transferring data over the channel and keep
it encrypted. In military operations information has to be kept secret hence, assistive
device can be useful.
There many other applications where assistive can be appropriate to be used.
Automation of operator services.
Automation of directory assistance.
Voice banking services
Voice prompter
Directory assistance call completion
Reverse directory assistance
Information services
Agent technology
Customer care
69
7.2 Future scope
In The United States of America and Japan the Arduino based robots are quite famous
due to their facial expression and mirroring properties. Creating an emotional bond with
the machine is one of the goals of the human robot interface. Body language and facial
expression that a voice recognition system can read can also be used for threat
assessment. Human works can be replaced by efficient voice controlled robots on
airports and border crossing.
If you smile at a robot while you are having a conversation and it smiles back at you,
this creates the emotional bonding with the human during the conversation. The system
might start adjusting to your behavior. The system may mirror those responses if the user
is fastidious about the robot or reciprocate an angry response or work to diffuse the
situation. All depends on the machines programming so that all the functions can be
performed accurately. Due to this advances, potential applications and the trends are
going forward.
70
CHAPTER 8
CONCLUSION
With a computer, multimedia hardware, and a relevant technical paper in the public
domain, I have designed a reasonable GSM calling method using speaker-dependent
voice recognition system for physically handicapped people. The accuracy is usually in
the mid-80% range, as long as the environment is quiet. The key properties of the
proposed platform are scalability and universality. Mel- frequency Cepstral Coefficient
method was applied for voice recognition and results were effective. Training the voice
module for the samples of the voice increased the accuracy for voice recognition. Given
the stringent property of voice being volatile and there is a change in the waveform every
time the same word is spoken, the voice module V3 could meticulously match the voice
with samples in the memory. GSM module SIM900 proved to be one of the most
effective and economical components to connect to the GSM network. Integration the
voice module V3 and the GSM module SIM900 with the Arduino UNO for producing an
assistive device can be used for emergency calls and getting immediate help. The
platform is composed from easy to get and relatively cheap hardware elements.
71
REFERENCES
72
REFERENCES
[1] V. Rudzionis, R. Maskeliunas and K. Driaunys, "Voice controlled environment for

the assistive tools and living space control," in Computer Science and Information
Systems (FedCSIS), 2012 Federated Conference on, 2012, pp. 1075-1080.
[2] Arduino. (2012). {http://www.arduino.cc/en/Main/arduinoBoardUno}
[3] Arduino. (2012). {http://arduino.cc/en/Main/ArduinoGSMShield}
[4] MobileMark. (2011). {http://www.mobilemark.com/gsm-external-antennas.htm}
[5] Skelectronics. (2011). {http://www.nskelectronics.com/files/vr3_manual.pdf}
[6] Youtube. (2010). {https://www.youtube.com/watch?v=Xjzm7S__kBU&list=PL8}
[7] Revistaie. (2010). {http://revistaie.ase.ro/content/46/s%20-%20furtuna.pdf}
[8] Voice Recognition. (2010). {http://www.hitl.washington.edu/research.html}
[9] Github. (2010). {https://github.com/elechouse/VoiceRecognitionV3}
[10] Bo Cui and Tongze Xue, "Design and realization of an intelligent access control
system based on voice recognition," in Computing, Communication, Control, and
Management, 2009. CCCM 2009. ISECS International Colloquium on, 2009, pp.
229-232.
[11] OpenElectronics. (2009). {http://www.open-electronics.org/gsm-remote-control-

part-4}
[12] T. Barbu, "Comparing various voice recognition techniques," in Speech Technology

and Human-Computer Dialogue, 2009. SpeD '09. Proceedings of the 5-Th
Conference on, 2009, pp. 1-6
[13] Soon Suck Jarng, "HMM voice recognition algorithm coding," in Information
Science and Applications (ICISA), 2011 International Conference on, 2011, pp. 1-7.
[14] Github. (2011). {http://www.pennula.de/datenarchiv/gsm-for-dummies.pdf}
[15] Privateline. (2008). {http://www.privateline.com/PCS/callprocessGSM.html}

73
[16] Z. Fanfeng, "Application research of voice control in reading assistive device for
visually impaired persons," in Multimedia Information Networking and Security
(MINES), 2010 International Conference on, 2010, pp. 14-16.
74

ABSTRACT - Spce061a Voice User Interface With

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ABSTRACT - Spce061a Voice User Interface With

Uploaded by

Copyright:

Available Formats

ASSISTIVE VOICE RECOGNITION DEVICE FOR GSM CALLING USING ARDUINO

technology is generally regarded as one of the convenient and safe recognition

emergency calling can be the future of biomedical field.

All rights reserved

INFORMATION TO ALL USERS

All rights reserved.

complete this project. I am grateful to my parents for their support, eternal

encouragement, and attention. I wish to express my sincere thanks to Dr. Christopher

Chassiakos, Ph.D., Chair of Electrical Engineering Department for the continuous

encouragement. I am extremely thankful and indebted to Dr. Christopher Druzgalski,

indirectly have lent their hand in this venture.

3. OVERVIEW OF APPLICABLE SPEECH PROCESSING.....................................13

3.1. Information Rate of the Speech Signal…………………………………….....14

4.1. Voice Recognition V3 (WIP)…….……….………………………………......17

5. GSM APPLICABILITY TO THE ASSISTIVE DEVICE......................................55

5.1. GSM Module – SIM900………………………………...………...………….57

7. APPLICATIONS AND FUTURE SCOPE………………….…............................68

1. Technical Description Of Arduino UNO……………....…..………………..….........5

2. Speaker Dependent Sample Values From Experimental Data ……….........……....62

3. Voice Samples for Speaker Independent Voice Recognition from Experimental

1. Arduino UNO front…………………………..………………………….......................3

2. Arduino UNO back……………………………..……………………………….…......3

3. GSM shield front…………………………..…………………………………...…........6

4. GSM shield back………………………………………..………………………..….....6

6. Front side of voice module…………………………..…………………………..….....9

7. Backside of voice module………………………..………………………………...…..9

8. Flow chart for voice recognized telephone calling….............................................…..12

9. Schematic diagram of the speech production/speech perception process…................14

10. Frame of the voice control system………………………...…………………….......19

11. SPCE061A pin diagram…………..………………………………………….……...21

12. The voice recognition block diagram…………………................................……….22

13. Power supply circuit……………………………..………………….…..………......23

14. The structure of voice signal process………………….…………………...…..……24

15. Experimental data of input speech signals………….……………………...…...…...31

16. Experimental vocal feature vectors……………………………………….…...…….32

17. Training set and the corresponding feature set………………………….…………..32

18. Experimental output of word ‘ready’…………..……………..………….…….……33

19. Experimental output of word ‘call’………...……………………….……….………33

20. Experimental input waveforms vocal utterances…........………………………...….35

21. Protocol vocal signals and their feature vectors………………………………...…..36

22. Test on serial monitor……...…….……………………………...………………..…40

23. Input command “settings” in serial monitor…………………………..………….....40

24. Settings of the voice module….…….………………………………………….……41

25. Input “Sigtrain 0 on” in serial monitor…...................................................................42

26. Input “Sigtrain 1 off” in serial monitor…………............................………………...43

27. Load 0 and 1 of voice samples…..……………………………….……………….....43

28. Recognize the voice input....…………….…………………………………….…….44

29. Use of GSM in biomedical…………..………………………….…………….…….56

31. SIM900 pin diagram…………………..………………….……….………..……….58

32. Burst of bits…………………………….………….……………………..………….60

33. Graph of speaker dependent voice samples………………………...…………….....62

34. Graphical representation of samples of speaker independent voice recognition…....63

35. Experimental output waveform of number ‘0’…..………………………….………64

36. Experimental output waveform of number ‘1’…..……….…………………………64

37. Serial monitor data for GSM network……..…………………….………………….65

38. Serial monitor showing experimental status of call ready…………………………..66

GUI Graphical User Interface

SAPI Speech application programming

MSS Microsoft speech server

ICSP In circuit serial programing

SDA Serial Data Line

SCL Serial Clock Line