You are on page 1of 86

ABSTRACT

ASSISTIVE VOICE RECOGNITION DEVICE FOR GSM CALLING USING ARDUINO


UNO
By

Dhruvin M. Lingaria

August 2015

Developing a smart home environment for the assistive living requires great efforts.

The key element of the smart environment is the ubiquitous voice user interface with

several additional capabilities such as the recognition of several gestures, which can be a

new feature of voice controlled devices. There are many identification technologies used

in current intelligent guard systems. Relative to other techniques, the voice recognition

technology is generally regarded as one of the convenient and safe recognition

techniques. The assistive device project has incorporated the technology of voice

recognition to perform the GSM calling. Arduino UNO is the microprocessor used to

create an interface between the voice module and the GSM module SIM900. Platform

was developed using inexpensive hardware and software elements available on the

market People with disabilities showed high robustness for assistive device. Sample

voice commands were stored in the temporary memory for the ATMEGA 328P when

field tests with several sets of voice commands was done. The GSM module SIM900

could easily connect to the local cellular network carriers. Hence voice recognized

emergency calling can be the future of biomedical field.


ASSISTIVE VOICE RECOGITION DEVICE FOR GSM CALLING USING ARDUINO
UNO

A PROJECT REPORT
Presented to the Department of Electrical Engineering
California State University, Long Beach

In Partial Fulfillment
of the Requirements for the Degree
Master of Science in Electrical Engineering

Committee Members:
Christopher Druzgalski, Ph.D. (Chair)
Anastasios Chassiakos, Ph.D.
James Ary, Ph.D.

College Designee:
Antonella Sciortino, Ph.D.

By
Dhruvin M Lingaria
B.E, 2012, Rizvi College Of Engineering, Mumbai, India
August 2015
ProQuest Number: 1600584

All rights reserved

INFORMATION TO ALL USERS


The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.

ProQuest 1600584

Published by ProQuest LLC (2015). Copyright of the Dissertation is held by the Author.

All rights reserved.


This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.

ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
Copyright 2015
Dhruvin M Lingaria
ALL RIGHTS RESERVED
ACKNOWLEDGEMENTS

I am thankful to the God for the wellbeing and good health that were necessary to

complete this project. I am grateful to my parents for their support, eternal

encouragement, and attention. I wish to express my sincere thanks to Dr. Christopher

Druzgalski, Ph.D., Professor of Biomedical Engineering, for arranging all the necessary

facilities for the research. I place on record, my sincere thank you to Dr. Anastasios

Chassiakos, Ph.D., Chair of Electrical Engineering Department for the continuous

encouragement. I am extremely thankful and indebted to Dr. Christopher Druzgalski,

Ph.D. for valuable guidance, sharing expertise, and encouragement extended to me.

I take this opportunity to express gratitude to all of the department faculty members

for their help and support. I also show my gratitude to one and all, who directly or

indirectly have lent their hand in this venture.

iii
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS...…………………………………………………….............iii

LIST OF TABLES…………………………………………………………………..........vi

LIST OF FIGURES………………………………………………………………….......vii

LIST OF ABBREVIATIONS……………………………………………….………........ix

CHAPTER

1. INTRODUCTION…………………………………………….……………….……..1

2. SYSTEM DESIGN…………..……………...…………………………………..........2

2.1. Procedure................................................................................…………...........2
2.1.1. Arduino Uno……………………….…………………………...…….......3
2.1.2. GSM Shield………………………………...…………….…….…………6
2.1.3. GSM Antenna.............................................................................................8
2.1.4. Voice Module...................................................................................….......9
2.2. Integrating GSM Shield with the Arduino Uno.…………………………......10
2.3. Integrating Voice Module with Arduino Uno.………………………….........11
2.4. Flow Chart: Voice Recognized Telephone Calling……………………….....12

3. OVERVIEW OF APPLICABLE SPEECH PROCESSING.....................................13

3.1. Information Rate of the Speech Signal…………………………………….....14


3.2. Basic Assumption of Speech Processing…….……….....................................15

iv
CHAPTER Page

4. VOICE RECOGNITION……………………………………………………….......16

4.1. Voice Recognition V3 (WIP)…….……….………………………………......17


4.1.1. Features……………………………………….…………………………..17
4.1.2. SPCE061A Single Chip……………………………….………………….17
4.2. Voice Recognition Techniques……..…………………………………..…….26
4.2.1. A Mel-Cepstral Vocal Sound Analysis Approach……………………….27
4.2.2. Dynamic Time Warping Algorithm (DTW)……………………...….......37
4.2.3. Hmm Voice Recognition Algorithm Coding….………………………....38
4.3. Train..…….…………………………………………………….…………......39
4.4. Protocol.……………………………………………………….……………...45
4.5. Details…………………………………………………...……………………46

5. GSM APPLICABILITY TO THE ASSISTIVE DEVICE......................................55

5.1. GSM Module – SIM900………………………………...………...………….57


5.2. GSM Call Processing…………….…………………...……….……………...59

6. RESULTS………………….………………………………………….………......62

7. APPLICATIONS AND FUTURE SCOPE………………….…............................68

7.1. Applications….………………………………………….…………………...68
7.2. Future Scope…………….………………………………….……………......70

8. CONCLUSION…………………….……………………………….…………….71

REFERENCES……………………….…………………………………………….72

iv
LIST OF TABLES

TABLE Page

1. Technical Description Of Arduino UNO……………....…..………………..….........5

2. Speaker Dependent Sample Values From Experimental Data ……….........……....62

3. Voice Samples for Speaker Independent Voice Recognition from Experimental

Data………………………………………………………………………………...63

vi
LIST OF FIGURES

FIGURE Page

1. Arduino UNO front…………………………..………………………….......................3

2. Arduino UNO back……………………………..……………………………….…......3

3. GSM shield front…………………………..…………………………………...…........6

4. GSM shield back………………………………………..………………………..….....6

5. GSM antenna……………………………………..………………………..……..……8

6. Front side of voice module…………………………..…………………………..….....9

7. Backside of voice module………………………..………………………………...…..9

8. Flow chart for voice recognized telephone calling….............................................…..12

9. Schematic diagram of the speech production/speech perception process…................14

10. Frame of the voice control system………………………...…………………….......19

11. SPCE061A pin diagram…………..………………………………………….……...21

12. The voice recognition block diagram…………………................................……….22

13. Power supply circuit……………………………..………………….…..………......23

14. The structure of voice signal process………………….…………………...…..……24

15. Experimental data of input speech signals………….……………………...…...…...31

16. Experimental vocal feature vectors……………………………………….…...…….32

17. Training set and the corresponding feature set………………………….…………..32

18. Experimental output of word ‘ready’…………..……………..………….…….……33


vii
FIGURE Page

19. Experimental output of word ‘call’………...……………………….……….………33

20. Experimental input waveforms vocal utterances…........………………………...….35

21. Protocol vocal signals and their feature vectors………………………………...…..36

22. Test on serial monitor……...…….……………………………...………………..…40

23. Input command “settings” in serial monitor…………………………..………….....40

24. Settings of the voice module….…….………………………………………….……41

25. Input “Sigtrain 0 on” in serial monitor…...................................................................42

26. Input “Sigtrain 1 off” in serial monitor…………............................………………...43

27. Load 0 and 1 of voice samples…..……………………………….……………….....43

28. Recognize the voice input....…………….…………………………………….…….44

29. Use of GSM in biomedical…………..………………………….…………….…….56

30. SIM900………………………………………………………………………..…….58

31. SIM900 pin diagram…………………..………………….……….………..……….58

32. Burst of bits…………………………….………….……………………..………….60

33. Graph of speaker dependent voice samples………………………...…………….....62

34. Graphical representation of samples of speaker independent voice recognition…....63

35. Experimental output waveform of number ‘0’…..………………………….………64

36. Experimental output waveform of number ‘1’…..……….…………………………64

37. Serial monitor data for GSM network……..…………………….………………….65

38. Serial monitor showing experimental status of call ready…………………………..66

39. Assistive device using voice recognition for GSM voice call………………………67
vii
LIST OF ABBREVIATIONS

LIST

GUI Graphical User Interface

SAPI Speech application programming

MSS Microsoft speech server

ICSP In circuit serial programing

SDA Serial Data Line

SCL Serial Clock Line

GSM Global System for Mobile Communications

IMSI International mobile subscriber Identity

WDT Watch Dog Timer

MFCC Mel-Frequency Cepstral Co-efficients

DDMFCC Delta Delta Mel- Frequency Cepstral Co-efficients

DTW Dynamic Time Warping Algorithm

HMM Hidden Markov Model

BCCH Broadcast Control Channel

ix
CHAPTER 1

INTRODUCTION

Voice technology is of enormous benefit for people with physical disabilities. People

with different kinds of disabilities may benefit from various kinds of speech and voice

processing technologies. In this project I have created an assistive device to help

physically disabled people to use their voice and initiate a GSM call. It is a very robust

product that can be used in any environment to suit the user. Voice module V3 records

the voice of different users to recognize the voice. Every user speaks the number 0 to 9

to train the voice module. These voice samples are stored into the voice module library.

Each user has his or her own voice samples from which the voice module recognizes the

voice. There are variations in the voices of each user as well each user has a different

voice sample every time they speak the same number. So, training the voice module for

different samples of voice is very crucial. Once the voice is recognized the Arduino

UNO creates an interface with the GSM module for research for the cellular network.

As soon as the up link connection is created with the cellular tower, the Arduino UNO

gives the command to GSM module to make a call; this call can be an emergency number

for the disabled people to use the emergency services. This project can be helpful to

many disabled people who want to take to doctors far away from them or need medical

services as soon as possible [1].

1
CHAPTER 2

SYSTEM DESIGN

2.1. PROCEDURE
Selection of the Components:

Selecting the components required to perform the project is one of the most important

steps to develop any product. Depending on the resources required the product can be

expensive or economical. Due to this factor the production of the project can be decided

on industrial bases. The required components are as follows:

1. Arduino UNO

2. GSM Module SIM900

3. Voice recognition module (V3)

4. Bread board

5. Connecting wires

6. Resistors

7. SIM card

8. Microphone

The components for the assistive device are easily available, cost effective and

efficient to handle. They do not require regular maintenance, due to this quality they can

be used regularly under robust conditions. As per the project requirement the components

2
use 5v and 2A for performing the task. The voice module and GSM module are the added

features for the assistive device giving the required output. Assembling all the

components together, the resultant output will have multiple applications in the

biomedical field. The required components are described below.

2.1.1 Arduino UNO

FIGURE 1. Arduino UNO Front.

FIGURE 2. Arduino UNO Back

3
Description:

The Arduino UNO is a microcontroller board based on the ATmega328. It has 14

digital input/output pins (of which 6 can be used as PWM outputs), a 16 MHz ceramic

resonator, 6 analog inputs, a USB connection, an ICSP header, a power jack, and a reset

button. It has everything to support the microcontroller; simply connect it to a computer

with a USB cable or power it with an AC-to-DC adapter or battery to get started.

The UNO differs from all preceding boards in that it does not use the FTDI USB-to-

serial driver chip. Instead, it features the Atmega16U2 (Atmega8U2 up to version R2)

programmed as a USB-to-serial converter.

Arduino UNO board version 2 has a resistor connecting the 8U2 HWB line to ground,

making it easier to put into DFU mode. Arduino UNO board version 3 has the following

new features: 1.0 pinout: added SDA and SCL pins that are near to the AREF pin and

two other new pins placed near to the RESET pin, the IOREF helps the shields to adapt to

the voltage provided by the Arduino UNO board. In the near future, shields will be

adaptable with both the board that uses the AVR, which works at 5V with the Arduino.

The second pin is not a connected pin and is reserved for future purposes. Stronger

RESET circuit and ATMEGA 16U2 are replaced by 8U2. "UNO" means one in Italian.

The Uno is the latest version of USB Arduino boards [2].

4
TABLE 1. Technical description of Arduino UNO [2]

Microcontroller ATmega328
Operating Voltage 5V
Input Voltage (recommended) 7-12V
Input Voltage (limits) 6-20V
Digital I/O Pins 14 (of which 6 provide PWM output)
Analog Input Pins 6
DC Current per I/O Pin 40mA
DC Current for 3.3V Pin 50mA
Flash Memory 32 KB of which 0.5 KB used by bootloader
SRAM 2 KB (ATmega328)
EEPROM 1 KB (ATmega328)
Clock Speed 16MHz
Length 68.6mm
Width 53.4mm
Weight 25g

5
2.1.2 GSM Shield

FIGURE 3. GSM Shield Front.

FIGURE 4. GSM Shield Back

6
Description:

The GSM shield allows an Arduino board to make/receive voice calls, to connect to

internet, and to send/receive SMS messages. The GSM shield uses a radio modem

SIM900. It is possible to communicate with the board using AT commands. The GSM

library has numerous methods for communication with the shield. The shield uses digital

pins 2 and 3 for software serial communication with the SIM900. Pin 2 is connected to

the SIM900’s TX pin and pin 3 to its RX pin. The modem's power key pin is connected

to Arduino pin 7. The SIM900 is a quad-band GSM/GPRS modem that works at

frequency GSM900MHz. It supports TCP/UDP and HTTP protocols through a GPRS

connection. GPRS data uplink and downlink transfer speed maximum is 85.6 kbps. The

cellular network interfacing with the board requires a SIM card provided by a network

operator. The most recent version of the board operates on the 1.0 pinout on rev 3 of the

Arduino Uno board [3].

7
2.1.3 GSM Antenna

FIGURE 5. GSM Antenna.

Description:

GSM systems have specific antenna design requirements because GSM technology

administers the capability for global communications between wireless carriers. The

antennas that make this possible are altogether technologically advanced. The phones

and the towers themselves use antennas to communicate with each other. The constant

development of technology means that antenna design companies have to work hard to

not only keep up with the demand for innovations, but to produce them as well. GSM has

already generated newer and more improved generations in the form of 3G and 4G

technologies like UMTS, EDGE, HSDPA, and LTE while competing with CDMA

protocol has moved to EV-DO [4].


8
2.1.4 Voice Module

FIGURE 6. Front side of Voice module.

FIGURE 7. Backside of Voice Module.

9
The module could identify the voice and receives configuration commands which it

responds through serial port interface. With the help of this module we would be able to

control the car or other electrical devices by voice. This module can store 80 pieces of

voice instruction. Those 80 pieces are divided into groups, with 7 instructions in each

group. Initially we should record the voice instructions as per group. Once that is done,

we import one of the groups by serial command before it could recognize the 7 voice

instructions within that groups. We first need to import the other group to implement the

instructions. This module is not dependent on the speaker. If a different speaker speaks

the voice instruction in place of you, it will not identify the instruction. Please note that

speaker independence requires strictly good MIC.

Technical Parameters:

Voltage: 4.5-5.5V

Current: <40mA

Digital Interface: 5V (TTL level UART interface)

Analog Interface: 3.5mm mono-channel microphone connector + microphone pin

interface

Size: 30mm x 47.5mm

Recognition accuracy: 99% (under ideal environment) [5]

2.2 Integrating GSM shield with the Arduino UNO

The GSM module has to be integrated with the Arduino UNO so the calling through

the phone of the patient can be initiated. The GSM module has a SIM card inserted in it.

10
The computer or laptop can make a phone call to any other phone with GSM network.

The pins in the GSM shield are activated when the call or SMS is sent to the GSM shield.

2.3 Integrating Voice module with Arduino UNO

Once the GSM module is integrated to the Arduino UNO, voice module has to

program with the Arduino UNO. The voice is recognized and amplified for the system to

use it for the phone call. The method of recognition of the voice is as follows.

Steps for calling:

Say “Dial”

Phone number please

Say the phone digit one at a time

After the last digit say “Verify”

Number is repeated

Send

Dialing

Number is dialed

Call in process

11
2.4 Flow Chart: Voice recognized telephone calling

FIGURE 8. Flow chart for voice recognized telephone calling.


http://www.ti.com.cn/cn/lit/an/spra144/spra144.pd

12
CHAPTER 3

OVERVIEW OF APPLICABLE SPEECH PROCESSING

Speech processing is one of the applications of digital signal (DSP) techniques to

process and /or analysis the speech signals. Applications of speech processing include:

Speech Coding

Speech Recognition

Speaker Verification \ Identification

Speech Enhancement

Speech Synthesis (Text To Speech Conversion)

The speech production process is initiated when the speaker formulates a message in

his/her mind to transmit to the listener via speech. The next step in the process is the

conversion of the message into a language code. This correlates to converting the

message into a set of phoneme sequences corresponding to the sounds that users’ make

denoting pitch associated with the duration of sounds, and loudness of sounds [6].

13
FIGURE 9. Schematic diagram of the speech production/speech perception process [6].

3.1 Information Rate of the Speech Signal

The rate of discrete symbol information in the crude message text is rather low (about

50 bits per second corresponding to about 8 sounds per second, where each sound is one

of the 50 distinct symbols).

Once the language code is converted, with the inclusion of prosody information, the

information rate rises to about 200 bps. In the next stage the representation of the

information in the signals becomes continuous with an equivalent rate of about 2000 bps

at the neuromuscular control level and about 30,000- 50,000 bps at the acoustic signal

level. The continuous information rate at the basilar membrane is in the range of 30,000

– 50,000 bps, while at the neural transduction stage it is about 2000 bps. The higher level

processing within the brain converts the neural signals to a discrete representation, which

is eventually decoded into a low bit rate messages [6].

14
3.2 Basic Assumption of speech processing

The speech processing system has basic assumption that the source of excitation and

the vocal tract system are independent.

Hence, it is an appropriate to model the source of excitation and the vocal tract

system separately. In continuous speech the vocal tract changes shape slowly and

gradually as it is reasonable to assume that the vocal tract has a fixed characteristics over

a time interval of the order of 10ms [6].

15
CHAPTER 4

VOICE RECOGNITION

Voice recognition is "the technology by which sounds, words or phrases spoken by

humans are converted into electrical signals, and these signals are transformed into

coding patterns to which meaning has been assigned" [7]. The notion could more

generally be called "sound recognition". The emphasis is more on the human voice

because we most often and most typically use our voices to communicate our ideas to

others in our surroundings. In the situation of a virtual environment, the user would

probably gain the feeling of immersion and being part of the simulation, if they could use

their most common form of communication, the voice. Due to fundamental difference

between human speech and the more traditional forms of computer input it is difficult to

use voice as an input to computer simulation. While computer programs are commonly

designed to produce an explicit and well-defined response upon receiving the proper

input, the human voice and spoken words are anything but precise. Each human voice is

different, and identical words can have different meanings if spoken with different

inflections or in different contexts. Several approaches have been tried, with varying

degrees of success, to overcome these difficulties [8]. Improving performances in voice

recognition can be done taking into account the following criteria:

16
Dimension of recognizable vocabulary.

Spontaneous degree of speaking to be recognized.

Dependence/independence on the speaker.

Time to put in motion the system.

System accommodating time at new speakers.

Decision and recognition time.

Recognition rate (expressed by word or by sentence) [7].

4.1. Voice Recognition V3 (WIP)

4.1.1. Features

Group control and external group select pin

Auto load records when power on

Recognize maximum 7 voice commands at same time

Signature function helps to make out voice record

Store maximum 255 records of voice

LED indicate [9].

4.1.2. SPCE061A single chip

The technology of the voice recognition adopted by the intelligent access system is

based on the SPCE061A single chip. The system hardware is made up of:

SPCE061A single chip

The power and gating circuit

The extended memorizer SPR4096

The voice input and Output circuit.


17
The key technologies are the application of the SPCE061A single chip on voice

recognition and the design of the assistive control circuit. The system software consist of

the voice recognition module, the voice training module, the voice-playing module, the

speech data processing module, and the cipher of input/output module. The voice module

completes the part of collecting and distilling the voice data, speech recognition and

voice playing in terms of initializing the system and the identification training. As per the

voice recognition arithmetic theory, the character distilling, pretreatment, and pattern

matching are then analyzed [10].

The classification of voice recognition:

The purpose and function of the voice recognition are different, the recognition is

classified as speaker recognition and voice recognition. The speaker recognition is

classified into two types, one is dependent to text and the other is irrelevant to text. The

voice recognition system requires pronunciation to be according to the stated contents

that is relevant, and then everybody's speech model is built up accurately. Identification

also requires users to pronounce according to the stated contents, the effect is very good.

Pronunciation is more important than the text of the content for the voice recognition

system so it is difficult to build up speech models. If customers make use of the system

conveniently it can be applied widely. The usage helps the system to be classified as

speaker recognition and confirmation. The former judges a voice that needs to be

identified from several samples. The latter judges whether or not an identified voice

comes from a certain speaker. Its output has two kinds of result, yes or no. The central

processor of this system is the SPCE061A single chip. The speaker approval of what is

relevant to text is accomplished on the chip, and then homologous order and operation
18
are carried out. The system is mainly made up of speaker identification module and

gating circuit.

In training, the voice of the speaker gets into the voice signal collection circuit

through a microphone, and then the collected voice signals are processed by the voice

processing circuit, the characteristic guidelines of the speaker are distilled and saved. At

last the database of characteristic parameters of talkers is formed. In identifying, the

voice that needs to be identified is matched with the information in the database of

characteristic parameters of speaker. Output circuits manage the gating electrical

machine [11].

SPR4096

Switch Circuit
Microphone

SPCE061A
Control Circuit
Single Chip
Speaker

Number dialed

Keyboard

FIGURE 10. Frame of the voice control system [10].

19
The hardware part of this system include SPCE061A single chip, the audio output

circuit, the voice recognition circuit, the FLASH circuit, the audio input circuit and the

keyboard circuit etc. The main mission of the hardware is to change voice signals into

digital signals, collect samples, upload, identify and play the voice datum [10].

Hardware design of the system:

A. CPU core circuit:

The SPCE061A chip has the system frequency 0.375~49.152 MHz and voltage scope

is 2.6~5.5 V. There are multi-function programmable I/O ports of 32 bits, 7 x 10 bits

channels voltage ADC, 2 x 16 bits timer/counters, microphone input with automatic

added function, double channels audio output function of 10 bits DAC and Watch Dog

Timer (WDT) in the single chip. The interruption controller can handle three kinds of

FIQ interruption, eleven kinds of IRQ interruption and one soft interruption controlled by

the instruction BREAK. There are voice processing function and abundant C function

databases provided by the single chip. It is very suitable to implode voice recognition

products.

20
FIGURE 11. SPCE061A Pin Diagram. www.go-gddq.com.

B. Voice recognition circuit:

The principle of voice recognition circuits is that voice signals are analyzed by the

intelligent system after distilling voice, firstly noises are filtered and the useful signals of

voice signals are distilled through a filter group, then the signals are processed and

chosen by calculation match of function PAR-COR, linear prediction coefficient, times of

zero, etc. The voice signals carry on mode match with the voice datum in the voice

database after analysis and processing, lastly the voice recognition result is output

according to the match result. The basic structure of the voice recognition circuit is

shown in figure 12.

21
Voice Database

A filter group PAR-


COR function Linear
Prediction Co-
efficient Times of
Voice Mode
Noise Filtering Zero
Signals Match

Output of
recognition
result

FIGURE 12. The voice recognition block diagram [10].

C. Power supply circuit:

The chip SPCE061A adopts low voltage supply in order to reduce power consumption.

SPCE061A has two power supplies, one is inside power supply VDD and the other one is

I/O power supply VDDH. I/O power supply is 5V, and inside power supply is 3.3V. The

main motive to reduce inside power supply is to lower the power consumption and

working temperature of the single chip. Though the voltage range of SPCE061A is very

wide, to make the chip run more stably and satisfy the voltage demand of I/O ports and

outside parts the power supply circuit is shown in figure 15. AC 220V is converted into

DC 5V by 7805 steady-voltage module, DC 5V will supply power for voice recognition

module and every I/O ports inside the system. DC 5V is converted into DC 3.3V by

TR1972-33 [10].

22
FIGURE 13. Power supply circuit [10].

1. Processing of the voice signals:

Firstly voice signals are pretreated and the signals are amplified properly, secondly

analog signals are converted into digital signals in order that digital equipment can

process the signals conveniently, and then characteristics are distilled in order that some

parameters of the signal characteristics can be used to replace the voice signal. Lastly

different treatment will be adopted according to the missions. Voice recognition can be

divided into two stages: the training stage and the identification stage.

In the training stage the voice signals expressed by characteristic parameters are

processed, standard datum that can show common characteristics of identification basic

units are obtained. Reference templates are formed based on above datum, and then the

reference template database is formed after all reference templates of identified basic

units are combined together. In the identification stage the identified voice signals after

characteristic distilling are compared with reference templates one by one.

23
FIGURE 14. The structure of voice signal process [10].

A. Voice signals pretreatment:

The noises seriously disturb the processing and identification of voice signals, so the

noises must be disposed firstly. The input analog voice signals from microphones must be

sampled and measured in order to obtain digital voice signals. Before converting voice

signals into digital signals, it is necessary to filter and counter disturb. In filtering the

signal part and noises beyond ½ sample frequency are filtered. The clean voice signals

24
are obtained later, and then low frequency disturbance is filtered through fore-

aggravation technology, especially the disturbance of 50Hz or 60Hz. The high-frequency

voice signals are improved and they can cutout DC floating, and can improve function of

energy of clean voice by confining random noises. [10]

B. Characteristic distilling:

The system adopts the evaluation method that uses the contrast value between

dispersion degree of different speakers and self-dispersion degree of each speaker as

characteristic parameters. The basic idea is to distill group characteristic parameters from

a voice segment of the same speaker that is to say to map the segment on a dot of the

multi-space. Different voices from the same speaker will produce different dots in the

characteristic space; the function of multi-variable probability density can describe the

distribution. For different single pronunciation from the same speaker, these dots are

relatively concentrated. But the pronunciation distribution from different speakers is apart

farther, the group characteristic parameters can describe the thumbprint of speakers

effectively. According to this principle, for single parameter, the F ratio between two

kinds of distribution parameters can be used as effective measurement rule. The F ratio

shows the contrast between dispersion degree of different speakers and self-dispersion

degree of each speaker. The ratio of one characteristic parameter is bigger, for this

characteristic, the former is bigger than the latter averagely. Therefore the recognition

system adopts a bigger F ratio and then the system capability is improved [10].

C. Module match:

At present research on the method of module match based on various characteristic

parameters is more and more embedded. Typical methods are: the vector measure
25
arithmetic, the Gauss mixture module arithmetic, the dynamic time whole (DTM)

arithmetic and the manual nerve net arithmetic. The above methods have both advantage

and weakness. When the DTM arithmetic is applied in the identification of long voice,

the operations of module match are too great. But the arithmetic is simple and effective

for short voice (the length of valid voice is subter-3 seconds) identification. So the

method is especially applicable to the speaker recognition system of voice and text. The

system in this paper adopts the DTW arithmetic [10].

4.2. Voice Recognition Techniques

Voice recognition represents the computational task of validating users’ claimed

identity using characteristics extracted from their voices. A speaker recognition system

has to be able to recognize who is speaking on the basis of individual information

included in the speech signals. Speaker recognition technology makes it possible to use

the speakers’ voice to verify their identity and control access to various services such as

database access services, voice mail, banking by telephone, information services, security

control for confidential information areas, and inaccessible computers. Speaker

recognition methods can be divided into text dependent (speech-dependent) and text-

independent (speech independent) techniques. The methods which were used earlier

discriminate the users based on the same spoken letters/words or numbers but the other

method don’t rely on definite speech. As any pattern recognition system, the speaker

recognition system consists of a feature extraction part and a classification one. The

speaker recognition can be divided into two different categories such as supervised and

26
unsupervised recognition depending on the character. In this project we considerate of

supervised case, therefore developing a supervised voice recognition system.

Also, speaker recognition encompasses both identification and verification of voices.

Speaker identification represents the method of coming to conclusion which registered

speaker pronounced the word. Speaker verification represents the procedure of accepting

or rejecting the identity claim of a previously identified speaker [12].

4.2.1 A Mel-Cepstral vocal sound analysis approach:

Consider a vocal signal S to be analyzed. First, we perform a short-time analysis on it.

The speech signal is divided into overlapping frames (256 samples) and overlaps (128

samples). Each resulted segment is then windowed, by multiplying it with a Hamming

window of length 256 samples. The FFT (Fast Fourier Transform) computes the

spectrum of each window sequence. The cepstrum of each windowed frame s[n] is then

computed by applying the inverse Fourier transform to its log spectrum.

Then, we translate the regular frequencies to a scale that is more appropriate for

speech. The Mel-scale poorly approximates the linearly-spaced frequency bands used in

the normal cepstrum than the human auditory system's response. The cepstrum and the

Mel frequency cepstrum have differences such as in the MFC, the frequency bands are

equally spaced on the Mel scale. The Mel-frequency cepstral coefficients (MFCCs) are

commonly obtained as follows:

1. Take the FFT of a windowed signal.

2. Using triangular overlapping windows map the powers of the spectrum onto the

Mel scale.

3. Each of the Mel frequencies have to undergo the log of the powers.
27
4. Take the DCT (Discrete Cosinus Transform) of the set of Mel log powers thinking

them as a signal.

5. The MFCCs are the amplitudes of the resulting spectrum.

Therefore, a sequence of MFCCs is thus obtained for each frame. Every MFCC set serve

as a melodic cepstral acoustic vector. Melodic cepstral acoustic vectors can perform as

feature vectors but we need to achieve more powerful speech features. Hence, the MFCC

acoustic vectors undergo derivation process. The first order derivatives of the Mel

cepstral coefficients is computed as delta Mel cepstral coefficient. Then, the delta delta

Mel frequency cepstral coefficients (DDMFCCs) are derivative of DMFCC, being the

second order derivatives of MFCCs. We derive these Mel-cepstra coefficients because

we want to model the intra-speaker variability. The computed DDMFC coefficients show

us how fast the voice of a speaker is changing in time. A DDMFC acoustic vector is thus

obtained for each frame of the initial speech signal S. Each acoustic vector is composed

of 256 samples, but the first 12 coefficients are mainly encoded with the speech

information. So, we truncate each vector at its first 12 samples, then position it as a

column of a matrix. This truncated DDMFCC acoustic matrix represents a powerful

voice discriminator, therefore it could be successfully used as a vocal feature vector for

speaker recognition. Each acoustic matrix has 12 rows and a number of columns

depending on the length of each vocal signal S. Therefore, because of their different

sizes, these speech feature vectors cannot be compared using the linear metrics, such as

the widely known Euclidean distance. A solution would be transforming the acoustic

matrices, through re-sampling or padding with zero values, such that they get the same

dimensions and the Euclidean metric could be used. The disadvantage of this approach is
28
the possible loss of valuable speech information from the feature vectors. There are

many other possible speech feature vectors that can be obtained with this delta delta Mel-

cepstral analysis. For example, a vocal feature vector for signal S could be made from

some statistical values computed for each DDMFCC (or MFCC) acoustic vector of this

sound signal [12].

Text-dependent voice recognition:

The speech-dependent recognition techniques is used to differentiate on the basis of

spoken words by the users. Template-matching technique is the most effective technique

for text dependent recognition process. Dynamic time warping (DTW) algorithms or

hidden Markov models (HMM) method are used extensively for voice recognition.

The DDMFCC – based feature extraction is performed and the feature vectors are

obtained. Thus, V(S), the feature vector of speech signal S, can be computed as the

truncated 12-row delta delta Mel ceptral acoustic matrix. Another featuring solution

tested is computing V(S) as the mean of the DDMFCC matrix. So, we obtain V(S) as the

unidimensional vector containing the mean values of the columns of the acoustic matrix.

A sequence of same-speech input vocal utterances to be recognized: {S1,..., Sn}. The

feature extraction process is then applied to them, the feature set {V(S) 1,……, V(S)n}

being obtained.

The next stage is Speaker classification based on pattern recognition. Voice

identification system is accompanied with a supervised classifier, proposing a minimum

mean distance classification approach. A set of registered speakers is to be created. By

the collection of spoken words a training set can be obtained related to the same speech,

provided by these speakers and filtered for noise removal. A vocal prototype is assigned
29
to each speech signal when they are trained. The feature training sets are obtained by

computing feature vectors from these prototypes. Consider N advised speakers, then the

resulted training set receives the form

Sp = {Sp1 ,..,SpN}, Where each Spi = {Si1 ,….. , Si n(i)} represents the set of signal prototypes

corresponding to the ith speaker. For each Si j , where i = 1’, N’ , j = 1’ , n(i)’ , the vocal

features extraction is performed and the obtained sequence {{ V (S11),…., V(S1 n(1)),……,

{V(SN 1),…, V(SN n(N))} represents the feature training set of classifiers.

The next step is to consider the minimum distance classification procedure. We consider

N classes of N advised speaker in the class. Our algorithm introduces each input vocal

sequence Si in the class of the speaker comparable to the smallest mean distance between the

feature vector of the input signal and his prototype vectors. Therefore, the closest speaker is

identified as the nth i registered speaker, where:

‘d’ represents the metric. The classification result, consisting of N classes of speech

utterances, represents also the speaker identification result. The accurate speaker is thus

identified each input data. The next stage of the recognition process, the speaker verification,

has to decide if that identified speaker is the one who really produced it. We propose a

threshold based verification procedure to be performed within each resulted speaker class.

So, each mean distance computed in any class must not exceed a special chosen threshold

value. Any threshold-based recognition approach implies the task of choosing a proper

threshold value. We propose an automatic threshold detection method, considering the

30
overall maximal distance between any two prototype vectors belonging to the same training

feature subset, as a threshold. Thus, a satisfactory threshold is obtained from the following

equation:

A high recognition rate, approximately 85%, has been reached by our speech-dependent

voice recognition system. 5 input vocal utterances and 3 registered speakers. The speech

input signals and their corresponding feature vectors are represented in the next two figures.

The speech of all these vocal utterances contain a single word: hello.

FIGURE 15. Experimental data of input speech signals [12].

31
FIGURE 16. Experimental Vocal feature vectors [12].

FIGURE 17. Training set and the corresponding feature set [12].

32
Using the values registered we obtain the identification result: Speaker 1= {S1, S4},

Speaker 2 = {S3, S4} and Speaker 3 = {S2}. Compute threshold valve T= 1.3915,

therefore we get the recognition: Speaker 1 = {S1, S4}, Speaker 2 = {S3}, Speaker 3 =

{S2}, Unregistered Speaker = {S4}.

FIGURE 18. Experimental output of word ‘READY’.

FIGURE 19. Experimental output of word ‘CALL’.


33
Text-independent voice recognition:

The speech-independent recognition systems involve impressing volumes of training

data ensuring that the entire vocal range is captured. Thus, it is useful for not cooperative

subjects, for example like those in the surveillance systems. The speech-independent

recognition methods are lucrative which are based on Vector Quantization (VQ) or

Gaussian Mixture Model (GMM). The VQ-based methods are parametric approaches

which use VQ codebooks consisting of a small number of representative feature vectors,

K Gaussian distributions are used to represent GMM method based non parametric

technique. We utilize the same delta delta Mel cepstral analysis for the feature extraction

part of this recognition system. Voice recognition techniques like those based on Vector

Quantization, produce MFCC based unidimensional feature vectors. We use bi-

dimensional feature vectors, each vector V(S) being computed as the truncated DDMFCC

acoustic matrix. The sequence of speech signals to be recognized, {S1,…, Sn}, is not

characterized by the same speech anymore. A similar minimum mean distance based

classifier is used, with a uniquely different training set. We consider a large spoken

letters /words, consisting mostly all the English language phonemes. Each registered

speaker provides this speech several times, so same text will be obtained from all the

prototype signals of Sp. The equation helps identify the speaker. We provide a threshold

based verification technique, but not an automatic one, like in the previous case. T is the

threshold value which is analytically chosen, such as to satisfy the necessary condition:

34
Where C1,…., Cn represent the identified voice classes. Many numerical tests using this

approach were performed and obtained a high voice recognition rate [12].

FIGURE 20. Experimental Input waveforms vocal utterances [12].

35
FIGURE 21. Prototype vocal signals and their feature vectors [12].

The prototype speech signals and their corresponding DDMFCC – based speech

feature vectors. Then, the mean distance values between the input feature vectors and

the training feature subsets are computed. Using the values registered in the TABLE 2

we obtain the identification result: Speaker 1= {S2, S6, S9}, Speaker 2 = {S1, S3, S7} and

Speaker 3= {S4, S5, S8}. We get the threshold value T = 7.67, therefore we obtain the

final recognition: Speaker 1 = {S2, S6, S9}, Speaker 2 = {S1, S3, S7}, Speaker 3 = {S4, S5,

S8} and unregistered Speaker = {S5}. This is the voice recognition technique I used in

the project for assistive device [12].

36
4.2.2. Dynamic Time Warping Algorithm (DTW):

Dynamic Time Warping algorithm (DTW) [Sakoe , H. & S. Chiba-8] is an algorithm

that calculates an optimal warping path between two time series. The distance and

warping path values between the two series is calculated by algorithm.

Suppose we have two numerical sequences (a1,a2, ..., an) and (b1, b2, ..., bm). The

two sequences length can differ from each other. The local distances calculation between

the elements of the two sequences using different types of distances is the initial of the

algorithm. The absolute distance between the values of the two elements (Euclidian

distance) is the method used for distance calculation. That results in a matrix of distances

having n lines and m columns of general term:

Starting with local distances matrix, then sequences with minimal distance matrix

between them is determined by using a dynamic programming algorithm and the

following optimization criterion:

Where aij is the minimal distance between the subsequences (a1,a2, ..., ai) and (b1,

b2, ..., bj). A path through minimal distance matrix is a warping path from a11 element

to anm element consisting of those aij elements that have formed the anm distance. The

equation below gives the global warp cost of the two sequences.

37
Where, wi: elements that belong to warping path, and p: the number of wi elements. The

calculations made for two short sequences are shown in figure 1 including the highlight

of the warping path.

4.2.3. HMM Voice Recognition Algorithm Coding :

However even though voice recognition is done partly in frequency domain, a still

unknown brain-like functioning algorithm should be discovered to explain how the voice

is divided into syllables and phonemes for recognition. Since there are too many

unknown facts about how the brain recognizes the voice through different paths and

processes, it may be still better to approach the problem by probabilistic algorithm than

analytic algorithm. For this reason, two different voice recognition algorithms have been

studied while the common feature in both these algorithms is to extract the feature

parameters of the speech signal. The NN (Neural Network) recognition algorithm first

generates a large-sized coefficient matrix through training of characteristic feature

parameters representing syllables or words, then calculates an output index by directly

applying the feature parameters of an unknown new syllable or word to the huge

coefficient matrix. Recognition using a neural network speech recognition method with a

large coefficient matrix for the whole learning process is time-consuming. If you add a

new speech signal to the recognition algorithm, the entire process from the beginning

should be repeated which is a problem due to high time consumption. In the second

method, HMM (Hidden Markov Model) recognition algorithm, for every new input voice

signal, voice feature parameters are generated which are used in the learning process to

create a new HMM model. So with each new HMM model created for every word,

38
during the testing phase, all these models are compared with the test word to find out the

matching voice sample. The disadvantage that a HMM model has is, that for every new

voice that is added to the model, a new individual HMM model needs to be created, and

each model should be compared with all the existing HMM models to get a match,

slowing down the recognition process speed. HMM method is fast in initial training, and

when a new voice information is added into the HMM database, only the new voice is

used in the training process to create a new HMM model. Compared to the neural

network algorithm, for a large number of speech samples, the HMM algorithm provides a

higher speech recognition rate [13].

4.3 Train

Connection of Voice Recognition V3 Module with Arduino are as follows:

Arduino VR Module

5V ---------> 5V

1 ---------> TX

0 ---------> RX

GND ---------> GND

Open vr_sample_train (File -> Examples -> VoiceRecognitionV3 -> vr_sample_train)

Choose right Arduino board (Tool -> Board, UNO recommended), Choose right serial

port.

Click Upload button, wait until Arduino is uploaded.

Open Serial Monitor. Set baud rate at 115200, Newline or Both NL & CR should be set.

39
FIGURE 22. Test on Serial Monitor [9].

Send command settings (case insensitive) to check Voice Recognition Module

settings. Write settings and press send for enter the settings.

FIGURE 23. Input Command “settings” in serial monitor [9].

40
FIGURE 24. Settings of the voice module [9].

Train Voice Recognition Module. Train record 0 with signature "On" by sending

sigtrain 0 On command to. When Serial Monitor prints "Speak now", you need speak

your voice (can be any word, meaningful word recommended, may be 'On' here), and

when Serial Monitor prints "Speak again", we need to speak the letter/ words again.

Serial Monitor prints "Success", and "record 0" is trained when two voices are matched,

or if are not matched, keep on repeating the process until you get success.

When training, the two led on the Voice Recognition Module can benefit your

training process. After sending train command, the blinking of SYS_LED will prompt to

be ready, as soon as the STATUS_LED lights is on then speak, when the STATUS_LED

lights are off the recording finishes. When the training is successful the SYS_LED is

blinks again and these status is repeated. Passed: SYS_LED and STATUS_LED blink

together. Failed: SYS_LED and STATUS_LED blink together quickly.

41
FIGURE 25. Input “Sigtrain 0 On” in serial monitor [9].

Train another record. Send sigtrain 1 Off command to train record 1 with signature

"Off". Choose your favorite words to train (it can be any word, meaningful word

recommended, may be 'Off' here).

42
FIGURE 26. Input “Sigtrain 1 Off” in serial monitor [9].

Send load 0 1 command to load voice.

FIGURE 27. Load 0 and 1 of voice samples [9].

43
FIGURE 28. Recognize the voice input [9].

Train finish. Train sample also support several other commands [9].

COMMAND FORMAT EXAMPLE COMMENT

1. train train (r0) (r1)... train 0 2 45 Train records


2. load load (r0) (r1) ... load 0 51 2 3 Load records
3. clear clear clear Remove all records in
Recognizer
4. record record (r0) record 0 Check record train status
5. vr vr vr Check recognizer
status
6. getsig getsig (r) getsig 0 Get signature of record (r)
7. sigtrain sigtrain (r) (sig) sigtrain 0 ZERO Train one record(r) with
signature (sig)
8. settings settings settings Check current system

44
4.4 Protocol
Base Format

1. Control

| Head (0AAH) | Length| Command | Data | End (0AH) |


Length = L (Length + Command + Data)

2. Return

| Head (0AAH) | Length| Command | Data | End (0AH) |


Length = L (Length + Command + Data)

NOTE: Data area is different with different commands.

Code

ALL CODE ARE IN HEXADECIMAL FORMAT

1. FRAME CODE
AA --> Frame Head
0A --> Frame End

2. CHECK
00 --> Check System Settings
01 --> Check Recognizer
02 --> Check Record Train Status
03 --> Check Signature of One Record

3. SYSTEM SETTINGS
10 --> Restore System Settings
11 --> Set Baud Rate
12 --> Set Output IO Mode
13 --> Set Output IO Pulse Width
15 --> Reset Output IO
15 --> Set Power on Auto Load

4. RECORD OPERATION
20 --> Train One Record or Records
21 --> Training of One Record and Set Signature
22 --> Set Signature for Record
45
5. RECOGNIZER CONTROL
30 --> Load a Record
31 --> Clear Recognizer
32 --> Group Control

6. THESE 3 COMMANDS ARE ONLY USED FOR RETURN MESSAGE


0A --> Prompt
0D --> Voice Recognized
FF --> Error [10]

4.5 Details

1. Check System Settings (00)

Use "Check System Settings" command to check current settings of Voice Recognition
Module, include serial baud rate, output IO mode, output IO pulse width, auto load and
group function.
Format:
| AA | 02 | 00 | 0A |
Return:
| AA | 08 | 00 | STA | BR | IOM | IOPW | AL | GRP | 0A |
STA : Trained status (0-untrained 1-trained FF-record value out of range)
BR: Baud rate (0,3-9600 1-2400 2-4800 4-19200 5-38400)
IOM: Output IO Mode (0-Pulse 1-Toggle 2-Clear 3-Set)
IOPW: Output IO Pulse Width(Pulse Mode) (1~15)
AL: Power on auto load (0-disable 1-enable)
GRP: Group control by external IO (0-disable 1-system group 2-user group)

2. Check Recognizer (01)

Use "Check Recognizer" command to check recognizer of Voice Recognition Module.


Format:
| AA | 02 | 01 | 0A |
Return:
| AA | 0D | 01 | RVN | VRI0 | VRI1 | VRI2 | VRI3 | VRI4 | VRI5 | VRI6 | RTN | VRMAP
| GRPM | 0A |
RVN: Number of valid records in recognizer maximum of 7
46
VRIn (n=0~6): Record which is in recognizer, n: recognizer index value
RTN: Recognizer containing total number of records.
VRMAP: valid record bit map for VRI0~VRI6.
GRPM: group mode indicate. (FF-not in group mode 00~0A-system group 80~87-user
group mode)

3. Check Record Train Status (02)

Use "Check Record Train Status" command to check if the record is trained.
Format:
Check all records
| AA | 03 | 02 | FF| 0A |
Check specified records
| AA | 03+n | 02 | R0 | ... | Rn | 0A |
Return:
| AA | 5+2n | 02 | N | R0 | STA | ... | Rn | STA | 0A |
*N: number of trained records.
**R0 ~ Rn: record.
STA: trained status (0-untrained 1-trained FF-record value out of range)

4. Check Signature of One Record (03)

To check the signature of one record this command is used.


Format:
| AA | 03 | 03 | Record | 0A |
Return:
| AA | 03 | 03 | Record | SIGLEN | SIGNATURE | 0A |
SIGLEN: signature string length
SIGNATURE: signature string

5. Restore System Settings (10)

To restore Voice Recognition Module settings to default this command is used.


Format:
| AA | 02 | 10 | 0A |
47
Return:
| AA | 03 | 10 | 00 | 0A |

6. Set Baud Rate (11)

Effect after Voice Recognition Module is restarted. To set baud rate of Voice
Recognition Module this command is used.
Format:
| AA | 03 | 11 | BR | 0A |
Return:
| AA | 03 | 11 | 00 | 0A |
BR: Serial baud rate. (0-9600 1-2400 2-4800 3-9600 4-19200 5-38400)

7. Set Output IO Mode (12)

To set Voice Recognition Module output IO mode and to bring it to immediate effect
after the instruction execution this instruction is used.
Format:
| AA | 03 | 12 | MODE | 0A |
Return:
| AA | 03 | 12 | 00 | 0A |
MODE: Output IO mode. (0-pulse mode 1-Toggle 2-Set 3-Clear)

8. Set Output IO Pulse Width (13)

Use this command to set output IO pulse width of Voice Recognition Module, take effect
immediately after the instruction execution. When output IO mode is "Pulse" Pulse
width is used.

Format:
| AA | 03 | 13 | LEVEL | 0A |
Return:
| AA | 03 | 13 | 00 | 0A |
LEVEL: pulse width level. Details:

- 00 10ms
- 01 15ms
- 02 20ms
48
- 03 25ms
- 04 30ms
- 05 35ms
- 06 40ms
- 07 45ms
- 08 50ms
- 09 75ms
- 0A 100ms
- 0B 200ms
- 0C 300ms
- 0D 400ms
- 0E 500ms
- 0F 1s

9. Reset Output IO (14)

Use this command to reset output IO. To generate a user-defined pulse in output IO
set/clear mode this command is used.
Format:
| AA| 03 | 14 | FF | 0A | (reset all output IO)
| AA| 03+n | 14 | IO0 | ... | IOn | 0A | (reset output IOs)
Return:
| AA | 03 | 14 | 00 | 0A |
IOn: number of output io

10. Set Power On Auto Load (15)

Use this command to enable or disable "Power On Auto Load" function.


Format:
| AA| 03 | 15 | 00 | 0A | (disable auto load)
| AA| 03+n | 15 | BITMAP | R0 | ... | Rn | 0A | (set auto load)
Return:
| AA| 04+n | 15 | 00 |BITMAP | R0 | ... | Rn | 0A | (set auto load)
BITMAP: Record bitmap.{ 0 (zero record, disable auto load), 01 (one record), 03 (two
records), 07 (three records), 0F (four records), 1F (five records), 3F (six record), 7F
(seven records) }
R0~Rn: Record

49
11. Train One Record or Records (20)

Train records, can train several records one time.


Format:
| AA| 03+n | 20 | R0 | ... | Rn | 0A |
Return:
| AA| LEN | 0A | RECORD | PROMPT | 0A |
| AA| 05+2n | 20 | N | R0 | STA0 | ... | Rn | STAn | SIG | 0A |
*SIG: signature string
**PROMPT: prompt string
Rn: Record
STA: train result (0-Success 1-Timeout 2-Record value out of range)
N: number of train success

12. Train One Record and Set Signature (21)

Setting a signature by training one record, one record one time.


Format:
| AA| 03+SIGLEN | 21 | RECORD | SIG | 0A | (Set signature)
Return:
| AA| LEN | 0A | RECORD | PROMPT | 0A | (train prompt)
| AA| 05+SIGLEN | 21 | N | RECORD | STA | SIG | 0A |
SIG: signature string
PROMPT: prompt string
STA: train result(0-Success 1-Timeout 2-Record value out of range)
N: number of train success

13. Set Signature for Record (22)

Set a record signature, one record at a time.


Format:
| AA | 03+SIGLEN | 22 | RECORD | SIG | 0A | (Set signature)
| AA | 03 | 22 | RECORD | 0A | (Delete signature)

50
Return:
| AA | 04+SIGLEN | 22 | 00 | RECORD | SIG | 0A | (Set signature return)
| AA | 04 | 22 | 00 | RECORD | 0A | (Delete signature return)
SIG: signature string
SIGLEN: signature string length

14. Load a Record or Records to Recognizer (30)

Load records (1~7) to recognizer of Voice Recognition Module, after execution the Voice
Recognition Module start to recognize immediately.
Format:
| AA| 2+n | 30 | R0 | ... | Rn | 0A |
Return:
| AA| 2+n | 30 | N | R0 | STA0 | ... | Rn | STAn | 0A |
N: number of loading successfully R0~Rn: Record STA0~STAn: Load result.(0-
Success FF-Record value out of range FE-Record untrained FD-Recognizer full FC-
Record already in recognizer)

15. Clear Recognizer (31)

Stop recognizing, and empty recognizer of Voice Recognition Module.

Format:
| AA | 02 | 31 | 0A |
Return:
| AA | 03 | 31 | 00 | 0A |

16. Group Control (32)

17. Group select

Set group control mode (disable, system, user), if group control function is enabled
(system or user), then voice recognition module is controlled by the external control IO.
Format:
| AA| 04 | 32 | 00 | MODE | 0A |
MODE: new group control mode. (00-disable 01-system 02-user FF-check)

51
Return:
| AA| 03 | 32 | 00 | 0A |
or
| AA| 05 | 32 | 00 | FF | MODE | 0A | (check command return)

18. Set user group

Set user group content (record).


Format:
| AA| 03 | 32 | 01 | UGRP | 0A | (Delete UGRP)
| AA| LEN | 32 | 01 | UGRP | R0 | ... | Rn | 0A | (Set UGRP)
UGRP: user group number
R0~Rn: record index number (n=0,1,...,6)
Return:
| AA| 03 | 32 | 00 | 0A | (Success return)

19. Load system group

To clear recognizer use command Load system group to recognizer.


Format:
| AA| 04 | 32 | 02 | SGRP | 0A |
Return:
| AA| 04 | 32 | SGRP | VRI0 | VRI1 | VRI2 | VRI3 | VRI4 | VRI5 | VRI6 | RTN | VRMAP
| GRPM | 0A |
SGRP: System group number.
VRIn (n=0~6): n (recognizer index value) and Record which is in recognizer,
RTN: total number of records in recognizer.
VRMAP: valid record bit map for VRI0~VRI6.
GRPM: group mode indicate. (00~0A-system group)

20. Load user group

Load user group to recognizer, this command would clear recognizer.


Format:
52
| AA| 04 | 32 | 03 | UGRP | 0A |
Return:
| AA| 04 | 32 | UGRP | VRI0 | VRI1 | VRI2 | VRI3 | VRI4 | VRI5 | VRI6 | RTN | VRMAP
| GRPM | | 0A |
UGRP: System group number.
VRIn (n=0~6): Record which is in recognizer, n is recognizer index value
RTN: total number of records in recognizer.
VRMAP: valid record bit map for VRI0~VRI6.
GRPM: group mode indicate. (00~0A-system group)

21. Check user group

Check user group content.


Format:
| AA| 04 | 32 | 04 | 0A | (check all user group)
or
| AA| 04 | 32 | 04 | UGRP0| ... | UGRPn | 0A | (check user group)
Return:
| AA | 0A | 32 | UGRP | R0 | R1 | R2 | R3 | R4 | R5 | R6 | 0A |
UGRP: User group number.
R0~R6: Any record.

22. Prompt (0A)

To return data when user trains voice command Prompt command is used.
Format:
NONE

Return:
| AA | 07 | 0A | RECORD | PROMPT | 0A |
RECORD: record which is in training
PROMPT: prompt string

23. Voice Recognized (0D)

53
24. Voice Recognized command is only used for Voice Recognition Module to return
data when voice is recognized.
Format:
NONE
Return:
| AA | 07 | 0D | 00 | GRPM | R | RI | SIGLEN | SIG | 0A |
GRPM: group mode indicate. (FF-not in group mode 00~0A-system group mode 80~87-
user group mode)
R: record which is recognized.
RI: Recognizer index value.
SIGLEN: signature length of the recognized record.

0: on signature, on SIG area


SIG: signature content

25. Error (FF)

To return error status inVoice Recognition Module ERROR command is used.


Format:
NONE
Return:
| AA | 03 | FF | ECODE | 0A |

54
CHAPTER 5

GSM APPLICABILITY TO THE ASSISTIVE DEVICE

GSM is an acronym that stands for Global System for Mobile Communications. In

1984 GSM a digital cellular network was developed as a standard for a mobile telephone

system that could be used across Europe. Mobile service use GSM as an international

standard. Subscribers can easily roam worldwide and access any GSM network as high

mobility offered by it. It offers much higher capacity than the current analog systems. A

larger number of subscribers is allowed due to more optimal allocation of the radio

spectrum. Voice communications, Short Message Service (SMS), fax, voice mail, and

other supplemental services such as call forwarding and caller ID are some of the services

offered by GSM. 450 MHz, 850 MHZ, 900 MHz, 1800 MHz, and 1900 MHz are the

most common frequency bands the GSM works on. To increase the amount of spectrum

available for each band some bands also have Extended GSM (EGSM). GSM works with

Time Division Multiple Access (TDMA) and Frequency Division Multiple Access

(FDMA) [14].

55
FIGURE 29. Use of GSM in biomedical [14].

56
5.1 GSM Module – SIM900

The range of frequencies that A GSM/GPRS-compatible Quad-band cell phone works

are 850/900/1800/1900MHz. They can be used not only to access the Internet, but also

for oral communication and for SMSs provided they are connected to a microphone and a

small loud speaker. The dimensions of the GSM- SIM900 module are as follows: 0.94

inches x 0.94 inches x 0.12 inches. With L-shaped contacts are placed on four sides so

that they can be soldered at the bottom and both on the side. An AMR926EJ-S processor

controls phone communication, data communication over an integrated TCP/IP stack and

an UART and a TTL serial interface. The processor internally manages the module the

communication with the circuit interfaced with the cell phone itself. A SIM card (3 or

1.8 V) which needs to be attached to the outer surface of the SIM900 module. The

GSM900 device integrates an SPI bus, a PWM module, an A/D converter, an I²C, an

RTC, and an analog interface. The radio section is GSM phase 2/2+ compatible and is

either class 4 (2 W) at 850/ 900 MHz or class 1 (1 W) at 1800/1900 MHz. The TTL

serial interface is in charge of communicating all the data relative to the received SMS

and those that come in during TCP/IP sessions in GPRS. GPRS class 10: max. 85.6 kbps

determines the data rate. Receiving the circuit commands coming from the PIC that

controlling the remote control is been monitored by TTL serial interface, that can be

either AT standard or AT-enhanced SIMCOM type. The module absorbs a maximum of

0.8A and supplied with continuous energy ranging between 3.4 and 4.5 V during

transmission [11].

57
FIGURE 30. SIM900 [11].

FIGURE 31. SIM900 Pin Diagram [11].

58
5.2 GSM call processing

Call processing consists of different steps that are set up, maintain, and end of call.

The American National Standard for Telecommunications has put forth a Telecom

Glossary. A call processing means: The sequence of operations performed by a

switching system from the acceptance of an incoming call through the final disposition of

the call. The end-to-end sequence of operations performed by a network from the instant

a call attempt is initiated until the instant the call release is completed.

Initialization:

The first part to mobile call processing is initialization. You get a connection to a

nearby cell site so that your account can be checked by cellular network checks. A valid

telephone number and an account is verified then the call proceeds.

We need a connection to the cellular system that is we need a frequency to transmit

on. The system checks a frequency list contained in its SIM card that is the removable

memory chip in the system. These frequencies are checked carried by these bit streams,

searching for a Broadcast Control Channel or BCCH within one of them. Each BCCH

transmit a unique data marker, so the mobile knows when it has found its channel. This

is a big difference between AMPS and GSM. With AMPS, to set up calls a dedicated

radio frequency is in each cell. To set up information any frequency can carry with the

help of GSM. It’s the channel within the data stream that’s important to find, not a

specific radio frequency. A base station’s Broadcast Control Channel continuously sends

out identifying information about its cell site. Information such as the area code for the

current location, network identity, information on surrounding cells, and usage of

59
frequency hopping. The BCCH is not a dedicated radio frequency. It resembled by the

channel within the bit stream carried by any of the frequencies in a cell.

The mobile is the receiver checking the any base station signals within the range. The

mobile acts as a scanning radio. The mobile goes through each BCCH frequency for

testing reception and the received level for each channel is measured. The GSM system

decides after this test which cell site should take the call. That’s usually the cell site

delivered the highest signal strength to the system.

FIGURE 32. Burst of bits [15].

Once homed in on the Broadcast Control Channel the mobile monitors the ongoing

data stream from the base station. The BCCH searches for a frequency control burst or

frequency control channel burst (FCCB). 142 burst of bits have 3 tailing bits in front and

behind. This distinctive burst says that synchronization bits will follow. A wireless

connection would be set up between mobile and the cellular system with the help of those

bits. And once that is done, mobile and base station can communicate and everything can

start working.

One burst of many within a single GSM TDMA frame is the digital signature which

mobile is searching in the BCCH. Bits resembles single pulses of electrical energy as if

they are single dashes of a Morse code key. Long and short pulses of energy representing

60
letters are used in Morse code. The pulses we use in digital radio do the same thing with

uniform length. Voice and data are represented in form of groups of bits. Bits are used

for signaling.

GSM is a time based multiplexed system. There are many calls on the same

frequency so they are divided by time they represent cars in a long lane. A new call must

fit somewhere in the frequency band as every third car. The mobile and base station are

provided with exact timing details for the coming conversation the synchronization bits.

Once our system is assigned a place in this digital freight train it can take and send

information. The first is the radio part and the second is the network part.

The radio subsystem. Sometimes called the air interface. How we set up, maintain,

and then later tear down a radio connection from the system to the cell site. The network

subsystem or the switching element decides who gets on the system, how to set up the

call and terminate it. What services and resources the system can use is determined by

network subsystem. We have two parts, each working to help the other out. A call

would not go through if these two parts do not work together in synchronization [15].

61
CHAPTER 6

RESULTS

TABLE 2. Speaker Dependent Sample Values from experimental data

No. of Sample Sample 2 Sample 3 Sample 4 Sample

Speaker 1 5

1 0.8512 0.7495 1.5512 1.6914 1.2527

2 1.0957 0.935 1.1123 1.4814 1.4456

3 1.3115 0.2556 1.5252 1.5013 1.5123

FIGURE 33. Graph of speaker dependent voice samples.

62
Voice samples for the speaker independent are considered for the below table.

TABLE 3. Voice samples for speaker independent voice recognition from experimental data

Number Sample Sample Sample Sample Sample Sample Sample Sample Sample

of 1 2 3 4 5 6 7 8 9

Speaker

1 5.6678 4.1606 5.9543 6.0646 10.6853 5.1522 7.5031 8.7624 3.2465

2 3.4676 7.2581 3.8976 4.7342 9.7366 5.7855 4.8871 7.9964 8.0245

3 6.2476 6.4327 4.3671 2.9857 8.6545 6.3879 10.8775 5.8976 4.9082

FIGURE 34. Graphical representation of samples of speaker independent voice recognition.

63
FIGURE 35. Experimental output waveform of Number ‘0’.

FIGURE 36. Experimental output waveform of number ‘1’.

From the above waveform I was able to find out the relation of speech with time and

frequency. Sampling of the voice was required for recognition of the voice from

different speakers. The voice samples of numbers from 0 to 9 were taken for the each

speaker to record and recognize their voice, so that the Arduino UNO can give

instructions to GSM module to begin initialization of call procedure.

64
The Arduino UNO analyzed the data for GSM module to dial the number. GSM

module checks for the cellular network provider before initiating the call. The network

once found by the GSM module the following data would appear on the serial monitor of

the Arduino UNO.

FIGURE 37. Serial Monitor Data for GSM network.

The network once connected allows the network provider to initiate the call. The

GSM module initialization steps can be shown in the serial monitor.

65
FIGURE 38. Serial monitor showing experimental status of Call Ready.

A period of time is provided to GSM module to locate and log on to the cellular

network. The call ready status shows the searching for the channel for communication is

over and the call can be made to the other device. The GSM module SIM900 mostly uses

the 900 MHz for communication be it searches other frequency channels too. It is a

string of text which is send. The module consist of AT command and are considered the

modem language.

The full connection of the Assistive Device Using Voice recognition for GSM calling

is show below.

66
FIGURE 39. Assistive device using voice recognition for GSM voice call.

67
CHAPTER 7

APPLICATIONS AND FUTURE SCOPE

7.1 Applications

The applications of assistive device using voice recognition are as follows:

Healthcare: ASR for doctors in the order to create patient records automatically:

Voice recognition using GSM module for emergency call help doctors to reach out

the patient as soon as possible. There is an increase in number fatalities every year in

physically handicapped people to keep a track of this the assistive device will prove to be

vital. In emergency cases patients require immediate attention from the doctor so just a

phone call from the assistive device can save time and notify the doctor.

Help for disable (especially to access the web and control the computer):

In today’s world web access and controlling the computer are important in any

occupation services. Disable people have difficulty in accessing their laptops and

computers. So, their voice can help them operate electronics devices due to this they will

be able to work efficiently without any problems. Improving the working environment

for the disables assistive devices can be innovative in the field of biomedical with more

research.

68
Military:

Handheld devices for speech-to-speech translation for basic purpose of making

communication simple. Also used in fighter planes where the pilot’s hand are too busy to

type. Voice recognition and GSM are vital for transferring data over the channel and keep

it encrypted. In military operations information has to be kept secret hence, assistive

device can be useful.

There many other applications where assistive can be appropriate to be used.

Automation of operator services.

Automation of directory assistance.

Voice banking services

Voice prompter

Directory assistance call completion

Reverse directory assistance

Information services

Agent technology

Customer care

69
7.2 Future scope

In The United States of America and Japan the Arduino based robots are quite famous

due to their facial expression and mirroring properties. Creating an emotional bond with

the machine is one of the goals of the human robot interface. Body language and facial

expression that a voice recognition system can read can also be used for threat

assessment. Human works can be replaced by efficient voice controlled robots on

airports and border crossing.

If you smile at a robot while you are having a conversation and it smiles back at you,

this creates the emotional bonding with the human during the conversation. The system

might start adjusting to your behavior. The system may mirror those responses if the user

is fastidious about the robot or reciprocate an angry response or work to diffuse the

situation. All depends on the machines programming so that all the functions can be

performed accurately. Due to this advances, potential applications and the trends are

going forward.

70
CHAPTER 8

CONCLUSION
With a computer, multimedia hardware, and a relevant technical paper in the public

domain, I have designed a reasonable GSM calling method using speaker-dependent

voice recognition system for physically handicapped people. The accuracy is usually in

the mid-80% range, as long as the environment is quiet. The key properties of the

proposed platform are scalability and universality. Mel- frequency Cepstral Coefficient

method was applied for voice recognition and results were effective. Training the voice

module for the samples of the voice increased the accuracy for voice recognition. Given

the stringent property of voice being volatile and there is a change in the waveform every

time the same word is spoken, the voice module V3 could meticulously match the voice

with samples in the memory. GSM module SIM900 proved to be one of the most

effective and economical components to connect to the GSM network. Integration the

voice module V3 and the GSM module SIM900 with the Arduino UNO for producing an

assistive device can be used for emergency calls and getting immediate help. The

platform is composed from easy to get and relatively cheap hardware elements.

71
REFERENCES

72
REFERENCES

[1] V. Rudzionis, R. Maskeliunas and K. Driaunys, "Voice controlled environment for


the assistive tools and living space control," in Computer Science and Information
Systems (FedCSIS), 2012 Federated Conference on, 2012, pp. 1075-1080.

[2] Arduino. (2012). {http://www.arduino.cc/en/Main/arduinoBoardUno}

[3] Arduino. (2012). {http://arduino.cc/en/Main/ArduinoGSMShield}

[4] MobileMark. (2011). {http://www.mobilemark.com/gsm-external-antennas.htm}

[5] Skelectronics. (2011). {http://www.nskelectronics.com/files/vr3_manual.pdf}

[6] Youtube. (2010). {https://www.youtube.com/watch?v=Xjzm7S__kBU&list=PL8}

[7] Revistaie. (2010). {http://revistaie.ase.ro/content/46/s%20-%20furtuna.pdf}

[8] Voice Recognition. (2010). {http://www.hitl.washington.edu/research.html}

[9] Github. (2010). {https://github.com/elechouse/VoiceRecognitionV3}

[10] Bo Cui and Tongze Xue, "Design and realization of an intelligent access control
system based on voice recognition," in Computing, Communication, Control, and
Management, 2009. CCCM 2009. ISECS International Colloquium on, 2009, pp.
229-232.

[11] OpenElectronics. (2009). {http://www.open-electronics.org/gsm-remote-control-


part-4}

[12] T. Barbu, "Comparing various voice recognition techniques," in Speech Technology


and Human-Computer Dialogue, 2009. SpeD '09. Proceedings of the 5-Th
Conference on, 2009, pp. 1-6

[13] Soon Suck Jarng, "HMM voice recognition algorithm coding," in Information
Science and Applications (ICISA), 2011 International Conference on, 2011, pp. 1-7.

[14] Github. (2011). {http://www.pennula.de/datenarchiv/gsm-for-dummies.pdf}

[15] Privateline. (2008). {http://www.privateline.com/PCS/callprocessGSM.html}


73
[16] Z. Fanfeng, "Application research of voice control in reading assistive device for
visually impaired persons," in Multimedia Information Networking and Security
(MINES), 2010 International Conference on, 2010, pp. 14-16.

74

You might also like