You are on page 1of 64

ABSTRACT

Speech recognition is the ability of a framework to distinguish words and sentences of


human spoken language and convert into a machine coherent structure. Numerous Speech
Recognition Assistants are accessible in the market yet they are not viable with stammered
speech or with the speech of individuals who have issue with communicating in English
easily. These speech obstacles significantly affect the execution of speech recognition
frameworks and along these lines harming the ability of the people with these hindrances to
utilize these instruments. It has been discovered that the vast majority of these devices
accessible in the market frequently disregard the speech of individuals with debilitations and
when these devices were tried against different clutters.
Our undertaking will concentrate on individuals who take somewhat longer to express
words in a specific language and for the general population who have a characteristic
stammer in there speech. This will assist a mass populace with such issues. We might want to
improve these frameworks further by:
Making the framework sufficiently competent to let individuals who have language
issues to effortlessly utilize these products for quicker info.
Letting the framework take contribution from individuals who have a characteristic
stammering issue.

I
TABLE OF CONTENTS

1. INTRODUCTION..............................................................................................8
2. PROJECT OVERVIEW....................................................................................10
2.1. LITERATURE SURVEY............................................................................10
2.2. PROBLEM DESCRIPTION.......................................................................12
2.3. REQUIREMENTS GATHERING..............................................................12
2.4. REQUIREMENT ANALYSIS...................................................................12
2.4.1 FUNCTIONAL REQUIREMENTS..................................................12
2.4.2 NON- FUNCTIONAL REQUIREMENTS.......................................12
2.5. DATA SOURCE........................................................................................12
2.6. COST ESTIMATION................................................................................13
2.7. RISK ANALYSIS......................................................................................15

2.8.SRS.............................................................................................................17

3. ARCHITECTURE & DESIGN........................................................................21


3.1 SYSTEM ARCHITECTURE....................................................................21
3.2 INTERFACE PROTOTYPING (UI).........................................................22
3.3 DATA FLOW DESIGN.............................................................................23
3.4 USE CASE DIAGRAM.............................................................................24
3.5 SEQUENCE DIAGRAM...........................................................................25
3.6 CLASS DIAGRAM....................................................................................26
3.7 STATE / ACTIVITY DIAGRAM..............................................................27
3.8 COMPONENT & DEPLOYMENT DIAGRAM........................................28
4. IMPLEMENTATION......................................................................................29
4.1 DATABASE DESIGN..................................................................................32
4.1.1 ER DIAGRAM..............................................................................32
4.1.2 RELATIONAL MODEL..................................................................33
4.2 USER INTERFACE......................................................................................34
4.3 MIDDLEWARE............................................................................................35
5. VERIFICATION & VALIDATION.................................................................36
5.1 UNIT TESTING..........................................................................................36
5.2 INTEGRATION TESTING........................................................................38

5.3 USER TESTING....................................................................................42

II
5.4 SIZE - LOC..............................................................................................43
5.5 COST ANALYSIS..................................................................................44
5.6 DEFECT ANALYSIS...........................................................................45
5.7 MC CALL’S QUALITY FACTORS.....................................................45
6. EXPERIMENT RESULTS & ANALYSIS.....................................................47
6.1 RESULTS...................................................................................................47
6.2 RESULT ANALYSIS...............................................................................50
6.3 CONCLUSION & FUTURE WORK.......................................................51
7. PLAGARISM REPORT..................................................................................52
8. RESEARCH PAPER.......................................................................................52
9. REFERENCES.................................................................................................62

III
LIST OF FIGURES

Figure 1 Risk Analysis........................................................................................................................... 16


Figure 2 System Architecture.............................................................................................................. 21
Figure 3 Interface Prototyping............................................................................................................. 22
Figure 4 Dataflow Diagram.................................................................................................................. 23
Figure 5 Use Case Diagram.................................................................................................................. 24
Figure 6 Sequence Diagram................................................................................................................. 25
Figure 7 Class Diagram......................................................................................................................... 26
Figure 8 State/Activity Diagram........................................................................................................... 27
Figure 9 Deployment Diagram............................................................................................................ 28
Figure 10 Signal processing................................................................................................................. 29
Figure 11 Acoustic models: Template and the state representation of the word “cat”......................30
Figure 12 The alignment path with the best total score identifies the word sequence and
segmentation...................................................................................................................................... 30
Figure 13 Models for Speech Recognition...........................................................................................31
Figure 14 ER Diagram.......................................................................................................................... 32
Figure 15 Relational Model.................................................................................................................. 33
Figure 17 User Interface 2................................................................................................................... 34
Figure 16 User Interface 1................................................................................................................... 34
Figure 18 Spyder Main Screen............................................................................................................. 35
Figure 19 Block Diagram of Training Phase..........................................................................................36
Figure 20 Block Diagram of Testing Phase...........................................................................................36
Figure 21 Mc Call's Quality Factors...................................................................................................... 45
Figure 22 Time taken by Google and SR..............................................................................................49
Figure 23 Comparison between Google and SR...................................................................................50
Figure 24 : The Processing of Speech Signals.......................................................................................57
Figure 25: The structure of a neural network with feedback..............................................................59
Figure 26 : Technical model of a neuron is represented......................................................................60
Figure 27 : Structural diagram of two-layer neural network...............................................................60

IV
LIST OF TABLES
Table 1 Literature Survey.........................................................................................................11
Table 2 Cost Estimation...........................................................................................................13
Table 3 Project Schedule..........................................................................................................15
Table 4 Conducted Test Cases.................................................................................................37
Table 5 User Testing Results...................................................................................................41
Table 6 Cost Categories...........................................................................................................43
Table 7 Cost Prediction Cases..................................................................................................44

V
LIST OF ABBREVIATIONS

1. SR – SPEECH RECOGNIZER
2. HMM – HIDDEN MARKOV MODEL
3. SRS – SOFTWARE REQUIREMENT SPECIFICATION
4. OS – OPERATING SYSTEM
5. ASR – AUTOMATED SPEECH RECOGNIZER
6. UI – USER INTERFACE
7. LDC – LINGUISTIC DATA CONSORTIUM
8. SRPI – SPEECH RECOGNITION FOR IMPAIRED
9. SLP – SPEECH LANGUAGE PATHOLOGIST
10. ER – ENTITY RELATIONSHIP
11. DFD – DATA FLOW DIAGRAM
12. LOC – LINES OF CODE

VI
1. INTRODUCTION
Distinctive thoughts shaped in the psyche of the speaker are conveyed by speech in form
of words or sentences by using a few appropriate linguistic methods. Speech is essential
method to communication amongst different beings. By speaking in voiced and unvoiced a
rudimentary acoustic division of speech which is fundamental for speech can be considered.
In progression to singular sounds called phonemes this method can nearly be
indistinguishable to the hints of each letter of the letters in order which makes the creation of
human speech. A large portion of the Information in computerized world is accessible to a
rare sorts of people who can peruse or comprehend a circumspect language. Language
advances can give arrangements as common interfaces so the computerized substance can
reach to the majority and encourage the trading of data crosswise over various individuals
talking distinctive dialects [4]. These innovations assume an essential job in a multi lingual
area, for example, India has around 1657 tongues/local dialects. Speech-to-Text
transformation takes contribution from an amplifier as speech and then it is changed over
into content structure which is show on work area. Speech handling is the investigation of
signals, and the different techniques that are utilized to process them. In this procedure
different applications, for example, speech coding, speech combination, SR and speaker
advancements; speech preparing is utilized. Among the abovementioned, speech recognition
is the most significant one. The primary reason for speech recognition is to change over the
acoustic flag acquired from a mouthpiece or a phone to produce a lot of words . So as to
remove and decide the semantic data passed on by a speech wave we need to utilize PCs or
electronic circuits. This procedure is performed for a few applications, for example, security
gadget, family unit apparatuses, PDAs ATM machines and PCs.

1.1 TYPES OF SR
SR frameworks can be divided in number of classes dependent on ability to perceive the
word and record of word it has. A couple of SR classes are delegated below:
Isolated Speech- Such words usually involves pause between the distinct utterances of a
word. It doesn’t mean that it can only accepts a single word but instead it require one
utterance at a time.
Connected Speech- Such words and speeches are similar to the isolated speeches but it
allows distinct utterances to least pause between the two of them.
Continuous speech- Such speeches let the person to speak in a natural way and it can also be
termed as computer dictation.

8
Spontaneous Speech- This is a kind of a natural speech but it is not rehearsed. In an ASR
environment such speech ability will have the ability to handle a lot of natural features of
speech like words that un together.

9
2. PROJECT OVERVIEW

2.1 LITERATURE SURVEY


Following are few papers, which we have used for literature survey on the topic of
Speech Recognition.
Ji Ming et.al.[15] suggested about a conventional speech enhancement method. This
paper was published in 2017 March by IEEE Transactions. It gives us an idea about
conventional speech enhancement methods, based on frame, multi-frame, or segment
estimation, require knowledge about the noise. In their results it has been found that by
directly matching long speech segments based on ZNCC we could potentially increase
significantly noise immunity for speech enhancement without requiring noise knowledge.

Ramji Srinivasan et.al.[16] In this paper, author mentioned about methods to separate
speech and how to remove ambient noise from the speech. This paper was published in 2013
July by IEEE Transactions. It gives us an idea about Single-channel speech separation,
assuming unknown, arbitrary temporal dynamics for the speech signals to be separated. A
data-driven approach is described, which matches each mixed speech segment against a
composite training segment to separate the underlying clean speech segments. The results
have demonstrated the significance of matching longest speech segments for speech
separation, in terms of improving performance over conventional frame-by-frame separation
algorithms for all the measures.

Shweta Khara et.al.[17] In this paper, author mentioned about the techniques for
extracting the features from speech and how to classify stuttering in speech. This paper was
published in 2018 by IEEE Conference. It speaks about techniques of feature extraction and
classification which are applied for automatic speech recognition, and also presents a
comparative analysis of stuttering techniques on the basis of accuracy, sensitivity, specificity,
and dataset size. The result obtained is that Automatic Speech Recognition (ASR) systems
are getting more significance and hence the use of those systems for patients and for Speech-
Language Pathologists (SLPs) is also increasing. The key issue in ASR is accuracy.
Therefore, more robust and accurate systems that can detect the rate of stuttering severity
are needed to develop.

10
Soumya Priyadarsini Panda et.al[18] In this paper, author gives us an overview of the
automated speech recognition system and the wide application of the technology in
advancement of human-computer interactive systems.. It was published in 2017 by IEEE
Conferences. The author concluded that the performance of the speech recognition systems
depends on the accuracy and speed. Accuracy may vary with the, vocabulary size and
confusability and depends on the accuracy of the language model.
Table 1 Literature Survey
No Title Author Published Year Data Sets
By
1 Speech Ji Ming IEEE 2017 Self-Trained
Enhancem & Transaction March Datasets
ent Based Danny s
on Full- Crookes
Sentence
Correlation
and Clean
Speech
Recognitio
n

2 CLOSE— Ramji IEEE 2013 Wall Street Journal


A Data- Srinivasa Transaction July Phase I (WSJ0)
Driven n s database was used in
Approach Ji Ming the experiments
to Speech &
Separation Danny
Crookes

3 A Shweta IEEE 2018 Self trained


Comparati Khara, Conferences Data sets
ve Study Shailend
of the ra Singh,
Technique Dharam
s for Vir
Feature
Extraction
and
Classificati
on in
Stuttering
4 Automated Soumya IEEE 2017 Self trained data sets.
Speech Priyadars Conferences
Recognition ini Panda
System in
Advanceme
nt of
Human-
Computer
Interaction

11
2.2 PROBLEM DESCRIPTION
Many SR Systems can be found in the market but these devices lacks effective speech
recognition for people with speech ailments. These devices can recognize the speech of a
normal person with high degree of performance but cannot detect the speech of person with
speech ailments such as stuttered speech and also for people who have problem with speaking
English fluently.
2.3 REQUIREMENTS GATHERING
It is a process of collecting a series of requirements which is taken from different focus
group. This becomes the basis of formal definition of requirements. The requirements are
listed below-
 To build a system that can identify words of human spoken language.
 To build a system that converts the human spoken language into a machine readable
form.
 To build a system that helps people who are partially impaired in speech. This
includes people who have problem in speaking a certain language or have a natural
stuttering in speech.
 To build a system that detects and understands stuttered speech.
 To build a system by using Artificial Hidden Markov Model (HMM).
 To be able to perform speech recognition with greater efficiency and performance.

2.4 REQUIREMENT ANALYSIS


2.4.1 FUNCTIONAL REQUIREMENTS
 The system runs on Python. So Python should be preinstalled in the
device.
 The system should be able to convert the user’s speech into text
accurately.
 The system should be compatible with the host device.
 The system should be secured.
 The system should provide maximum true positive results.
 The system should provide minimum false positive results.

2.4.2 NON-FUNCTIONAL REQUIREMENTS

 The system should be reliable and should be easy to use for the user.

12
 It should work in all the versions of Mac OS, Linux and Windows 7 and
above.
 It should fulfil all the safety requirements of the system.
 It should have high accuracy and performance.

2.5 DATA SOURCE


The datasets for the normal were available from Linguistic Data Consortium and UCI .But
in case of stammering datasets, no proper data sets were available.
So we had to take inputs for the real time. We took 5 subjects and from them we obtained
26 samples.
These samples were then compared from Google speech. Time and accuracy constraints
were measured and graphs and accuracy were obtained.
2.6 COST ESTIMATION
 Cost of the system is important part while deploying the system.
 We can define cost in following categories:
Table 2 Cost Estimation

 We can use Lee and Stolfo's model for cost estimation, a risk analysis procedure to
select sensitive data/assets and create a cost matrix.
 They divide the cost items into damage cost, operation cost, and response cost, and
combine them together to calculate the total cost.
 Model divides these in 3 level:

13
 First: involves using few resources to save money. These resources can easily
be obtained at the beginning of an event. Ex: destination Service
 Second: involves the use of a moderate number of resources during the event.
These resources are computed at any point during an event, and are
maintained throughout the event’s duration.
 Third: uses resources of several events within a window.
Each level has cost : Level 1 == 1 to 5

: Level 2 == 10
: Level 3 == 100
 Total cost can be given by:

Here,
Cost: Consequential cost = (damage cost + response cost)

14
 There can be situation when accent creates trouble as African and Asian(or other
horizon) accent is completely different.
Components of risk matrix:
 Severity
 Probability
 Risk assessment

Figure 1 Risk Analysis

16
2.9 SRS
2.9.1 Introduction
Purpose
We have designed thus document to present description of the Speech Recognition for
Partially Impaired People. This document explains purpose, features and enhancements of the
project along with the interface system and the conditions required for its operation. We are
writing this document for developers and users.
Product Scope
This Software will be a an upgrade to the current Speech recognition systemsallowing the
people with stammering issues to speak according to their convenience. This will be available
for many Computer Environment such as Linux, MAC OS X &Windows. Main Focus of
software will be on solving the easiness of the stammering people. It can be used by non-
commercial users and by organization or Business Purpose. This software is also working for
the normal people and is capable of accessing a few system and web controls .Also one can
access youtube links according to their needs.
2.9.2 Overall Description
Product Perspective
This Product is update to an existing speech recognition systems. As our software will
contain some new and productive feature like making it usable for partially impaired people.
Product Function
Product will let the user to control his speech input as per his convenience. This will allow him/her to
speak my enabling their own duration of time. This will meet the need of partially impairedpeople
issue which helps the user to carry minimal number of devices.
User Classes and Characteristics
This product can be of use to:
 Individuals:
o A User can use it for non-commercial use. Like accessing websites or
handling system controls
 Organizations:
o Organizations can use this application for implementation purpose.
2.9.3 Operating Environment
Operating Environment for SRPI will be Windows 2003 or newer, MAC OS X, Linux.
2.9.4 Design and Implementation Constraints
 GUI is only in English

17
 Only Registered Version have feature of stammering option.
 Option needs to be chosen before speaking.
Assumptions and Dependencies
For Using SRPI one must need a high speedInternet for faster results. So it’s assumed here
that user has an internet connectivity. Also, we’ll try our best to provide faster speech-to-text
conversion, but though it’s depends on the Data Connectivity speed.
2.9.5 External Interface Requirements
User Interfaces
 You need to use SRPI.
 On main screen of SRPI, User can find following option:
o Normal People
o Disabled People
o Control your system
 User can change default theme to given options.
Hardware Interfaces
This software is an open source software. It requires the minimal hardware
requirements for the system or devices such as:
 Intel Pentium III or AMD 800 MHz & later version for the cases
 RAM-256 MB or above
 Disk space- 1.5 GB (minimum).
Software Interfaces
 Windows: 7 or above (vista may be supported depending on the basis of the
hardware)
 Mac OS: OS X(lion or above)
 Linux
Communications Interfaces
This is network controlled device. It requires network server communication. It
 High speed internet connection:
o For better and faster conversion
o For more accurate input

18
2.9.6 System Features
This software features all important options that can be used in Speech Recognition.It
provides an idea to improve current AI speech recognition systems to help Partially vocal
impaired people.
Options:
Description and Priority
This allows the user to make a choice according to its needs and also helps him to switch to
different options for different users in one system only.
Functional Requirements:
It will help user to have easy to control. User must choose correct option as stated to
get perfect output.
Description and Priority
This Use case will provide you option using which you can either Normal People or
Disabled people. This feature only available for commercial users. And one must have
SRPI for using this feature.
2.9.7 Functional Requirements:
It will provide two option like Normal or Disabled People. User can choose it as per
his/her requirement. This also has so many options like record meeting, Chat option,
File sharing etc.
2.9.8 Other Nonfunctional Requirements
Performance Requirements
There are no as such performance requirement for this application. The user requires a
high speed internet facility. OS requirements are as follows:
WINDOWS: 7 or above.
Mac OS X- Lion or above
2.9.9 Safety Requirements
There is as such no safety requirements for now. Still constant speed internet connection
should be available for most accurate results.
2.9.10 Security Requirements
The security requirements will be same as those stated in section 5.2
Software Quality Attributes
a. Functionality, usability, reliability, performance and supportability.
b. Availability, serviceability and install ability.

19
Business Rules
The application will be free for customers and organizations as well for first three
years. After that, the old user will have to pay 1.66$ per year and new users will pay 2.5$ in a
year.

20
3 ARCHITECTURE AND DESIGN
3.1 SYSTEM ARCHITECTURE
In this Speech Recognition System we have used the Hidden Markov Model
algorithm to detect the speech signals and convert those speech signals into text. We have
used Hidden Markov Model because it provides results with higher accuracy as compared to
other algorithms. It uses probability functions to predict the correct speech said by the user.
Following shows the figure of the proposed architecture (Figure 1). Here speech
samples are fed into the data signal interface from where the required features are extracted.
These features follows a series of recognition process like phonetic training, word class
training, transition probabilities, probability density functions to convert the speech into text.
The database consists of all the datasets from which each of the phonemes are compared and
the word with the highest probability is selected.

Figure 2 System Architecture

21
3.2 INTERFACE PROTOTYPING
Interface prototyping is an iterative development process which results to give a User
Interface (UI). User Prototypes have a large number of uses such as-these prototypes allows
us to check all the probable problems that may arise in the system by discussing it with the
stakeholders, it also helps us to find the best solution for the system, it is a basis which checks
the usability of the system, etc.
To define any interface there are few steps that must be followed. Below diagram (Figure
2) represents all those steps.-
At first we need to define our requirements. In our case our need is a system which
converts the speech signals from users and converts them into text. But for our project we are
using an algorithm which can understand the speech of disabled people and convert it into
text.
Second and third steps are to build and evaluate prototype. User Interface diagram of our
system is provided in our next section. Note that what we have described as a result of this
prototype is the result of UI. Now, our project is limited to only the algorithm and so we have
not developed any specific front end.

Figure 3 Interface Prototyping

22
3.3 DATAFLOW DESIGN
Below figure represents dataflow diagram (Figure 3). Data flow diagram is the process of
representing how data of a process flows. It also shows the input and output of every entity
and its process. In data flow diagram there is no specific control or loops. It only shows how
data flows through architecture.
Our system shows how user’s data passes through different entities and how it is
processed by different language models. The processing of the speech signal takes places in
the server where the received speech string is passed into syntactic analysis and then to
semantic analysis. Here our algorithm Hidden Markov Model is used for speech
identification. After the speech has been recognised it is send to the launch action which
displays the text to the user.

Figure 4 Dataflow Diagram

23
3.4 USECASE DIAGRAM
Following figure (Figure 4) depicts the use case diagram. A use case diagram is a
behaviour diagram. Use case diagrams are developed at the early stage of a development
process. It helps us to model the functionality of a system. The functionality of a system is
displayed using actors and use cases. Use cases are a set of functionalities, services and
actions that a system needs to perform. Use cases helps in easy understanding of the
functionalities and services of a system. The actors are the defined roles present in the system
and operating the functionalities within the system.
In our system, the actor is the user himself. The actions are, at first the user will start
the system. After starting the system there will be two options for the user. One is for normal
person and the other is for person with disabilities (stuttered speech). If the user selects the
option for normal person, the microphone will record the speech input of the user until there
is a pause and if the user selects the option for disable person then the microphone will record
the speech of the user and will wait for the user till he processes the next word and speaks.
This will continue till it gets the command from the user that he has completed his/her
speech. After selecting the option the microphone starts recording the voice inputs. After the
voice signal has been recorded, the analog signals are converted into digital form. These
digital signals are processed by the algorithm that we are using and it converts the digital
signal into text. After the Speech has been converted into text the user stops the system.

Figure 5 Use Case Diagram

24
3.5 SEQUENCE DIAGRAM
Sequence diagram are used to describe communication between the different parts of
a system in a sequential order i.e. in the order the interactions takes place. This diagram is
used to understand the different requirements that are required in a new or an existing system.
Lifelines are used in sequence diagram which shows the individual participation of the actors.
The lifeline is depicted using a vertical line.
In our system we have considered five actors- user, audio wave, microphone, main interface,
Hidden Markov Model. This diagram shows the sequence of steps that are followed in our
system. This is only imagined scenario and many more possibilities will be there.

Figure 6 Sequence Diagram

25
3.6 CLASS DIAGRAM
A class diagram is a static structure which determines how the structure of the system
will be and also mentions about the system’s classes, attributes and the relationship among
the different objects.
Below is the class diagram (Figure 6). Here we have different classes which have
different attributes and perform different functions and each of these is related to one another.
For instance, the Main Activity class starts the speech recognition system and eventually the
microphone gets started. The view uses data from microphone and also from the SQLite
database and sends these data to prepare_stream to start the processing and then convert the
signal into text.

Figure 7 Class Diagram

26
3.7 ACTIVITYDIAGRAM
Activity diagram shows activity happening throughout the process. In this speech
recognition system we have one main actor i.e. the user.
Here in the diagram below (Figure 7), the user first starts the speech recognition
system. After starting the system the system provides two options to the user. One option is
for the normal person and the other is for disabled person. If the user selects the option for
normal person, the microphone will record the speech input of the user until there is a pause
in the speech and if the user selects the option for disable person then the microphone will
record the speech of the user and will wait for the user till he processes the next word and
speaks. This will continue till it gets the command from the user that he has completed
his/her speech. After this the microphone converts the analog signal into digital form and
processes it though a series of language model. After the signals are processed, it displays the
converted speech to the user.

Figure 8 State/Activity Diagram

27
3.8 DEPLOYMENT DIAGRAM
Deployment diagram is useful to define structure of final system and connection of
software and hardware. We are only working with new algorithm that can give us a better
result as compared to the existing algorithm that is being used for speech recognition. So
given diagram is only representation of how final system may look like.
There are basically three important components in our project. Source of the system
can be a PC, Laptop or a Smart Phone. Inside the voice system there will be a microphone
that collects all the voice samples, converts them to digital signal and feeds these converted
signal into the language model and algorithm. Here the signals are compared with the
existing database present in the system and the words with highest probability are selected.
After these, the converted signal is send to the output where it is displayed to the user as text.
Following figure shows Deployment diagram:

Figure 9 Deployment Diagram

28
4 IMPLEMENTATION
4.1 Recognising Process
In order to recognise a speech the process is as follows : waveforms are taken and
broken into waveforms. The utterances that happen through silences are split apart. In
doing that we need to consider all possible combination of words and map them with
the provided audio. Hence, we get the more suitable combination.

Figure 10 Signal processing

A few of the necessary concept that are considered -


Firstly, it is a feature concept as we have a lot of parameters, optimization is necessary.
We calculate the speeches and then such speeches are divided into frames. Taking each frame
to be of 10ms length we try to extract 39-40 numbers representing the speech. Such technique
is known as feature vector.
Secondly, it is a model concept. Such model describes a few objects (mathematically)
which collect similar attribute of the spoken term. In model concept, following problems are
raised –
 Is the model describing reality well,
 Can any betterment be made in the internal problems of the model and
 Is the model adaptive to the conditional changes
Such model is termed as Hidden Markov Model. Being a generic type of model, it
describes how black box channel works. This model describes a sequential procedure such as
speech. HMM proves to be a very practical approach for decoding the speech.

29
Figure 12 Identifying sequence of words and segmenting them using alignment path

Third, it is a matching process. This model requires a little more time in order to
separate element vectors and other model. Enquiries are updated by using several tricks.
4.2 MODELS
We use three models for recognizing the speech and perform matching:
Acoustic models have properties (acoustic) for every senome. This model contains
properties which have no relation with the context

Figure 11 Acoustic models: Template and the state representation of the word “cat”

30
Second is the Phonetic dictionary. It maps words with phones. It is not very
powerful mapping. This dictionary is not the most important technique to map words and
phones. Perplexing capacities can be used that words learnt in ML algorithms.

Third is the Language model. We use language model to control the search of word.
N-gram model, state language model are used. N-gram consists of the measurement of
sequences and state language defines sequences of speeches in state.

The above entities help to recognize the given speech. The diagram given below
depicts the three models used in SR.

Figure 13 Models for Speech Recognition

31
4.1 DATABASE DESIGN
4.1.1 ER DIAGRAM
Entity Relation diagram represents relation between different entities. In our diagram
we have five main entities. User, Speech Recognition System, Microphone, Speech
Recognizer, Output. User has attributes like user ID and name with which he can access the
system. Speech Recognition System consists of two options from which the user can select.
One option is for normal person and the other option is for disabled person. The microphone
takes the voice input and converts the analog voice inputs into digital signal and then it sends
the digital input to the speech recognizer. The speech recognizer consists of phonetic
dictionary, language model and acoustic model. These three models will process the speech
signal and convert them into text. Finally the speech recognizer sends the recognized to the
output.

Figure 14 ER Diagram

32
4.1.2 RELATIONAL MODEL
A Relational Model determines the perspective of database management and how to
manage these data using structured language. It is a declarative method which specifies the
data and the queries in a table. Relational model helps us in determining the structure of our
database, how to store the data into the the database and also how to retrieve the data from
the database.
Below is our Relational model. It consists of the following tables-User, Speech
Recognition System, Speech Recognizer, Microphone and Output. Each of these tables
contains different columns and stores different types of data.

Figure 15 Relational Model

33
4.2 USER INTERFACE
Below is the representation of the Speech Recognition System UI. Since we have not
worked with the frontend of the system, this is a snapshot of out output window.
The UI consists of three options – 1. Normal People and
2. Disabled People
3. Control Your System
Figure 16 User Interface 1

Figure 17 User Interface 2

34
4.3MIDDLEWARE
As we are working with the algorithm we don’t need any specific middleware. But we have
used one software in order to ease our work. Hidden Markov Model algorithm code is written
in python language. We have used the SPYDER tool to work with python.
Spyder is an open source platform, in which one can do development using python
programming. Spyder come with anaconda. And best part about it is most of the libraries are
already included. And if it’s not included one can use it’s user interface in order to activate
those libraries easily. There is no need to use command prompt.
Also, Spyder has better feature when it comes to visual analysis of result. As we can
zoom and mark, etc in graphs. Following is screen of spyder software.

Figure 18 Spyder Main Screen

35
5 VERIFICATION AND VALIDATION
5.1 UNIT TESTING
Unit Testing is a testing method in which individual units of source code is being tested.
This test is usually done to examine the functionalities of the smallest units of the system.

Figure 19 Block Diagram of Training Phase

But in speech recognition software there is no specific unit testing methods available.
Here we just take the similarity measure and analyse the score. So we have performed the test
by running all the functionalities of the system and checking whether all the functionalities
are running properly or not.

Figure 20 Block Diagram of Testing Phase

36
Table 4 Conducted Test Cases

Test
case Test case
number Input Expected output Actual output Result
"Hello I am
1 Himanshu" hello I am Himanshu hello I am Himanshu pass
today's date as to be today's date is
2 show date displayed displayed pass
Play my favorite
3 video playing your video playing your video pass
4 Reboot now Rebooting now Rebooting now pass
5 shutdown shutting down shutting down pass

Below are the snapshots of the test being conducted–


1. This application has a menu which allows to the type of user. It has two options-
- one for normal people and
- other for disabled people with stuttered speech

- .
2. Once the user selects the desired option, the application starts to take input from the user
and it will end its recording once it encounters a long pause in the speech input.

37
3. It requires a proper internet connection in the device when it is being used. If it encounters
and internet connectivity issues, it will display an error message to the user.

4. Once the application completes its recording, it translates the users voice input into text
message and display it to the user on the screen.

5. If you give input as Show date, then it will display today’s current date with time.

38
6. If you give input as Reboot now, it starts rebooting process

7. If you give input as Play my favorite video, then it will play your favorite video

39
8. If you give input as shutdown, it will shutdown

40
5.2 USER TESTING

Below is the table of results that we have obtained after performing user testing –
Table 5 User Testing Results
S Sample Speech Google SR Google SR Output Google SR
No. (Time in (Time in Output Result Result
seconds) seconds)

1 A1 My Name is 4 4.5 My Name is My name is Fail Pass


Yash Yes Yash
Umaretiya Umarethiya Umaretiya

2 A2 My Name is 3.5 4.5 My Name is My Name is Fail Pass


Yash Yes Yash
Umaretiya Umaretiya Umaretiya

3 B1 My college is 2 3.5 My college is My college is Fail Pass


SRM University Yes RM SRM University
University
4 B2 My college is 2 3 My college is My college is Pass Pass
SRM University SRM SRM University
University
5 C1 My surname is 2.5 3 My surname My surname is Fail Fail
Singh is Sing Sing
6 C2 My surname is 2 3.5 My surname My surname is Pass Pass
Singh is Singh Singh
7 D1 Welcome to 4 2.5 Welcome to Welcome to Pass Pass
Chennai Chennai Chennai
8 D2 Welcome to 3.5 3 Welcome to Welcome to Pass Pass
Chennai Chennai Chennai
9 E1 This place is 3 3.5 This place is This place is Fail Pass
Kattankulathur Kite an thur Kattankulathur
10 E2 This place is 6 3 (Invalid Input) This place is Fail Pass
Kattankulathur Kattankulathur
11 F1 India is very 2 3.5 India is very India is very Pass Pass
powerful powerful powerful
nation nation nation
12 F2 India is very 3 3 India is very India is very Pass Pass
powerful powerful powerful
nation nation nation
13 G1 This is Dimri 2.5 5 This is Dimri This is the MRI Pass Fail
Situation Situation Situation
14 G2 This is Dimri 3 5 This is Dimri This is The M Pass Fail
Situation Situation Situation
15 H1 My name is 4 2 My name is My name is Fail Pass
Abirami Abhirami Abirami
16 H2 My name is 4 2.5 My name is My name is Fail Pass
Abirami Abhirami Abirami
17 I1 The weather is 2 2.5 The weather The weather is Fail Pass

41
cold is cold
18 I2 The weather is 3 3 The weather The weather is Pass Pass
cold is cold cold
19 J1 This is Sujaya 3 2.5 This is Sujaya This is Sujaya Pass Pass
Report Report Report
20 J2 This is Sujaya 3 3 This is Sujaya This is Sujaya Pass Pass
Report Report Report
21 K1 This is Abirami 2 2.5 This is This is Abirami Fail Pass
Report Abhirami Report
Report
22 K2 This is Abirami 2 2 This is This is Abirami Fail Pass
Report Abhirami Report
Report
23 L1 I stay in 3 2.5 I stay in I stay in Pass Pass
Bangalore Bangalore Bangalore
24 L2 I stay in 3 3 I stay in I stay in Pass Pass
Bangalore Bangalore Bangalore
25 M1 What is your 2 2 What is your What is your Fail Pass
Qualification Qualifications Qualification
26 M2 What is your 3 2.5 What is your What is your Pass Pass
Qualification Qualification Qualification

5.3 SIZE – LOC


Size or LOC calculates the program size according to the number of lines in code. Annexure I
section contains the code for the speech recognition.
For this system, datasets for the normal were available from Linguistic Data Consortium and
UCI .But in case of stammering datasets, no proper data sets were available.So we had to take
inputs for the real time. We took 5 subjects and from them we obtained 26 samples. These
samples were then compared from Google speech. Time and accuracy constraints were
measured and graphs and accuracy were obtained.

42
5.4 COST ANALYSIS
Cost of the system is important part while deploying the network. We have already
estimated cost using model mentioned in section 2.6. It explains the cost, which we have
to consider in order to make any system.
Here, we are not generating physical architecture to convert speech-to-text, what we
are doing is to create algorithm for doing the same. In this case we can analyse cost only
based on result which we will get after detecting intrusion detection.

Table 6 Cost Categories

Above table shows categories of cost while table 5 shows 5 cases which we can predict
based on our detection:

43
Table 7 Cost Prediction Cases

False Negative(FN) FN is incurred by device which doesn’t


install python, or in which Speech
Recognition does not work properly and
mistakenly ignores Speech. FN
cost=Unrecognized speech
True Positive(TP) TP occurs when event of speech in inputted
correctly and includes cost detecting speech
and analysing it.
False Positive(FP) FP occurs when speech is incorrectly
detected. For example network issues.
True Negative(TN) TN is always zero, as It occurs SR correctly
recognises speech
Misclassified Hit This cost occurs when wrong type of speech
is identified

Note that, In each of these prediction case, we can decided whether to Speak or wait
for the speech. If response cost is more than damage cost, then we try to avoid inputting
speech. In this case total loss will be equal to damage cost.

5.5 DEFECT ANALYSIS

Defect analysis means use defect to make better quality, to have constant
improvement. Normally this kind of system put the defects in separate categories and try to
find reason or cause behind occurrence of defect,
To analyse defect we can create cause effect diagram. It represents the cause and all
the important or reasonable effects, which led system to that specific cause.
Following depicts few points to describe why defect analysis is important:
 Can be useful to find root cause
 Can uplift team efforts
 Provides easy format to find reasons for cause
 Find areas for gathering data
 Uses an orderly, easy-to-read format

44
Limited Data set Unrelated algorithm
architecture/device

Following diagram shows main cause for less efficient SR One cause can be limited
data sets. If we used limited datasets to train our algorithm, it may be unable to detect newest
. So its speeches. It is always advised to keep ample dataset for research. Sometime we might
be using wrong kind of algorithm to detect or train our system. For example, we cannot use
normal regression directly on IDS dataset, because it is classification data. Third reason can
be lack of proper physical device needed to detect voice. Following shows cause effect
diagram:

Figure 16 Block Diagram of Testing Phase

In this way, we can analyse defect at early stage to avoid it in later critical stages.
5.6 MC CALL’S QUALITY FACTORS
Following diagram shows Quality factors defined by Mc call.

Figure 21 Mc Call's Quality Factors

45
• Integrity = Auditability + Instrumentation + Security.
• Usability = Operability + Training.
• Both relationships seems appropriate for SR, and it seems like relationships like this
could be useful during evaluation, and maybe even in a benchmark.
• In the computer field, a benchmark is a set of executable instructions, which may be
used to compare the performance of two or more computer systems. A benchmark is
usually composed of computer programs, but it may also include scripts of narrative
instructions that direct a person or a machine to perform certain specific tasks during
the course of the comparison test
• Security. The availability of mechanisms that control or protect programs and data
• Instrumentation. The degree to which the program monitors its own operation and
identifies errors that do occur
• Auditability. The ease with which conformance to standards can be checked
• Operability. The ease of operation of a program
• Training. The degree to which the software assists in enabling new users to apply the
system.
That is how we can define quality factor for Speech Recognition System

46
6 EXPERIMENTAL RESULTS AND ANALYSIS
6.1 RESULT
Confusion matrix is generally used to find the accuracy in machine learning
algorithmic projects.

In these figure,
 True positive (TP): includes the speech successfully detected by system .
 False positive (FP): includes the normal behaviour, which was detected as
attack/abnormal.
 True Negative (TN): included normal or non-intrusive type of behaviours,
which are also identified as same.
False Negative (FN): includes attack, which are classified as normal behaviour
In our project, datasets for stammering were not available so we took 26 real time
samples using 5 subjects with stammering issues.
These were compared with the google speech recognition and results were obtained
and presented in the table below:
F1-score can also be generated using y_test and y_pred value. It can be manually
calculated using precision and recall.
In our project, datasets for stammering were not available so we took 26 real time
samples using 5 subjects with stammering issues.
These were compared with the google speech recognition and results were obtained
and presented in the table below:

47
Table 8 Results

S Sample Speech Google SR Google Output SR Output Google SR


No. (Time in (Time in Result Result
seconds) seconds)

1 A1 My Name is 4 4.5 My Name is My name is Fail Pass


Yash Umaretiya Yes Yash Umaretiya
Umarethiya

2 A2 My Name is 3.5 4.5 My Name is My Name is Fail Pass


Yash Umaretiya Yes Umaretiya Yash Umaretiya

3 B1 My college is 2 3.5 My college is My college is Fail Pass


SRM University Yes RM SRM University
University
4 B2 My college is 2 3 My college is My college is Pass Pass
SRM University SRM SRM University
University
5 C1 My surname is 2.5 3 My surname is My surname is Fail Fail
Singh Sing Sing
6 C2 My surname is 2 3.5 My surname is My surname is Pass Pass
Singh Singh Singh
7 D1 Welcome to 4 2.5 Welcome to Welcome to Pass Pass
Chennai Chennai Chennai
8 D2 Welcome to 3.5 3 Welcome to Welcome to Pass Pass
Chennai Chennai Chennai
9 E1 This place is 3 3.5 This place is This place is Fail Pass
Kattankulathur Kite an thur Kattankulathur
10 E2 This place is 6 3 (Invalid Input) This place is Fail Pass
Kattankulathur Kattankulathur
11 F1 India is very 2 3.5 India is very India is very Pass Pass
powerful nation powerful powerful nation
nation
12 F2 India is very 3 3 India is very India is very Pass Pass
powerful nation powerful powerful nation
nation
13 G1 This is Dimri 2.5 5 This is Dimri This is the MRI Pass Fail
Situation Situation Situation
14 G2 This is Dimri 3 5 This is Dimri This is The M Pass Fail
Situation Situation Situation
15 H1 My name is 4 2 My name is My name is Fail Pass
Abirami Abhirami Abirami
16 H2 My name is 4 2.5 My name is My name is Fail Pass
Abirami Abhirami Abirami
17 I1 The weather is 2 2.5 The weather is The weather is Fail Pass
cold cold
18 I2 The weather is 3 3 The weather is The weather is Pass Pass

48
cold cold cold
19 J1 This is Sujaya 3 2.5 This is Sujaya This is Sujaya Pass Pass
Report Report Report
20 J2 This is Sujaya 3 3 This is Sujaya This is Sujaya Pass Pass
Report Report Report
21 K1 This is Abirami 2 2.5 This is This is Abirami Fail Pass
Report Abhirami Report
Report
22 K2 This is Abirami 2 2 This is This is Abirami Fail Pass
Report Abhirami Report
Report
23 L1 I stay in 3 2.5 I stay in I stay in Pass Pass
Bangalore Bangalore Bangalore
24 L2 I stay in 3 3 I stay in I stay in Pass Pass
Bangalore Bangalore Bangalore
25 M1 What is your 2 2 What is your What is your Fail Pass
Qualification Qualifications Qualification
26 M2 What is your 3 2.5 What is your What is your Pass Pass
Qualification Qualification Qualification

In the above table , 26 stammering samples were taken and their time of detection and
conversion by Google engine and SR(our project) has been calculated.
Pass and fail have been allotted to the correct and incorrect recognitions respectively.
Following the results above, graphs are made to showcase the values:

Figure 22 Time taken by Google and SR

The above graph depicts the different time taken by Google and SR to give output. Blue
depicts Google and Red depicts SR.

49
Figure 23 Comparisons between Google and SR

The above graph depicts the cases in which Google and SR gave correct outputs.
6.2 RESULT ANALYSIS
For the results in section 6.1 we calculated accuracy for SR outputs.
For Sr, total outputs are: 26
For Sr, total correct outputs are: 23

50
6.3 CONCLUSION AND FUTURE ENHANCEMENT

This paper of SR through introduction of technology and its various application. After
this we discussed several tools to convert our ideas into a practical approach. Once the
software is improved after trying and examining the results, we looked upon a few
inadequacies that we obtained. Completing the testing provided as focal points that help as
suggestions for future enhancement.
This project can be worked upon and more of the details can be obtained in order to
bring extra features and a few modifications. Presently the project does not use a large
vocabulary, work is done in order to accumulate few tests and improve efficiency of
software.

51
8. RESEARCH PAPER

SPEECH RECOGNITION FOR PARTIALLY IMPAIRED


Himanshu Garg1, Sujaya Rajkhowa2 and M.S.Abirami3
1,2, Students, Department of Software Engineering, SRM-IST
3, Assistant Professor, Department of Software Engineering, SRMIST

Abstract— Speech recognition innovation is one from the quickly developing building
advancements. It has various applications in various regions and gives potential
advantages. About 20% individuals of the world are experiencing different handicaps; a
considerable lot of them are visually impaired or unfit to utilize their hands
successfully. The speech recognition frameworks in those particular cases give an
imperative help to them, so they can bestow information to people by working PC
through voice input.This task is planned and formed keeping that factor into psyche,
and a little exertion is made to accomplish this point. It goes for helping the users to
open diverse sites. At the underlying dimension exertion is made to give assistance to
fundamental tasks as talked about above, however the product can additionally be
refreshed and upgraded so as to cover more activities.
Keyword: Speech to Text conversion, Automatic Speech Recognition

7 INTRODUCTION

In present day humanized social orders for communication between human speeches is one of the
regular techniques. Diverse thoughts framed in the brain of the speaker are imparted by speech as
words, phrases, and sentences by applying some appropriate syntactic rules.The speech is essential
method of communication among person and furthermore the most common and productive type of
trading data among human in speech. By ordering the speech with voiced, unvoiced and silence
(VAS/S) a rudimentary acoustic division of speech which is fundamental for speech can be
considered. In progression to singular sounds called phonemes this procedure can nearly be
indistinguishable to the hints of each letter of the letters in order which makes the structure of human
speech. The majority of the Information in advanced world is accessible to a rare sorts of people who
can peruse or comprehend a trustworthy language. Language advancements can give arrangements as
customary interfaces so the computerized substance can reach to
the majority and encourage the trading of data crosswise over various individuals talking distinctive
dialects [4]. These advances assume an essential job in multi-lingual social orders, for example, India
which has around 1652 lingos/local dialects. Speech to Text change take contribution from receiver as
speech and then it is changed over into content structure which is show on work area. Speech
preparing is the investigation of speech signals, and the different techniques which are utilized to
process them. In this procedure different applications, for example, speech coding, speech
amalgamation, speech recognition and speaker recognition advances; speech preparing is utilized.
Among the abovementioned, speech recognition is the most essential one. The fundamental
motivation behind speech recognition is to change over the acoustic flag acquired from a receiver or a

52
phone to produce a lot of words [13, 23]. So as to remove and decide the phonetic data passed on by a
speech wave we need to utilize PCs or electronic circuits. This procedure is performed for a few
applications, for example, security gadget, family apparatuses, phones ATM machines and PCs.

1. Types of speech recognition


Speech recognition frameworks can be divided into the quantity of classes dependent on their capacity
to recognize that words and rundown of words they have. A couple of classes of speech recognition
are delegated under:
1.1 Isolated Speech-Isolated words ordinarily include a respite between two utterances; it doesn't
imply that it just acknowledges a solitary word yet rather it requires one expression at any given
moment.
1.2 Connected Speech-Connected words or connected speech is like isolated speech however permits
separate utterances with insignificant interruption between them.
1.3 Continuous Speech-Continuous speech enable the client to talk normally, it is likewise called the
computer dictation.
1.4 Spontaneous Speech-At an essential dimension, it tends to be thought of as speech that is natural
sounding and not practiced. An ASR framework with spontaneous speech capacity ought to have the
capacity to deal with an assortment of natural speech highlights, for example, words being run
together, "ums" and "ahs", and even slight stutters.

8 LITERATURE REVIEW

1. Yee-Ling Lu, Man-Wai and Wan-Chi Siu explains about text-to-phoneme conversion by using
recurrent neural networks trained with the real time recurrent learning (RTRL) algorithm [3].
2. Penagarikano, M.; Bordel, G explains a technique to perform the speech to text conversion as well
as an investigational test carried out over a task oriented Spanish corpus are reported & analytical
results also.

3. Sultana, S.; Akhand, M. A H; Das, P.K.; Hafizur Rahman, M.M. explore Speech-to-Text (STT)
conversion using SAPI for Bangla language. Although achieved performance is promising for STT
related studies, they identified several elements to recover the performance and might give better
accuracy and assure that the theme of this study will also be helpful for other languages for Speechto-
Text conversion and similar tasks [3].

4. Moulines, E., in his paper "Text-to-speech algorithms based on FFT synthesis," present FFT
synthesis algorithms for a French text-to-speech system based on diaphone concatenation. FFT
synthesis techniques are capable of producing high quality prosodic adjustments of natural speech.
Several different approaches are formulated to reduce the distortions due to diaphone concatenation.
5. Decadt, Jacques, Daelemans, Walter and Wambacq describes a method to develop the readability
of the textual output in a large vocabulary continuous speech recognition system when out-of-
vocabulary words occur. The basic idea is to replace uncertain words in the transcriptions with a
phoneme recognition result that is post-processed using a phoneme-tographeme converter. This
technique uses machine learning concepts.

9 METHODOLOGY AND TOOLS

53
1. Fundamentals to speech recognition
Speech recognition is fundamentally the exploration of conversing with the PC, and having it
effectively perceived [17]. To expand it we need to comprehend the accompanying terms [4], [13].
Utterances-At the point when client says a few things, at that point this is an articulation [13] as it
were talking a word or a blend of words that implies something to the PC is called an expression.
Utterances are then sent to speech motor to be prepared.
Pronunciation-A speech recognition motor uses a procedure word is its pronunciation, that speaks
to what the speech motor figures a word should sounds like [4]. Words can have the multiple
pronunciations related with them.
Grammar - Utilization of specific arrangement of tenets so as to characterize the words and
phrases that will be perceived by speech motor, all the more briefly grammar characterizes the
space with which the speech motor works [4]. Grammar can be basic as rundown of words or
sufficiently adaptable to help the different degrees of varieties.
Accuracy-The execution of the speech recognition framework is quantifiable [4]; the capacity of
recognizer can be estimated by computing its accuracy. It is helpful to distinguish a utterance.
Vocabularies -Vocabularies are the rundown of words that can be perceived by the speech
recognition engine [4]. For the most part the littler vocabularies are simpler to distinguish by a
speech recognition motor, while a substantial posting of words are troublesome errand to be
recognized by the engine.
Training-Training can be utilized by the users who experience issues of talking or articulating
certain words, speech recognition frameworks with training ought to have the capacity to adjust.

2. Software Requirements and Tools


• Programming Language- Python with appropriate packages
• Operating System- Windows/Linux
• Functional Browser
• Internet

IV. DESIGN AND IMPLEMENTATION

Best algorithms for speech recognition:


• Hidden Markov models.
• Dynamic time warping based speech recognition.
• Neural networks.
• Deep feedforward and recurrent neural networks.
• End to end automatic speech recognition
In this paper we are using Neural Network.

54
1. NEURAL NETWORKS
Connectionism, or the study of artificial neural networks, was at first enlivened by neurobiology;
however it has since turned into an exceptionally interdisciplinary field, crossing software
engineering, electrical building, science, material science, brain research, and phonetics also. A few
scientists are as yet concentrating the neurophysiology of the human cerebrum, however much
consideration is presently being centered on the general properties of neural calculation, utilizing
rearranged neural models. These properties include:
a. Trainability. Networks can be taught to form associations between any input and
output patterns. This can be used, for example, to teach the network to classify speech
patterns into phoneme categories.
b. Generalization.Networks dossn't simply retain the training information; rather, they
get familiar with the basic examples, so they can sum up from the training
information to new models. This is basic in speech recognition, in light of the fact
that acoustical examples are never precisely the equivalent.
c. Nonlinearity.Networks can register nonlinear, non-parametric elements of their
information, enabling them to perform discretionarily complex changes of
information. This is valuable since speech is an exceptionally nonlinear procedure.

d. Robustness.Networks are tolerant of both physical harm and loud information; in certainty
uproarious information can assist the networks with forming better speculations. This is an
important element, since speech designs are famously uproarious.
e. Uniformity.Networks offer a uniform computational worldview which can without much of a
stretch coordinate limitations from various kinds of sources of info. This makes it simple to utilize
both essential and differential speech contributions, for instance, or to join acoustic and obvious
signs in a multi modal framework.

55
f. Parallelism.Networks are profoundly parallel in nature, so they are appropriate to
implementations on greatly parallel PCs. This will at last license quick handling of speech or other
information.
Neural networks comprise of a lot of hubs that an exceptional kind of record on the whole and that
every hub is the standard unit of record and the agreement could work in parallel relies upon the
cooperations among themselves and how they identify with a portion of the researchers are
characterized as:
• Mathematical models mimicking attributes of organic frameworks that manage data in parallel
made out of generally basic components called.
• Is a simple entity class of algorithms that are figured in charts (graphs gathered these plans
countless and these algorithms give answers for various complex issues [4].
• To feature the movement of neural networks is the procedure of characterization and coding and
to feature the properties of neural networks are:
a) Resistance to clamor;
b) Flexibility in managing the mutilated pictures;
c) Maximum protection from label pictures of eviscerated or incompletely disintegrated;
d) Combinations of parallel procedures with a substantial number of working units that invigorate
by association of procedures notwithstanding the load of data appropriated in parallel. With non-
straight activities, for example their capacity to make non-direct connections incorporate maps of
clamor that makes them a decent wellspring of appraisals and attribution (grouping predication);
e) High ability to adjust the arrangement of logarithms and forces of training inner permits the
utilization of interior alteration that lives in the region of enduring change.

56
1.1 PROCEDURE-

Figure 24 : The Processing of Speech Signals

The method comprises of iteratively choosing the most far off score as for mean. In the event that
this score goes past a specific edge, the score is evacuated and mean and standard deviation
estimations are recalculated. At the point when there are just a couple of utterances to assess mean
and change, this strategy prompts an extraordinary improvement. Content ward and content free
analyses have been completed by utilizing a telephonic multisession database.The paper shows the
interrelationship between algorithmic research framework improvements dependent on the
experience from the speaker utilizing smaller than expected issues amid the framework
configuration procedure, and presents a model of speech recognition dependent on artificial neural
network.
Present investigation of artificial neural networks for speech recognition task. Neural network
measure impact on the viability of discovery of phonemes in words. The examination techniques
for speech flag parameterization. Find out about how to utilize direct expectation investigation, an
impermanent method for learning of the neural network for recognition of phonemes. The proposed
method for teaching as input requires just the transcription of words from the training set and don't
require any manual division of words;
a) Development and research of the strategies for diagnosing and recognizing adjusted signs;
b) Software implementation and pilot testing on genuine signs of neural network techniques for
preparing.
1.2 Recognition Process Recognition Algorithm
• Input the signal into the computer to select word boundaries;
• Allocation of parameters characterizing the signal spectrum;

57
• The utilization of counterfeit neural system to assess the degree of nearness of acoustic
parameters;
• Comparison with standards in the dictionary.
Voice signal as an input to a neural network, subsequent to preparing the sound information got a
variety of segments of the flag. Each segment compares to a lot of numbers that describe the
amplitude spectra of a signal, to get ready for the estimation for the signal yields of the neural
network to compose all, where a column which is a lot of quantities of each edge.
Where I is the quantity of estimations of a lot of numbers, N is the quantity of sets of numbers
(outline motion in the wake of slicing). The quantity of input and yield neurons is known, every one
of the input neurons relates to one lot of numbers, and the yield layer just a single neuron, which
compares to the ideal estimation of the flag recognition. Table 3 demonstrates the parameter
definition utilizes in this examination.
1.3. Equations
To ascertain the yield of the neural system, it's an unquestionable requirement complete the
accompanying progressive steps -
Step 1: Initiate all contexts of all the neurons in the hidden layer;
Step 2: Apply the first set of numbers to the neural network. Calculate the output of the hidden
layer.

F(x)―non-linear activation function

for the numbers from 0 to 9.

To recognize the one number you have to manufacture your very own neural system it's an
unquestionable requirement to assemble 10 of neural systems. Database of more than 250 words
(numbers from 0 to 9) with various varieties of articulation, base randomly divided into two
equivalent amounts of instructional exercise and sample tests. When training neural system
recognition of one number, for number 5, the ideal yield of the neural system should be unit for the
training set with the number 5 and the remainder is zero.

58
Figure 25: The structure of a neural network with feedback

Neural network training is brought out through the steady introduction of the training set, with
concurrent tuning scales as per a particular procedure, until around the assortment of design blunder
achieves a satisfactory level.Error in the system function will be calculated by the following formula:

where N is the number of training samples prepared by neural system models the genuine yield of
the neural system.
A prototype of a neuron is nerve cell science. A neuron comprises of a cell body, or soma, and two
sorts of outer wood-like branches: Axon and dendrites. The cell body contains the nucleus, which
holds data on hereditary attributes and plasma with molecular instruments for the generation and
transmission of components of the neuron of the important materials. A neuron gets signals from
different neurons through the dendrites and transmits signals created by the cells of the body, along
the axon, which toward the finish of branches into the fibre, the endings of synapses.
Mathematical model of a neuron described democratic ratio:

Block diagram of a neuron: x1, x 2… xn -input neuron; w1, w2,….., the wn-a set of weights; F(S) is a
function of activation; y-output signal, neuro control performs simple operations like weighted
summation, treating the result of nonlinear threshold conversion. Highlight of neural network
approach is that the structure of the straightforward homogeneous components enables you to
address the difficulties of the mind boggling connections between things. The structure of relations
characterizes the practical properties of the network in general.
The functional features of neurons and how they consolidate into a network structure decides the
features of neural networks. To address the difficulties of the most sufficient recognizable proof and
the board are multilayer neural networks direct activity or layered perceptions. When structuring
neurons together in layers, every one of which handles vector signals from the past layer. Least

59
implementation is grinning two-layer neural network, comprising of the input (switch gear), middle
of the road (covered up), and the yield.

Figure 26 : Technical model of a neuron is represented

Implementation of the model of two-layer neural network of direct action has the following
mathematical representation:

where the dimension of the vector inputs is: nφ φ neural network; nh-the number of neurons in the
hidden layer; θ-vector of the configurable parameters of the neural network, which includes weight
and neuron-by offset (wji, Wij); fj(x)-activation function for the hidden layer neurons; Fi(x)-activation
function neuron in the output layer.
The most essential feature of neural network method is the likelihood of parallel preparing. This
feature if there are countless neural associations empowers to altogether quicken the procedure of seal

60
information handling [6]. A plausibility of handling of speech signals continuously. The neural
network has characteristics that are natural in the supposed artificial insight.

V. RESULTS AND DISCUSSION

When we give the example input as voice message through receiver then we will get the
interpreted content. In the event that the content contains the word google or Facebook, at
that point it will opens internet browser and will show the site.
Model of speech recognition depended on artificial neural networks. This was researched to
build up a learning neural network utilizing genetic calculation. This methodology was
executed in the system recognizable proof numbers, going to the acknowledgment of the
system of recognition of voice directions. A system of programmed recognition of speech
keywords that were related with the preparing of phone calls or a circle of security was
created. The accuracy dimension of estimating based on present informational index
experience was in every case better.

61
9. REFERENCES

[1] Sanjib Das, “Speech Recognition Technique: A Review”,International Journal of


Engineering Research and Applications (IJERA) ISSN: 2248-9622 Vol. 2, Issue 3, May-Jun
2012.
[2] Ms. Sneha K. Upadhyay,Mr. Vijay N. Chavda,”Intelligent system based on speech
recognition with capability of self learning” ,International Journal For Technological
Research In Engineering ISSN (Online): 2347 - 4718 Volume 1, Issue 9, May-2014.
[3] Deepa V.Jose, Alfateh Mustafa, Sharan R,”A Novel Model for Speech to Text
Conversion” International Refereed Journal of Engineering and Science (IRJES)ISSN
(Online) 2319-183X, Volume 3, Issue 1 (January 2014).
[4] B. RaghavendharReddy,E. Mahender, ” Speech to Text Conversion using Android
Platform”, International Journal of Engineering Research and Applications (IJERA) ISSN:
2248-9622,Vol. 3, Issue 1, January -February 2013.
[5] Kaveri Kamble, Ramesh Kagalkar,” A Review: Translation of Text to Speech Conversion
for Hindi Language”, International Journal of Science and Research (IJSR) ISSN (Online):
2319-7064.Volume 3 Issue 11, November 2014.
[6] Santosh K.Gaikwad,BhartiW.Gawali,PravinYannawar, “A Review on Speech
Recognition Technique”,International Journal of Computer Applications (0975 –
8887)Volume 10– No.3, November 2010.
[7] Penagarikano, M.; Bordel, G., “Speech-to-text translation by a non-word lexical unit
basedsystem,"Signal Processing and Its Applications, 1999. ISSPA '99. Proceedings of the
Fifth International Symposium on , vol.1, no., pp.111,114 vol.1, 1999.
[8] Olabe, J. C.; Santos, A.; Martinez, R.; Munoz, E.; Martinez, M.; Quilis, A.; Bernstein, J.,
“Real time text-tospeech conversion system for spanish," Acoustics, Speech, and Signal
Processing, IEEE International Conference on ICASSP '84. , vol.9, no., pp.85,87, Mar 1984.
[9] Kavaler, R. et al., “A Dynamic Time Warp Integrated Circuit for a 1000-Word
Recognition System”, IEEE Journal of Solid-State Circuits, vol SC-22, NO 1, February 1987,
pp 3-14.
[10] Aggarwal, R. K. and Dave, M., “Acoustic modelling problem for automatic speech
recognition system: advances and refinements (Part II)”, International Journal of Speech
Technology (2011) 14:309–320.

62
[11] Ostendorf, M., Digalakis, V., & Kimball, O. A. (1996). “From HMM’s to segment
models: a unified view of stochastic modeling for speech recognition”. IEEE Transactions on
Speech and Audio Processing, 4(5), 360– 378.
[12] Yasuhisa Fujii, Y., Yamamoto, K., Nakagawa, S., “AUTOMATIC SPEECH
RECOGNITION USING HIDDEN CONDITIONAL NEURAL FIELDS”, ICASSP 2011: P-
5036-5039.
[13] Mohamed, A. R., Dahl, G. E., and Hinton, G., “Acoustic Modelling using Deep Belief
Networks”, submitted to IEEE TRANS. On audio, speech, and language processing, 2010.
[14] Sorensen, J., and Allauzen, C., “Unary data structures for Language Models”,
INTERSPEECH 2011.
[15] Ji Ming & Danny Crookes, “Speech Enhancement Based on Full-Sentence
Correlation and Clean Speech Recognition”, in IEEE Transactions, March, 2017.
[16] Ramji Srinivasan, Ji Ming & Danny Crookes, “CLOSE—A Data-Driven Approach to
Speech Separation”, in IEEE Transactions, July, 2013.
[17] Shweta Khara, Shailendra Singh & DharamVir, “A Comparative Study of the
Techniques for Feature Extraction and Classification in Stuttering” , IEEE Conferences,
February, 2018.
[18] Soumya Priyadarsini Panda, “Automated Speech Recognition System in Advancement
of Human-Computer Interaction” , in IEEE Conferences, 2017
[19] Kain, A., Hosom, J. P., Ferguson, S. H., Bush, B., “Creating a speech corpus with semi-
spontaneous, parallel conversational and clear speech”, Tech Report: CSLU-11- 003, August
2011.

63
ANNEXURES
ANNEXURE I: SPEECH RECOGNITION FOR SPEECH RECOGNITION
import speech_recognition as sr
import webbrowser
import os

while True:
print("1.Normal People\n2.Disabled People\n3.Control Your System")
ch=int(input('Enter your choice: '))

if (ch==1):
r = sr.Recognizer()

with sr.Microphone() as source:


print ("A moment of silence")
r.adjust_for_ambient_noise(source, duration = 1)
print("Say something!")
audio = r.listen(source)
print("Trying to recognize audio")

try:
t=r.recognize_google(audio)
print ("You just said " ,t)

except sr.UnknownValueError:
print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
print("Could not request results from Google Speech Recognition service; {0}".format(e))

elif (ch==2):
try:
final_string=""
r = sr.Recognizer()
while(True):

with sr.Microphone() as source:


r.adjust_for_ambient_noise(source, duration = 1)
print("Speak")
audio = r.listen(source)

64
print("Wait")
try:
t=r.recognize_google(audio)
final_string=final_string+" "+t

except sr.UnknownValueError:
print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
print("Could not request results from Google Speech Recognition service; {0}".format(e))
except KeyboardInterrupt:
print("Consolidated Result: ",final_string)
print("Exited")
elif (ch==3):
r = sr.Recognizer()
with sr.Microphone() as source:
print ("A moment of silence")
r.adjust_for_ambient_noise(source, duration = 1)
print("Say something!")
audio = r.listen(source)
print("Trying to recognize audio")
try:
t=r.recognize_google(audio)
print ("You just said " ,t)

t=t.strip().lower()

if t == 'show date':
print("Showing Date")
os.system('date')
elif t == 'shutdown system':
print('Shutting Down')
os.system('sudo halt')
elif t == 'reboot now':
print('Rebooting Now')
os.system('sudo reboot')
elif t== 'play my favourite video':
print('Playing your video')
webbrowser.open('https://www.youtube.com/watch?v=UhYRlI_bpJQ')
elif t == 'which is my college':
print('Your college is: SRM University')

65
webbrowser.open('http://srmuniv.ac.in/')
else:
print("Command Not Recognised Yet!! Please Try Again")

except sr.UnknownValueError:
print("Audio Unknown (or not understood)")
except sr.RequestError as e:
print("Could not request results from Google Speech Recognition service; {0}".format(e))
else:
print("Invalid Option")

cnt= input("Do you want to continue (Y/N)? : ")


if cnt == 'n' or cnt == 'N':
exit()
else:
pass

66

You might also like