Professional Documents
Culture Documents
A
Minor Project Report
submitted
in partial fulfillment
for the award of the Degree of
Bachelor of Technology
in Department of Computer Engineering
(with specialization in Computer Science & Engineering )
S.A.R.
A
Minor Project Report
submitted
in partial fulfillment
for the award of the Degree of
Bachelor of Technology
in Department of Computer Engineering
(with specialization in Computer Science & Engineering )
Supervisor :
Submitted by :
Hitesh Khandelwal-11E1IRCSM4XP010
Himanshi Gupta-11E1IRCSF4XP009
Ipseema Ved-11E1IRCSF4XP011
Ojasvita Sharma-11E1IRCSM4XP019
Candidates Declaration
We hereby declare that the Report of the U.G. Project Work entitled S.A.R.
which is being submitted to the International Institute of Management, Engineering
& Technology, in the partial fulfillment of the requirements for the award of the
Degree of Bachelor of Engineering in COMPUTER ENGINEERING in the
Department of Computer Engineering, is a bonafide report of the work
carried out by us. The material contained in this Report has not been submitted
to any University or Institution for the award of any degree.
ACKNOWLEDGEMENT
We take this opportunity to express my deepest and sincere gratitude to our supervisor Ms.
Swati Saxena for his insightful advice, motivating suggestions, invaluable guidance, help and
support in successful completion of this project and also for his constant encouragement and
advice throughout our Bachelors program.
We express our deep gratitude to Mr. Rishikant Shukla & Mr. Pankaj Jain of Computer
Science Department for their regular support, co-operation, and co-ordination.
The in-time facilities provided by the department throughout the Bachelors program are also
equally acknowledgeable.
We would like to convey our thanks to the teaching and non-teaching staff of the Department
of Computer Engineering, for their invaluable help and support throughout the period of
Bachelors Degree. We are also grateful to all our classmates for their help, encouragement
and invaluable suggestions.
CERTIFICATE OF APPROVAL
The undersigned certify that the final year project entitled Reverse Engineering Exploit
submitted by Hitesh Khandelwal, Ojasvita Sharma, Ipseema Ved and Himanshi Gupta to the
Department of Computer Engineering in partial fulfillment of requirement for the degree of
Bachelor of Engineering in Computer Engineering. The project was carried out under special
supervision and within the time frame prescribed by the syllabus.
We found the students to be hardworking, skilled, bonafide and ready to undertake any
commercial and industrial work related to their field of study.
1. .
Ms. Swati Saxena
(Project Supervisor)
12. .
Mr. Rishikant Shukla
(External Examiner)
1
23. .
Mr. Pankaj Jain
(H.O.D, CSE)
Table of Contents
Topics
Page No
Candidates Declaration i
Acknowledgement ii
Certification of Approval iii
Table of content
iv
List of Figures
vii
List of Tables
viii
List of Abbreviations
Abstract
ix
Chapter 1: Introduction
1.1 Background Introduction
.1.1 Book Reader
3
3
8
4
10
10
10
10
22
24
26
27
29
30
32
31
32
33
34
36
5.2 Limitation
36
5
33
5.2 Future
37
List of Figures
Figure 1.1
Figure 1.2
Figure 3.1
12
Figure 3.2
Figure 3.3
Figure 3.4
Figure 3.5
Figure 3.6
Acoustic models: template and state representations for the word cat
Figure 3.7
The alignment path with the best total score identifies the word sequence and
segmentation 20
Figure 3.8
22
Figure 3.9
23
Figure 3.10
Sequence Diagram
Figure 4.1
Sphinx Architecture 31
Figure 4.2
Book Reader 34
Figure 4.3
Browse option
Figure 4.4
Figure 4.5
14
24
34
35
35
18
19
List of Tables
Table 1Five steps of speech recognitions
16
List of Abbreviations
Abbreviation
OS
JSGF
JSAPI
SAR
TTS
Full Form
Operating System
Java Speech Grammar Format
Java Speech API
Synthesizer and Recognizer
Text to Speech
ABSTRACT
Speech recognition technology is one from the fast growing engineering technologies. It has a
number of applications in different areas and provides potential benefits. Nearly 20% people
of the world are suffering from various disabilities; many of them are blind or unable to use
their hands effectively. The speech recognition systems in those particular cases provide a
significant help to them, so that they can share information with people by operating
computer through voice input. This project is designed and developed keeping that factor into
mind, and a little effort is made to achieve this aim. Our project is will identify the words that
a person speaks into a microphone, recognizes the command given and performs the
operation as per the requirement of the user. This can be used in Hands-free Computing, CarBased System and Health-Care System. Speech recognizer will converts the normal language
text into speech irrespective of the file format for which the operation has to be performed. It
is the artificial production of human speech.
Chapter 1
Introduction
Language is men's most important means of communication and speech its primary medium.
Spoken interaction both between human interlocutors and between humans and machines is
inescapably embedded in the laws and conditions of Communication, which comprise the
encoding and decoding of meaning as well as the mere transmission of messages over an
acoustical channel. Here we deal with this interaction between the man and machine through
synthesis and recognition applications. Speech recognition, involves capturing and digitizing the
sound
waves,
converting
them
to basic language units or phonemes, constructing words from phonemes, and contextually analy
sing the words to ensure correct spelling for words that sound alike. Speech Recognition is the
ability of a computer to recognize general, naturally flowing utterances from a wide variety of
users. It recognizes the caller's answers to move along the flow of the call. Emphasis is given on
the modelling of speech units and grammar on the basis of CMU Sphinx model. Speech
Recognition allows you to provide input to an application with your voice. The applications and
limitations on this subject enlighten the impact of speech processing in our modern technical
field. While there is still much room for improvement, current speech recognition systems have
remarkable performance. We are only humans, but as we develop this technology and build
remarkable changes we attain certain achievements. Rather than asking what is still deficient, we
ask instead what should be done to make it efficient.
1.1Background Introduction
2
Chapter 1
1.1.1. Book Reader (Speech Synthesis) :
Human speech is the most natural form of communication between people. Therefore, it
would be very benecial using natural speech for the interaction between human and
computer. An articial production of speech by a computer is called speech synthesis,
speech synthesis from a text is called Text-to-Speech (TTS) synthesis.
Many applications of speech synthesis exist, we only mention some examples: In
telecommunications services, speech synthesis can replace the operator's voice, short
messages can be simply pronounced by the synthesis. In public transport, synthesized
speech is used instead human speech to inform the passengers about the
arrivals/departures or to provide other important information. Speech synthesis could be
also used for language education, but to our best knowledge, this is not done yet, due to
the low quality of existing systems.
The main goal of this paper is to propose and implement a TTS system based on the
MBROLA (Multiband Resynthesize Overlap-Add) project.
1.1.2. Speech Recognizer :
Speech recognition technology is one from the fast growing engineering technologies. It
has a number of applications in different areas and provides potential benefits. Nearly
20% people of the world are suffering from various disabilities; many of them are blind
or unable to use their hands effectively. The speech recognition systems in those
particular cases provide a significant help to them, so that they can share information with
people by operating computer through voice input.
Unlike old software this software can operate with dictated continuous speech. Large
vocabulary with different words have been provided for increasing the probability of
correct recognition process. A constrained syntax been used helps recognize words by
disambiguating similar sounds.
Chapter 1
Since this is a data-centric product a gram file is required to store the entire vocabulary in
a form which is understandable by the system. Speech recognizer will communicate with
this gram file in order to interpret the commands given by the user. User is supposed to
give the instruction in the microphone which are decoded by the decoder, thus it is
converted into a string. Here the gram file comes in use where the strings in instruction is
compared with those present in gram file and hence the operation is performed
accordingly
The Java Speech API defines a standard, easy-to-use, cross platform software interface to
state-of-the-art speech technology. Two core speech technologies are supported through
the Java Speech API: speech recognition and speech synthesis. Speech recognition
provides computers with the ability to listen to spoken language and to determine what
has been said. In other words, it processes audio input containing speech by converting it
to text.
The Java Speech API was developed through an open development process. With the
active involvement of leading speech technology companies, with input from application
developers and with months of public review and comment, the specification has
achieved a high degree of technical excellence. As a specification for a rapidly evolving
technology, Sun will support and enhance the Java Speech API to maintain its leading
capabilities.
The Java Speech API is an extension to the Java platform. Extensions are packages of
classes written in the Java programming language (and any associated native code) that
application developers can use to extend the functionality of the core part of the Java
platform.
1.2.2
The JavaTM Speech Grammar Format (JSGF) is a platform and vendor independent way
of describing a rule grammar (also known as a command and control grammar or regular
grammar).
4
Chapter 1
A rule grammar specifies the types of utterances a user might say. For example, a service
control grammar might include Service, and "Action commands.
A voice application can be based on a set of scenarios. Each scenario knows the context
and provides appropriate grammar rules for the context.
Grammar rules can be provided in multi-lingual manner.
The grammar body defines rules as a rule name followed by its definition-token.
The definition can include several alternatives separated by | characters.
For example:
public <greet> = (Good morning | Hello) (Bhiksha | Evandro | Paul | Philip);
System returns the words Bhiksha, Evandro, Paul or Philip only if Good
morning or Hello was spoken.
1.2.3
MBROLA :
1.3Motivation
Keyboard, although a popular medium, is not very convenient as it requires a certain
amount of skill for effective usage. A mouse on the other hand requires a good hand-eye
co-ordination. It is also cumbersome for entering non-trivial amount of text data and
hence requires use of an additional media such as keyboard. Physically challenged people
find computers difficult to use. Partially blind people find reading from a monitor
difficult.
So, this project SAR (Synthesis and Recognition) together form a speech interface. A
speech synthesiser converts text into speech. Thus it can read out the textual contents
Chapter 1
from the screen. Speech recogniser had the ability to understand the spoken words and
convert it into text.
1.4Problem Definition
The aim of this project is to provide better interface for the synthesis of the files so it can
easily be read by the users and to use the computer though our voice command. It will
explain the purpose and features of the system, the interfaces of the system, what the
system will do, the constraints under which it must operate and how the system will react
to external stimuli.
Chapter 1
A speech synthesizer is a speech engine that converts text to speech. The
javax.speech.synthesis package defines the Synthesizer interface to support Speech
Synthesis plus a set of supporting classes and interfaces. As a type of speech engine,
much of the functionality of a Synthesizer is inherited from the Engine interface in the
javax.speech package and from other classes and interfaces in that package.
Chapter 1
It almost eliminates the use of keyboard, mouse or any other input providing interface.
There are two purposes served by this software:
i.
Speech Synthesizer: This part of the application converts the normal language text
into speech irrespective of the file format for which the operation has to be
performed. It is the artificial production of human speech. This application can be
used as screen reader for people with visual impairment, other than that this can
also be used by people with dyslexia and other reading difficulties as well as by
pre-literate children.
ii.
This software in platform dependent and require Window's Operating System for its
functioning. The synthesizer part of the software makes use of the MBROLA language.
This software needs a microphone for providing clarity in the words been spoken by the
user for the purpose of recognition. The application maintains a JSGF file which is a
grammar file that is it stores acoustic language for proper recognition and synthesis of the
normal human language which is supposed to be US English. The software includes an
xml file which gets loaded whenever program runs. This is a configuration file which
includes the packages for recognition of the commands. The application has the ability of
reading the text from any kind of file which could be a word file, pdf file and txt file. The
recognizer has the ability to open or close any application through voice recognition.
Chapter 2
Requirement Analysis
2.1Project Requirements
.1.1Hardware Requirements :
.1.2Software Requirements :
.1.3Technologies to be used :
.2 Feasibility Study
The main objective of this study is to determine SAR (Synthesizer and Recognizer) is
feasible or not. Mainly there are three types of feasibility study to which the developed
system is subjected as described below. The key considerations are involved in the
feasibility:
1. Technical feasibility
2. Economic feasibility
3. Operational feasibility
The developed system must be evaluated from a technical viewpoint first, and their
impact on the organization must be accessed. If compatible, behavioral system can be
9
Chapter 2
devised. Then they must be tested for the feasibility. The above three keys are explained
below:
Technical Feasibility
SAR satisfies technical feasibility because this Service can be implemented as a standalone application. It is Microsoft Windows OS compatible.
Economic feasibility
Our project entitled SAR (Synthesizer and Recognizer) is economically feasible because
it is developed using very less amount of economic resources. It is Free.
Operational feasibility
Operational feasibility should be accounted after the software is developed so that it can
cope up with the defined objectives.
The application is user friendly with its GUI and handy to use.
The application is be affordable because the requirement is just normal computers and a
microphone.
Since this application is developed in Java it runs on multiple platform.
10
Chapter 3
3.1Block Diagram
3.1.1 Book Reader (Speech Synthesizer) :
The TTS systems rst convert the input text into its corresponding linguistic or phonetic
representations and then produce the sounds corresponding to those representations. With
the input being a plain text, the generated phonetic representations also need to be
augmented with information about the intonation and rhythm that the synthesized speech
should have. This task is done by a text analysis module in most speech synthesizers. The
transcription from the text analysis module is then given to a digital signal processing
(DSP) module that produces synthetic speech. Figure 3.1 shows the block diagram of a
TTS system.
Figure 3.2 shows the text-to-speech synthesis cycle in most TTS systems. In this cycle,
the text preprocessing transforms the input text into a regularized format that can be
processed by the rest of the system. This includes breaking the input text into sentences,
11
Chapter 3
tokenizing them into words and expanding the numerals. The prosodic phrasing
component then divides the preprocessed text into meaningful chunks of information
based on language models and constructs. The pronunciation generation component is
responsible for generating the acoustic sequence needed to synthesize the input text by
nding the pronunciation of individual words in the input text. Duration value for each
segment of speech is determined by the segmental duration generation component. The
function of the intonation generation component is to generate a fundamental frequency
(F0) contour for the input text to be synthesized. The waveform generation component
takes as input the phonetic and prosodic information generated by the various
components described above, and generates the speech output.
12
Chapter 3
13
Chapter 3
Speech Recognition involves capturing the users utterance, digitizing utterance into
digital signal than converting them into basic unit of utterance (phonemes) and
contextually analyzing the words to ensure correct spelling for words that sound alike
(such as write and right). Figure 3.3 illustrates these processes.
Process Name
Description
p
1.
User Input
2.
Digitization
Chapter 3
3.
4.
5.
Phonetic Breakdown
Statistical Modeling
Matching
phonetic
Grammar here means the union words or phrases to constraint the range of input or
output in the voice application, and Dictionary means the mapping table of phonetic
representation and word ,for example thu ,thee is mapping to the.
15
Chapter 3
Raw speech - Speech is typically sampled at a high frequency, e.g., 16 KHz over a
microphone or 8 KHz over a telephone. This yields a sequence of amplitude values over
time.
Signal analysis - Raw speech should be initially transformed and compressed, in order to
simplify subsequent processing. Many signal analysis techniques are available which can
16
Chapter 3
extract useful features and compress the data by a factor of ten without losing any
important information.
Speech frames - The result of signal analysis is a sequence of speech frames, typically at
10 milliseconds intervals, with about 16 coefficients per frame. These frames maybe
augmented by their own first and/or second derivatives, providing explicit information
about speech dynamics; this typically leads to improved performance. The speech frames
17
Chapter 3
Figure 3.6 - Acoustic models: template and state representations for the word cat.
Acoustic analysis and frame scores - Acoustic analysis is performed by applying each
acoustic model over each frame of speech, yielding a matrix of frame scores, as shown in
Figure 3.6. Scores are computed according to the type of acoustic model that is being
used. For template-based acoustic models, a score is typically the Euclidean distance
between a templates frame and an unknown frame. For state-based acoustic models, a
score represents an emission probability, i.e., the likelihood of the current state generating
the current frame, as determined by the states parametric or non-parametric function.
18
Chapter 3
Figure 3.7 - The alignment path with the best total score identifies the word sequence and
segmentation.
19
Chapter 3
Time alignment can be performed efficiently by dynamic programming, a general
algorithm which uses only local path constraints, and which has linear time and space
requirements.
(This general algorithm has two main variants, known as Dynamic Time Warping (DTW)
and Viterbi search, which differ slightly in their local computations and in their
optimality Criteria.)
In a state-based system, the optimal alignment path induces segmentation on the word
sequence, as it indicates which frames are associated with each state. This segmentation
can be used to generate labels for recursively training the acoustic models on
corresponding frames.
Word sequence - The end result of time alignment is a word sequence - the sentence
hypothesis for the utterance. Actually it is common to return several such sequences,
namely the ones with the highest scores, using a variation of time alignment called N-best
search. This allows a recognition system to make two passes through the unknown
utterance: the first pass can use simplified models in order to quickly generate an N-best
list, and the second pass can use more complex models in order to carefully rescore each
of the N hypotheses, and return the single best hypothesis.
20
Chapter 3
.2 Use Case Diagram :
21
Chapter 3
System: System will analyze the text provided by the user and then performs
linguistic analysis to generate waveform which will be then listened by the user
through speakers
Speech recognizer:
User: User is supposed to give commands to the system through a microphone.
System: System will recognize the commands and will perform task accordingly.
22
Chapter 3
.2 Sequence Diagrams:
23
Chapter 4
24
Chapter 4
Here it checks if the language allows a particular syllable to appear after another. After
that, there will be grammar check. It tries to find out whether or not the combination of
words any sense.
4. Dialog Management
The errors encountered are tried to be corrected. Then the meaning of the combined
words is extracted & the required task is performed.
5. Response Generation
After the task is performed, the response or the result of that task is generated. The
response is either in the form of a speech or text. What words to use so as to maximize
the user understanding, are decided here. If the response is to be given in the form of
speech, then Text to Speech conversion process is used.
Chapter 4
However, the sound of the continuous sentence is not natural. The current synthesizers
are thus mostly based on shorter units such as phonemes, diphones, and demi syllables or
on combinations of these.
Several methods of concatenative synthesis exist, we mention here the interesting ones
only:
Micro phonemic method: units of variable length de- rived from natural speech are
used.
Sinusoidal models: assumption that the speech signal can be represented as a sum of
sine waves with time- varying amplitudes and frequencies.
PSOLA methods: very popular method, allows prerecorded speech samples smoothly
concatenated and provides good controlling for pitch and duration, used in some
synthesis systems as for example in ProVerbe, HADIFIX, MBROLA (system that is
used in this work), etc.
Isolated Word
Isolated word recognizers usually require each utterance to have quiet (lack of an audio
signal) on BOTH sides of the sample window. It doesn't mean that it accepts single
words, but does require a single utterance at a time. Often, these systems have
26
Chapter 4
"Listen/NotListen" states, where they require the speaker to wait between utterances
(usually doing processing during the pauses).
Connected Word
Connect word systems (or more correctly 'connected utterances') are similar to Isolated
words, but allow separate utterances to be 'runtogether' with a minimal pause between
them.
Continuous Speech
Recognizers with continuous speech capabilities are some of the most difficult to create
because they must utilize special methods to determine utterance boundaries. Continuous
speech recognizers allow users to speak almost naturally, while the computer determines
the content. Basically, it's computer dictation.
Spontaneous Speech
At a basic level, it can be thought of as speech that is natural sounding and not rehearsed.
An ASR system with spontaneous speech ability should be able to handle a variety of
natural speech features such as words being run together, "ums" and "ahs", and even
slight stutters.
Voice Verification/Identification
Some ASR systems have the ability to identify specific users by characteristics of their
voices (voice biometrics). If the speaker claims to be of a certain identity and the voice is
used to verify this claim, this is called verification or authentication. On the other hand,
identification is the task of determining an unknown speaker's identity. In a sense speaker
verification is a 1:1 match where one speaker's voice is matched to one template (also
called a "voice print" or "voice model") whereas speaker identification is a 1: N match
where the voice is compared against N templates.
There are two types of voice verification/identification system, which are as follows:
Text-Dependent:
If the text must be the same for enrolment and verification this is called text dependent
recognition. In a text-dependent system, prompts can either be common across all
speakers (e.g.: a common pass phrase) or unique. In addition, the use of shared-secrets
27
Chapter 4
(e.g.: passwords and PINs) or knowledge-based information can be employed in order to
create a multi-factor authentication scenario.
Text-Independent:
Text-independent systems are most often used for speaker identification as they require
very little if any cooperation by the speaker. In this case the text during enrolment and
test is different. In fact, the enrolment may happen without the user's knowledge, as in the
case for many forensic applications. As text-independent technologies do not compare
what was said at enrolment and verification, verification applications tend to also employ
speech recognition to determine what the user is saying at the point of authentication.
In text independent systems both acoustics and speech analysis techniques are used.
Dynamic Time Warping algorithm is one of the oldest and most important algorithms in
speech recognition. The simplest way to recognize an isolated word sample is to compare
it against a number of stored word templates and determine the best match. This goal
depends upon a number of factors. First, different samples of a given word will have
somewhat different durations. This problem can be eliminated by simply normalizing the
templates and the unknown speech so that they all have an equal duration. However,
another problem is that the rate of speech may not be constant throughout the word; in
other words, the optimal alignment between a template and the speech sample may be
nonlinear. Dynamic Time Warping (DTW) is an efficient method for finding this optimal
nonlinear alignment.
4.4.2
The most flexible and successful approach to speech recognition so far has been Hidden
Markov Models (HMM).A Hidden Markov Model is a collection of states connected by
28
Chapter 4
transitions. It begins with a designated initial state. In each discrete time step, a transition
is taken up to a new state, and then one output symbol is generated in that state. The
choice of transition and output symbol are both random, governed by probability
distributions.
4.4.3
Neural Networks
A neural network consists of many simple processing units (artificial neurons) each of
which is connected to many other units. Each unit has a numerical activation level
(analogous to the firing rate of real neurons). The only computation that an individual
unit can do is to compute a new activation level based on the activations of the units it is
connected to. The connections between units are weighted and the new activation is
usually calculated as a function of the sum of the weighted inputs from other units.
Some units in a network are usually designated as input units which mean that their
activations are set by the external environment. Other units are output units, their values
are set by the activation within the network and they are read as the result of a
computation.
Those units which are neither input nor output units are called hidden units.
29
Chapter 4
Chapter 4
Acoustic models characterize how sound changes over time. Each phoneme or speech
sound is modeled by a sequence of states and signal observation probability distributions
of sounds that you might hear (observe) in that state.
Sphinx4 is implemented using a 5-state phonetic model, each phone model has exactly
five states. At run-time, frames of the input audio are compared to the distributions in the
states to see which ones the sound could have come from which might be likely
producers of the observed audio.
Acoustic models that are matched to the conditions they will be used in perform best.
That is to say, English acoustic models work best for English, and telephone models work
best on the telephone. With SphinxTrain, we can train acoustic models for any language,
task, or channel condition.
An LM file (often with a .lm extension) is a Language model. The Language model
describes the likelihood, probability, or penalty taken when a sequence or collection of
words is seen. Sphinx4 uses N-gram models, and usually N is 3, so they are tri-gram
models, and these are sequences of three words. All the sequences of three words, two
words, and one word are combined together using back-off weights in order to assign
probabilities to sequences of words.
Finally, the decoder needs to know the pronunciations of words, and the Dictionary file
(often with a .dic extension) is a list of words with a sequence of phones.
4.5.3 Decoder
The decoder performs main component of SR. It reads features from the front end,
couples this with data from the knowledge base and feedback from the application, and
performs a search to determine the most likely sequences of words that could be
represented by a series of features. The term "search space" is used to describe the most
likely sequences of words, and is dynamically updated by the decoder during the
decoding process.
Chapter 4
FreeTTS and CMU Sphinx 4 are open-source implementations of Java Speech API. They
will become reference implementations after Sphinx 4 project complete.
The FreeTTS is a speech synthesis system written entirely in the Java programming
language. It is based upon Flite 1.1: a small run-time speech synthesis engine developed
at CMU. Flite is derived from the Festival Speech Synthesis System from the University
of Edinburgh and the FestVox project from CMU. The goal of FreeTTS is quick and
small, because most of Java applications runs on Internet.
This requirement lead to trade-off of its voice quality.
.7 Snapshots
32
Chapter 4
33
Chapter 4
34
Chapter 5
Speech recognition will revolutionize the way people interacted with Smart devices &
will, ultimately, differentiate the upcoming technologies. Almost all the smart devices
coming today in the market are capable of recognizing speech. Many areas can benefit
from this technology. Speech Recognition can be used for intuitive operation of
considered inferior.
In this work, we propose and implement Book Reader, a TTS synthesizer that uses
MBROLAforspeechsynthesis.Themainrequirementsarethelanguageindependence
andunderstandabilityofthepronouncedspeech.
.2 Limitations
Out-of-Vocabulary (OOV) Words
Systems have to maintain a huge vocabulary of word of different language & sometimes
according to the user phonetics also. They are not capable of adjust their vocabulary
according to the change in users. Systems must have some method of detecting OOV
words, and dealing with them in a sensible way.
Spontaneous Speech
Systems are unable to recognize the speech properly when it contains disfluencies (filled
pauses, false starts, hesitations, ungrammatical constructions etc.). Spontaneous speech
remains a problem.
Adaptability
35
Chapter 5
Speech Recognition software are not capable of adapting to various changing conditions
which include different microphone, background noise, new speaker, new task domain,
new language even. The efficiency of the software degrades drastically.
Accent
Mostly all the systems are made according to the common accent of the particular
language. But the accent of people varies in a wide range. This application supports US
accent only.
.3 Future Scope
This work can be taken into more detail and more work can be done on the project in
order to bring modifications and additional features. The current software doesnt support
a large vocabulary, the work will be done in order to accumulate more number of samples
and increase the efficiency of the software. The current version of the software supports
only US accent but more areas can be covered and effort will be made in this regard.
36