Speech Recognition Seminar Report

A
Seminar Report
On
SPEECH RECOGNITION
In partial fulfillment of requirements for the degree
Third Year Computer Engineering
By
GAIKWAD SURAJ VITTHAL

Exam Seat No. : T-80694222
Roll No. : 22
Under the guidance of
Prof. S. R. LAHANE
DEPARTMENT OF COMPUTER ENGINEERING
University Of Pune
Gokhale Education Society’s
R. H. Sapat College of Engineering,

Management Studies and Research,
Nashik - 422 005, (M.S.), INDIA
[2012 – 2013]
This is to certify that the seminar report entitled “SPEECH RECOGNITION”

is being submitted herewith by “GAIKWAD SURAJ VITTHAL, T-80694222” has
successfully completed his/her seminar work in partial fulfillment of
requirements for the degree of Third Year Computer Engineering of University Of
Pune.
Date:
Place: GES COEMSR, NASHIK

Gokhale Education Society’s
R. H. Sapat College of Engineering,

Management Studies and Research,
Nashik - 422 005, (M.S.), INDIA
Prof. S. R. LAHANE
Seminar Guide
Prof. N. V. Alone
Head of the Department
SPEECH RECOGNITION
i
ABSTRACT
Language is man's most important means of communication and speech its primary
medium.
Spoken interaction both between human interlocutors and between humans and
machines is
inescapably embedded in the laws and conditions of Communication, which comprise
the encoding
and decoding of meaning as well as the mere transmission of messages over an
acoustical channel.
Here we deal with this interaction between the man and machine through synthesis
and recognition
applications.
Speech recognition, involves capturing and digitizing the sound waves,
converting them to
basic language units or phonemes, constructing words from phonemes, and
contextually analyzing
the words to ensure correct spelling for words that sound alike. Speech
Recognition is the ability of
a computer to recognize general, naturally flowing utterances from a wide
variety of users. It
recognizes the caller's answers to move along the flow of the call.
Emphasis is given on the modeling of speech units and grammar on the
basis of Hidden
Markov Model& Neural Networks. Speech Recognition allows you to provide input to an
application
with your voice. The applications and limitations on this subject
enlighten the impact of speech
processing in our modern technical field.
While there is still much room for improvement, current speech
recognition systems have
remarkable performance. We are only humans, but as we develop this
technology and build
remarkable changes we attain certain achievements. Rather than asking what is still
deficient, we ask
instead what should be done to make it efficient.
SPEECH RECOGNITION
ii
TABLE OF CONTENTS
Chapter 1: Introduction
1.1 Introduction…………………………………………………………..……..1
1.2 Speech Recognition…………………………………...…………………….1
Chapter 2: Literature Survey
2.1 Speech Recognition Process……………………………………….………..3
2.2 Structure of Standard Speech Recognition System….……………………...4
2.3 Types of Speech Recognition System…………………………….……..….9
Chapter 3: System Analysis
3.1 Speech Recognition Algorithms……………………………………..…….11
3.1.1 Dynamic Time Warping………………….……….…………….….……11
3.1.2 Hidden Markov Model……………………………………………….…..11
3.1.3 Neural Network…………………………………………………………..12
Chapter 4: Discussion
4.1 Speech Recognition Softwares…………………………………………….14
4.2 Advantages & Disadvantages……………………………………………...18
4.2.1 Advantages.……………………………………………………………....18
4.2.2 Disadvantages……………………………………………………………19
4.3 Applications………………………………………………………………..20
SPEECH RECOGNITION
iii
Chapter 5: Conclusion & Future Scope
5.1 Conclusion………………………………………………………………....22
5.2 Future Scope………………………………………………...……….…….22
Acknowledgement…………………………………………………………………….24
Bibliography……………………………………………………………………….….25
SPEECH RECOGNITION
iv
LIST OF ABBREVIATIONS
HMM: Hidden Markov Model
SR: Speech Recognition
SRS: Speech Recognition System
OOV: Out of Vocabulary
DTW: Dynamic time warping
ASR: Automatic Speech Recognition
OS: Operating System
LVCSR: Large Vocabulary Continuous Speech Recognition
IRIS: Intelligent Rival Imitator of SIRI
SPEECH RECOGNITION
v
LIST OF FIGURES
Figure
No.
Title
Page
No.
1.1 Speech Recognition 2
2.1 Typical Speech Recognition System 4
2.2 Signal analysis converts raw speech to speech frames. 5
2.3 Acoustic models: template and state representations 6
2.4
The alignment path with the best total score identifies the word
sequence and segmentation
7
3.1 Simple HMM with two states & two output symbols 11
3.2 Unit activations for neural network 13
4.1 Julius SR Engine Interface 14
4.2 Google Now Interface 15
4.3 Dragon Naturally Speaking Interface 17
4.4
Windows Speech Recognition Interface
17
SPEECH RECOGNITION
1
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
Have you ever talked to your computer? (And no, yelling at it when your Internet
connection
goes down or making polite chit-chat with it as you wait for all 25MB of that very
important file to
download doesn't count). I mean, have you really, really talked to your computer?
Where it actually
recognized what you said and then did something as a result? If you
have, then you've used a
technology known as speech recognition.
Speech recognition allows you to provide input to a system with your voice. Just
like clicking
with your mouse, typing on your keyboard, or pressing a key on the phone keypad
provides input to
an application, speech recognition allows you to provide input by talking. In the
desktop world, you
need a microphone to be able to do this.
1.2 SPEECH RECOGNITION
Speech recognition (or sometimes referred to as Automatic Speech
Recognition) is the
process by which a computer (or other type of machine) identifies spoken words.
Basically, it means
talking to a computer & having it correctly understand what you are
saying. By “understand” we
mean, the application to react appropriately or to convert the input
speech to another medium of
conversation which is further perceivable by another application that can
process it properly &
provide the user the required result.
The days when you had to keep staring at the computer screen and frantically hit
the key or
click the mouse for the computer to respond to your commands may soon be a things
of past. Today
we can stretch out and relax and tell your computer to do your bidding. This has
been made possible
by the ASR (Automatic Speech Recognition) technology.
Speech recognition is an alternative to traditional methods of interacting
with a computer,
such as textual input through a keyboard. An effective system can
replace, or reduce the reliability
on, standard keyboard and mouse input. This can especially assist the following:
 People who have little keyboard skills or experience, who are slow typists, or
do not have the
time or resources to develop keyboard skills.
SPEECH RECOGNITION
2
 Dyslexic people or others who have problems with character or word use and
manipulation in
a textual form.
 People with physical disabilities that affect either their data entry,
or ability to read (and
therefore check) what they have entered.
Figure 1.1 – Speech Recognition
SPEECH RECOGNITION
3
CHAPTER 2
LITERATURE SURVEY
2.1 SPEECH RECOGNITION PROCESS

In humans the speech or acoustic signals are received by the ears &
then transmitted to the
brain for understanding & extracting the meaning out of the speech & then to react
it appropriately.
Speech recognition enabled computer or devices too, work under the same
principle. They receive
the acoustic signal through microphone; these signals are in analog form & need to
be digitalized to
be understood by the system. The signals are then digitalized & sent to
the processing unit for
extracting the meaning out of the signals & to give the desired output to the user.
Any speech recognition system involves following five major steps:
1. Signal Processing
The sound is received through the microphone in the form of analog electrical
signals.
These signals consist of the voice of the user & the noise from the surroundings.
The
noise is then removed & the signals are converted into digital signal.
These digital
signals are converted into a sequence of feature vectors.
(Feature Vector - If you have a set of numbers representing certain
features of an
object you want to describe, it is useful for further processing to construct a
vector out
of these numbers by assigning each measured value to one component of the vector.)
2. Speech Recognition
This is the most important part of this process; here the actual
recognition is done.
The sequence of feature vectors is then decoded into a sequence of
words. This
decoding is done on the basis of algorithms such as Hidden Markov
Model, Neural
Network or Dynamic Time Wrapping. The program has big dictionary of
popular
words that exist in language. Each feature vector is matched against the
sound
&converted into appropriate character group. It checks and compares words
that are
similar in sound with the formed character groups. All these similar
words are then
collected.
SPEECH RECOGNITION
4
3. Semantic Interpretation
Here it checks if the language allows a particular syllable to appear
after another.
After that, there will be grammar check. It tries to find out whether
or not the
combination of words any sense.
4. Dialog Management
The errors encountered are tried to be corrected. Then the meaning of
the combined
words is extracted & the required task is performed.
5. Response Generation
After the task is performed, the response or the result of that task
is generated. The
response is either in the form of a speech or text. What words to use
so as to
maximize the user understanding, are decided here. If the response is to
be given in
the form of speech, then Text to Speech conversion process is used.
2.2 STRUCTURE OF STANDARD SPEECH RECOGNITION SYSTEM
Figure 2.1 – Typical Speech Recognition System

SPEECH RECOGNITION
5
The structure of a standard speech recognition system is illustrated in
Figure 2.1. The
elements are as follows:
 Raw speech - Speech is typically sampled at a high frequency, e.g.,
16 KHz over a
microphone or 8 KHz over a telephone. This yields a sequence of amplitude values
over time.
 Signal analysis - Raw speech should be initially transformed and
compressed, in order to
simplify subsequent processing. Many signal analysis techniques are
available which can
extract useful features and compress the data by a factor of ten without losing any
important
information.
Figure 2.2 - Signal analysis converts raw speech to speech frames.
 Speech frames - The result of signal analysis is a sequence of speech frames,

typically at 10
milliseconds intervals, with about 16 coefficients per frame. These frames maybe
augmented
by their own first and/or second derivatives, providing explicit
information about speech
dynamics; this typically leads to improved performance. The speech frames
are used for
acoustic analysis.
 Acoustic models - In order to analyze the speech frames for their acoustic
content, we need a
set of acoustic models. There are many kinds of acoustic models, varying
in their
representation, granularity, context dependence, and other properties.
During training, the
acoustic models are incrementally modified in order to optimize the
overall performance of
the system. During testing, the acoustic models are left unchanged.
SPEECH RECOGNITION
Figure 2.3 - Acoustic models: template and state representations for the word
“cat”.
 Acoustic analysis and frame scores - Acoustic analysis is performed by
applying each
acoustic model over each frame of speech, yielding a matrix of frame
scores, as shown in
Figure 2.3. Scores are computed according to the type of acoustic model
that is being used.
For template-based acoustic models, a score is typically the Euclidean
distance between a
template’s frame and an unknown frame. For state-based acoustic models, a score
represents
an emission probability, i.e., the likelihood of the current state generating the
current frame,
as determined by the state’s parametric or non-parametric function.
SPEECH RECOGNITION
7
Figure 2.4 - The alignment path with the best total score identifies
the word sequence and
segmentation.
 Time alignment - Frame scores are converted to a word sequence by identifying a
sequence
of acoustic models, representing a valid word sequence, which gives the best total
score along
an alignment path through the matrix. The process of searching for the best
alignment path is
called time alignment.
An alignment path must obey certain sequential constraints which reflect the fact
that speech
always goes forward, never backwards. These constraints are manifested
both within and
between words. Within a word, sequential constraints are implied by the sequence of
frames
(for template-based models), or by the sequence of states (for state-
based models) that
comprise the word, as dictated by the phonetic pronunciations in a
dictionary, for example.
Between words, sequential constraints are given by a grammar, indicating
what words may
follow what other words.
Time alignment can be performed efficiently by dynamic programming, a general
algorithm
which uses only local path constraints, and which has linear time and
space requirements.
SPEECH RECOGNITION
8
(This general algorithm has two main variants, known as Dynamic Time
Warping (DTW)
and Viterbi search, which differ slightly in their local computations and
in their optimality
criteria.)
In a state-based system, the optimal alignment path induces segmentation
on the word
sequence, as it indicates which frames are associated with each state. This
segmentation can
be used to generate labels for recursively training the acoustic models
on corresponding
frames.
 Word sequence - The end result of time alignment is a word sequence
- the sentence
hypothesis for the utterance. Actually it is common to return several such
sequences, namely
the ones with the highest scores, using a variation of time alignment
called N-best search.
This allows a recognition system to make two passes through the unknown utterance:
the first
pass can use simplified models in order to quickly generate an N-best
list, and the second
pass can use more complex models in order to carefully rescore each of
the N hypotheses,
and return the single best hypothesis.
SPEECH RECOGNITION
9
2.3 TYPES OF SPEECH RECOGNITION SYSTEMS
Speech recognition systems can be separated in several different classes
by describing what
types of utterances they have the ability to recognize. These classes are based on
the fact that one of
the difficulties of SR is the ability to determine when a speaker starts and
finishes an utterance. Most
packages can fit into more than one class, depending on which mode they're using.
 Isolated Word
Isolated word recognizers usually require each utterance to have quiet
(lack of an audio
signal) on BOTH sides of the sample window. It doesn't mean that it
accepts single words,
but does require a single utterance at a time. Often, these systems have
"Listen/Not−Listen"
states, where they require the speaker to wait between utterances
(usually doing processing
during the pauses).
 Connected Word
Connect word systems (or more correctly 'connected utterances') are
similar to Isolated
words, but allow separate utterances to be 'run−together' with a minimal pause
between them.
 Continuous Speech
Recognizers with continuous speech capabilities are some of the most
difficult to create
because they must utilize special methods to determine utterance
boundaries. Continuous
speech recognizers allow users to speak almost naturally, while the computer
determines the
content. Basically, it's computer dictation.
 Spontaneous Speech
At a basic level, it can be thought of as speech that is natural sounding and not
rehearsed. An
ASR system with spontaneous speech ability should be able to handle a
variety of natural
speech features such as words being run together, "ums" and "ahs", and even slight
stutters.
 Voice Verification/Identification
Some ASR systems have the ability to identify specific users by
characteristics of
their voices (voice biometrics). If the speaker claims to be of a certain identity
and the voice
is used to verify this claim, this is called verification or
authentication. On the other
hand, identification is the task of determining an unknown speaker's
identity. In a
SPEECH RECOGNITION
10
sense speaker verification is a 1:1 match where one speaker's voice is
matched to one
template (also called a "voice print" or "voice model") whereas speaker
identification is a 1:
N match where the voice is compared against N templates.
There are two types of voice verification/identification system, which are as
follows:
 Text-Dependent:
If the text must be the same for enrollment and verification this is
called text-
dependent recognition. In a text-dependent system, prompts can either be
common
across all speakers (e.g.: a common pass phrase) or unique. In addition,
the use of
shared-secrets (e.g.: passwords and PINs) or knowledge-based information
can be
employed in order to create a multi-factor authentication scenario.
 Text-Independent:
Text-independent systems are most often used for speaker identification as
they
require very little if any cooperation by the speaker. In this case the
text during
enrollment and test is different. In fact, the enrollment may happen without the
user's
knowledge, as in the case for many forensic applications. As text-
independent
technologies do not compare what was said at enrollment and verification,
verification
applications tend to also employ speech recognition to determine what the
user is
saying at the point of authentication.
In text independent systems both acoustics and speech analysis techniques are used.
SPEECH RECOGNITION
11
CHAPTER 3
SYSTEM ANALYSIS
3.1 SPEECH RECOGNITION ALGORITHMS

3.1.1 Dynamic Time Warping
Dynamic Time Warping algorithm is one of the oldest and most important
algorithms in
speech recognition. The simplest way to recognize an isolated word sample is to
compare it
against a number of stored word templates and determine the “best
match”. This goal
depends upon a number of factors. First, different samples of a given
word will have
somewhat different durations. This problem can be eliminated by simply
normalizing the
templates and the unknown speech so that they all have an equal duration. However,
another
problem is that the rate of speech may not be constant throughout the word; in
other words,
the optimal alignment between a template and the speech sample may be nonlinear.
Dynamic
Time Warping (DTW) is an efficient method for finding this optimal nonlinear
alignment
 Hidden Markov Model
The most flexible and successful approach to speech recognition so far
has been Hidden
Markov Models (HMM).A Hidden Markov Model is a collection of states
connected by
transitions. It begins with a designated initial state. In each discrete time step,
a transition is
taken up to a new state, and then one output symbol is generated in that state. The
choice of
transition and output symbol are both random, governed by probability
distributions.
Figure 3.1– Simple HMM with two states & two output symbols
SPEECH RECOGNITION
12
Formally, an HMM consists of the following elements:
{s} = A set of states.
{o
ì]
} = A set of transition probabilities, where o
ì]
is the probability of taking thetransition
from state i to state j.
{b
ì
(u)} = A set of emission probabilities, where b
ì
is the probability distributionover the
acoustic space describing the likelihood of emittingeach possible sounduwhile in
state i.
Since o
ì]
and b
ì
are both probabilities, they must satisfy the following properties:
o
ì]
≥ 0, b
ì
(u) ≥ 0, ∀ i, ], u
o
ì]
]
= 1, ∀i
b
ì
(u)
u
= 1, ∀i
 Neural Networks
A neural network consists of many simple processing units (artificial neurons) each
of which
is connected to many other units. Each unit has a numerical activation level
(analogous to the
firing rate of real neurons). The only computation that an individual unit can do
is to compute
a new activation level based on the activations of the units it is connected to.
The connections
between units are weighted and the new activation is usually calculated as
a function of the
sum of the weighted inputs from other units.
Some units in a network are usually designated as input units which

mean that their
activations are set by the external environment. Other units are output units,
their values are
set by the activation within the network and they are read as the
result of a computation.
Those units which are neither input nor output units are called hidden units.
SPEECH RECOGNITION
13
A given unit is typically updated in two stages: first we compute the
unit’s net input (or
internal activation), and then we compute its output activation as a function of
the net input.
In the standard case, the net input x
ì
for unit j is just the weighted sum of its inputs:
x
ì
= y
ì
w
]ì
ì
Here y
ì
is the output activation of an incoming unit, &w
]ì
is the weight from unit i to unit j.
Figure 3.2 – Unit activations for neural network.
SPEECH RECOGNITION
14
CHAPTER 4
DISCUSSION
4.1 SPEECH RECOGNITION SOFTWARES

There are ample of Speech Recognition Softwares available in the
market. These softwares
are available for various kinds of platforms including Smart phones, PCs, Tablets
etc& are designed
for different Operating Systems as well.
 Julius
Figure 4.1 – Julius SR Engine Interface

 Open source& Freeware speech recognition engine
 Developed by - Nagoya Institute of Technology
 Developed in C language.
 Operating systems – Unix, Windows
 Language available in – Japanese
SPEECH RECOGNITION
15
 High-performance, two-pass large vocabulary continuous speech recognition
(LVCSR)
decoder software for speech-related researchers and developers.
 Google Now
Figure 4.2 – Google Now Interface

 An intelligent personal assistant software
 Developed by - Google
 Operating System – Android 4.1& later.
 Language available in – English
 Google Now is implemented as an aspect of the Google Search application. It
recognizes
repeated actions that a user performs on the device & to display more
relevant
information to the user in the form of "cards".
 SIRI
 An intelligent personal assistant and knowledge navigator software.
 Developed by – Apple Inc.
 Operating Systems – iOS 5 & later.
SPEECH RECOGNITION
16
 Platform - iPhone (4S and later),iPod Touch (5th generation),iPad (3rd
generation and
later)
 Languages available in - English, French, German, Japanese, Chinese,
Korean, Italian,
Spanish
 The application uses a natural language user interface to answer
questions, make
recommendations, and perform actions by delegating requests to a set of Web
services.
 S Voice
 An intelligent personal assistant and knowledge navigator software.
 Developed by – Samsung
 Operating System – Android 4.0 & 4.1
 Platform – Samsung Galaxy S III, Samsung Galaxy Note II, Samsung Galaxy Note
10.1,
and Samsung Galaxy Stellar
 Languages available in - English, Arabic, French, Spanish, Korean, Italian, and
German
 The application uses a natural language user interface to answer
questions, make
recommendations, and perform actions by delegating requests to a set of Web
services.
 Iris (Intelligent Rival Imitator of SIRI)
 A personal assistant application for Android.
 Developed by –Dextra Software Solutions (Narayan Babu& team, Kochi, India)
 Operating System - Android
 Developed in 8 hours.
 The application uses natural language processing to answer questions based on
user voice
request.
 Iris can talk on topics ranging from Philosophy, Culture, History,
science to general
conversation.
SPEECH RECOGNITION
17
 Dragon NaturallySpeaking
Figure 4.3 – Dragon Naturally Speaking Interface

 A speech recognition software package
 Developed by - Nuance Communications
 Operating System – Windows
 The software has three primary areas of functionality: dictation, text-
to-speech and
command input.
 Windows Speech Recognition
Figure 4.4 – Windows Speech Recognition Interface

 A speech recognition application
 Developed by – Microsoft
 Operating System - Windows Vista, Windows 7 and Windows 8
 Languages available in - English (U.S. and British), Spanish, German,
French, Japanese
and Chinese
 Allows the user to control the computer by giving specific voice commands. The
program
can also be used for the dictation of text so that the user can enter text using
their voice
 Has a fairly high recognition accuracy and provides a set of commands
that assists in
dictation
SPEECH RECOGNITION
18
4.2 ADVANTAGES& DISADVANTAGES

4.2.1 Advantages
 Increases productivity
By speaking normally into the SRS program, you create documents at the
speed you can
compose them in your head. People without strong typing skills or those who don't
wish to be
slowed down by manual input can use voice recognition software to
dramatically reduce
document creation time.
 Can help with menial computer tasks, such as browsing and scrolling
People are becoming lazy day by day. They are also not interested in doing the
necessary
routine work even. Previously there where punch cards to provide input
to the system,
then there came the keyboard, track ball, touch screen, mouse, gesture
control, joysticks
etc; all the previously used input methods require motion of hand or
fingers. But, with
SRS user can provide input to the system through just his voice. He can complete
most of
his menial computer tasks easily.
 Can help people with disabilities
More recently students with learning or physical disabilities have been
able to use SRS.
Those with learning disabilities that affect their ability to write can now
complete exams via
voice recognition technology, and those with physical disabilities such as
upper body
paralysis can use SRS to communicate effectively with others.
 Cost effective
In a study of traditional transcription services versus voice recognition
software, Dr. Robert
G. Zick and Dr. Jon Olsen found that using SRS had a slightly lower accuracy rate
(98.5% v/s
99.7%), but was more cost effective overall.
 Diminishes spelling mistakes
Even the most experienced typists will occasionally have a spelling
blunder; the average
person is likely to make several mistakes in his or her composition. SRS always
provides the
SPEECH RECOGNITION
19
correct spelling of a word (assuming it translated it accurately in the
first place), thus
eliminating the need to spend time running spell checkers.
4.2.2 Disadvantages
 Inaccuracy & Slowness
Most people cannot type as fast as they speak. In theory, this should make voice
recognition
software faster than typing for entering text on a computer. However, this may not
always be
the case because of the proofreading and correction required after dictating a
document to the
computer. Although voice recognition software may interpret your spoken
words correctly
the majority of the time, you might still need to make corrections to
punctuation.
Additionally, the software may not recognize words such as brand names
or uncommon
surnames until you add them to the program's library of words. SR
systems are unable to
recognize the words which are phonetically similar. E.g. “there” & “their”.
• Vocal Strain
Using voice recognition software, you may find yourself speaking more loudly than
in normal
conversation. In 2000, Linda L. Grubbs of PC World magazine reported that this
habit could
lead to vocal cord injury. Although there is no definite scientific link
established between the
use of voice recognition software and damage to the voice, talking
loudly for extended
periods always carries the possibility of causing strain and hoarseness.
• Adaptability
Speech Recognition softwares are not capable of adapting to various changing
conditions
which include different microphone, background noise, new speaker, new
task domain,
new language even. The efficiency of the software degrades drastically.
• Out-of-Vocabulary (OOV) Words
Systems have to maintain a huge vocabulary of word of different language &
sometimes
according to the user phonetics also. They are not capable of adjust
their vocabulary
according to the change in users. Systems must have some method of
detecting OOV
words, and dealing with them in a sensible way.
SPEECH RECOGNITION
20
• Spontaneous Speech
Systems are unable to recognize the speech properly when it contains disfluencies
(filled
pauses, false starts, hesitations, ungrammatical constructions etc.).
Spontaneous speech
remains a problem.
• Prosody
Systems are unable to process Prosody (study of speech rhythms). Stress,
intonation, and
rhythm convey important information for word recognition and the user's intentions
(e.g.,
sarcasm, anger).
• Accent, dialect and mixed language
Mostly all the systems are made according to the common accent of the
particular
language. But the accent of people varies in a wide range. Dialect of
the people also
varies according to the regions. Systems are not capable of adjust according to all
of these
accent & dialect changes. People also sometimes use mixed language mode
for
conversation & mostly SR systems work on a single language model at a time.
5.3 APPLICATIONS
 Games and Edutainment
Speech recognition offers game and edutainment developers the potential to
bring their
applications to a new level of play. With games, for example,
traditional computer-based
characters could evolve into characters that the user can actually talk to.
 Data Entry
Applications that require users to keyboard paper-based data into the
computer (such as
database front-ends and spreadsheets) are good areas for a speech
recognition application.
Reading data directly to the computer is much easier for most users and
can significantly
speed up data entry.
While speech recognition technology cannot effectively be used to enter names, it
can enter
numbers or items selected from a small (less than 100 items) list. Some recognizers
can even
handle spelling fairly well. If an application has fields with mutually
exclusive data types
(for example, one field allows "male" or "female", another is for age, and a third
is for city),
SPEECH RECOGNITION
21
the speech recognition engine can process the command and automatically determine
which
field to fill in.
 Document Editing
This is a scenario in which one or both modes of speech recognition
could be used to
dramatically improve productivity. Dictation would allow users to dictate
entire documents
without typing. Command and control would allow users to modify
formatting or change
views without using the mouse or keyboard. For example, a word processor
might provide
commands like "bold", "italic", "change to Times New Roman font", "use
bullet list text
style," and "use 18 point type." A paint package might have "select eraser" or
"choose a wider
brush."
 Speaker Identification
Recognizing the patterns of speech of a various persons can be used to
identify them
separately. It can be used as a Biometric authentication system in which
the user
authenticates him/her self with the help of their speech. The various
characteristics of speech
which involves frequency, amplitude & other special features are captured &
compared with
the previously stored database.
 Automation at Call Centers

Receiving call from a huge number of customers, answering them or
diverting them to a
particular customer care representative according to the customers demand. It can
be used to
provide a faster response to the customer & provide better service.
 Medical Disabilities
This technology is a great boon for blind & handicapped as they can
utilize the speech
recognition technology for various works. Those who are unable to operate
the computer
through keyboard & mouse can operate it with just their voice.
 Fighter Aircrafts
Pilots in fighter aircrafts have to keep a check on various functions going on in
the aircraft.
They have to provide a faster response to the sudden changes in the aircraft
maneuver. They
can give commands with their voice commands. It requires building a
pilot voice template
before. The actions are confirmed through visual or aural feedback.
SPEECH RECOGNITION
22
CHAPTER 5
CONCLUSION & FUTURE SCOPE
5.1 CONCLUSION
 Speech recognition will revolutionize the way people interacted with
Smart devices & will,
ultimately, differentiate the upcoming technologies. Almost all the smart
devices coming
today in the market are capable of recognizing speech. Many areas can
benefit from this
technology. Speech Recognition can be used for intuitive operation of
computer-based
systems in daily life.
 This technology will spawn revolutionary changes in the modern world and become
a pivot
technology. Within five years, speech recognition technology will become so
pervasive in our
daily lives that service environments lacking this technology will be considered
inferior.
5.2 FUTURE SCOPE

 Achieving efficient speaker independent word recognition
All the SR systems will be speaker independent and will produce the
same kind out
output for a particular command irrespective of the user. SR systems
will be able to
process the voice commands of all the users with very high accuracy & efficiency.
 Ability to distinguish nuances of speech and meanings of words
SR systems would be able to distinguish between nuances phrases &
meaningful
commands, & would be able to process the proper command out of the nuances phrases
correctly.
 Stand-alone Speech Recognition Systems
Presently there is no SR stand-alone systems available, all the SR systems been
developed
are based on one or the other preexisting hardware and software
platforms. But in near
future Stand Alone SR systems might be available in the market.
SPEECH RECOGNITION
23
 Wearable Speech Recognition System
SR systems will be embedded in wearable devices or things such as wrist
watch,
necklace, bracelet etc. There will be no need of carrying bulky devices and the
technology
can be used on the go.
 Talk with all the devices.
All the devices including Smart phones, Computers, Television,
Refrigerator, Washing
Machines etc will be controlled with the voice commands of the user.
There will be no
need of having a Remote or pressing buttons on the device to interact with it.
SPEECH RECOGNITION
24
ACKNOWLEDGMENT
I would like to avail this opportunity to express deep gratitude to

my seminar
guide Prof. S. R. Lahane who took keen interest in the topic and
provided
excellent guidance and motivation for the completion of my seminar.
I would also like to thank Prof. N. V. Alone (Head of Department,

Computer
Engineering), Prof. Dr. P. C. Kulkarni (Principal, GES RHS COEMSR) and
all
the faculty members of the college for their help and support.
I would also like to thank my parents and friends, without their

continuous
motivation, help and support this would not have been possible.
Suraj Vitthal Gaikwad

T.E. Computer
Exam Seat Number: T80694222
SPEECH RECOGNITION
25
BIBLIOGRAPHY
[1] JOE TEBELSKIS {1995}, SPEECH RECOGNITION USING NEURAL NETWORKS,
School of Computer Science, Carnegie Mellon University
[2] KÅRE SJÖLANDER {2003}, An HMM-based system for automatic segmentation
and
alignment of speech, Umeå University, Department of Philosophy and Linguistics
[3] KLAUS RIES {1999}, HMM AND NEURAL NETWORK BASED SPEECH ACT
DETECTION, International Conference on Acoustics and Signal Processing (ICASSP’99)
[4] B. PLANNERER {2005}, AN INTRODUCTION TO SPEECH RECOGNITION
[5] KIMBERLEE A. KEMBLE, AN INTRODUCTION TO SPEECH RECOGNITION, Voice
Systems Middleware Education, IBM
[6] LAURA SCHINDLER {2005}, A SPEECH RECOGNITION AND SYNTHESIS TOOL,
Department of Mathematics and Computer Science, College of Arts and
Science, Stetson
University
[7] MIKAEL NILSSON, MARCUS EGNARSSON {2002}, SPEECH RECOGNITION USING
HMM, Blekinge Institute Of technology

Speech Recognition Seminar Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Recognition Seminar Report

Uploaded by

Copyright:

Available Formats

A

In partial fulfillment of requirements for the degree

Third Year Computer Engineering

GAIKWAD SURAJ VITTHAL

Under the guidance of

DEPARTMENT OF COMPUTER ENGINEERING

R. H. Sapat College of Engineering,

This is to certify that the seminar report entitled “SPEECH RECOGNITION”

Place: GES COEMSR, NASHIK

R. H. Sapat College of Engineering,

Figure 1.1 – Speech Recognition

2.1 SPEECH RECOGNITION PROCESS

Figure 2.1 – Typical Speech Recognition System

Figure 2.2 - Signal analysis converts raw speech to speech frames.

 Speech frames - The result of signal analysis is a sequence of speech frames,

3.1 SPEECH RECOGNITION ALGORITHMS

Some units in a network are usually designated as input units which

Figure 3.2 – Unit activations for neural network.

4.1 SPEECH RECOGNITION SOFTWARES

Figure 4.1 – Julius SR Engine Interface

Figure 4.2 – Google Now Interface

Figure 4.3 – Dragon Naturally Speaking Interface

Figure 4.4 – Windows Speech Recognition Interface

4.2 ADVANTAGES& DISADVANTAGES

 Automation at Call Centers

5.2 FUTURE SCOPE

I would like to avail this opportunity to express deep gratitude to

I would also like to thank Prof. N. V. Alone (Head of Department,

I would also like to thank my parents and friends, without their

Suraj Vitthal Gaikwad

You might also like