You are on page 1of 18

(7KS09) Seminar Report

On

Speech recognition

Submitted By

Komal Pralhad Gulhane

Seventh Semester
B.E.(Computer Science & Engineering)

Guided by

Prof.A.A. Chaudhari

PRMIT&R

Department of Computer Science & Engineering,


Prof. Ram Meghe Institute of Technology & Research,
Badnera - Amravati
2021-2022

1
CERTIFI CA TE

This is to certify that the seminar report entitled

Speech Recognition
is a bonafide work and it is submitted to the Sant Gadge Baba
Amravati University, Amravati
By

Komal Pralhad Gulhane

Seventh Semester B.E.(Computer Science &


Engineering) in the partial fulfillment of the requirement
for the degree of Bachelor of Engineering in Computer
Science & Engineering , during the academic year 2021-2022
under my guidance .

Prof.A.A. Chaudhari Dr. G. R.


Bamnote
Guide HOKD
Department of Computer Sci.& Engg. Department of Computer Sci.&
Engg.
PRM Institute Of Technology & Research, PRM Institute Of Technology &
Research, Badnera Badnera

Department of Computer Science & Engineering,

2
Prof. Ram Meghe Institute of Technology & Research,
Badnera- Amravati

2021-2022

3
ACKNOWLEDGEMENT

I express my sincere gratitude to my guide Prof. A. A. Chaudhari, Department


of Computer Science & Engineering for having made available her valuable
time, support and for guiding me through this seminar with timely inputs and
suggestions and encouragement.

I am extremely grateful to Dr. G. R. Bamnote, HOKD, Department of


Computer Science & Engineering for providing me with a platform for the
successful completion of my seminar work.

I would also like to thank all my professors for their constant guidance and
support throughout the course of the seminar.

Lastly, I would like to extend my heartfelt gratitude to the authors of the


references and literatures referred to in the seminar.

Komal Gulhane

4
ABSTRACT

Language is man's most important means of communication and speech its


primary medium. Spoken interaction both between human interlocutors and
between humans and machines is inescapably embedded in the laws and
conditions of Communication, which comprise the encoding and decoding of
meaning as well as the mere transmission of messages over an acoustical channel.
Here we deal with this interaction between the man and machine through synthesis
and recognition applications.

Speech recognition, involves capturing and digitizing the sound waves,


converting them to basic language units or phonemes, constructing words from
phonemes, and contextually analyzing the words to ensure correct spelling for
words that sound alike. Speech Recognition is the ability of a computer to
recognize general, naturally flowing utterances from a wide variety of users.
It recognizes the caller's answers to move along the flow of the call.

Emphasis is given on the modeling of speech units and grammar on the


basis of Hidden Markov Model& Neural Networks. Speech Recognition allows
you to provide input to an application with your voice. The applications and
limitations on this subject enlighten the impact of speech processing in our modern
technical field.

While there is still much room for improvement, current speech recognition
systems have remarkable performance. We are only humans, but as we develop
this technology and build remarkable changes we attain certain achievements.
Rather than asking what is still deficient, we ask instead what should be done to
make it efficient.

5
TABLE OF CONTENTS

Chapter Topic Page


No.

1 Introduction 6

1.1 Speech Recognition 7

1.2 History of Speech Recognition 7

1.3 Working 8

2 Types of Speech Recognition 9

3 Advantages of Speech Recognition 11

3.1 Disadvantages of Speech Recognition 12

4 Application 13

5 Conclusion & Future Scope 15

References 17

6
1. INTRODUCTION

Speech recognition is the process of converting an acoustic signal, captured by a


microphone or a telephone, to a set of words. The recognized words can be an end
in themselves, as for applications such as commands & control, data entry, and
document preparation.

They can also serve as the input to further linguistic processing in order to achieve
speech understanding. It is also known as Automatic Speech Recognition (ASR)
,computer speech recognition, speech to text (STT). Speech recognition allows you
to provide input to a system with your voice. Just like clicking with your mouse,
typing on your keyboard, or pressing a key on the phone keypad provides input to
an application, speech recognition allows you to provide input by talking. In the
desktop world, you need a microphone to be able to do this.

1.1 SPEECH RECOGNITION


Speech recognition (or sometimes referred to as Automatic Speech
Recognition) is the process by which a computer (or other type of machine)
identifies spoken words. Basically, it means talking to a computer & having it
correctly understand what you are saying. By “understand” we mean, the application
to react appropriately or to convert the input speech to another medium of
conversation which is further perceivable by another application that can process it
properly & provide the user the required result.

The days when you had to keep staring at the computer screen and
frantically hit the key or click the mouse for the computer to respond to your
commands may soon be a things of past. Today we can stretch out and relax and
tell your computer to do your bidding. This has been made possible by the ASR
(Automatic Speech Recognition) technology.

Speech recognition is an alternative to traditional methods of interacting


with a computer, such as textual input through a keyboard. An effective system
can replace, or reduce the reliability on, standard keyboard and mouse input. This
can especially assist the following:

7
 People who have little keyboard skills or experience, who are slow typists,
or do not have the time or resources to develop keyboard skills.

 Dyslexic people or others who have problems with character or word use
and manipulation in a textual form.
 People with physical disabilities that affect either their data entry, or
ability to read (and therefore check) what they have entered.

Figure 1.1 – Speech Recognition

1.2 History of Speech Recognition


The first speech recognition systems were focused on numbers, not words. In
1952, Bell Laboratories designed the “Audrey” system which could recognize a
single voice speaking digits aloud. Ten years later, IBM
introduced “Shoebox” which understood and responded to 16 words in English.
By the year 2001, speech recognition technology had achieved close to 80%
accuracy. For most of the decade there weren’t a lot of advancements until
Google arrived with the launch of Google Voice Search. Because it was an app,
this put speech recognition into the hands of millions of people

In 2011 Apple launched Siri which was similar to Google’s Voice Search. The
early part of this decade saw an explosion of other voice recognition apps. And
with Amazon’s Alexa, Google Home we’ve seen consumers becoming more and
more comfortable talking to machines.

• Today, some of the largest tech companies are competing to herald the speech
accuracy title. In 2016, IBM achieved a word error rate of 6.9 percent. In 2017

8
Microsoft usurped IBM with a 5.9 percent claim. Shortly after that IBM
improved their rate to 5.5 percent. However, it is Google that is claiming the

1.3 Working
The first component of speech recognition is, of course,
speech. Speech must be converted from physical sound to an electrical signal
with a microphone, and then to digital data with an analog-to-digital converter.
Once digitized, several models can be used to transcribe the audio to text.

• Installing speech Recognition : $ pip install Speech Recognition


• Working with Microphone : To install PyAudio Package lowest rate at 4.9

Figure1.3 - Structure Of Standard Speech Recognition System

9
2. TYPES OF SPEECH RECOGNITION SYSTEMS

Speech recognition systems can be separated in several different classes by


describing what types of utterances they have the ability to recognize. These
classes are based on the fact that one of the difficulties of SR is the ability to
determine when a speaker starts and finishes an utterance. Most packages can fit
into more than one class, depending on which mode they're using.

 Isolated Word

Isolated word recognizers usually require each utterance to have quiet (lack
of an audio signal) on BOTH sides of the sample window. It doesn't
mean that it accepts single words, but does require a single utterance at a
time. Often, these systems have "Listen/Not−Listen" states, where they
require the speaker to wait between utterances (usually doing processing
during the pauses).

 Connected Word

Connect word systems (or more correctly 'connected utterances') are


similar to Isolated words, but allow separate utterances to be 'run−together'
with a minimal pause between them.

 Continuous Speech

Recognizers with continuous speech capabilities are some of the most


difficult to create because they must utilize special methods to determine
utterance boundaries. Continuous speech recognizers allow users to speak
almost naturally, while the computer determines the content. Basically, it's
computer dictation.

 Spontaneous Speech

At a basic level, it can be thought of as speech that is natural sounding and


not rehearsed. An ASR system with spontaneous speech ability should be
able to handle a variety of natural speech features such as words being run
together, "ums" and "ahs", and even slight stutters.

10
 Voice Verification/Identification

Some ASR systems have the ability to identify specific users by


characteristics of their voices (voice biometrics). If the speaker claims to
be of a certain identity and the voice is used to verify this claim, this is
called verification or authentication. On the other hand, identification is
the task of determining an unknown speaker's identity. In a sense
speaker verification is a 1:1 match where one speaker's voice is matched to
one template (also called a "voice print" or "voice model") whereas
speaker identification is a 1: N match where the voice is compared against
N templates.

There are two types of voice verification/identification system, which are as


follows:

 Text-Dependent:
If the text must be the same for enrollment and verification this is
called text- dependent recognition. In a text-dependent system,
prompts can either be common across all speakers (e.g.: a common
pass phrase) or unique. In addition, the use of shared-secrets (e.g.:
passwords and PINs) or knowledge-based information can be
employed in order to create a multi-factor authentication scenario.
 Text-Independent:
Text-independent systems are most often used for speaker
identification as they require very little if any cooperation by the
speaker. In this case the text during enrollment and test is different. In
fact, the enrollment may happen without the user's knowledge, as in
the case for many forensic applications. As text-independent
technologies do not compare what was said at enrollment and
verification, verification applications tend to also employ speech
recognition to determine what the user is saying at the point of
authentication.
In text independent systems both acoustics and speech analysis
techniques are used.

11
3. ADVANTAGES & DISADVANTAGES

 Increases productivity

By speaking normally into the SRS program, you create documents at the
speed you can compose them in your head. People without strong typing
skills or those who don't wish to be slowed down by manual input can use
voice recognition software to dramatically reduce document creation time.

 Can help with menial computer tasks, such as browsing and scrolling

People are becoming lazy day by day. They are also not interested in
doing the necessary routine work even. Previously there where punch
cards to provide input to the system, then there came the keyboard, track
ball, touch screen, mouse, gesture control, joysticks etc; all the previously
used input methods require motion of hand or fingers. But, with SRS user
can provide input to the system through just his voice. He can complete
most of his menial computer tasks easily.

 Can help people with disabilities

More recently students with learning or physical disabilities have


been able to use SRS. Those with learning disabilities that affect their
ability to write can now complete exams via voice recognition
technology, and those with physical disabilities such as upper body
paralysis can use SRS to communicate effectively with others.

 Cost effective

In a study of traditional transcription services versus voice recognition


software, Dr. Robert
G. Zick and Dr. Jon Olsen found that using SRS had a slightly lower
accuracy rate (98.5% v/s 99.7%), but was more cost effective overall.

12
 Diminishes spelling mistakes

Even the most experienced typists will occasionally have a spelling


blunder; the average person is likely to make several mistakes in his or
her composition. SRS always provides the correct spelling of a word
(assuming it translated it accurately in the first place), thus eliminating the
need to spend time running spell checkers.

3.1 Disadvantages

 Inaccuracy & Slowness

Most people cannot type as fast as they speak. In theory, this should make
voice recognition software faster than typing for entering text on a computer.
However, this may not always be the case because of the proofreading and
correction required after dictating a document to the computer. Although
voice recognition software may interpret your spoken words correctly the
majority of the time, you might still need to make corrections to
punctuation. Additionally, the software may not recognize words such as
brand names or uncommon surnames until you add them to the program's
library of words. SR systems are unable to recognize the words which are
phonetically similar. E.g. “there” & “their”.

• Vocal Strain

Using voice recognition software, you may find yourself speaking more
loudly than in normal conversation. In 2000, Linda L. Grubbs of PC World
magazine reported that this habit could lead to vocal cord injury. Although
there is no definite scientific link established between the use of voice
recognition software and damage to the voice, talking loudly for
extended periods always carries the possibility of causing strain and
hoarseness.

• Adaptability

Speech Recognition softwares are not capable of adapting to various


changing conditions which include different microphone, background
noise, new speaker, new task domain, new language even.

13
4. APPLICATIONS

 Games and Edutainment

Speech recognition offers game and edutainment developers the potential


to bring their applications to a new level of play. With games, for
example, traditional computer-based characters could evolve into
characters that the user can actually talk to.

 Data Entry

Applications that require users to keyboard paper-based data into the


computer (such as database front-ends and spreadsheets) are good areas for
a speech recognition application. Reading data directly to the computer is
much easier for most users and can significantly speed up data entry.

While speech recognition technology cannot effectively be used to enter


names, it can enter numbers or items selected from a small (less than 100
items) list. Some recognizers can even handle spelling fairly well. If an
application has fields with mutually exclusive data types (for example,
one field allows "male" or "female", another is for age, and a third is for
city),

the speech recognition engine can process the command and automatically
determine which field to fill in.
 Document Editing

This is a scenario in which one or both modes of speech recognition could


be used to dramatically improve productivity. Dictation would allow users
to dictate entire documents without typing. Command and control would
allow users to modify formatting or change views without using the mouse
or keyboard. For example, a word processor might provide commands like
"bold", "italic", "change to Times New Roman font", "use bullet list text
style," and "use 18 point type." A paint package might have "select eraser"
or "choose a wider brush."

14
 Speaker Identification
Recognizing the patterns of speech of a various persons can be used to
identify them separately. It can be used as a Biometric authentication
system in which the user authenticates him/her self with the help of their
speech. The various characteristics of speech which involves frequency,
amplitude & other special features are captured & compared with the
previously stored database.

 Automation at Call Centers


Receiving call from a huge number of customers, answering them or
diverting them to a particular customer care representative according to the
customers demand. It can be used to provide a faster response to the
customer & provide better service.
 Medical Disabilities
This technology is a great boon for blind & handicapped as they can utilize
the speech recognition technology for various works. Those who are unable
to operate the computer through keyboard & mouse can operate it with just
their voice.
 Fighter Aircrafts
Pilots in fighter aircrafts have to keep a check on various functions going
on in the aircraft. They have to provide a faster response to the sudden
changes in the aircraft maneuver. They can give commands with their
voice commands. It requires building a pilot voice template before. The
actions are confirmed through visual or aural feedback.

Amazon's Alexa
Apple Siri
Google's Google Assistant

Microsoft Cortana

15
5. CONCLUSION & FUTURE SCOPE

5.1 CONCLUSION

Speech recognition will revolutionize the way people interacted with Smart
devices & will, ultimately, differentiate the upcoming technologies. Almost all
the smart devices coming today in the market are capable of recognizing
speech. Many areas can benefit from this technology. Speech Recognition can
be used for intuitive operation of computer-based systems in daily life.
This technology will spawn revolutionary changes in the modern world and
become a pivot technology. Within five years, speech recognition technology
will become so pervasive in our daily lives that service environments lacking
this technology will be considered inferior

5.2 Future Scope

 Achieving efficient speaker independent word recognition

All the SR systems will be speaker independent and will produce the
same kind out output for a particular command irrespective of the user. SR
systems will be able to process the voice commands of all the users with
very high accuracy & efficiency.

 Ability to distinguish nuances of speech and meanings of words

SR systems would be able to distinguish between nuances phrases &


meaningful commands, & would be able to process the proper command
out of the nuances phrases correctly.

 Stand-alone Speech Recognition Systems

Presently there is no SR stand-alone systems available, all the SR systems


been developed are based on one or the other preexisting hardware and
software platforms. But in near future Stand Alone SR systems might be
available in the market.

16
 Wearable Speech Recognition System

SR systems will be embedded in wearable devices or things such as wrist


watch, necklace, bracelet etc. There will be no need of carrying bulky
devices and the technology can be used on the go.

 Talk with all the devices.

All the devices including Smart phones, Computers, Television,


Refrigerator, Washing Machines etc will be controlled with the voice
commands of the user. There will be no need of having a Remote or
pressing buttons on the device to interact with it.

17
REFERENCE

• "Speaker Independent Connected Speech Recognition- Fifth Generation


Computer Corporation". Fifthgen.com. Archived from the original on 11
November 2013. Retrieved 15 June 2013.

• ^ P.Nguyen (2010). "Automatic classification of speaker


characteristics". International Conference on Communications and
Electronics 2010. pp. 147–
152. doi:10.1109/ICCE.2010.5670700. ISBN 978-1-4244-7055-6. S2CID 13
482115.

• ^ "British English definition of voice recognition". Macmillan Publishers


Limited. Archived from the original on 16 September 2011. Retrieved 21
February 2012.

• ^ "voice recognition, definition of". WebFinance

18

You might also like