You are on page 1of 22

SPEECH

RECOGNITION
SYSTEMS

A Presentation By -
Ayush Rungta
(1409122011)
INTRODUCTION

The process of enabling a computer to identify


and respond to the sounds produced in human
speech is referred to as Speech Recognition.

You talk to your computer, phone or device


and it uses what you said as input to trigger
some action.
TYPES OF SPEECH RECOGNITION

There are two types of Speech Recognition Systems-

Speaker Dependent SRS


Speakerdependent software is commonly
used for dictation software.

Speaker Independent SRS


Speakerindependent software is more
commonly found in telephone applications.
Speaker Dependent System

Speakerdependent software works by learning the


unique characteristics of a single person's voice, in a way
similar to voice recognition. New users must first "train"
the software by speaking to it, so the computer can analyze
how the person talks.

This often means users have to read a few pages of text to


the computer before they can use the speech recognition
software.
Speaker Independent System

Speakerindependent software is designed to recognize


anyone's voice, so no training is involved. This means it is
the only real option for applications such as interactive
voice response systems where businesses can't ask
callers to read pages of text before using the system.

The downside is that speakerindependent software is


generally less accurate than speakerdependent software.
RECOGNITION
Voice Input

Analog
To
Digital

Acoustic Decoder Language


Model Model

Display

Speech Feedback
Engine
Digitization

The analog-to-digital converter (ADC) translates


this analog wave into digital data that the computer can
understand.

To do this, it samples, or digitizes, the sound by taking


precise measurements of the wave at frequent intervals.

Uses filters to measure energy levels for various points


on the frequency spectrum
Noise Removal

The system filters the digitized sound to remove


unwanted noise, and sometimes to separate it into
different bands of frequency (frequency is the
wavelength of the sound waves, heard by humans as
differences in pitch).

It also normalizes the sound, or adjusts it to a constant


volume level.

Two microphones can be used to remove noise. One


facing the speaker and the one facing away from the
speaker.
Speech Engine

It is the software program that recognizes speech. A


speech engine takes a spoken utterance, compares it to
the vocabulary, and matches the utterance to vocabulary
words.

Speech Engine consists of the following three


components-

Acoustic Model
Language Model
Decoder
Acoustic Model

An Acoustic Model is a file that contains statistical


representations of each of the distinct sounds that makes
up a word. Each of these statistical representations is
assigned a label called a phoneme.

The English language has about 40 distinct sounds that


are useful for speech recognition, and thus we have 40
different phonemes.
Language Model

A Statistical Language Model is a file used by a


Speech Recognition Engine to recognize speech. It
contains a large list of words and their probability of
occurrence. It is used in dictation applications.

It tries to capture the properties of a language, and to


predict the next word in a speech sequence.
Decoder

Software program that takes the sounds spoken by a user


and searches the Acoustic Model for the equivalent
sounds.
When a match is made, the Decoder determines
the phoneme corresponding to the sound. It keeps track of
the matching phonemes until it reaches a pause in the
users speech.
It then searches the Language Model or Grammar file for
the equivalent series of phonemes. If a match is made it
returns the text of the corresponding word or phrase to the
calling program.
Neural Networks

Artificial Neural Networks (ANNs) are systems consisting


of interconnected computational nodes working somewhat
similarly to human neurons.

Neural networks can be used to approximate functions or


classify data into similar classes than can be phonemes,
sub-phoneme units, syllables or words in the speech
recognition domain.

The ability to learn by adapting strengths of inter-neuron


connections (synapses) is a fundamental property of
artificial neural networks.
Fig.- A diagram representing a simple neural network.
PROCESS OF SPEECH RECOGNITION

S1

Speaker Speech
Parsing S2
and
Recognition Recognition
Arbitration

SK

SN
Input From User

Switch S1
on
Channel
9

Speaker Speech
Parsing S2
and
Recognition Recognition
Arbitration

SK

SN
Authentication

Who is
speaking?
S1

Speaker Speech
Parsing S2
and
Recognition Recognition
Arbitration

SK

Annie
David SN
Cathy
Understanding

What is he
S1
saying?

Speaker Speech
Parsing S2
and
Recognition Recognition
Arbitration

SK

On,Off,TV SN
Fridge,Door
Inferring & Execution
What is he
talking S1
about?

Speaker Speech
Parsing S2
and
Recognition Recognition
Arbitration

SK

Switch,to,channel,nine
Channel->TV
SN
Dim->Lamp
On->TV,Lamp
Framework of Voice Recognition

S1

Face Gesture
Parsing S2
And
Recognition Recognition
Arbitration

SK

SN

Authentication Understanding Inferring and Execution


APPLICATIONS

In car systems

Healthcare

Military

Telephony

Education and Daily Life

People with Disabilities


THE END