You are on page 1of 70

Speech Recognition

Tamer M. Nassef
Definition
Speech recognition is the process of converting
an acoustic signal, captured by a microphone or
a telephone, to a set of words.
The recognised words can be an end in
themselves, as for applications such as
commands & control, data entry, and document
preparation.
They can also serve as the input to further
linguistic processing in order to achieve speech
understanding
Speech Processing
Signal processing:
Convert the audio wave into a sequence of feature vectors
Speech recognition:
Decode the sequence of feature vectors into a sequence
of words
Semantic interpretation:
Determine the meaning of the recognized words
Dialog Management:
Correct errors and help get the task done
Response Generation
What words to use to maximize user understanding
Speech synthesis (Text to Speech):
Generate synthetic speech from a marked-up word string
Dialog Management
Goal: determine what to accomplish in response
to user utterances, e.g.:
Answer user question
Solicit further information
Confirm/Clarify user utterance
Notify invalid query
Notify invalid query and suggest alternative
Interface between user/language processing
components and system knowledge base
What you can do with Speech
Recognition
Transcription
dictation, information retrieval
Command and control
data entry, device control, navigation, call
routing
Information access
airline schedules, stock quotes, directory
assistance
Problem solving
travel planning, logistics
Transcription and Dictation
Transcription is transforming a stream of
human speech into computer-readable
form
Medical reports, court proceedings, notes
Indexing (e.g., broadcasts)
Dictation is the interactive composition of
text
Report, correspondence, etc.
Speech recognition and
understanding
Sphinx system
speaker-independent
continuous speech
large vocabulary
ATIS system
air travel information retrieval
context management
Speech Recognition and Call
Centres
Automate services, lower payroll
Shorten time on hold
Shorten agent and client call time
Reduce fraud
Improve customer service
Applications related to Speech
Recognition
Speech Recognition
Figure out what a person is saying.
Speaker Verification
Authenticate that a person is who she/he
claims to be.
Limited speech patterns
Speaker Identification
Assigns an identity to the voice of an
unknown person.
Arbitrary speech patterns
Many kinds of Speech Recognition
Systems
Speech recognition systems can be
characterised by many parameters.
An isolated-word (Discrete) speech
recognition system requires that the
speaker pauses briefly between words,
whereas a continuous speech recognition
system does not.
Spontaneous V Scripted
Spontaneous, speech contains
disfluencies, periods of pause and restart,
and is much more difficult to recognise
than speech read from script.
Enrolment
Some systems require speaker enrolment,
a user must provide samples of his or her
speech before using them, whereas other
systems are said to be speaker-
independent, in that no enrolment is
necessary.
Large V small vocabularies
Some of the other parameters depend on the
specific task. Recognition is generally more
difficult when vocabularies are large with many
similar-sounding words.
When speech is produced in a sequence of
words, language models or artificial grammars
are used to restrict the combination of words.
The simplest language model can be specified
as a finite-state network, where the permissible
words following each word are given explicitly.
Perplexity
One popular measure of the difficulty of
the task, combining the vocabulary size
and the language model, is perplexity.
Loosely defined as the geometric mean of
the number of words that can follow a
word after the language model has been
applied., (Zue, Cole, and Ward, 1995).
Finally, some external parameters can
affect speech recognition system
performance. These include the
characteristics of the environmental noise
and the type and the placement of the
microphone.
Properties of Recognizers
Summary
Speaker Independent vs. Speaker Dependent
Large Vocabulary (2K-200K words) vs.
Limited Vocabulary (2-200)
Continuous vs. Discrete
Speech Recognition vs. Speech Verification
Real Time vs. multiples of real time
Continued
Spontaneous Speech vs. Read Speech
Noisy Environment vs. Quiet Environment
High Resolution Microphone vs. Telephone vs.
Cellphone
Push-and-hold vs. push-to-talk vs. always-
listening
Adapt to speaker vs. non-adaptive
Low vs. High Latency
With online incremental results vs. final results
Dialog Management
Features That Distinguish
Products & Applications
Words, phrases, and grammar
Models of the speakers
Speech flow
Vocabulary: How many words
How you add new words
Grammars
Branching Factor (Perplexity)
Available languages
Systems are also defined by Users
Different Kinds of Users
One time vs. Frequent users
Homogeneity
Technically sophisticated
Based on Users have different speaker
models
Speaker Models
Speaker Dependent
Speaker Independent
Speaker Adaptive
Sample Market: Call Centers

Automate services, lower


payroll
Shorten time on hold
Shorten agent and client call
time
Reduce fraud
Improve customer service
A TIMELINE OF SPEECH
RECOGNITION
1890s Alexander Graham Bell discovers Phone while
trying to develop speech recognition system for deaf
people.
1936AT&T's Bell Labs produced the first electronic
speech synthesizer called the Voder (Dudley, Riesz and
Watkins).
This machine was demonstrated in the 1939 World Fairs
by experts that used a keyboard and foot pedals to play
the machine and emit speech.
1969John Pierce of Bell Labs said automatic speech
recognition will not be a reality for several decades
because it requires artificial intelligence.
Early 70s
Early 1970'sThe Hidden Markov Modeling
(HMM) approach to speech recognition was
invented by Lenny Baum of Princeton University
and shared with several ARPA (Advanced
Research Projects Agency) contractors including
IBM.
HMM is a complex mathematical pattern-
matching strategy that eventually was adopted
by all the leading speech recognition companies
including Dragon Systems, IBM, Philips, AT&T
and others.
70+
1971DARPA (Defense Advanced Research Projects Agency)
established the Speech Understanding Research (SUR) program to
develop a computer system that could understand continuous
speech.
Lawrence Roberts, who initiated the program, spent $3 million per
year of government funds for 5 years. Major SUR project groups
were established at CMU, SRI, MIT's Lincoln Laboratory, Systems
Development Corporation (SDC), and Bolt, Beranek, and Newman
(BBN). It was the largest speech recognition project ever.

1978The popular toy "Speak and Spell" by Texas Instruments was


introduced. Speak and Spell used a speech chip which led to huge
strides in development of more human-like digital synthesis sound.
80+
1982Covox founded. Company brought digital sound (via
The Voice Master, Sound Master and The Speech
Thing) to the Commodore 64, Atari 400/800, and finally
to the IBM PC in the mid 80s.
1982Dragon Systems was founded in 1982 by speech
industry pioneers Drs. Jim and Janet Baker. Dragon
Systems is well known for its long history of speech and
language technology innovations and its large patent
portfolio.
1984SpeechWorks, the leading provider of over-the-
telephone automated speech recognition (ASR)
solutions, was founded.
90s
1993 Covox sells its products out to Creative Labs, Inc.
1995 Dragon released discrete word dictation-level speech
recognition software. It was the first time dictation speech
recognition technology was available to consumers. IBM and
Kurzweil followed a few months later.
1996 Charles Schwab is the first company to devote resources
towards developing up a speech recognition IVR system with
Nuance. The program, Voice Broker, allows for up to 360
simultaneous customers to call in and get quotes on stock and
options... it handles up to 50,000 requests each day. The system
was found to be 95% accurate and set the stage for other
companies such as Sears, Roebuck and Co., and United Parcel
Service of America Inc., and E*Trade Securities to follow in their
footsteps.
1996 BellSouth launches the world's first voice portal, called Val
and later Info By Voice.
95+
1997 Dragon introduced "Naturally Speaking", the first
"continuous speech" dictation software available
(meaning you no longer need to pause between words
for the computer to understand what you're saying).
1998 Lernout & Hauspie bought Kurzweil. Microsoft
invested $45 million in Lernout & Hauspie to form a
partnership that will eventually allow Microsoft to use
their speech recognition technology in their systems.
1999 Microsoft acquired Entropic, giving Microsoft
access to what was known as the "most accurate speech
recognition system" in the Old VCR!
2000

2000 Lernout & Hauspie acquired Dragon Systems


for approximately $460 million.

2000 TellMe introduces first world-wide voice


portal.

2000 NetBytel launched the world's first voice


enabler, which includes an on-line ordering
application with real-time Internet integration for
Office Depot.
2000s
2001ScanSoft Closes Acquisition of Lernout
& Hauspie Speech and Language Assets.
2003ScanSoft Ships Dragon
NaturallySpeaking 7 Medical, Lowers
Healthcare Costs through Highly Accurate
Speech Recognition.

2003ScanSoft closes deal to distribute and


support IBM ViaVoice Desktop Products.
Signal Variability
Speech recognition is a difficult problem, largely because
of the many sources of variability associated with the
signal.
The acoustic realisations of phonemes, the recognition
systems smallest sound units of which words are
composed, are highly dependent on the context in which
they appear.
These phonetic variables are exemplified by the acoustic
differences of the phoneme 't/'in two, true, and butter in
English.
At word boundaries, contextual variations can be quite
dramatic, and devo andare sound like devandare in
Italian.
More
Acoustic variability can result from changes in
the environment as well as in the position and
characteristics of the transducer.
Within-speaker variability can result from
changes in the speaker's physical and emotional
state, speaking rate, or voice quality.
Differences in socio-linguistic background,
dialect, and vocal tract size and shape can
contribute to across-speaker variability.
What is a speech recognition
system?
Speech recognition is generally used as a
human computer interface for other software.
When it functions in this role, three primary tasks
need be performed.
Pre-processing, the conversion of spoken input
into a form the recogniser can process.
Recognition, the identification of what has been
said.
Communication, to send the recognised input to
the application that requested it.
How is pre-processing performed
To understand how the first of these
functions is performed, we must examine,
Articulation, the production of the sound.
Acoustics, the stream of the speech itself.
What characterises the ability to
understand spoke input, Auditory
perception.
Articulation
The science of articulation is concerned with how
phonemes are produced. The focus of articulation is on
the vocal apparatus of the throat, mouth and nose where
the sounds are produced.
The phonemes themselves need to be classified, the
system most often used by speech recognition is the
ARPABET, (Rabiner and Juang, 1993) The ARPABET
was created in the 1970s by and for contractors working
on speech processing for the Advanced Research
Projects Agency of the U.S. department of defence.
ARPABET

Like most phoneme classifications, the


ARPABET separates consonants from vowels.
Consonants are characterised by a total or
partial blockage of the vocal tract.
Vowels are characterised by strong harmonic
patterns and relatively free passage of air
through the vocal tract.
Semi-Vowels, such as the y in you, fall between
consonants and vowels.
Consonant Classifcation
Consonant classification uses the,
Point of articulation.
Manner of articulation.
Presence or absence of voicing.
Acoustics
Articulation provides valuable information
about how speech sounds are produced,
but a speech recognition system cannot
analyse movements of the mouth.
Instead, the data source for speech
recognition is the stream of speech itself.
This is an analogue signal, a sound
stream, and a continuous flow of sound
waves and silence.
Important Features (Acoustics)
Four important features of the acoustic analysis
of speech are, (Carter, 1984)
Frequency, the number of vibrations per second
a sound produces
Amplitude, the loudness of the sound.
Harmonic structure added to the fundamental
frequency of a sound are other frequencies that
contribute to its quality or timbre.
Resonance.
Auditory perception, hearing
speech.
"Phonemes tend to be abstractions that are implicitly
defined by the pronunciation of the words in the
language. In particular, the acoustic realisation of a
phoneme may heavily depend on the acoustic context in
which it occurs. This effect is usually called co-
articulation", (Ney, 1994).
The way a phoneme is pronounced can be affected by
its position in a word, neighbouring phonemes and even
the word's position in a sentence. This affect is called the
co-articulation effect.
The variability in the speech signal caused by co-
articulation and other sources make speech analysis
very difficult.
Human Hearing
The human ear can detect frequencies from 20Hz to
20,000Hz but it is most sensitive in the critical frequency
range, 1000Hz to 6000Hz, (Ghitza, 1994).
Recent Research has uncovered the fact that humans
do not process individual frequencies.
Instead, we hear groups of frequencies, such as format
patterns, as cohesive units and we are capable of
distinguishing them from surrounding sound patterns,
(Carrell and Opie, 1992) .
This capability, called auditory object formation, or
auditory image formation, helps explain how humans can
discern the speech of individual people at cocktail parties
and separate a voice from noise over a poor telephone
channel, (Markowitz, 1995).
Pre-processing Speech
Like all sounds, speech is an analogue
waveform. In order for a Recognition System to
perform action on speech, it must be
represented in a digital manner.
All noise patterns silences and co-articulation
effects must be captured.
This is accomplished by digital signal
processing. The way the analogue speech is
processed is one of the most complex elements
of a Speech Recognition system.
Recognition Accuracy
To achieve high recognition accuracy the
speech representation process should,
(Markowitz, 1995),
Include all critical data.
Remove Redundancies.
Remove Noise and Distortion.
Avoid introducing new distortions.
Signal Representation

In statistically based automatic speech


recognition, the speech waveform is sampled at
a rate between 6.6 kHz and 20 kHz and
processed to produce a new representation as a
sequence of vectors containing values of what
are generally called parameters.
The vectors typically comprise between 10 and
20 parameters, and are usually computed every
10 or 20 milliseconds.
Parameter Values
These parameter values are then used in
succeeding stages in the estimation of the
probability that the portion of waveform just
analysed corresponds to a particular phonetic
event that occurs in the phone-sized or whole-
word reference unit being hypothesised.
In practice, the representation and the
probability estimation interact strongly: what one
person sees as part of the representation
another may see as part of the probability
estimation process.
Emotional State
Representations aim to preserve the information
needed to determine the phonetic identity of a
portion of speech while being as impervious as
possible to factors such as speaker differences,
effects introduced by communications channels,
and paralinguistic factors such as the emotional
state of the speaker.

They also aim to be as compact as possible.


Representations used in current speech
recognisers, concentrate primarily on properties
of the speech signal attributable to the shape of
the vocal tract rather than to the excitation,
whether generated by a vocal-tract constriction
or by the larynx.

Representations are sensitive to whether the


vocal folds are vibrating or not (the
voiced/unvoiced distinction), but try to ignore
effects due to variations in their frequency of
vibration.
Future Improvements in Speech
Representation.
The vast majority of major commercial and
experimental systems use representations akin
to those described here.

However, in striving to develop better


representations, wave-let transforms
(Daubechies, 1990) are being explored, and
neural network methods are being used to
provide non-linear operations on log spectral
representations.
Work continues on representations more closely
reflecting auditory properties (Greenberg, 1988) and on
representations reconstructing articulatory gestures from
the speech signal (Schroeter & Sondhi, 1994).

It is attractive because it holds out the promise of a small


set of smoothly varying parameters that could deal in a
simple and principled way with the interactions that occur
between neighbouring phonemes and with the effects of
differences in speaking rate and of carefulness of
enunciation.
The ultimate challenge is to match the superior
performance of human listeners over automatic
recognisers.
This superiority is especially marked when there is little
material to allow adaptation to the voice of the current
speaker, and when the acoustic conditions are difficult.
The fact that it persists even when nonsense words are
used shows that it exists at least partly at the
acoustic/phonetic level and cannot be explained purely
by superior language modelling in the brain.
It confirms that there is still much to be done in
developing better representations of the speech signal,
(Rabiner and Schafer, 1978; Hunt, 1993).
Signal Recognition Technologies
Signal Recognition methodologies fall into
to four categories, most system will apply
one or more in the conversion process.
Template Matching,
Template match is the oldest and least effective method.
It is a form of pattern recognition.
It was the dominant technology in the 1950's and 1960's.
Each word or phrase in an application is stored as a
template.
The user input is also arranged into templates at the
word level and the best match with a system template is
found.
Although Template matching is currently in decline as
the basic approach to recognition, it has been adapted
for use in word spotting applications. It also remains the
primary technology applied to speaker verification,
(Moore, 1982).
Acoustic-Phonetic Recognition
Acoustic-phonetic recognition functions at the
phoneme level. It is an attractive approach to
speech as it limits the number of representations
that must be stored. In English there are about
forty discernible phonemes no matter how large
the vocabulary, (Markowitz, 1995).
Acoustic phonetic recognition involves three
steps,
Feature Extraction.
Segmentation and Labelling.
Word-Level recognition.
Acoustic phonetic recognition supplanted
template matching in the early 1970's.
The successful ARPA SUR systems
highlighted potential benefits of this
approach. Unfortunately acoustic phonetic
was at the time a poorly researched area
and many of the expected advances failed
to materialise.
The high degree of acoustic similarity among
phonemes combined with phoneme variability
resulting from the co-articulation effect and other
sources create uncertainty with regard to
potential phoneme labels, (Cole 1986).
If these problems can be overcome, there is
certainly an opportunity for this technology to
play a part in future Speech Recognition system.
Stochastic Processing,
The term stochastic refers to the process of making a
sequence of non-deterministic selections from among a
set of alternatives.
They are non-deterministic because the choices during
the recognition process are governed by the
characteristics of the input and not specified in advance,
(Markowitz, 1995).
Like template matching, stochastic processing requires
the creation and storage of models of each of the items
that will be recognised.
It is based on a series of complex statistical or
probabilistic analyses. These statistics are stored in a
network-like structure called a Hidden Markov Model
(HMM), (Paul, 1990).
HMM
A Hidden Markov Model is made up of states and
transitions, which are shown, in the diagram. Each state
represents of a HMM holds statistics for a segment of a
word, which describe the value and variations that are
found in the model of that word segment. The transitions
allow for speech variations such as
The prolonging of a word segment, this would cause
several recursive transitions in the recogniser.
The omission of a word segment, This would cause a
transition that skips a state.
Stochastic processing using Hidden Markov Models is
accurate, flexible, and capable of being fully automated,
(Rabiner and Juang, 1986).
Neural networks
"if speech recognition systems could learn speech
knowledge automatically and represent this knowledge
in a parallel distributed fashion for rapid evaluation
such a system would mimic the function of the human
brain, which consists of several billion simple, inaccurate
and slow processors that perform reliable speech
processing", (Waibel and Hampshire, 1989).

An artificial neural network is a computer program, which


attempt to emulate the biological functions of the Human
brain. They are an excellent classification systems, and
have been effective with noisy, patterned, variable data
streams containing multiple, overlapping, interacting and
incomplete cues, (Markowitz, 1995).
Neural networks do not require the complete
specification of a problem, learning instead through
exposure to large amount of example data. Neural
networks comprise of an input layer, one or more hidden
layers, and one output layer. The way in which the nodes
and layers of a network are organised is called the
networks architecture.
The allure of neural networks for speech recognition lies
in their superior classification abilities.
Considerable effort has been directed towards
development of networks to do word, syllable and
phoneme classification.
Auditory Models,
The aim of auditory models to allow a Speech
Recognition system to screen all noise from the
signal and concentrate on the central speech
pattern in a similar way to the Human Brain.
Auditory modelling offers the promise of being
able to develop robust Speech Recognition
systems that are capable of working in difficult
environments.
Currently, it is purely an experimental
technology.
Performance of Speech
Recognitions systems
Performance of speech recognition systems is typically
described in terms of word error rate, defined as:
Deletion, The loss of a word within the original speech.
The system outputs "A E I U" while the input was "A E I
O U".
Substitution, The replacement of an element of the input,
such as a word, with another. The system outputs "song"
while the input was "long".
Insertion, The system adds an element to the input, such
as a word, when no word was input. The system outputs
"A E I O U" while the input was "A E I U".
Speech Recognition as Assistive
Technology
Main use is as alternative Hands Free
Data entry mechanism
Very effective
Much faster than switch access
Mainstream technology
Used in many applications where hands
are needed for other things e.g. mobile
phone while driving, in surgical theatres
Dictation is a big part of office
administration and commercial speech
recognition systems are targeted at this
market.
Some interesting facts
Switch access users who were at around 5
words per minute achieved 80 words with
SR
This allowed them to do state exams
SR can be used for environmental control
systems around the home e.g.
Open Curtains
People with speech impairment (Dysarthic
Speech) have shown improved articulation
after using SR systems especially Discrete
systems
Reasons why SR may fail some
people
Crowded room - Cannot have everyone
talking at once
Too many errors because all noises,
coughs, throat clearances etc are picked
up
Speech not good enough to use it
Not enough training
Cognitive overhead too much for some
people
Too demanding physically Hard work to
talk for a long time
Cannot be bothered with Initial Enrolment
Drinking- Adversely affects vocal cords
Smoking, Shouting, Dry Mouth and illness
all affect the vocal tract
Need to drink water
Room must not be too stuffy
Some links
The following are links to major speech
recognition links
Carnegie Mellon Speech
Demos
CMU Communicator
Call: 1-877-CMU-PLAN (268-7526), also 268-
5144, or x8-1084
the information is accurate; you can use it for
your own travel planning

CMU Universal Speech Interface (USI)


CMU Movie Line
Seems to be about apartments now
Call: (412) 268-1185
Telephone Demos
Nuance http://www.nuance.com
Banking: 1-650-847-7438
Travel Planning: 1-650-847-7427
Stock Quotes: 1-650-847-7423
SpeechWorks
http://www.speechworks.com/demos/demos.htm
Banking: 1-888-729-3366
Stock Trading: 1-800-786-2571
MIT Spoken Language Systems
Laboratory
http://www.sls.lcs.mit.edu/sls/whatwedo/applicati
ons.html
Travel Plans (Pegasus): 1-877-648-8255
Weather (Jupiter): 1-888-573-8255
IBM http://www-3.ibm.com/software/speech/
Mutual Funds, Name Dialing: 1-877-VIA-
VOICE

You might also like