Professional Documents
Culture Documents
Tamer M. Nassef
Definition
Speech recognition is the process of converting
an acoustic signal, captured by a microphone or
a telephone, to a set of words.
The recognised words can be an end in
themselves, as for applications such as
commands & control, data entry, and document
preparation.
They can also serve as the input to further
linguistic processing in order to achieve speech
understanding
Speech Processing
Signal processing:
Convert the audio wave into a sequence of feature vectors
Speech recognition:
Decode the sequence of feature vectors into a sequence
of words
Semantic interpretation:
Determine the meaning of the recognized words
Dialog Management:
Correct errors and help get the task done
Response Generation
What words to use to maximize user understanding
Speech synthesis (Text to Speech):
Generate synthetic speech from a marked-up word string
Dialog Management
Goal: determine what to accomplish in response
to user utterances, e.g.:
Answer user question
Solicit further information
Confirm/Clarify user utterance
Notify invalid query
Notify invalid query and suggest alternative
Interface between user/language processing
components and system knowledge base
What you can do with Speech
Recognition
Transcription
dictation, information retrieval
Command and control
data entry, device control, navigation, call
routing
Information access
airline schedules, stock quotes, directory
assistance
Problem solving
travel planning, logistics
Transcription and Dictation
Transcription is transforming a stream of
human speech into computer-readable
form
Medical reports, court proceedings, notes
Indexing (e.g., broadcasts)
Dictation is the interactive composition of
text
Report, correspondence, etc.
Speech recognition and
understanding
Sphinx system
speaker-independent
continuous speech
large vocabulary
ATIS system
air travel information retrieval
context management
Speech Recognition and Call
Centres
Automate services, lower payroll
Shorten time on hold
Shorten agent and client call time
Reduce fraud
Improve customer service
Applications related to Speech
Recognition
Speech Recognition
Figure out what a person is saying.
Speaker Verification
Authenticate that a person is who she/he
claims to be.
Limited speech patterns
Speaker Identification
Assigns an identity to the voice of an
unknown person.
Arbitrary speech patterns
Many kinds of Speech Recognition
Systems
Speech recognition systems can be
characterised by many parameters.
An isolated-word (Discrete) speech
recognition system requires that the
speaker pauses briefly between words,
whereas a continuous speech recognition
system does not.
Spontaneous V Scripted
Spontaneous, speech contains
disfluencies, periods of pause and restart,
and is much more difficult to recognise
than speech read from script.
Enrolment
Some systems require speaker enrolment,
a user must provide samples of his or her
speech before using them, whereas other
systems are said to be speaker-
independent, in that no enrolment is
necessary.
Large V small vocabularies
Some of the other parameters depend on the
specific task. Recognition is generally more
difficult when vocabularies are large with many
similar-sounding words.
When speech is produced in a sequence of
words, language models or artificial grammars
are used to restrict the combination of words.
The simplest language model can be specified
as a finite-state network, where the permissible
words following each word are given explicitly.
Perplexity
One popular measure of the difficulty of
the task, combining the vocabulary size
and the language model, is perplexity.
Loosely defined as the geometric mean of
the number of words that can follow a
word after the language model has been
applied., (Zue, Cole, and Ward, 1995).
Finally, some external parameters can
affect speech recognition system
performance. These include the
characteristics of the environmental noise
and the type and the placement of the
microphone.
Properties of Recognizers
Summary
Speaker Independent vs. Speaker Dependent
Large Vocabulary (2K-200K words) vs.
Limited Vocabulary (2-200)
Continuous vs. Discrete
Speech Recognition vs. Speech Verification
Real Time vs. multiples of real time
Continued
Spontaneous Speech vs. Read Speech
Noisy Environment vs. Quiet Environment
High Resolution Microphone vs. Telephone vs.
Cellphone
Push-and-hold vs. push-to-talk vs. always-
listening
Adapt to speaker vs. non-adaptive
Low vs. High Latency
With online incremental results vs. final results
Dialog Management
Features That Distinguish
Products & Applications
Words, phrases, and grammar
Models of the speakers
Speech flow
Vocabulary: How many words
How you add new words
Grammars
Branching Factor (Perplexity)
Available languages
Systems are also defined by Users
Different Kinds of Users
One time vs. Frequent users
Homogeneity
Technically sophisticated
Based on Users have different speaker
models
Speaker Models
Speaker Dependent
Speaker Independent
Speaker Adaptive
Sample Market: Call Centers