SuperHuman Speech Recognition Jul 2 2008

IBM Research
Superhuman Speech Recognition:

Technology Challenges & Market Adoption
David Nahamoo
IBM Fellow
Speech CTO, IBM Research
July 2, 2008
© 2006 IBM Corporation

IBM Research
Overall Speech Market Opportunity

WW Voice-Driven Conversation Access Technology Forecast
•This chart represents all revenue for speech related ecosystem activity.
•Revenue exceeded $1B for the 1st time in 2006
•Note also that hosted services will represent ½ of speech related revenue in 2011
*Opus Research 02_2007
2 © 2006 IBM Corporation
IBM Research
Speech Market Segments

Speech Segments Need-based Segmentation Market Usage
Speech Self Service Transaction/Problem Solving Contact Centers
Speech Analytics Intelligence Contact Centers, Media,

Government
Speech Biometrics Security Contact Centers, Government
Speech Transcription Information Access and provision Media, Medical, Legal, Education,
Government, Unified Messaging
Speech Translation Multilingual Communication Contact Centers, Tourism, Global
Digital Communities, Media (XCast)
Speech Control Command & Control Embedded - Automotive, Mobile

Devices, Appliances, Entertainment
Speech Search & Messaging Information Search & Retrieval Mobile Internet, Yellow Pages
SMS, IM, email
• Improved accuracy
• Much larger vocabulary speech recognition system

IBM Research
New Opportunity Areas

 Contact Centers Analytics
– Quality Assurance, Real Time Alerts, Compliance
 Media Transcription
– Closed captioning
 Accessibility
– Government, Lectures
 Content Analytics
– Audio-indexing, cross-lingual information retrieval, multi-media mining
 Dictation
– Medical, Legal, Insurance, Education
 Unified Communication
– Voicemail, Conference calls, email and SMS on hand held

IBM Research
Target zone
Human
Baseline for
conversations

IBM Research
Performance Results (2004 DARPA EARS Evaluation)

(Last public evaluation of English Telephony Transcription)
20
19
IBM
18
V3
WER
17 V2
16 V1-V4
V4
15
14
0.1 1 10 100
xRT
IBM: Best Speed-Accuracy Tradeoff

IBM Research
MALACH
Multilingual Access to Large Spoken ArCHives
• Funded by NSF, 5-year project (Started in Oct. 2001)
 Project Participants
– IBM, Visual History Foundation, Johns Hopkins University, University of Maryland,
Charles University and University of West Bohemia
 Objective
– Improve access to large multilingual collections of spontaneous speech by
advancing the state-of-the-art in technologies that work together to achieve
this objective: Automatic Speech Recognition, Computer-Assisted Translation
, Natural Language Processing and Information Retrieval

IBM Research
MALACH: A challenging speech corpus

Multimedia digital archive: 116,000 hours of interviews with over 52,000 survivors,
liberators, rescuers and witnesses of the Nazi Holocaust, recorded in 32
languages.
Goal: improved access to large multilingual spoken archives
Challenges:
Disfluencies
• A- a- a- a- band with on- our- on- our- arm
Emotional speech
• young man they ripped his teeth and beard out they beat him
Frequent interruptions:
• CHURCH TWO DAYS these were the people who were to go to

march TO MARCH and your brother smuggled himself SMUGGLED
8 IN IN IN IN © 2006 IBM Corporation
IBM Research
Effects of Customization (MALACH Data)
State-of-the-art ASR system

trained on SWB data (8KHz)
Word Error Rates
90
MALACH Training data seen
80
by AM and LM
70
60
50
40 fMPE, MPE, Consensus
30 decoding
20
Jan. '02 Oct. '02 Oct. '03 June '04 Nov. '04

IBM Research
Improvement in Word Error Rate for IBM embedded ViaVoice
0
WER across 3 car speeds and 4 grammars

IBM Research
Progress in Word Error Rate – IBM WebSphere Voice Server

Grammar Tasks over Telephone
2001 - 2006
6
5
4
Word Error
3
Rate %
2
1
0
WVS
45% relative improvement in WER the last 2.5 years

20% relative improvement in speed in the last 1.5 years

IBM Research
Multi-Talker Speech Separation Task
Lay white at X 8 soon
male and female speaker at 0dB
Bin Green with F 7 now

IBM Research
Two Talker Speech Separation Challenge Results
Examples:
Mixture
Recognition Error

IBM Research
Comparison of Human & Machine Speech Recognition

100
Voicemail
SWITCHBOARD
10 BROADCAST NEWS
BROADCAST-HUMAN
SWITCHBOARD-HUMAN
1
1992 1993 1994 1995 1996 1997 1998 1999 2000
70
60
Word Error Rate
50
40
30
20
10
0
WSJ Broadcast Conv Tel Vmail SWB Call center Meeting
Clean Speech Spontaneous Speech
Human-Machine Human-Human
IBM Research
IBM’s Superhuman Speech Recognition

Universal Recognizer
• Any accent
• Any topic
• Any noise conditions
• Broadcast, phone, in car, or live
• Multiple languages
• Conversational

IBM Research
Human Experiments
 Question:
– Can post-processing of recognizer hypotheses by humans improve accuracy?
– What is the relative contribution of linguistic vs. acoustic information (in this post-
processing operation?)
 Experiment
– Produce recognizer hypotheses in form of “sausages”
– Allow human to correct output either with linguistic information alone or with short segments
of acoustic information
 Results
– Human performance still far from maximum possible, given information in “sausages”
– Recognizer hypothesized linguistic context information not useful by itself
– Acoustic information in limited span (1 sec. average) marginally useful
that could stem
 What we learned
– Hard-to-design
it cuts down on
– Expensive to conduct and
– Hard to decide if not valuable I
comes stay I’m
they cut them

IBM Research
Acoustic Modeling Today

 Approach: Hidden Markov Models
– Observation densities (GMM) for P( feature | class )
• Mature mathematical framework, easy to combine with linguistic information
• However, does not directly model what we want i.e., P( words | acoustics )
 Training: Use transcribed speech data
– Maximum Likelihood
– Various discriminative criteria
 Handling Training/Test Mismatches:

– Avoid mismatches by collecting “custom” data
– Adaptation & adaptive training algorithms
 Significantly worse than humans for tasks with little or no linguistic

information - e.g., digits/letters recognition
 Human performance extremely robust to acoustic variations
– due to speaker, speaking style, microphone, channel, noise, accent, & dialect variations
 Steady progress over the years

Continued progress using current methodology very likely in the
17
future © 2006 IBM Corporation
IBM Research
Towards a Non-Parametric Approach to Acoustics
 General Idea: Back to pattern recognition basics!
– Break test utterance into sequence of larger segments (phone, syllable, word,
phrase)
– Match segments to closest ones in training corpus using some metric (possibly
using long distance models)
– Helps to get it right if you’ve heard it before
 Why prefer this approach over HMMs?
– HMMs compress training by x1000; too many modeling assumptions
• 1000hrs ~ 30Gb; State-of-the-art acoustic models ~ 30Mb
• Relaxing assumptions have been key to all recent improvements in acoustic modeling
 How can we accomplish this?
– Store & index training data for rapid access of training segments close to test segments
– Develop a metric D( train_seq, test_seq): obvious candidate is DTW with appropriate metric and warping rules
 Back to the Future?
– Reminiscent of DTW & Segmental models from late 80’s – ME was missing
– Limited by computational resources (storage/cpu/data) then & so HMMs won
 Implications:
– Need 100x more data for handling larger units (hence 100x more
computing resources)
– Better performance with more data – likely to have “heard it before”
IBM Research
Utilizing Linguistic Information in ASR

 Today’s standard ASR does not explicitly use linguistic information
– But recent work at JHU, SRI and IBM all show promise
– Semantic structured LM improves ASR significantly for limited domains
 Reduces WER by 25% across many tasks (Air Travel, Medical)
 A large amount of linguistic knowledge sources now available, but not used
for ASR
 Inside IBM
 WWW text: Raw text: 50 million pages ~25 billion words, ~10% useful after cleanup
 News text: 3-4 billion words, broadcast or newswires
 Name entity annotated text: 2 million words tagged
 Ontologies
 Linguistic knowledge used in rule-based MT system
 External
 WordNet, FrameNet, Cyc ontologies
 PennTreeBank, Brown corpus (syntactic & semantic annotated)
 Online dictionaries and thesaurus
 Google

IBM Research
Super Structured LM for LVCSR

Semantic World
Dialogue Parser
Coherence: Knowledge
State
semantic,syntactic,
pragmatic
Wash your clothes with
Named
soap/ soup. Documen Entity
David and his/ her father t Type W , ..., W
walked into the room. 1 N
Embedde
I ate a one/ nine pound
steak. d
Speaker Grammar
(turn,
Word
gender, ID) Syntactic Class
Parser
•Acoustic Confusability: LM should be optimized to distinguish between acoustic confusable
sets, rather than based on N-gram counts
•Automatic LM adaptation at different levels: discourse, semantic structure, and phrase
levels

IBM Research
Combination Decoders
 “ROVER” is used in all current systems
– NIST tool that combines multiple system outputs through voting
 Individual systems currently designed in an ad-hoc manner
 Only 5 or so systems possible
“I feel shine today”
“I veal fine today”
“I feel fine toady”
“I feel fine today”

An army (“Million”) of simple decoders
• Each makes uncorrelated errors
IBM Research
Million Feature Paradigm: Acoustic information for ASR
Segmental analysis Trajectory features Discard transient noise;

Broadband features Global adaptation for
Narrowband features Onset features stationary noise
Information Sources Noise Sources

• Feature definition is key challenge
• Maximum entropy model used to compute word probabilities.
• Information sources combined in unified theoretical framework.
• Long-span segmental analysis inherently robust to both
stationary and transient noise
IBM Research
Implications of the data-driven learning paradigm
 ASR systems give the best results when test data is similar to the
training data
 Performance degrades as the test data diverges from the training data
– Differences can occur both at the acoustic and linguistic levels, e.g.
1. A system designed to transcribe standard telephone audio
(8kHz) cannot transcribe compressed telephony archives (6kHz)
2. A system designed for a given domain (e.g. broadcast news) will
perform worse on a different domain (e.g. dictation)
 Hence the training and test sets have to be carefully chosen if the task
at hand expects a variety of acoustic sources

IBM Research
Generalization Dilemma
Want to get here:
Correct
complex
model
(simple model
on the right
Model combination: manifold)
Can we at least get best
Performance
of both worlds?
Simple
model The Gutter of
Complex model: Data Addiction
brute force learning
In-Domain Out-of-Domain
Test Conditions
IBM Research
Summary
 Continue the current tried-and-true technical approach
 Continue the yearly milestones and evaluations
 Continue the focus on accuracy, robustness, & efficiency
 Increase the focus on quantum leap innovation

 Increase the focus on language modeling
 Plan for 2 orders of magnitude increase in
– Access to annotated speech and text data
– Computing resources
 Improve cross-fertilization among different projects

SuperHuman Speech Recognition Jul 2 2008

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SuperHuman Speech Recognition Jul 2 2008

Uploaded by

Copyright:

Available Formats

IBM Research

Superhuman Speech Recognition:

© 2006 IBM Corporation

Overall Speech Market Opportunity

Speech Market Segments

Speech Self Service Transaction/Problem Solving Contact Centers

Speech Analytics Intelligence Contact Centers, Media,

Speech Biometrics Security Contact Centers, Government

Speech Control Command & Control Embedded - Automotive, Mobile

3 © 2006 IBM Corporation

New Opportunity Areas

4 © 2006 IBM Corporation

5 © 2006 IBM Corporation

Performance Results (2004 DARPA EARS Evaluation)

IBM: Best Speed-Accuracy Tradeoff

6 © 2006 IBM Corporation

7 © 2006 IBM Corporation

MALACH: A challenging speech corpus

• CHURCH TWO DAYS these were the people who were to go to

Effects of Customization (MALACH Data)

State-of-the-art ASR system

Word Error Rates

9 © 2006 IBM Corporation

Improvement in Word Error Rate for IBM embedded ViaVoice

10 © 2006 IBM Corporation

Progress in Word Error Rate – IBM WebSphere Voice Server

45% relative improvement in WER the last 2.5 years

11 © 2006 IBM Corporation

Multi-Talker Speech Separation Task

Lay white at X 8 soon

male and female speaker at 0dB

Bin Green with F 7 now

12 © 2006 IBM Corporation

Two Talker Speech Separation Challenge Results

13 © 2006 IBM Corporation

Comparison of Human & Machine Speech Recognition

Clean Speech Spontaneous Speech

IBM’s Superhuman Speech Recognition

15 © 2006 IBM Corporation

they cut them

Acoustic Modeling Today

 Handling Training/Test Mismatches:

 Significantly worse than humans for tasks with little or no linguistic

 Steady progress over the years

Utilizing Linguistic Information in ASR

19 © 2006 IBM Corporation

Super Structured LM for LVCSR

20 © 2006 IBM Corporation

“I feel shine today”

“I veal fine today”

“I feel fine toady”

“I feel fine today”

Million Feature Paradigm: Acoustic information for ASR

Segmental analysis Trajectory features Discard transient noise;

Information Sources Noise Sources

Implications of the data-driven learning paradigm

23 © 2006 IBM Corporation

 Increase the focus on quantum leap innovation

 Improve cross-fertilization among different projects

25 © 2006 IBM Corporation

You might also like