Speech Recognition Technology in a Ubiquitous Computing Environment

Sadaoki Furui
Tokyo Institute of Technology Department of Computer Science 2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552 Japan Tel/Fax: +81-3-5734-3480 furui@cs.titech.ac.jp

100%

Read Speech

Switchboard Conversational Speech WSJ Varied Microphone

foreign Broadcast Speech foreign NAB

WORD ERROR RATE

Spontaneous 20k Speech   ATIS  1k     5k    

10%

Noisy

Resource Management Courtesy NIST 1999 DARPA HUB-4 Report, Pallett et al.
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003

1%

History of DARPA speech recognition benchmark tests

Spontaneous speech Fluent speech Read speech Connected speech Isolated words 2 word spotting digit strings system driven dialogue name dialing

2-way dialogue

natural conversation transcription

Speaking style

network agent & intelligent messaging

form fill by voice
1980

2000 office dictation

voice commands 20

directory assistance
1990

200 2000 20000 Vocabulary size (number of words)

Unrestricted

Speech recognition technology

Speech input Acoustic analysis x1 ... xT Global search: Maximize P (x1... xT | w1... wk )・P(w1... wk ) over w1 ... wk Recognized Word sequence P(x1... xT | w1... wk ) Phoneme inventory Pronunciation lexicon P(w1 ... wk ) Language model

Mechanism of state-of-the-art speech recognizers

LPC or mel cepstrum, time derivatives, auditory models Cepstrum subtraction

Speech input Acoustic analysis

Context-dependent, tied mixture sub-word HMMs, learning from speech data

SBR, MLLR
Phoneme inventory Pronunciation lexicon Language model

Global search

Frame synchronous, beam search, stack search, fast match, A* search

Recognized Word sequence

Bigram, trigram, FSN, CFG

State-of-the-art algorithms in speech recognition

4,000 3,500

Clock frequency (MHz)

3,000 2,500 2,000 1,500 1,000 500 0 1995 2000 2005 2010 2015

Year

Change of MPU clock frequency

1T 100G 10G 1G 100M 10M 1M 100K 10K

16G 4G 256M 64M 4M 11M 21M 40M 76M 200M 1G

64G

256G

DRAM (Gates/chip) MPU (Transistors/chip)

520M

1.4G

1990

1995

2000

2005

2010

2015

Year Change of DRAM/MPU capacity per chip

18 16 14 12 Sales/Yr 10 8 6 4 2 1940 1945 1950 1955 1960 1965 1975 1985 1990 1970 1980 1995 2000 2005 0 Yr

Mainframe (one computer, many people) PC (one person, one computer) Ubiquitous computing (one person, many computers)

The Major Trends in Computing
(http://www.ubiq.com/hypertext/weiser/NomadicInteractive/Sld003.htm)

MIT wearable computing people
(http://www.media.mit.edu/wearables/)

Feature Privacy Personalization Localized information Localized control Resource management

Ubicomp

Wearables X X

X X X

Features provided by Ubicomp vs.Wearables
(http://rhodes.www.media.mit.edu/people/rhodes/papers/wearhive.html)

Ubiquitous computing environment

Office (Dictation, Meeting records)

Wearable speech recognizer

Home (Electrical appliances, Games)

Trip (Translator)

Train station (Tickets)

Internet (Browsing, News on demand)

Car (Navigation)

Speech recognition in the ubiquitous/wearable computing environment

Recognizer

Recognizer Recognizer

Recognizer

Meeting manager

Recognizer ◆

Meeting synopsizing system using collaborative speech recognizers

Difficulties in automatic speech recognition

Lack of systematic understanding in variability ・ Structural or functional variability ・ Parametric variability

Lack of complete structural representations of speech Lack of data for understanding non-structural variability

Noise ・ Other speakers ・ Background noise ・ Reverberations

Distortion Noise Echoes Dropouts

Channel

Speech recognition system

Speaker ・ Voice quality ・ Pitch ・ Gender ・ Dialect Speaking style ・ Stress/Emotion ・ Speaking rate ・ Lombard effect

Task/Context ・ Man-machine dialogue ・ Dictation ・ Free conversation ・ Interview Phonetic/Prosodic context

Microphone ・ Distortion ・ Electrical noise ・ Directional characteristics

Main causes of acoustic variation in speech

Recognition results Evaluation Discrepancy Parameter adaptation algorithm Parameter modification instruction

Classifier / Recognizer (Acoustic models, language models)

Framework of adaptive learning

Input speech Noise Feature set 1 2 M Language model 1 2 N Context Flexible decoder

Speaker model

Decision

Recognition results

World model

Flexible speech recognition

Spontaneous speech Fluent speech Read speech Connected speech Isolated words 2 word spotting digit strings system driven dialogue name dialing

2-way dialogue

natural conversation transcription

Speaking style

network agent & intelligent messaging

form fill by voice
1980

2000 office dictation

voice commands 20

directory assistance
1990

200 2000 20000 Vocabulary size (number of words)

Unrestricted

Speech recognition technology

Speech input Acoustic analysis x1 ... xT Global search: Maximize P (x1... xT | w1... wk )・P(w1... wk ) over w1 ... wk Recognized Word sequence P(x1... xT | w1... wk ) Phoneme inventory Pronunciation lexicon P(w1 ... wk ) Language model

Mechanism of state-of-the-art speech recognizers

P(M) Message source M

P(W|M) Linguistic channel W

P(X|W) Acoustic channel X Speech recognizer

Language Vocabulary Grammar Semantics Context Habits

Speaker Reverberation Noise Transmission characteristics Microphone

A communication - theoretic view of speech generation & recognition

Detector 1 Detector 2 Speech input Detector 3 Integrated search & confidence evaluation

Understanding & response

Detector N-1 Detector N

Solved by discriminative training, consistent with Neymann-Pearson Lemma

An architecture of a detection-based speech understanding system

Partial language model

Partial language modeling and detection-based search still need to be solved

Large-scale spontaneous speech corpus

World knowledge Linguistic information Para-linguistic information Discourse information

Spontaneous speech

Speech recognition

Transcription

Understanding Information extraction Summarization

Summarized text Keywords Synthesized voice

Overview of the Science and Technology Agency Priority Program “Spontaneous Speech: Corpus and Processing Technology”

Text (Keyboard) Stylus Touch, Handwriting Tactile

Spoken language Speech recognition TTS synthesis

Audio

Synergy (Fusion)

Gesture Sign

Visual I/O Display Lips/face recognition

Gaze

Multimodal human-machine communication (HMC)

Multimedia information (Broadcast news, etc.)

Image processing

Speech recognition

Information retrieval

Information extraction and retrieval of spoken language content
(Spoken document retrieval, information indexing, story segmentation, topic tracking, topic detection, etc.)

Ubiquitous computing Internet Mobile computing Image/motion processing Wearable computing Multimedia multimodal communication Human-computer interaction Dialog modeling

Contents
Speech understanding

Information retrieval (access)

Information extraction

Emerging technology

Sign up to vote on this title
UsefulNot useful