Automatic Speech Recognition and Its

A U T O M A T I C SPEECH R E C O G N I T I O N AND ITS
A P P L I C A T I O N TO I N F O R M A T I O N E X T R A C T I O N
Sadaoki Furui
Department of Computer Science
Tokyo institute of Technology
2-12-1, Ookayama, Meguro-ku, Tokyo, 152-8552 Japan
furui@cs.titech.ac.jp
ABSTRACT pattern recognition paradigm, a data-driven
approach which makes use of a rich set of speech
This paper describes recent progress and the utterances from a large population of speakers,
author's perspectives of speech recognition the use of stochastic acoustic and language
technology. Applications of speech recognition modeling, and the use of dynamic programming-
technology can be classified into two main areas, based search methods.
dictation and human-computer dialogue systems.
In the dictation domain, the automatic broadcast A series of (D)ARPA projects have been a major
news transcription is now actively investigated, driving force of the recent progress in research
especially under the DARPA project. The on l a r g e - v o c a b u l a r y , c o n t i n u o u s - s p e e c h
broadcast news dictation technology has recently recognition. Specifically, dictation of speech
been integrated with information extraction and reading newspapers, such as north America
retrieval technology and many application business newspapers including the Wall Street
systems, such as automatic voice document Journal (WSJ), and conversational speech
indexing and retrieval systems, are under recognition using an Air Travel Information
development. In the human-computer interaction System (ATIS) task were actively investigated.
domain, a variety of experimental systems for More recent DARPA programs are the broadcast
information retrieval through spoken dialogue are news dictation and natural conversational speech
being investigated. In spite of the remarkable recognition using Switchboard and Call Home
recent progress, we are still behind our ultimate tasks. Research on human-computer dialogue
goal of understanding free conversational speech systems, the Communicator program, has also
uttered by any speaker under any environment. started [ 1]. Various other systems have been
This paper also describes the most important actively investigated in US, Europe and Japan
research issues that we should attack in order to stimulated by DARPA projects. Most of them
advance to our ultimate goal of fluent speech can be classified into either dictation systems or
recognition. human-computer dialogue systems.
1. I N T R O D U C T I O N Figure 1 shows a mechanism of state-of-the-art

speech recognizers [2]. Common features of
The field of automatic speech recognition has these systems are the use of cepstral parameters
witnessed a number of significant advances in and their regression coefficients as speech
the past 5 - 10 years, spurred on by advances in features, triphone HMMs as acoustic models,
signal processing, algorithms, computational vocabularies of several thousand or several ten
architectures, and hardware. These advances thousand entries, and stochastic language models
include the widespread adoption of a statistical such as bigrams and trigrams. Such methods have
11
been applied not only to English but also to world domain of obvious value has lead to rapid
French, German, Italian, Spanish, Chinese and technology transfer of speech recognition into
Japanese. Although there are several language- other research areas and applications. Since the
specific characteristics, similar recognition variations in speaking style and accent as well
results have been obtained. as in channel and environment conditions are
totally unconstrained, broadcast news
Speec~input is a superb stress test that requires new
algorithms to work across widely
Acoustic
analysis I varying conditions. Algorithms need
to solve a specific problem without
~XI'..XT d e g r a d i n g any other c o n d i t i o n .
I Gl°balsearch: ~'-P(xr"xTIwr"wk)Ph°nemeinvent°rylI Another advantage of this domain is
that news is easy to collect and the
| maximize Pronunciationlexicon[
IP(xr..xTIWr..wt).P(wr..wt)l supply of data is boundless. The data
°ver Wl'"wt J,,P(wl""wk) tLanguagemodel [ is found speech; it is completely
uncontrived.
1
Recognized 2.2 J a p a n e s e B r o a d c a s t News
word sequence
Dictation System
Fig. 1 - Mechanism of state-of-the-art speech recognizers.
We have been developing a large-
The remainder of this paper is organized as vocabulary continuous-speech recognition
follows. Section 2 describes recent progress in (LVCSR) system for Japanese broadcast-news
broadcast news dictation and its application to speech transcription [4][5]. This is a part of a
information extraction, and Section 3 describes joint research with the NHK broadcast company
human-computer dialogue systems. In spite of whose goal is the closed-captioning of TV
the remarkable recent progress, we are still far programs. The broadcast-news manuscripts that
behind our ultimate goal of understanding free were used for constructing the language models
conversational speech uttered by any speaker were taken from the period between July 1992
under any environment. Section 4 describes how • and May 1996, and comprised roughly 500k
to increase the robustness of speech recognition, sentences and 22M words. To calculate word n-
and Section 5 describes perspectives of linguistic gram language models, we segmented the
modeling for spontaneous speech recognition/ broadcast-news manuscripts into words by using
understanding. Section 6 concludes the paper. a m o r p h o l o g i c a l analyzer since Japanese
sentences are written without spaces between
2. BROADCAST NEWS DICTATION AND words. A word-frequency list was derived for the
INFORMATION EXTRACTION news manuscripts, and the 20k most frequently
used words were selected as vocabulary words.
2.1 DARPA Broadcast News Dictation Project This 20k vocabulary covers about 98% of the
words in the broadcast-news manuscripts. We
With the introduction of the broadcast news test calculated bigrams and trigrams and estimated
bed to the DARPA project in 1995, the research unseen n-grams using Katz's back-off smoothing
effort took a profound step forward. Many of method.
the deficiencies of the WSJ domain were resolved
in the broadcast news domain [3]. Most Japanese text is written by a mixture of three
importantly, the fact that broadcast news is a real- kinds of characters: Chinese characters (Kanji)
12
and two kinds of Japanese characters (Hira-gana is the reading-dependent language model, and
and Kata-kana). Most Kanji have multiple LM3 is a modification of LM2 by filled-pause
readings, and correct readings can only be modeling. For clean speech, LM2 reduced the
decided according to context. Conventional word error rate by 4.7 % relative to LM1, and
language models usually assign equal probability LM3 model reduced the word error rate by 10.9
to all possible readings of each word. This causes % relative to LM2 on average.
r e c o g n i t i o n errors because the assigned
probability is sometimes very different from the 2.3 I n f o r m a t i o n Extraction in the D A R P A
true probability. We therefore constructed a Project
language model that depends on the readings of
words in order to take into account the frequency News is filled with events, people, and
and context-dependency of the readings. organizations and all manner of relations among
Broadcast news speech includes filled pauses at them. The great richness of material and the
the beginning and in the middle of sentences, naturally evolving content in broadcast news has
which cause recognition errors in our language leveraged its value into areas of research well
models that use news manuscripts written prior beyond speech recognition. In the DARPA
to broadcasting. To cope with this problem, we project, the Spoken Document Retrieval (SDR)
introduced filled-pause modeling into the of TREC and the Topic Detection and Tracking
language model. (TDT) program are supported by the same
materials and systems that have been
Table 1 - Experimental results of Japanese broadcast news developed in the broadcast news dictation
dictation with various language models (word error rate [%]) arena [3]. BBN'sRough'n'Reddy system
Evaluationsets extracts structural features of broadcast
Language news. CMU's Informedia [6], MITRE's
model m/c m/n f/c f/n
Broadcast Navigator, and SRI's Maestro
LM1 17.6 37.2 14.3 41.2 have all exploited the multi-media features
of news producing a wide range of
LM2 16.8 35.9 13.6 39.3
capabilities for browsing news archives
LM3 14.2 33.1 12.9 38.1 interactively. These systems integrate
various diverse speech and language
News speech data, from TV broadcasts in July technologies including speech recognition,
1996, were divided into two parts, a clean part speaker change detection, speaker identification,
and a noisy part, and were separately evaluated. name extaction, topic c l a s s i f i c a t i o n and
The clean part consisted of utterances with no information retrieval.
background noise, and the noisy part consisted
of utterances with background noise. The noisy 2.4 I n f o r m a t i o n Extraction f r o m J a p a n e s e
part included spontaneous speech such as reports Broadcast News
by correspondents. We extracted 50 male
utterances and 50 female utterances for each part, Summarizing transcribed news speech is useful
yielding four evaluation sets; male-clean (m/c), for retrieving or indexing broadcast news. We
male-noisy (m/n), female-clean (f/c), female- investigated a method for extracting topic words
noisy (fin). Each set included utterances by five from nouns in the speech recognition results on
or six speakers. All utterances were manually the basis of a significance measure [4][5]. The
segmented into sentences. Table 1 shows the extracted topic words were compared with "true"
experimental results for the baseline language topic words, which were given by three human
model (LM 1) and the new language models. LM2 subjects. The results are shown in Figure 2.
13
When the top five topic words were chosen system, German and Servocroatian public
(recall=13%), 87% of them were correct on newscasts are recorded daily. The newscasts are
average. automatically segmented and an index is created
for each of the segments by means of automatic
speech recognition. The user can query the
system in natural language by keyboard or
through a speech utterance. The system returns
75 a list of segments which is sorted by relevance
with respect to the user query. By selecting a
"~ 50 segment, the user can watch the corresponding
part of the news show on his/her computer screen.
Speech The system overview is shown in Fig. 3.
25
-q3-Text (b) The SCAN- speech content based audio
I i i i
navigator at AT&T Labs
0 25 50 75 100
Recall[%] SCAN (Speech Content based Audio Navigator)
Fig. 2 - Topic word extraction results. is a spoken document retrieval system developed
at AT&T Labs integrating speaker-independent,
large-vocabulary speech recognition with
3. HUMAN-COMPUTER DIALOGUE
information-retrieval to support query-based
SYSTEMS
retrieval of information from speech archives [8].
Initial development focused on the application
3.1 Typical Systems in US and Europe
of SCAN to the broadcast news domain. An
overview of the system architecture is provided
Recently a number of sites have been working in Fig. 4. The system consists of three
on human-computer dialogue systems. The components: (1) a speaker-independent large-
followings are typical examples. vocabulary speech recognition engine which
(a) The View4You system
at t h e U n i v e r s i t y of
Karksruhe
I Resultoutput]
The University of Karlsruhe
focuses its speech research - --~ [ (Thesaurus)
on a content-addressable (Satellitereceiver)
multimedia information ~ Video
retrieval system, under a ( MPEG-coder ) MPEO-video Videoqueryserver )
multi-lingual environment, .~ Result
~ MPEG-audio
where queries and
multimedia documents may C Segmnter ) Front-end
a p p e a r in m u l t i p l e ~ MPEG-audio
languages [7]. The system is , Segmentboundaries
~peech recognizer) MPEO-auaio Text Onput speech recognizer~
called "View4You" and Text
their research is conducted Segmentboundaries
in cooperation with the Ilnternet
newWW~spaperl
Informedia project at CMU
[6]. In the View4You Fig. 3 - System overview of the View4You system.
14
technology at MIT for several
Intonational I years. Recently, they have
phrase boundary [ initiated a significant redesign
detection I of the GALAXY architecture
to m a k e it e a s i e r for
Classification researchers to develop their
User interface
own applications, using either
exclusively their own servers
or intermixing them with
Recognition Information
retrieval servers developed by others.
This redesign was done in part
due to the fact that GALAXY
Fig. 4 - Overview of the SCAN spoken document system architecture.
has been designed as the first
reference architecture for the
segments the speech archive and generates new DARPA Communicator program. The
transcripts, (2) an information-retrieval engine resulting configuration of the GALAXY-II
which indexes the transcriptions and formulates architecture is shown in Fig. 5. The boxes in
hypotheses regarding document relevance to this figure represent various human language
user-submitted queries and (3) a graphical-user- technology servers as well as information and
interface which supports search and local domain servers. The label in italics next to each
contextual navigation based on the machine- box identifies the corresponding MIT system
generated transcripts and graphical component. Interactions between servers are
representations of query-keyword distribution in mediated by the hub and managed in the hub
the retrieved speech transcripts. The speech script. A particular dialogue session is initiated
recognition component of SCAN includes an by a user either through interaction with a
intonational phrase boundary detection module graphical interface at a Web site, through direct
and a c l a s s i f i c a t i o n m o d u l e , T h e s e telephone dialup, or through a desktop agent.
subcomponents preprocess the speech data before
passing the speech to the recognizer itself.
GENESIS
[ generation
Language I
(c) The DECTALK
& ENVOICE D-Server
GALAXY-II
conversational Text-to-speech Dialogue I
,conversion [ management
system at MIT
Galaxy is a client-
server architecture Audio [ App.ca,ion [ '
developed at MIT Phone
server back-ends
I-Sorvor
for accessing on-
line information Speech Context
using spoken recognition tracking
dialogue [9]. Ithas SUMMIT Discourse
s e r v e d as t h e Frame ]
testbed for construction
TINA
developing human
language Fig. 5 - Architecture of GALAXY-II.
15
(d) The ARISE train travel information 3.2 Designing a Multimodal Dialogue System
system at LIMSI for Information Retrieval
The ARISE (Automatic Railway Information
Systems for Europe) projects aims developing We have recently investigated a paradigm for
prototype telephone information services for train designing multimodal dialogue systems [ 13]. An
travel information in several European countries example task of the system was to retrieve
[ 10]. In collaboration with the Vecsys company particular information about different shops in
and with the SNCF (the French Railways), the Tokyo Metropolitan area, such as their names,
LIMSI has developed a prototype telephone addresses and phone numbers. The system
service providing timetables, simulated fares and accepted speech and screen touching as input,
reservations, and information on reductions and and presented retrieved information on a screen
services for the main French intercity display or by synthesized speech as shown in Fig.
connections. A prototype French/English service 6. The speech recognition part was modeled by
for the high speed trains between Paris and the FSN (finite state network) consisting of
London is also under development. The system keywords and fillers, both of which were
is based on the spoken language systems implemented by the DAWG (directed acyclic
developed for the RailTel project [11] and the word-graph) structure. The number ofkeywords
ESPRIT Mask project [12]. Compared to the was 306, consisting of district names and
RailTel system, the main advances in ARISE are business names. The fillers accepted roughly
in dialogue management, confidence measures, 100,000 non-keywords/phrases occuring in
inclusion of optional spell mode for ci,ty/station spontaneous speech. A variety of dialogue
names, and barge-in capability to allow more strategies were designed and evaluated based on
natural interaction between the user and the an objective cost function having a set of actions
machine. and states as parameters. Expected dialogue cost
Input
The speech recognizer uses
~ Speech
n-gram backoff language recognizer
models estimated on the
transcriptions of spoken
queries. Since the amount sc ey'
of language model training Output Dialogue
manager
d a t a is s m a l l , s o m e
grammatical classes, such
as cities, days and months,
are used to provide more ~ Speech L
robust estimates of the n- synthesizer ]-
gram probabilities. A
Fig. 6 - Multimodal dialogue system structure for information retrieval.
c o n f i d e n c e s c o r e is
a s s o c i a t e d with each
hypothesized word, and if the score is below an was calculated for each strategy, and the best
e m p i r i c a l l y d e t e r m i n e d threshold, the strategy was selected according to the keyword
hypothesized word is marked as uncertain. The recognition accuracy.
uncertain words are ignored by the understanding
component or used by the dialogue manager to
start clarification subdialogues.
16
4. ROBUST SPEECH
RECOGNITION
b i'"
INoiSe
. Other speakers
•• Background noise|
] fDtstortlon ~
Reverberations .J |N°ise
/ Ech°es |l
4.1 Automatic "//~Dropouts )
adaptation
Ultimately, speech -! Channel ~ recognition

recognition systems -1 I system
should be capable of f
robust, speaker- Speaker Task/context
independent or speaker- • Voice quality • Man-machine Microphone
• Pitch dialogue • Distortion
adaptive, c o n t i n u o u s • Gender • Dictation • Electrical noise
speech recognition• • Dialect • Free conversation Directional |
Figure 7 shows main Speaking style • Interview characteristics J
• Stress/emotion Phonetic/prosodic
causes of acoustic • Speaking rate context
variation in speech [14]. ~. • Lombard effect
It is crucial to establish
Fig. 7 - Main causes of acoustic variation in speech.
methods that are robust
a g a i n s t v o i c e v a r i a t i o n due to
i n d i v i d u a l i t y , the p h y s i c a l a n d
psychological condition of the speaker,
telephone sets, microphones, network
Microphone [/............... fClose-talking
(Microphone array
microphone
characteristics, additive background
noise, speaking styles, and so on. • fAuditory models
Analysis and feature extraction .....~(EIH, SMC, PLP)
Figure 8 shows main methods for
making speech recognition systems /" Adaptive filtering
robust against voice variation. It is also J [ Noise subtraction
. ." . ,~ ] Comb filtering
important for the systems to impose venture-level normmizatiorv/ 1 ( n , ~ t ' . t r , ' j l . . . . i n n
ada t tion r'--x ~'v. . . . . . . . . v v . . . ~
f e w r e s t r i c t i o n s on t a s k s a n d p a. , / ~ Cepstral mean normalization
/ l A cepstra
vocabulary. To solve these problems, , ~. RASTA
it is essential to develop automatic
adaptation techniques. r ( Noise addition
| J HMM (de) composition(PMC)
........................... "~ Model transformation(MLLR)
E x t r a c t i o n and n o r m a l i z a t i o n of. Model-level t...... I, Bayesian adaptive learning
(adaptation to) voice individuality is normalization/I _ ' ,
adaptation ~ Distance// f'Frequency weighting measure
one of the most important issues [ 14]. •~ ' [ [similarity t......~ Weighted cepstral distance
A small percentage of people | I Imeasures [ I.Cepstrum projection measure
occasionally cause systems to produce (Reference~
I temolates/I ~
/ ~
/ . .
exceptionally low recognition rates• I~models ) Robust matching~--- ~-- Word spottm ~ .
This is an example of the "sheep and . / t.utterance venncation
goats" phenomenon. Speaker ]Linguisti c processing t.... Language model adaptation
adaptation (normalization) methods
c a n u s u a l l y be c l a s s i f i e d i n t o
s u p e r v i s e d ( t e x t - d e p e n d e n t ) and
unsupervised (text-independent)
Fig. 8 - Main methods to cope with voice variation in
methods• U n s u p e r v i s e d , on-line, speech recognition.
17
instantaneous/incremental adaptation is ideal, s p e a k e r - a d a p t e d model is c o n s t r u c t e d .
since the system works as if it were a speaker- Experimental results show that the adaptation
independent system, and it performs increasingly reduced the word error rate by 11.8 % relative to
better as it is used. However, since we have to the speaker-independent models.
adapt many phonemes using a limited size of
utterances including only a limited number of 5. PRESPECTIVES OF LANGUAGE
phonemes, it is crucial to use reasonable MODELING
modeling of speaker-to-speaker variablity or
constraints. Modeling of the mechanism of 5.1 Language modeling for spontaneous
speech production is expected to provide a useful speech recognition
modeling of speaker-to-speaker variability.
One of the most important issues for speech
4.2 On-line speaker adaptation in broadcast recognition is how to create language models
news dictation (rules) for s p o n t a n e o u s speech. When
recognizing spontaneous speech in dialogues, it
Since, in broadcast news, each speaker utters is necessary to deal with variations that are not
several sentences in succession, the recognition encountered when recognizing speech that is read
error rate can be reduced by adapting acoustic from texts. These variations include extraneous
words, out-of-vocabulary words, ungrammatical
models incrementally within a segment that
sentences, disfluency, partial words, repairs,
contains only one speaker. We applied on-line,
hesitations, and repetitions. It is crucial to
unsupervised, instantaneous and incremental
develop robust and flexible parsing algorithms
speaker adaptation combined with automatic
that match the characteristics of spontaneous
detection of speaker changes [4]. The MLLR [ 15]
speech. A paradigm shift from the present
-MAP [ 16] and VFS (vector-field smoothing)
transcription-based approach to a detection-based
[17] methods were i n s t a n t a n e o u s l y and
approach will be important to solve such
incrementally carried out for each utterance. The
problems [2]. How to extract contextual
adaptation process is as follows. For the first information, predict users' responses, and focus
input utterance, the speaker-independ¢nt model on key words are very important issues.
is used for both recognition and adaptation, and
the first speaker-adapted model is created. For Stochastic language modeling, such as bigrams
the second input utterance, the likelihood value and trigrams, has been a very powerful tool, so
of the utterance given the speaker-independent it would be very effective to extend its utility by
model and that given the speaker-adapted model incorporating semantic knowledge. It would also
are calculated and compared. If the former value be useful to integrate unification grammars and
is larger, the utterance is considered to be the context-free grammars for efficient word
beginning of a new speaker, and another speaker- prediction. Style shifting is also an important
adapted model is created. Otherwise, the existing problem in spontaneous speech recognition. In
speaker-adapted model is incrementally adapted. typical laboratory experiments, speakers are
For the succeeding input utterances, speaker reading lists of words rather than trying to
changes are detected in the same way by accomplish a real task. Users actually trying to
comparing the acoustic likelihood values of each accomplish a task, however, use a different
utterance obtained from the speaker-independent linguistic style. Adaptation of linguistic models
model and some speaker-adapted models. If the according to tasks, topics and speaking styles is
speaker-independent model yields a larger a very important issue, since collecting a large
likelihood than any of the speaker-adapted linguistic database for every new task is difficult
models, a speaker change is detected and a new and costly.
18
5.2 M e s s a g e - D r i v e n S p e e c h R e c o g n i t i o n models in the same way as in usual recognition
processes. We assume that P(M) has a uniform
State-of-the-art automatic speech recognition probability for all M. Therefore, we only need to
systems employ the criterion of maximizing consider further the term P ( ~ M ) . We assume
P(/4,qX), where W is a word sequence, and X is that P ( ~ M ) can be expressed as follows.
an acoustic observation sequence. This criterion
is reasonable for dictating read speech. However, P(WW/) - (4)
the ultimate goal of automatic speech recognition
is to extract the underlying messages of the where ~, 0<-/1.<1, is a weighting factor. P(W),
speaker from the speech signals. Hence we need the first term of the right hand side, represents a
to model the process of speech generation and part of P ( ~ M ) that is independent of Mand can
recognition as shown in Fig. 9 [ 18], where M is be given by a general statistical language model.
the message (content) that a speaker intended to P'(WIM), the second term of the right hand side,
convey. represents the part ofP(WIAD that depends on
M. We consider that M is
P(M) P( WIM) P(XI W) represented by a co-occurrence
Message ~ Linguistic ~ Acoustic ~.~ Speech o f w o r d s b a s e d on the
source channel channel recognizer distributional hypothesis by
Harris [ 19]. Since this approach
formulates P'(WIM) without
• Language • Speaker explicitly representing M, it can
Vocabulary Reverberation use i n f o r m a t i o n about the
Grammar Noise
Semantics Transmission- speaker's message M without
Context characteristics being affected by the
Habits Microphone
quantization problem of topic
Fig. 9 - A communication - theoretic view of speech generation and classes. This new formulation
recognition. of speech r e c o g n i t i o n was
a p p l i e d to the J a p a n e s e
According to this model, the speech recognition broadcast news dictation, and it was found that
process is represented as the maximization of the word error rates for the clean set were slightly
following a posteriori probability [4][5], reduced by this method.
maxP(MIX) = max]~ P(MIW)P(WIX). (1) 6. CONCLUSIONS

M MW
Using Bayes' rule, Eq. (1) can be expressed as Speech recognition technology has made a
remarkable progress in the past 5 - 10 years.
maxP(MIX) = maxZ P(XIW) P(WIM) P(M) Based on the progress, various application
M w P(X) (2) systems have been developed using dictation and
spoken dialogue technology. One of the most
For simplicity, we can approximate the equation important applications is information extraction
as and retrieval. Using the speech recognition
technology, broadcast news can be automatically
P(XlW) P(W1M) P(M) indexed, producing a wide range of capabilities
max P(MIX) = max (3)
M M,w P(X) for browsing news archives interactively. Since
speech is the most natural and efficient
P(X1W) is calculated using hidden Markov communication method between humans,
19
automatic speech recognition will continue to audio navigator: a systems overview", Proc. Int.
find applications, such as meeting/conference Conf. Spoken Language Processing, Sydney, pp.
summarization, automatic closed captioning, and 2867-2870 (1998)
interpreting telephony. It is expected that speech [9] S. Seneff, E. Hurley, R. Lau, C. Pao, P. Schmid
and V. Zue: "GALAXY-II: a reference architecture
recognizer will become the main input device of
for conversational system development", Proc. Int.
the "wearable" computers that are now actively Conf. Spoken Language Processing, Sydney, pp.
investigated. In order to materialize these 931-934 (1998)
applications, we have to solve many problems. [10] L. Lamel, S. Rosset, J. L. Gauvain and S.
The most important issue is how to make the Bennacef: "The LIMSI ARISE system for train
speech recognition systems robust against travel information", Proc. IEEE Int. Conf. Acoust.,
acoustic and lingustic variation in speech. In this Speech, Signal Process., Phoenix, pp. 501-504
context, a paradigm shitt from speech recognition (1999)
[11] L. F. Lamel, S. K. Bennacef, S. Rosset, L.
to understanding where underlying messages of
Devillers, S. Foukia, J. J. Gangolf and J. L.
the speaker, that is, meaning/context that the Gauvain: "The LIMSI RailTel system: Field trial
speaker intended to convey are extracted, instead of a telephone service for rail travel information",
of transcribing all the spoken words, will be Speech Communication, 23, pp. 67-82 (1997)
indispensable. [12] J. L. Gauvain, J. J. Gangolf and L. Lamel:
"Speech recognition for an information Kiosk",
REFERENCES Proc. Int. Conf. Spoken Language Processing,
Philadelphia, pp. 849-852 (1998)
[ 1] http://fofoca.mitre.org [13] S. Furui and K. Yamaguchi: "Designing a
[2] S. Furui: "Future directions in speech information multimodal dialogue system for information
processing", Proc. 16th ICA and 135th Meeting retrieval", Proc. Int. Conf. Spoken Language
ASA, Seattle, pp. 1-4 (1998) Processing, Sydney, pp. 1191-1194 (1998)
[3] F. Kubala: "Broadcast news is good news", [14] S. Furui: "Recent advances in robust speech
DARPA Broadcast News Workshop, Virginia recognition", Proc. ESCA-NATO Workshop on
(1999) Robust Speech Recognition for Unknown
[4] K. Ohtsuki, S. Furui, N. Sakurai, A. Iwasaki and Communication Channels, Pont-a-Mousson,
Z.-P. Zhang: "Improvements in Japanese broadcast France, pp. 11-20 (1997)
news transcription", DARPA Broadcast News [ 15] C. J. Leggetter and P. C. Woodland: "Maximum
Workshop, Virginia (1999) likelihood linear regression for speaker adaptation
[5] K. Ohtsuki, S. Furui, A. Iwasaki and N. Sakurai: of continuous density hidden Markov models",
"~lessage-driven speech recognition and topic- Computer Speech and Language, pp. 171-185
word extraction", Proc. IEEE Int. Conf. Acoust., (1995).
Speech, Signal Process., Phoenix, pp. 625-628 [16] J. -L. Gauvain and C.-H. Lee: "Maximum a
(1999) posteriori estimation for multivariate Gaussian
[6] M. Witbrock and A. G. Hauptmann: "Speech mixture observations of Markov chains" IEEE
recognition and information retrieval: Trans. on Speech and Audio Processing, 2, 2, pp.
Experiments in retrieving spoken documents", 291-298 (1994).
Proc. DARPA Speech Recognition Workshop, [17] K. Ohkura, M. Sugiyama and S. Sagayama:
Virginia, pp. 160-164 (1997). See also http:// "Speaker adaptation based on transfer vector field
www.informedia.cs.cmu.edu/ smoothing with continuous mixture density
[7] T. Kemp, P. Geutner, M. Schmidt, B. Tomaz, M. HMMs", Proc. Int. Conf. Spoken Language
Weber, M. Westphal and A. Waibel: "The Processing, Banff, pp. 369-372 (1992)
interactive systems labs View4You video indexing [18] B.-H. Juang: "Automatic speech recognition:
system", Proc. Int. Conf. Spoken Language Problems, progress & prospects", IEEE Workshop
Processing, Sydney, pp. 1639-1642 (1998) on Neural Networks for Signal Processing (1996)
[8] J. Choi, D. Hindle, J. Hirschberg, I. Magrin- [19] Z. S. Harris: "Co-occurrence and transformation
Chagnolleau, C. Nakatani, F. Pereira, A. Singhal in linguistic structure", Language, 33, pp. 283-
and S. Whittaker: "SCAN - speech content based 340 (1957)
20

Automatic Speech Recognition and Its

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Speech Recognition and Its

Uploaded by

Copyright:

Available Formats

A U T O M A T I C SPEECH R E C O G N I T I O N AND ITS

1. I N T R O D U C T I O N Figure 1 shows a mechanism of state-of-the-art

Ultimately, speech -! Channel ~ recognition

maxP(MIX) = max]~ P(MIW)P(WIX). (1) 6. CONCLUSIONS

You might also like