You are on page 1of 5

TSINGHUA SCIENCE AND TECHNOLOGY

ISSN  1 0 0 7 - 0 2 1 4  1 8 / 2 2  pp5 4 0 -5 4 4
Volume 13, Number 4, August 2008

NICT/ATR Chinese-Japanese-English Speech-to-Speech


Translation System

Tohru Shimizu**, Yutaka Ashikari, Eiichiro Sumita, ZHANG Jinsong (჆࠱ഞ), Satoshi Nakamura

Knowledge Creating Communication Research Center, National Institute of Information and Communications Technology;
ATR Spoken Language Translation Research Laboratories, 2-2-2 Keihanna Science City, Kyoto 619-0288, Japan

Abstract: This paper describes the latest version of the Chinese-Japanese-English handheld speech-to-
speech translation system developed by NICT/ATR, which is now ready to be deployed for travelers. With
the entire speech-to-speech translation function being implemented into one terminal, it realizes real-time,
location-free speech-to-speech translation. A new noise-suppression technique notably improves the speech
recognition performance. Corpus-based approaches of speech recognition, machine translation, and speech
synthesis enable coverage of a wide variety of topics and portability to other languages. Test results show
that the character accuracy of speech recognition is 82%-94% for Chinese speech, with a bilingual
evaluation understudy score of machine translation is 0.55-0.74 for Chinese-Japanese and Chinese-English.

Key words: speech-to-speech translation; speech recognition; speech synthesis; machine translation;
large-scale corpus

processing technologies has made it possible to realize


Introduction
multi-lingual speech-to-speech translation in real situa-
Language differences have been a serious barrier for tions. The advantages of the corpus-based approach are
globalization, international travel, and international that it achieves wide coverage, robustness, and port-
business. Breaking the language barrier will bring lar- ability to new languages and domains[1]. This paper de-
ger markets for international business and international scribes recent progress of the NICT/ATR speech-to-
tourism, with mobile speech translators contributing to speech translation system.
improve communication ability. Our goal is to realize
1 Speech-to-Speech Translation
automatic speech-to-speech translation for many lan-
guage pairs. Speech recognition, speech synthesis, and System
machine translation research started about 50 years ago.
1.1 Automatic speech recognition (ASR)
They have developed independently for a long time
until speech-to-speech translation research was pro- Robust speech recognition in noisy environments is an
posed in the 1980’s. The feasibility of speech-to- important issue for speech-to-speech translation in the
speech translation was the focus of research at the be- real world. Both a minimum mean square error
ginning because each component was difficult to build (MMSE) estimator for the log Mel-spectral energy co-
and integration seemed more difficult. efficients using a Gaussian mixture model (GMM)[2,3]
Recent progress in corpus-based speech and language
     
and a particle filter[4] are introduced here to suppress

Received: 2007-09-10; revised: 2008-03-18
interference and noise and to attenuate reverberation.
** To whom correspondence should be addressed.
To obtain a compact and accurate model from cor-
E-mail: tohru.shimizu@nict.go.jp; Tel: 81-774-951301 pora with limited size, the minimum description length
Tohru Shimizu et alġNICT/ATR Chinese-Japanese-English Speech-to-Speech … 541

successive state splitting (MDL-SSS)[5] algorithm is provided by the linguistic data consortium (LDC). The
used for acoustic modeling, with the composite multi- second step is a subword-based IOB (I: inside, O: out-
class N-gram models[6] used for acoustic and language side, B: beginning) tagging step implemented by a
modeling. MDL-SSS algorithm automatically deter- conditional random fields (CRF) tagging model. The
mines the appropriate number of parameters according subword-based IOB tagging achieves better segmenta-
to the size of the training data based on the maximum tion than character-based IOB tagging. The third step
description length (MDL) criterion. Japanese, English, is confidence-dependent disambiguation that combines
and Chinese acoustic models were trained using data the previous two results. The subword-based segmen-
from 4200 Japanese speakers, 532 English speakers, tation was evaluated with two data sets from the
and 1600 Chinese speakers. The models were also Sighan Bakeoff and the NIST machine translation
adapted to several accents, e.g., US (United States), evaluation workshop. Our segmentation gave a higher
AUS (Australia), and BRT (Britain) for English, and F-score than the best published results for the second
[11]
Putonghua, BJ (Beijing), SH (Shanghai), CT (Canton), Sighan Bakeoff data sets . This segmentation
and TW (Taiwan) for Chinese. A statistical language method was also evaluated in a translation scenario us-
model was trained using a large-scale corpus (852×103 ing the NIST translation evaluation 2005 data[12],
Japanese sentences, 710×103 English sentences, and where its billingual evaluation understudy (BLEU)
500×103 Chinese sentences) drawn from travel related score was 1.1% higher than that using the LDC word
topics. The model quickly adapts to different tasks and segmentation.
situations by separating the lexicon of location-specific The language model plays an important role in the
proper nouns (e.g. place names, hotels, restaurants) statistical machine translation (SMT). The effective-
from the lexicon of common nouns. ness of the language model is significant if the test data
Even when the acoustic and language models are happens to have the same characteristics as the training
well-trained, environmental conditions such as the data for the language models. However, this coinci-
variability of speakers, mismatches between training dence is rare in practice. To avoid this performance re-
and testing channels, and interference from environ- duction, a topic adaptation technique is often used. For
mental noise may cause recognition errors. A general- this purpose, a “topic” is defined as clusters of bilin-
ized word posterior probability (GWPP) is used for the gual sentence pairs[13]. The topic-dependent language
post processing of the speech recognition to automati- models were tested using IWSLT06 data. Our ap-
cally reject erroneous (low confidence) utterances[7-9]. proach improved the BLEU score between 1.1% and
1.4%[14].
1.2 Machine translation (MT)
1.3 Speech synthesis (SS)
Two translation modules, TATR (a phrase-based SMT
module) and EM (a simple memory-based translation A speech synthesis engine, XIMERA, was developed
module) were automatically constructed from large- using four large corpora (a 110-h corpus of Japanese
scale corpora. The EM matches a given source sen- male, a 60-h corpus of Japanese female, a 16-h corpus
tence against the source language parts of translation of English male, and a 20-h corpus of Chinese female).
examples. If an EM is achieved, the corresponding tar- This corpus-based approach makes it possible to pre-
get language sentence will be output. Otherwise, serve the naturalness and personality of the speech
TATR is called. TATR, which is built within the without introducing signal processing to the speech
framework of feature-based exponential models, uses a segment[15]. XIMERA’s hidden Markov model
phrase translation probability from source to target, an (HMM)-based statistical prosody model is automati-
inverse phrase translation probability, a lexical weight- cally trained, so it can generate a highly natural F0 pat-
ing probability from source to target, an inverse lexical tern[16]. In addition, the cost function for segment se-
weighting probability, and a phrase penalty. lection has been optimized based on perceptual ex-
A three-step subword-based approach is used for periments, thereby improving the naturalness of the se-
word segmentation of Chinese[10]. The first step is a lected segments[17].
dictionary-based step, similar to the word segmentation
542 Tsinghua Science and Technology, August 2008, 13(4): 540-544

1.4 Speech-to-speech translation terminal

The speech-to-speech translation system is designed


for use with mobile terminals such as shown in Fig. 1.
Speech-to-speech translation can be performed for any
combination of Japanese, English, and Chinese lan-
guages. The device is 150 mm (width)×32 mm (thick-
ness)×95 mm (height). A uni-directional microphone is
used for speech recognition in noisy environments.

Notes: J is for Japanese, E is for English, and C is for Chinese.

Fig. 2 Connections between internal and external


speech-to-speech translation resources using STML

includes different speaking styles, read speech, semi-


spontaneous speech, and fully natural spontaneous
speech. BTEC, in particular, includes parallel sen-
tences in two languages that are commonly used in in-
ternational travel. MAD and FED are dialogues using
speech-to-speech translation. MAD was collected in an
Fig. 1 Handheld Chinese, Japanese, and English speech-
office and FED was collected at the Kansai Interna-
to-speech translation system
tional Airport, spoken by actual travelers in the airport.
The display has essentially two windows. The first
window is for the speech recognition results, while the 2.2 Speech recognition module
other window is for the translation results. System
The speech recognition performances with the test sets
messages are shown in each window.
are given in Tables 1 and 2 for real time factors of 1
1.5 Connections between internal and external and 5 with a 2.8-GHz CPU, where the real time factor
speech-to-speech translation resources is a time ratio of the analysis time to the length of the
utterance.
The device can be connected to internal and external The vocabulary size was about 35h103 words in
speech-to-speech translation resources (e.g. speech canonical form and 50h103 words with pronunciation
recognition, machine translation, speech synthesis variations. Although the speech recognition perform-
servers of other language/language pairs), through a ance for dialog speech is worse than for read speech,
machine translation markup language (STML) and to the utterance correctness excluding erroneous recogni-
web services using STML as shown in Fig. 2. tion output using the GWPP[7,9] was greater than 83%
in all cases.
2 Evaluation of the Speech-to-Speech
Translation System 2.3 Machine translation module

An evaluation of the system accuracy for sixteen refer-


2.1 Speech and language corpora ence translations is shown in Table 3.
The system was evaluated against three kinds of The accuracies of the translation outputs ranked A
speech and language corpora, the basic travel expres- (perfect), B (good), C (fair), or D (nonsense) by pro-
sion corpus (BTEC), machine aided dialog (MAD), fessional translators are shown in Table 4. The results
and field experiment data (FED)[18-21]. The test set agree with the BLEU scores in Table 3.
Tohru Shimizu et alġNICT/ATR Chinese-Japanese-English Speech-to-Speech … 543

Table 1 Evaluation of speech recognition module


Number of speakers Number of utterances Vocabulary size (×103) Perplexity
Characteristics
Japanese English Chinese Japanese English Chinese Japanese English Chinese Japanese English Chinese
BTEC (Read speech) 20 20 20 510 510 510 18.9 24.7 34.2
MAD (Dialog speech,
12 12 10 502 502 502 23.2 28.9 62.5
Office, Clean) 66 44 47
FED (Dialog speech,
6 18 23 155 156 145 36.2 21.8 52.6
Airport, Noisy)

Table 2 Speech recognition performance (Real time factor=5)


Word accuracy (%) Character accuracy (%) Utterance accuracy (%) (Japanese)
Rejected utterances (%) (Japanese)
Japanese English Chinese All Not rejected
94.9 (94.8) 92.3 (88.4) 94.2 (94.1) 82.4 (82.4) 87.1 (91.0) 5.9 (13.1)
92.9 (91.4) 90.5 (85.8) 83.4 (82.6) 62.2 (60.2) 83.9 (85.9) 30.7 (37.6)
91.0 (89.4) 81.1 (77.5) 89.6 (88.0) 69.0 (65.8) 91.4 (91.8) 32.9 (36.8)
Notes: The values in parentheses are speech recognition performance of real time factor equal to 1.

Table 3 Objective evaluation of machine translation module Chinese, with BLEU score for machine translation of
Language pair BLEU 0.55-0.74 for Chinese-Japanese and Chinese-English.
Japanese-to-English 0.6998 References
English-to-Japanese 0.7496
Japanese-to-Chinese 0.6584 [1] Nakamura S, Markov K, Nakaiwa H, et al. The ATR multi-
Chinese-to-Japanese 0.7400 lingual speech-to-speech translation system. IEEE Trans.
English-to-Chinese 0.5520 on Audio, Speech, and Language Processing, 2006, 14(2):
Chinese-to-English 0.6581 365-376.
[2] Segura J C, de la Torre A, Benitez M C, et al. Model-based
Table 4 Human evaluations of the translation accuracy
compensation of the additive noise for continuous speech
Translation accuracy (%) recognitionˉExperiments using AURORA II database and
Language pair
A A+B A+B+C D tasks. In: Proceedings of Eurospeech. Aalborg, Denmark,
Japanese-to-English 78.4 86.3 92.2 7.8 2001, 1: 221-224.
English-to-Japanese 74.3 85.7 93.9 6.1 [3] Fujimoto M, Nakamura S. A non-stationary noise suppres-
Japanese-to-Chinese 68.0 78.0 88.8 11.2 sion method based on particle filtering and polyak averag-
Chinese-to-Japanese 68.6 80.4 89.0 11.0 ing. IEICE Transactions on Information and Systems, 2006,
English-to-Chinese 52.5 67.1 79.4 20.6 J89-ED(3): 922-930.
Chinese-to-English 68.0 77.3 86.3 13.7 [4] Arulampalam M S, Maskell S, Gordon N, et al. A tutorial
on particle filters for online nonlinear/non-Gaussian Bayes-
3 Conclusions ian tracking. IEEE Trans. on Signal Processing, 2002,
50(2): 174-188.
This paper describes recent progress on the NICT/ATR [5] Jitsuhiro T, Matsui T, Nakamura S. Automatic generation
speech-to-speech translation system. Corpus-based ap- of non-uniform context-dependent HMM topologies based
proaches for speech recognition, machine translation, on the MDL criterion. In: Proceedings of Eighth European
and speech synthesis enable coverage of a wide variety Conference on Speech Communication and Technology.
of topics and portability to other languages. Since the Geneva, Switzerland, 2003: 2721-2724.
entire speech-to-speech translation function is imple- [6] Yamamoto H, Isogai S, Sagisaka Y. Multi-class composite
mented into one terminal, it provides real-time and lo- N-gram language model. Speech Communication, 2003, 41:
cation-free speech-to-speech translation service for 369-379.
many language pairs. Test results show that speech [7] Lo W K, Soong F K. Generalized posterior probability for
recognition character accuracy is 82%-94% for   minimum error verification of recognized sentences. In:
544 Tsinghua Science and Technology, August 2008, 13(4): 540-544

Proceedings of International Conference on Acoustics, [15] Kawai H, Toda T, Ni J, et al. XIMERA: A new TTS from
Speech, and Signal Processing. Philadelphia, USA, 2005, 1: ATR based on corpus-based technologies. In: Proceedings
85-88. of the 5th ISCA Speech Synthesis Workshop. Pittsburgh,
[8] Soong F K, Lo W K, Nakamura S. Optimal acoustic and USA, 2004.
language model weight for minimizing word verification [16] Tokuda K, Yoshimura T, Masuko T, et al. Speech parame-
errors. In: Proceedings of ICSLP. Jeju, Korea, 2004, 1: ter generation algorithms for HMM-based speech synthesis.
441-444. In: Proceedings of ICASSP. Istanbul, Turkey, 2000:
[9] Takezawa T, Shimizu T. Performance improvement of dia- 1215-1218.
log speech-to-speech translation by rejecting unreliable ut- [17] Toda T, Kawai H, Tsuzaki M. Optimizing sub-cost func-
terances. In: Proceedings of ICSLP. Pittsburgh, USA, 2006: tions for segment selection based on perceptual evaluation
1169-1172. in concatenative speech synthesis. In: Proceedings of
[10] Zhang R, Kikui G, Sumita E. Subword-based tagging by ICASSP. Montreal, Canada, 2004: 657-660.
conditional random fields for Chinese word segmentation. [18] Takezawa T, Kikui G. Collecting machine-translation-
In: Companion Volume to the Proceedings of the North aided bilingual dialogues for corpus-based speech-to-
American chapter of the Association for Computational speech translation. In: Proceedings of Eighth European
Linguistics (NAACL). New York, USA, 2006: 193-196. Conference on Speech Communication and Technology.
[11] Second International Chinese Word Segmentation Bakeoff. Geneva, Switzerland, 2003: 2757-2760.
http://sighan.cs.uchicago.edu/bakeoff2005/. [19] Kikui G, Sumita E, Takezawa T, et al. Creating corpora for
[12] NIST 2005 Machine Translation Evaluation Official Re- speech-to-speech translation. In: Proceedings of Eighth
sults. http://www.nist.gov/speech/tests/mt/doc/mt05eval_ European Conference on Speech Communication and
official_results_release_20050801_v3.html. Technology. Geneva, Switzerland, 2003: 381-384.
[13] Zhang R, Yamamoto H, Paul M, et al. The NICT/ATR sta- [20] Takezawa T, Kikui G. A comparative study on human
tistical machine translation system for the IWSLT 2006 communication behaviors and linguistic characteristics for
evaluation. In: Proceedings of the International Workshop speech-to-speech translation. In: Proceedings of LREC.
on Spoken Language Translation. Kyoto, Japan, 2006: Lisbon, Portugal, 2004: 1589-1592.
83-90. [21] Kikui G, Takezawa T, Mizushima M, et al. Monitor ex-
[14] Yamamoto H, Sumita E. Online language model task adap- periments of ATR speech-to-speech translation system. In:
tation for statistical machine translation (in Japanese). In: Proceedings of Autumn Meeting of the Acoustical Society
Proceedings of FIT2006. Fukuoka, Japan, 2006: 131-134. of Japan. Tokyo, Japan, 2005: 19-20.

You might also like