Professional Documents
Culture Documents
ISSN 1 0 0 7 - 0 2 1 4 1 8 / 2 2 pp5 4 0 -5 4 4
Volume 13, Number 4, August 2008
Tohru Shimizu**, Yutaka Ashikari, Eiichiro Sumita, ZHANG Jinsong (࠱ഞ), Satoshi Nakamura
Knowledge Creating Communication Research Center, National Institute of Information and Communications Technology;
ATR Spoken Language Translation Research Laboratories, 2-2-2 Keihanna Science City, Kyoto 619-0288, Japan
Abstract: This paper describes the latest version of the Chinese-Japanese-English handheld speech-to-
speech translation system developed by NICT/ATR, which is now ready to be deployed for travelers. With
the entire speech-to-speech translation function being implemented into one terminal, it realizes real-time,
location-free speech-to-speech translation. A new noise-suppression technique notably improves the speech
recognition performance. Corpus-based approaches of speech recognition, machine translation, and speech
synthesis enable coverage of a wide variety of topics and portability to other languages. Test results show
that the character accuracy of speech recognition is 82%-94% for Chinese speech, with a bilingual
evaluation understudy score of machine translation is 0.55-0.74 for Chinese-Japanese and Chinese-English.
Key words: speech-to-speech translation; speech recognition; speech synthesis; machine translation;
large-scale corpus
successive state splitting (MDL-SSS)[5] algorithm is provided by the linguistic data consortium (LDC). The
used for acoustic modeling, with the composite multi- second step is a subword-based IOB (I: inside, O: out-
class N-gram models[6] used for acoustic and language side, B: beginning) tagging step implemented by a
modeling. MDL-SSS algorithm automatically deter- conditional random fields (CRF) tagging model. The
mines the appropriate number of parameters according subword-based IOB tagging achieves better segmenta-
to the size of the training data based on the maximum tion than character-based IOB tagging. The third step
description length (MDL) criterion. Japanese, English, is confidence-dependent disambiguation that combines
and Chinese acoustic models were trained using data the previous two results. The subword-based segmen-
from 4200 Japanese speakers, 532 English speakers, tation was evaluated with two data sets from the
and 1600 Chinese speakers. The models were also Sighan Bakeoff and the NIST machine translation
adapted to several accents, e.g., US (United States), evaluation workshop. Our segmentation gave a higher
AUS (Australia), and BRT (Britain) for English, and F-score than the best published results for the second
[11]
Putonghua, BJ (Beijing), SH (Shanghai), CT (Canton), Sighan Bakeoff data sets . This segmentation
and TW (Taiwan) for Chinese. A statistical language method was also evaluated in a translation scenario us-
model was trained using a large-scale corpus (852×103 ing the NIST translation evaluation 2005 data[12],
Japanese sentences, 710×103 English sentences, and where its billingual evaluation understudy (BLEU)
500×103 Chinese sentences) drawn from travel related score was 1.1% higher than that using the LDC word
topics. The model quickly adapts to different tasks and segmentation.
situations by separating the lexicon of location-specific The language model plays an important role in the
proper nouns (e.g. place names, hotels, restaurants) statistical machine translation (SMT). The effective-
from the lexicon of common nouns. ness of the language model is significant if the test data
Even when the acoustic and language models are happens to have the same characteristics as the training
well-trained, environmental conditions such as the data for the language models. However, this coinci-
variability of speakers, mismatches between training dence is rare in practice. To avoid this performance re-
and testing channels, and interference from environ- duction, a topic adaptation technique is often used. For
mental noise may cause recognition errors. A general- this purpose, a “topic” is defined as clusters of bilin-
ized word posterior probability (GWPP) is used for the gual sentence pairs[13]. The topic-dependent language
post processing of the speech recognition to automati- models were tested using IWSLT06 data. Our ap-
cally reject erroneous (low confidence) utterances[7-9]. proach improved the BLEU score between 1.1% and
1.4%[14].
1.2 Machine translation (MT)
1.3 Speech synthesis (SS)
Two translation modules, TATR (a phrase-based SMT
module) and EM (a simple memory-based translation A speech synthesis engine, XIMERA, was developed
module) were automatically constructed from large- using four large corpora (a 110-h corpus of Japanese
scale corpora. The EM matches a given source sen- male, a 60-h corpus of Japanese female, a 16-h corpus
tence against the source language parts of translation of English male, and a 20-h corpus of Chinese female).
examples. If an EM is achieved, the corresponding tar- This corpus-based approach makes it possible to pre-
get language sentence will be output. Otherwise, serve the naturalness and personality of the speech
TATR is called. TATR, which is built within the without introducing signal processing to the speech
framework of feature-based exponential models, uses a segment[15]. XIMERA’s hidden Markov model
phrase translation probability from source to target, an (HMM)-based statistical prosody model is automati-
inverse phrase translation probability, a lexical weight- cally trained, so it can generate a highly natural F0 pat-
ing probability from source to target, an inverse lexical tern[16]. In addition, the cost function for segment se-
weighting probability, and a phrase penalty. lection has been optimized based on perceptual ex-
A three-step subword-based approach is used for periments, thereby improving the naturalness of the se-
word segmentation of Chinese[10]. The first step is a lected segments[17].
dictionary-based step, similar to the word segmentation
542 Tsinghua Science and Technology, August 2008, 13(4): 540-544
Table 3 Objective evaluation of machine translation module Chinese, with BLEU score for machine translation of
Language pair BLEU 0.55-0.74 for Chinese-Japanese and Chinese-English.
Japanese-to-English 0.6998 References
English-to-Japanese 0.7496
Japanese-to-Chinese 0.6584 [1] Nakamura S, Markov K, Nakaiwa H, et al. The ATR multi-
Chinese-to-Japanese 0.7400 lingual speech-to-speech translation system. IEEE Trans.
English-to-Chinese 0.5520 on Audio, Speech, and Language Processing, 2006, 14(2):
Chinese-to-English 0.6581 365-376.
[2] Segura J C, de la Torre A, Benitez M C, et al. Model-based
Table 4 Human evaluations of the translation accuracy
compensation of the additive noise for continuous speech
Translation accuracy (%) recognitionˉExperiments using AURORA II database and
Language pair
A A+B A+B+C D tasks. In: Proceedings of Eurospeech. Aalborg, Denmark,
Japanese-to-English 78.4 86.3 92.2 7.8 2001, 1: 221-224.
English-to-Japanese 74.3 85.7 93.9 6.1 [3] Fujimoto M, Nakamura S. A non-stationary noise suppres-
Japanese-to-Chinese 68.0 78.0 88.8 11.2 sion method based on particle filtering and polyak averag-
Chinese-to-Japanese 68.6 80.4 89.0 11.0 ing. IEICE Transactions on Information and Systems, 2006,
English-to-Chinese 52.5 67.1 79.4 20.6 J89-ED(3): 922-930.
Chinese-to-English 68.0 77.3 86.3 13.7 [4] Arulampalam M S, Maskell S, Gordon N, et al. A tutorial
on particle filters for online nonlinear/non-Gaussian Bayes-
3 Conclusions ian tracking. IEEE Trans. on Signal Processing, 2002,
50(2): 174-188.
This paper describes recent progress on the NICT/ATR [5] Jitsuhiro T, Matsui T, Nakamura S. Automatic generation
speech-to-speech translation system. Corpus-based ap- of non-uniform context-dependent HMM topologies based
proaches for speech recognition, machine translation, on the MDL criterion. In: Proceedings of Eighth European
and speech synthesis enable coverage of a wide variety Conference on Speech Communication and Technology.
of topics and portability to other languages. Since the Geneva, Switzerland, 2003: 2721-2724.
entire speech-to-speech translation function is imple- [6] Yamamoto H, Isogai S, Sagisaka Y. Multi-class composite
mented into one terminal, it provides real-time and lo- N-gram language model. Speech Communication, 2003, 41:
cation-free speech-to-speech translation service for 369-379.
many language pairs. Test results show that speech [7] Lo W K, Soong F K. Generalized posterior probability for
recognition character accuracy is 82%-94% for minimum error verification of recognized sentences. In:
544 Tsinghua Science and Technology, August 2008, 13(4): 540-544
Proceedings of International Conference on Acoustics, [15] Kawai H, Toda T, Ni J, et al. XIMERA: A new TTS from
Speech, and Signal Processing. Philadelphia, USA, 2005, 1: ATR based on corpus-based technologies. In: Proceedings
85-88. of the 5th ISCA Speech Synthesis Workshop. Pittsburgh,
[8] Soong F K, Lo W K, Nakamura S. Optimal acoustic and USA, 2004.
language model weight for minimizing word verification [16] Tokuda K, Yoshimura T, Masuko T, et al. Speech parame-
errors. In: Proceedings of ICSLP. Jeju, Korea, 2004, 1: ter generation algorithms for HMM-based speech synthesis.
441-444. In: Proceedings of ICASSP. Istanbul, Turkey, 2000:
[9] Takezawa T, Shimizu T. Performance improvement of dia- 1215-1218.
log speech-to-speech translation by rejecting unreliable ut- [17] Toda T, Kawai H, Tsuzaki M. Optimizing sub-cost func-
terances. In: Proceedings of ICSLP. Pittsburgh, USA, 2006: tions for segment selection based on perceptual evaluation
1169-1172. in concatenative speech synthesis. In: Proceedings of
[10] Zhang R, Kikui G, Sumita E. Subword-based tagging by ICASSP. Montreal, Canada, 2004: 657-660.
conditional random fields for Chinese word segmentation. [18] Takezawa T, Kikui G. Collecting machine-translation-
In: Companion Volume to the Proceedings of the North aided bilingual dialogues for corpus-based speech-to-
American chapter of the Association for Computational speech translation. In: Proceedings of Eighth European
Linguistics (NAACL). New York, USA, 2006: 193-196. Conference on Speech Communication and Technology.
[11] Second International Chinese Word Segmentation Bakeoff. Geneva, Switzerland, 2003: 2757-2760.
http://sighan.cs.uchicago.edu/bakeoff2005/. [19] Kikui G, Sumita E, Takezawa T, et al. Creating corpora for
[12] NIST 2005 Machine Translation Evaluation Official Re- speech-to-speech translation. In: Proceedings of Eighth
sults. http://www.nist.gov/speech/tests/mt/doc/mt05eval_ European Conference on Speech Communication and
official_results_release_20050801_v3.html. Technology. Geneva, Switzerland, 2003: 381-384.
[13] Zhang R, Yamamoto H, Paul M, et al. The NICT/ATR sta- [20] Takezawa T, Kikui G. A comparative study on human
tistical machine translation system for the IWSLT 2006 communication behaviors and linguistic characteristics for
evaluation. In: Proceedings of the International Workshop speech-to-speech translation. In: Proceedings of LREC.
on Spoken Language Translation. Kyoto, Japan, 2006: Lisbon, Portugal, 2004: 1589-1592.
83-90. [21] Kikui G, Takezawa T, Mizushima M, et al. Monitor ex-
[14] Yamamoto H, Sumita E. Online language model task adap- periments of ATR speech-to-speech translation system. In:
tation for statistical machine translation (in Japanese). In: Proceedings of Autumn Meeting of the Acoustical Society
Proceedings of FIT2006. Fukuoka, Japan, 2006: 131-134. of Japan. Tokyo, Japan, 2005: 19-20.