Professional Documents
Culture Documents
1. INTRODUCTION
2. TRAINING AND TEST DATA
Under the auspices of DARPA’s Global Autonomous Language
For acoustic model training, we used the following corpora:
Exploitation (GALE) project, a tremendous amount of work was
done by the speech research community toward improving speech • 85 hours of FBIS + TDT4 with transcripts provided by BBN
recognition performance. The goal of the GALE program is to
• 51 hours of GALE data (first and second quarter releases)
make foreign language (Arabic and Chinese) speech and text ac-
provided by the Linguistic Data Consortium (LDC)
cessible to English monolingual people, particularly in military
settings. A core component of GALE is ASR and research in this • 1800 hours of unsupervised data (i.e. without transcripts)
area spans multiple fields ranging from traditional speech recog- which are part of TDT4 BN-03
nition to speaker segmentation and clustering, sentence boundary
detection, etc. The following resources were used for language modeling
In this work, we focus primarily on Arabic broadcast news • Transcripts of the audio data
transcription although many of the techniques that are expounded
• Arabic Gigaword corpus
here have been successfully applied to our Mandarin and English
BN systems as well. Research on Arabic ASR has been fertile over • Web transcripts for broadcast conversations collected by
the past few years as attested by the numerous papers published on CMU/ISL (28M words from Al-Jazeera)
this subject [1, 2, 3]. In the following, we describe the key design
The resulting language model was a 4-gram LM with 56M n-grams
characteristics of our system:
trained with modified Kneser-Ney smoothing. Throughout this pa-
• A cross-adaptation architecture between unvowelized and per, we report results on the following test sets:
vowelized speaker-adaptive trained (SAT) acoustic models. • RT-04: 3 shows of 25 minutes each (totaling 75 minutes of
The distinction between the two comes from the explicit BN)
modeling of short vowels which are pronounced in Arabic
but almost never transcribed. Both models exhibit a penta- • BNAT-05: 12 shows of 30 minutes each (totaling 5.5 hours
phone acoustic context and comprise 5K context dependent of BN) from 5 different sources (VOA, NTV, ALJ, DUB,
states and 400K 40-dimensional Gaussians. The gains from LBC) provided by BBN
cross-adaptation are estimated to be about 1% absolute. • BCAD-05: 6 shows of 30 minutes each (totaling 3 hours of
• Large-scale discriminative training on 1800 hours of un- BC) from ALJ provided by BBN
supervised data. The aforementioned models were trained • EVAL-06: 37 shows of 5 minutes each (totaling 3 hours of
with a combination of fMPE and MPE [4] on 135 hours of BN and BC) from 9 different sources
supervised data and 1800 hours of TDT4 BN-03 data. The
gains from the unsupervised data are 1.3% absolute after The table 1 shows the OOV rates and the word error rates on RT-04
discriminative training. for an ML-trained vowelized system (without pronunciation prob-
abilities) as a function of the vocabulary size. As can be seen, the
This work was partially supported by the Defense Advanced Research benefit from increasing the vocabulary from 129K to 589K words
Projects Agency under contract No. HR0011-06-2-0001. is 1.5% absolute.
Nb. words Nb. pronunciations OOV WER System Nb. leaves Nb. Gaussians
129K 538K 2.9% 19.8% SI 3.5K 150K
343K 1226K 1.2% 18.6% U-SA 5.0K 400K
589K 1967K 0.8% 18.3% V-SA 4.0K 400K
Table 1: OOV and word error rates on RT-04 as a function of vo- Table 3: System statistics.
cabulary size.
MLLR MLLR
3.3. Unsupervised training
The starting point for unsupervised training are 1800 hours of au-
U-SA decoding V-SA decoding dio which included in the TDT4 BN-03 dataset. First, we decode
the entire data with an unvowelized speaker adapted system fol-
lowing steps (1)-(4) from section 3 (left-hand side of Figure 1).
Words Then, we select a subset of the data based on per utterance average
word posteriors. The word posteriors are obtained using consen-
Figure 1: System diagram. sus processing on the word lattices [7]. We then build the acoustic
models by varying the amount of unsupervised data in addition to
the 135 hours of supervised data (cf. section 2). In table 4, we
Posterior thresh. Nb. of hours WER Training Method WER (RT-04)
1.0 (baseline) 0 17.2% Flat-Start 23.0%
0.95 751 15.9% Bootstrap 22.8%
0.9 1321 15.7%