Professional Documents
Culture Documents
ABSTRACT
Multilingual automatic speech recognition (ASR) models with
phones as modeling units have have improved greatly in low-
resource and similar-language scenarios, which benefits from shared
representation across languages. Meanwhile, subwords have demon-
strated their effectiveness for monolingual end-to-end recognition Fig. 1. Phone-level subword based multilingual ASR architecture.
systems. In this paper, we investigate the use of phone-based sub-
words, specifically Byte Pair Encoding (BPE), as modeling units for sonable to share linguistic and phonetic information between Chi-
multilingual end-to-end speech recognition. To explore the possi- nese and Japanese, even though they have many identical characters
bilities of phone-based BPE (PBPE) for multilingual ASR, we first in spelling. A natural way to multilingual unit modeling is from a
use three types of multilingual BPE training methods for similar phonetic-based perspective, such as the International Phonetic Al-
low-resource languages in Central Asia. Then, by adding three phabet (IPA), which can facilitate to increase representation sharing
high-resource European languages to the experiments, we analyze across languages and lead to higher recognition accuracy [3].
language sharing degree in similar and low-resource scenarios. Fi- One intuitive approach is to combine advantages of BPE sub-
nally, we propose a method to adjust the bigram statistics in the BPE words and phones. [4] proposes the Phonetically Induced Sub-
algorithm and show that the PBPE representation leads to accuracy words (PhIS) to generate a grapheme subword vocabulary that
improvements in multilingual scenarios. The experiments show that retains the properties of a phoneme subword vocabulary and can
PBPE outperforms phone, character and character-based BPE as be used in a probabilistic tokenization framework. [5] proposes the
output representation units. Particularly, the best PBPE model in pronunciation-assisted subword modeling (PASM) to use the statis-
multilingual experiments achieves a 25% relative improvement on a tics collected from common correspondence between sub-word units
low-resource language compared to a character-based BPE system. and phonetic units to guide segmentation process. [6] investigates
the use of phone-based subwords as modeling units for English end-
Index Terms— Speech recognition, multilingual, subword, to-end speech recognition. However, these researches are limited in
BPE, phone-based BPE Monolingual scenario.
1. INTRODUCTION Inspired by the above, after extensive investigations into Multi-
lingual E2E ASR models by exploring different types of output rep-
Recent End-to-end (E2E) ASR has become predominant in both re- resentations, including character-level, phone-level, and character-
search and industrial areas thanks to its modeling simplicity, com- based BPE, we propose the phone-based BPE (PBPE) for Mutilin-
pactness, as well as efficacy in performance, which makes E2E neu- gual E2E ASR. We explore different PBPE configurations. Speech
ral models an attractive option for multilingual ASR. A major fo- recognition results on six languages see the effective of PBPE over
cus of multilingual ASR is to improve performance on low resource the existing unit model methods. We first experiment on three Cen-
languages, which benefit from the shared representation of similar tral Asian languages which belongs to same language family. Then,
languages. Meanwhile, the degree of the shared representation can by adding three high-resource European languages, we extend mul-
be greatly improved by the appropriate output representation. One tilingual PBPE experiment on common scenario. Our contributions
standard method for multilingual output representation is character- are listed as follows: (1) We compare the strengths and weaknesses
based subwords: Byte-Pair Encoding (BPE) [1] is based on a com- of different output representations in trilingual use cases; (2) We in-
pression algorithm which replaces frequent bigrams with new to- vestigate three types of PBPE traing mechanisms for multilingual
kens, thereby compressing the input. BPE variants include using scenarios; (3) We propose a method to adjust the bigram statistics
bytes instead of characters as input or estimating unigram probabil- in the BPE algorithm, showing that the PBPE representation leads
ities for the BPE subwords [2]. For Multilingual E2E ASR, which to accuracy improvements in multilingual scenarios; (4) We analyze
expects to benefit from shared phonetic representations, the use of different representations and demonstrate how we can improve them
those word spelling related modeling units can be problematic, es- for multilingual ASR.
pecially for languages where there is no consistent connection be-
To our best knowledge, we are the first to apply the phone-based
tween word spelling and pronunciation. For example, it is not rea-
BPE to Multilingual E2E ASR and analyze the effect of language
This work was supported by the National Key RD Program of China sharing and ASR performance. We would like to emphasize that
(2020AAA0107902). Hao Huang is the correspondence author. in addition to using pronunciation dictionary for subword extrac-
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on July 02,2023 at 14:18:47 UTC from IEEE Xplore. Restrictions apply.
tion and decoding, our method avoids the extra processing steps 2.2.1. Word-frequency Merging
from conventional systems. On the other hand, large collections
of pronunciations are readily accessible [7][8][9], and pronuncia- WM is the most common BPE building approach for multilingual
tion of out-of-collection words can be constructed with grapheme- ASR. Training transcripts in all languages are combined together to
to-phoneme methods [10][11][12], and thus our approach maintains merge the word-frequency statistics across languages and then learn
the simplicity of end-to-end methods. a single ordered set of BPE units from the merged statistics. The
advantages of this approach are able to maintain a consistent mech-
anism between BPE learning and applying, hold a unique word seg-
2. MULTILINGUAL PHONE-BASED SUBWORD mentation behavior for a word regardless of its language, and cus-
tomize the number of vocabulary size easily.
The goal of training an E2E multilingual ASR model by using
phone-based subword units is to share phone representation across
languages, in the meantime, using efficiency of BPE to get bet- 2.2.2. PBPE units Merging
ter performance. This may improves the alignment of embedding
PM is to learn PBPE unit sets for each language separately, and
spaces across languages that share the subwords and that have the
then merge all the PBPE units with its relative order. Since some
same pronunciation. Figure 1 shows that the architecture has fol-
PBPE units are common to different languages and merged into one
lowing stages: The first stage is that convert the text to phoneme
multilingual symbol vocabulary, the number of vocabulary size is
sequence by using grapheme to phoneme (G2P) converter. Then the
hard to be customized. The advantages of PM is to ensure the high-
PBPE model is used to construct a PBPE symbol set. The multi-
frequency monolingual symbols contain in multilingual symbol set.
lingual ASR model that makes use of the PBPE symbol set as the
the output units is trained given the paired speech and PBPE based
transcriptions. In the inference time, given the input test speech 2.2.3. Sentences Sampling
utterance, the ASR model outputs recognized sequqnce of PBPE.
SS splits on the concatenation of sentences sampled randomly from
Finally, the phoneme to grapheme (P2G) converter reconstructs the
the monolingual corpora. Similar to [14], sentences are sampled ac-
output PBPE sequence to grapheme word sequence.
cording to a multinomial distribution with probabilities {si }i=1...N ,
2.1. G2P conversion where:
Grapheme-to-Phoneme (G2P) model translates text data into phonemes qβ ni
si = PN i β with qi = PN ,
plays. Conventionally, a G2P component is built by using rule based q
j=1 j k=1 nk
models. We first build G2P on a modified linguist transcribed IPA
lexicon such that each phoneme is a single character to ensure proper where β controls the sampling of languages with different frequen-
phoneme subword generation. To solve the problem that some words cies. We consider β = 0.5. By sampling with this distribution, we
are uncovered in the pronunciation lexicon in this approach, we also can give more tokens to low-resource languages and reduce the bias
resort to a Transformer based G2P model. The model is only suitable towards high-resource languages. In particular, this prevents words
for the case that sequence is uncovered in the lexicon. of low-resource languages from being split at the single phone level.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on July 02,2023 at 14:18:47 UTC from IEEE Xplore. Restrictions apply.
Table 1. Datasets: the source, the number of IPA phone tokens, the Table 2. Word error rate (WER) results (%) for Uyghur, Uzbek and
size of train, development and test sets in hours. Kazakh in the Mono-lingual, Bi-lingual and Tri-lingual experiments.
Language Corpora #Phones Train Dev Test Model Exp. Output Rep. Vocab Size Ug Uz Kz
German CommonVoice 40 696.9 26.9 27.0
O0 Char 35/30/42 10.4 16.8 17.3
French CommonVoice 57 642.0 25.5 26.1
O1 Phone 35/34/34 11.0 15.8 17.6
Italian CommonVoice 33 213.1 24.8 20.0 Mono
Kazakh KSC 34 318.4 7.1 7.0 O2 BPE 150 7.1 15.2 16.3
Uzbek CommonVoice 32 65.5 13.9 17.4 O3 PBPE-WM 150 7.2 15.1 16.2
Uyghur CommonVoice 33 52.5 4.5 4.82 B0 Char 55 7.5 14.8 -
B1 Phone 38 7.5 14.2 -
B2 BPE 150 6.4 13.4 -
Bi
3. EXPERIMENT B3 PBPE-WM 150 6.4 14.8 -
B4 PBPE-PM 213 4.8 12.6 -
B5 PBPE-SS 213 4.4 12.1 -
3.1. Experiment dataset and setup
T0 Char 95 7.2 14.3 16.5
The experiments are conducted on two datasets, CommonVoice [15] T1 Phone 46 5.6 12.4 15.3
T2 BPE 292 5.3 12.3 16.4
and Kazakh Speech Corpus (KSC) [16]. In our experiment, we use Tri
T3 PBPE-WM 292 4.9 11.8 14.4
Kazakh, Uyghur and Uzbek to train a Trilingual ASR. The three T4 PBPE-PM 292 4.7 11.7 14.9
Central Asian languages have mutually different character sets, but T5 PBPE-SS 292 4.9 11.7 14.5
sharing a lot in pronunciation. The Uyghur writting system uses the
Arabic alphabet, while Kazakh uses the Cyrillic alphabet and Uzbek Table 3. WER results (%) for German, French, Kazakh, Italian,
uses the Latin alphabet. These three languages belong to same lan- Uzbek and Uyghur in the Multi-lingual experiments.
guage family but Kazakh belong to a different language branch.
Model Exp. Output Rep. Vocab Size De Fr Kz It Uz Ug
We adding German, French, Italian data to train the multilin- O4 BPE 600/ 300/150 12.4 15.1 12.8 16.4 46.3 56.3
Mono
gual models. Among these languages, Kazakh comes from the KSC O5 PBPE 600/300/150 12.8 15.7 13.1 17.6 51.4 63.0
and other 5 languages come from CommonVoice. Detailed database M0 BPE 1577 12.8 15.8 11.6 12.6 17.6 13.9
M1 PBPE-WM 1577 14.4 19.0 12.5 15.0 17.9 13.0
descriptions are shown in Table 1.
M2 PBPE-PM 1577 13.9 18.1 11.9 14.5 16.7 11.2
For the G2P model, as mentioned in Section 2.2, we train the Multi
M3 PBPE-SS 1577 15.9 19.9 15.7 17.4 24.7 21.9
Transformer based G2P models to predict uncovered phoneme se- M4 M3+LP3 1577 13.3 17.4 11.5 13.7 16.2 10.7
quence for the 3 European languages in our experiments. Because M5 M3+LP4 1577 13.7 17.7 11.8 14.1 16.7 10.4
there are unique correspondence between grapheme and phoneme in
Uyghur, Kazakh and Uzbek, the G2P conversion was achieved via character and phone based) outperforms character-level and phone-
rule based method. All the monolingual phones were mapped to IPA level based system. Noted that PBPE achieves comparable perfor-
symbols and we merged the phones from 6 languages to create the mance with character based BPE .
universal phone set for multilingual training. The middle part of Table 2 presents the bilingual ASR results,
For the ASR modeling, we largely adopt the Conformer [17] where the Uyghur and Uzbek data were used due to the close simi-
from ESPnet Toolkit[18] for training the hybrid attention + CTC [19] larity between the two languages. In B4, The vocabulary size |V | =
based model. The E2E architecture was based on the Conformer 213 is obtained by merging the monolingual PBPE sets with |V | =
network consisting of 12 encoder and 6 decoder blocks. The inter- 150 for each language, which means 87 symbols are shared. For fair
polation weight for the CTC objective was 0.3 and 0.5 for the train- comparison, in PM and SS, we remain the vocabulary size fixed to
ing and decoding stages, respectively. For the Transformer module, 213. In B4 and B5, we can see the PBPE with PM and SS setups
we set the self-attention layer to have 4 heads with 256 dimension achieve better ASR results over the phone, character and character-
hidden states, and the feed-forward network dimensions to 2,048. based BPE unit modeling schemes.
We trained all the models using the Noam optimizer with the ini- The lower part of Table 2 shows the tri-lingual results, where the
tial learning rate of 10 and 25k warm-up steps. We set the dropout Kazakh data is added. T0, T1 and T2 correspond to the tri-lingual
rate and label smoothing to 0.1. For data augmentation, we used the models using character-level, phone-level and character-based BPE
standard 3-way speed perturbation with factors of 0.9, 1.0 and 1.1, representation, respectively. T3, T4 and T5 correspond to PBPE rep-
and the spectral augmentation [20]. We extract 80D FBank features resentation using WM, PM and SS. As previous, |V | = 292 is ob-
from audio (resampled to 16KHz) as inputs to acoustic model. A tained by merging the monolingual PBPE sets with |V | = 150 for
beam size of 10 is used for decoding. We reported the results on an each language, which means 158 symbols are shared. We can see the
average model constructed using the last ten checkpoints. three PBPE setups demonstrates their advantages on different lan-
guages. In summary, for both bilingual and trilingual experiments,
3.2. Results on three Central Asian languages PBPE achieves better speech recognition results.
We first try to verify the effectiveness of PBPE based modeling on 3.3. Multilingual Results
the Central Asian Trilingual ASR. All the Uyghur and Uzbek data in
the training set were used. For the purpose of data balance, we only We use all the five CommonVoice data and full KSC data for multi-
randomly sampled from KSC to obtain 4k Kasakh utterances (64.84 lingual validation. We mark Uyghur and Uzbek as low-resource lan-
hours). The Mini-batch size is set to 100 for 200 training epochs. We guages, Kazakh and Italian as mid-resource languages, and German
use IPA lexicon from language experts for G2P and P2G Conversion. and French as high-resource languages. In the monolingual experi-
The upper part of Table 2 presents the monolingual ASR re- ment, we set high-, mid- and low-resource languages vocabulary size
sults, where O0, O1 and O2 correspond to the monolingual models to 600, 300 and 150, respectively. In the multilingual experiment
using character-level, phone-level and character-based BPE repre- M2, we merge all six language monolingual PBPE units described
sentations, respectively. We can see the system using subword (both in Section 2.2.2, and get vocabulary size 1577, which means 523
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on July 02,2023 at 14:18:47 UTC from IEEE Xplore. Restrictions apply.
Table 4. Symbol sharing rate (%) in multilingual symbol sets
Exp. De Fr Kz It Uz Ug
M0 69.4 70.1 20.2 65.6 48.3 5.8
M1 65.8 62.5 40.9 34.4 30.8 31.1
M2 63.4 64.0 37.2 34.0 30.0 30.7
M3 59.3 59.9 46.2 38.0 38.0 37.0
M4 63.5 66.5 46.2 42.0 41.8 39.5
M5 60.4 63.3 47.6 39.7 40.5 39.6
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on July 02,2023 at 14:18:47 UTC from IEEE Xplore. Restrictions apply.
6. REFERENCES [17] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang,
Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and
[1] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural machine Ruoming Pang, “Conformer: Convolution-augmented Transformer for
translation of rare words with subword units,” in Proceedings of the Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
54th Annual Meeting of the Association for Computational Linguistics, [18] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro
2016, pp. 1715–1725. Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann,
[2] Mike Schuster and Kaisuke Nakajima, “Japanese and korean voice Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa
search,” in 2012 IEEE international conference on acoustics, speech Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proceed-
and signal processing (ICASSP), 2012, pp. 5149–5152. ings of Interspeech 2018, 2018, pp. 2207–2211.
[3] Piotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, [19] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and
Odette Scharenborg, and Najim Dehak, “That sounds familiar: an anal- Tomoki Hayashi, “Hybrid ctc/attention architecture for end-to-end
ysis of phonetic representations transfer across languages,” in Proceed- speech recognition,” IEEE Journal of Selected Topics in Signal Pro-
ings of Interspeech 2020, 2020, pp. 3705–3709. cessing, pp. 1240–1253, 2017.
[4] Vasileios Papadourakis, Markus Müller, Jing Liu, Athanasios [20] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret
Mouchtaris, and Maurizio Omologo, “Phonetically induced subwords Zoph, Ekin D. Cubuk, and Quoc V. Le, “SpecAugment: A Simple Data
for end-to-end speech recognition.,” in Interspeech, 2021, pp. 1992– Augmentation Method for Automatic Speech Recognition,” in Proc.
1996. Interspeech 2019, 2019, pp. 2613–2617.
[5] Hainan Xu, Shuoyang Ding, and Shinji Watanabe, “Improving end-to- [21] Neeraj Gaur, Brian Farris, Parisa Haghani, Isabel Leal, Pedro J.
end speech recognition with pronunciation-assisted sub-word model- Moreno, Manasa Prasad, Bhuvana Ramabhadran, and Yun Zhu, “Mix-
ing,” in ICASSP 2019-2019 IEEE International Conference on Acous- ture of informed experts for multilingual speech recognition,” in
tics, Speech and Signal Processing (ICASSP), 2019, pp. 7110–7114. ICASSP 2021 - 2021 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2021, pp. 6234–6238.
[6] Weiran Wang, Guangsen Wang, Aadyot Bhatnagar, Yingbo Zhou,
[22] Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang,
Caiming Xiong, and Richard Socher, “An Investigation of Phone-
James Qin, Parisa Haghani, W. Ronny Huang, Min Ma, and Junwen
Based Subword Units for End-to-End Speech Recognition,” in Proc.
Bai, “Scaling end-to-end models for large-scale multilingual asr,” in
Interspeech 2020, 2020, pp. 1778–1782.
2021 IEEE Automatic Speech Recognition and Understanding Work-
[7] Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe, “Globalphone: A shop (ASRU), 2021, pp. 1011–1018.
multilingual text & speech database in 20 languages,” in 2013 IEEE [23] Zirui Wang and Yulia Tsvetkov, “Gradient vaccine: Investigating and
International Conference on Acoustics, Speech and Signal Processing, improving multi-task optimization in massively multilingual models,”
2013, pp. 8126–8130. in Proceedings of the International Conference on Learning Represen-
[8] Steven Moran and Daniel McCloy, Eds., PHOIBLE 2.0, Max Planck tations (ICLR), 2021.
Institute for the Science of Human History, 2019. [24] Jennifer Drexler and James Glass, “Subword regularization and
[9] Alan W Black, “Cmu wilderness multilingual speech dataset,” in 2019 beam search decoding for end-to-end automatic speech recognition,”
IEEE International Conference on Acoustics, Speech and Signal Pro- in ICASSP 2019-2019 IEEE International Conference on Acoustics,
cessing (ICASSP), 2019, pp. 5971–5975. Speech and Signal Processing (ICASSP), 2019, pp. 6266–6270.
[10] Kanishka Rao, Fuchun Peng, Haşim Sak, and Françoise Beaufays,
“Grapheme-to-phoneme conversion using long short-term memory re-
current neural networks,” in 2015 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4225–
4229.
[11] Yolchuyeva Sevinj, Németh Géza, and Gyires-Tóth Bálint, “Trans-
former based grapheme-to-phoneme conversion,” Proceedings of In-
terspeech 2019, pp. 2095–2099, 2019.
[12] Shubham Toshniwal and Karen Livescu, “Jointly learning to align and
convert graphemes to phonemes with neural attention models,” in 2016
IEEE Spoken Language Technology Workshop (SLT), 2016, pp. 76–82.
[13] Taku Kudo and John Richardson, “SentencePiece: A simple and lan-
guage independent subword tokenizer and detokenizer for neural text
processing,” in Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing: System Demonstrations,
2018, pp. 66–71.
[14] Alexis Conneau and Guillaume Lample, “Cross-lingual language
model pretraining,” Advances in neural information processing sys-
tems, 2019.
[15] Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh
Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis
Tyers, and Gregor Weber, “Common voice: A massively-multilingual
speech corpus,” in Proceedings of the Twelfth Language Resources and
Evaluation Conference, 2020, pp. 4218–4222.
[16] Yerbolat Khassanov, Saida Mussakhojayeva, Almas Mirzakhmetov,
Alen Adiyev, Mukhamet Nurpeiissov, and Huseyin Atakan Varol, “A
crowdsourced open-source Kazakh speech corpus and initial speech
recognition baseline,” in Proceedings of the 16th Conference of the
European Chapter of the Association for Computational Linguistics,
2021, pp. 697–706.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on July 02,2023 at 14:18:47 UTC from IEEE Xplore. Restrictions apply.