You are on page 1of 5

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.

00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10096812

INVESTIGATION INTO PHONE-BASED SUBWORD UNITS FOR MULTILINGUAL


END-TO-END SPEECH RECOGNITION

Saierdaer Yusuyin1 , Hao Huang1,3 , Junhua Liu2 , Cong Liu2


1
School of Information Science and Engineering, Xinjiang University, Urumqi, China
2
iFlyTek, Hefei, China
3
Xinjiang Key Laboratory of Multi-lingual Information Technology, Urumqi, China
sar dar@foxmail.com, huanghao@xju.edu.cn

ABSTRACT
Multilingual automatic speech recognition (ASR) models with
phones as modeling units have have improved greatly in low-
resource and similar-language scenarios, which benefits from shared
representation across languages. Meanwhile, subwords have demon-
strated their effectiveness for monolingual end-to-end recognition Fig. 1. Phone-level subword based multilingual ASR architecture.
systems. In this paper, we investigate the use of phone-based sub-
words, specifically Byte Pair Encoding (BPE), as modeling units for sonable to share linguistic and phonetic information between Chi-
multilingual end-to-end speech recognition. To explore the possi- nese and Japanese, even though they have many identical characters
bilities of phone-based BPE (PBPE) for multilingual ASR, we first in spelling. A natural way to multilingual unit modeling is from a
use three types of multilingual BPE training methods for similar phonetic-based perspective, such as the International Phonetic Al-
low-resource languages in Central Asia. Then, by adding three phabet (IPA), which can facilitate to increase representation sharing
high-resource European languages to the experiments, we analyze across languages and lead to higher recognition accuracy [3].
language sharing degree in similar and low-resource scenarios. Fi- One intuitive approach is to combine advantages of BPE sub-
nally, we propose a method to adjust the bigram statistics in the BPE words and phones. [4] proposes the Phonetically Induced Sub-
algorithm and show that the PBPE representation leads to accuracy words (PhIS) to generate a grapheme subword vocabulary that
improvements in multilingual scenarios. The experiments show that retains the properties of a phoneme subword vocabulary and can
PBPE outperforms phone, character and character-based BPE as be used in a probabilistic tokenization framework. [5] proposes the
output representation units. Particularly, the best PBPE model in pronunciation-assisted subword modeling (PASM) to use the statis-
multilingual experiments achieves a 25% relative improvement on a tics collected from common correspondence between sub-word units
low-resource language compared to a character-based BPE system. and phonetic units to guide segmentation process. [6] investigates
the use of phone-based subwords as modeling units for English end-
Index Terms— Speech recognition, multilingual, subword, to-end speech recognition. However, these researches are limited in
BPE, phone-based BPE Monolingual scenario.
1. INTRODUCTION Inspired by the above, after extensive investigations into Multi-
lingual E2E ASR models by exploring different types of output rep-
Recent End-to-end (E2E) ASR has become predominant in both re- resentations, including character-level, phone-level, and character-
search and industrial areas thanks to its modeling simplicity, com- based BPE, we propose the phone-based BPE (PBPE) for Mutilin-
pactness, as well as efficacy in performance, which makes E2E neu- gual E2E ASR. We explore different PBPE configurations. Speech
ral models an attractive option for multilingual ASR. A major fo- recognition results on six languages see the effective of PBPE over
cus of multilingual ASR is to improve performance on low resource the existing unit model methods. We first experiment on three Cen-
languages, which benefit from the shared representation of similar tral Asian languages which belongs to same language family. Then,
languages. Meanwhile, the degree of the shared representation can by adding three high-resource European languages, we extend mul-
be greatly improved by the appropriate output representation. One tilingual PBPE experiment on common scenario. Our contributions
standard method for multilingual output representation is character- are listed as follows: (1) We compare the strengths and weaknesses
based subwords: Byte-Pair Encoding (BPE) [1] is based on a com- of different output representations in trilingual use cases; (2) We in-
pression algorithm which replaces frequent bigrams with new to- vestigate three types of PBPE traing mechanisms for multilingual
kens, thereby compressing the input. BPE variants include using scenarios; (3) We propose a method to adjust the bigram statistics
bytes instead of characters as input or estimating unigram probabil- in the BPE algorithm, showing that the PBPE representation leads
ities for the BPE subwords [2]. For Multilingual E2E ASR, which to accuracy improvements in multilingual scenarios; (4) We analyze
expects to benefit from shared phonetic representations, the use of different representations and demonstrate how we can improve them
those word spelling related modeling units can be problematic, es- for multilingual ASR.
pecially for languages where there is no consistent connection be-
To our best knowledge, we are the first to apply the phone-based
tween word spelling and pronunciation. For example, it is not rea-
BPE to Multilingual E2E ASR and analyze the effect of language
This work was supported by the National Key RD Program of China sharing and ASR performance. We would like to emphasize that
(2020AAA0107902). Hao Huang is the correspondence author. in addition to using pronunciation dictionary for subword extrac-

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on July 02,2023 at 14:18:47 UTC from IEEE Xplore. Restrictions apply.
tion and decoding, our method avoids the extra processing steps 2.2.1. Word-frequency Merging
from conventional systems. On the other hand, large collections
of pronunciations are readily accessible [7][8][9], and pronuncia- WM is the most common BPE building approach for multilingual
tion of out-of-collection words can be constructed with grapheme- ASR. Training transcripts in all languages are combined together to
to-phoneme methods [10][11][12], and thus our approach maintains merge the word-frequency statistics across languages and then learn
the simplicity of end-to-end methods. a single ordered set of BPE units from the merged statistics. The
advantages of this approach are able to maintain a consistent mech-
anism between BPE learning and applying, hold a unique word seg-
2. MULTILINGUAL PHONE-BASED SUBWORD mentation behavior for a word regardless of its language, and cus-
tomize the number of vocabulary size easily.
The goal of training an E2E multilingual ASR model by using
phone-based subword units is to share phone representation across
languages, in the meantime, using efficiency of BPE to get bet- 2.2.2. PBPE units Merging
ter performance. This may improves the alignment of embedding
PM is to learn PBPE unit sets for each language separately, and
spaces across languages that share the subwords and that have the
then merge all the PBPE units with its relative order. Since some
same pronunciation. Figure 1 shows that the architecture has fol-
PBPE units are common to different languages and merged into one
lowing stages: The first stage is that convert the text to phoneme
multilingual symbol vocabulary, the number of vocabulary size is
sequence by using grapheme to phoneme (G2P) converter. Then the
hard to be customized. The advantages of PM is to ensure the high-
PBPE model is used to construct a PBPE symbol set. The multi-
frequency monolingual symbols contain in multilingual symbol set.
lingual ASR model that makes use of the PBPE symbol set as the
the output units is trained given the paired speech and PBPE based
transcriptions. In the inference time, given the input test speech 2.2.3. Sentences Sampling
utterance, the ASR model outputs recognized sequqnce of PBPE.
SS splits on the concatenation of sentences sampled randomly from
Finally, the phoneme to grapheme (P2G) converter reconstructs the
the monolingual corpora. Similar to [14], sentences are sampled ac-
output PBPE sequence to grapheme word sequence.
cording to a multinomial distribution with probabilities {si }i=1...N ,
2.1. G2P conversion where:
Grapheme-to-Phoneme (G2P) model translates text data into phonemes qβ ni
si = PN i β with qi = PN ,
plays. Conventionally, a G2P component is built by using rule based q
j=1 j k=1 nk
models. We first build G2P on a modified linguist transcribed IPA
lexicon such that each phoneme is a single character to ensure proper where β controls the sampling of languages with different frequen-
phoneme subword generation. To solve the problem that some words cies. We consider β = 0.5. By sampling with this distribution, we
are uncovered in the pronunciation lexicon in this approach, we also can give more tokens to low-resource languages and reduce the bias
resort to a Transformer based G2P model. The model is only suitable towards high-resource languages. In particular, this prevents words
for the case that sequence is uncovered in the lexicon. of low-resource languages from being split at the single phone level.

2.2. Phone-based BPE model training 2.3. P2G conversion


Phone-based sub-words are employed as the multilingual model-
ing unit, which are generated by BPE algorithm [1]. Firstly, the There are a few challenges for decoding with PBPEs. First, it is nec-
phoneme vocabulary is initialized as mentioned in Section 2.1. Then, essary to convert the decoded phone sequence into a word sequence.
all symbol pairs are counted iteratively and each occurrence of the Second, unlike in the character case where the spelling uniquely de-
most frequent pair (‘C’, ‘D’) are replaced with a new symbol ‘CD’. termines a word, a phoneme sequence can be mapped to multiple
A new symbol that represents a phoneme bigram is created after each possible words (homophones). E.g. the French word ”on” and ”ont”
merging operation. Eventually, frequent phoneme pairs are com- both have the same pronunciation consisting of the phoneme ”Õ”.
bined into a single sign. Afterward, we can adjust the total number For the first challenge, even though the PBPE algorithm can handle
of merging operations to determine the final symbol vocabulary size. OOV problem by keeping all the single phones in the symbol set, the
ASR model is still possible to decode a phone sequence uncovered
We propose a length penalty schemes to adjust the bigram statis-
in the lexicon. To tackle this problem, we use IPA pronunciation dic-
tics by using the BPE algorithm. We define LPb as the length penal-
tionary for mapping the phone sequence to word sequence. The out-
ized number of occurrences for bigram b
of-collection phone sequences can be constructed into words with
 a Transformer based P2G model. For the homophone problem, we
c l≤N
LPb (α, l, c) = add special word-disambiguate symbols to the labels (the phoneme
(1 − α)c l>N
inventory), whereby we can always uniquely discriminate between
all words. We go through the pronunciation lexicon and collect all
where α is the length penalty factor (0 ≤ α ≤ 1); l is the length of
phoneme sequences which can not be uniquely mapped to words.
bigram; c is the bigram count and N is the cutoff point determining
For example, For all French homophone sequences, we add special
where to apply this penalty. The length penalty aims to penalize
symbols #1 to #N: ”on #8 → Õ”, ”ont #10 → Õ”. By adding
phone-based subwords that are longer than N and to encourage the
different disambiguate symbols to different languages, we prevent
BPE algorithm to generate more short subwords and boost symbol
causing confusion after combining all the monolingual data for mul-
sharing across languages.
tilingual training. These word-disambiguate symbols allow for de-
Phoneme subwords are created by simply using the phoneme
coding without an external language model, and also allow us to
sequences as input to the BPE vocabulary construction algorithm of
use our simple decoder implementation. It also might improve the
[13]. We describe 3 types of phone-based BPE segmentation meth-
performance of models that have the power to discriminate between
ods: Word-frequency Merging (WM), PBPE units Merging (PM),
words.
and Sentences Sample (SS).

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on July 02,2023 at 14:18:47 UTC from IEEE Xplore. Restrictions apply.
Table 1. Datasets: the source, the number of IPA phone tokens, the Table 2. Word error rate (WER) results (%) for Uyghur, Uzbek and
size of train, development and test sets in hours. Kazakh in the Mono-lingual, Bi-lingual and Tri-lingual experiments.
Language Corpora #Phones Train Dev Test Model Exp. Output Rep. Vocab Size Ug Uz Kz
German CommonVoice 40 696.9 26.9 27.0
O0 Char 35/30/42 10.4 16.8 17.3
French CommonVoice 57 642.0 25.5 26.1
O1 Phone 35/34/34 11.0 15.8 17.6
Italian CommonVoice 33 213.1 24.8 20.0 Mono
Kazakh KSC 34 318.4 7.1 7.0 O2 BPE 150 7.1 15.2 16.3
Uzbek CommonVoice 32 65.5 13.9 17.4 O3 PBPE-WM 150 7.2 15.1 16.2
Uyghur CommonVoice 33 52.5 4.5 4.82 B0 Char 55 7.5 14.8 -
B1 Phone 38 7.5 14.2 -
B2 BPE 150 6.4 13.4 -
Bi
3. EXPERIMENT B3 PBPE-WM 150 6.4 14.8 -
B4 PBPE-PM 213 4.8 12.6 -
B5 PBPE-SS 213 4.4 12.1 -
3.1. Experiment dataset and setup
T0 Char 95 7.2 14.3 16.5
The experiments are conducted on two datasets, CommonVoice [15] T1 Phone 46 5.6 12.4 15.3
T2 BPE 292 5.3 12.3 16.4
and Kazakh Speech Corpus (KSC) [16]. In our experiment, we use Tri
T3 PBPE-WM 292 4.9 11.8 14.4
Kazakh, Uyghur and Uzbek to train a Trilingual ASR. The three T4 PBPE-PM 292 4.7 11.7 14.9
Central Asian languages have mutually different character sets, but T5 PBPE-SS 292 4.9 11.7 14.5
sharing a lot in pronunciation. The Uyghur writting system uses the
Arabic alphabet, while Kazakh uses the Cyrillic alphabet and Uzbek Table 3. WER results (%) for German, French, Kazakh, Italian,
uses the Latin alphabet. These three languages belong to same lan- Uzbek and Uyghur in the Multi-lingual experiments.
guage family but Kazakh belong to a different language branch.
Model Exp. Output Rep. Vocab Size De Fr Kz It Uz Ug
We adding German, French, Italian data to train the multilin- O4 BPE 600/ 300/150 12.4 15.1 12.8 16.4 46.3 56.3
Mono
gual models. Among these languages, Kazakh comes from the KSC O5 PBPE 600/300/150 12.8 15.7 13.1 17.6 51.4 63.0
and other 5 languages come from CommonVoice. Detailed database M0 BPE 1577 12.8 15.8 11.6 12.6 17.6 13.9
M1 PBPE-WM 1577 14.4 19.0 12.5 15.0 17.9 13.0
descriptions are shown in Table 1.
M2 PBPE-PM 1577 13.9 18.1 11.9 14.5 16.7 11.2
For the G2P model, as mentioned in Section 2.2, we train the Multi
M3 PBPE-SS 1577 15.9 19.9 15.7 17.4 24.7 21.9
Transformer based G2P models to predict uncovered phoneme se- M4 M3+LP3 1577 13.3 17.4 11.5 13.7 16.2 10.7
quence for the 3 European languages in our experiments. Because M5 M3+LP4 1577 13.7 17.7 11.8 14.1 16.7 10.4
there are unique correspondence between grapheme and phoneme in
Uyghur, Kazakh and Uzbek, the G2P conversion was achieved via character and phone based) outperforms character-level and phone-
rule based method. All the monolingual phones were mapped to IPA level based system. Noted that PBPE achieves comparable perfor-
symbols and we merged the phones from 6 languages to create the mance with character based BPE .
universal phone set for multilingual training. The middle part of Table 2 presents the bilingual ASR results,
For the ASR modeling, we largely adopt the Conformer [17] where the Uyghur and Uzbek data were used due to the close simi-
from ESPnet Toolkit[18] for training the hybrid attention + CTC [19] larity between the two languages. In B4, The vocabulary size |V | =
based model. The E2E architecture was based on the Conformer 213 is obtained by merging the monolingual PBPE sets with |V | =
network consisting of 12 encoder and 6 decoder blocks. The inter- 150 for each language, which means 87 symbols are shared. For fair
polation weight for the CTC objective was 0.3 and 0.5 for the train- comparison, in PM and SS, we remain the vocabulary size fixed to
ing and decoding stages, respectively. For the Transformer module, 213. In B4 and B5, we can see the PBPE with PM and SS setups
we set the self-attention layer to have 4 heads with 256 dimension achieve better ASR results over the phone, character and character-
hidden states, and the feed-forward network dimensions to 2,048. based BPE unit modeling schemes.
We trained all the models using the Noam optimizer with the ini- The lower part of Table 2 shows the tri-lingual results, where the
tial learning rate of 10 and 25k warm-up steps. We set the dropout Kazakh data is added. T0, T1 and T2 correspond to the tri-lingual
rate and label smoothing to 0.1. For data augmentation, we used the models using character-level, phone-level and character-based BPE
standard 3-way speed perturbation with factors of 0.9, 1.0 and 1.1, representation, respectively. T3, T4 and T5 correspond to PBPE rep-
and the spectral augmentation [20]. We extract 80D FBank features resentation using WM, PM and SS. As previous, |V | = 292 is ob-
from audio (resampled to 16KHz) as inputs to acoustic model. A tained by merging the monolingual PBPE sets with |V | = 150 for
beam size of 10 is used for decoding. We reported the results on an each language, which means 158 symbols are shared. We can see the
average model constructed using the last ten checkpoints. three PBPE setups demonstrates their advantages on different lan-
guages. In summary, for both bilingual and trilingual experiments,
3.2. Results on three Central Asian languages PBPE achieves better speech recognition results.

We first try to verify the effectiveness of PBPE based modeling on 3.3. Multilingual Results
the Central Asian Trilingual ASR. All the Uyghur and Uzbek data in
the training set were used. For the purpose of data balance, we only We use all the five CommonVoice data and full KSC data for multi-
randomly sampled from KSC to obtain 4k Kasakh utterances (64.84 lingual validation. We mark Uyghur and Uzbek as low-resource lan-
hours). The Mini-batch size is set to 100 for 200 training epochs. We guages, Kazakh and Italian as mid-resource languages, and German
use IPA lexicon from language experts for G2P and P2G Conversion. and French as high-resource languages. In the monolingual experi-
The upper part of Table 2 presents the monolingual ASR re- ment, we set high-, mid- and low-resource languages vocabulary size
sults, where O0, O1 and O2 correspond to the monolingual models to 600, 300 and 150, respectively. In the multilingual experiment
using character-level, phone-level and character-based BPE repre- M2, we merge all six language monolingual PBPE units described
sentations, respectively. We can see the system using subword (both in Section 2.2.2, and get vocabulary size 1577, which means 523

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on July 02,2023 at 14:18:47 UTC from IEEE Xplore. Restrictions apply.
Table 4. Symbol sharing rate (%) in multilingual symbol sets
Exp. De Fr Kz It Uz Ug
M0 69.4 70.1 20.2 65.6 48.3 5.8
M1 65.8 62.5 40.9 34.4 30.8 31.1
M2 63.4 64.0 37.2 34.0 30.0 30.7
M3 59.3 59.9 46.2 38.0 38.0 37.0
M4 63.5 66.5 46.2 42.0 41.8 39.5
M5 60.4 63.3 47.6 39.7 40.5 39.6

symbols are shared across six languages. To easy compare, we set


same vocabulary size in other multilingual experiments. Consider to
amount of data, we set training epoch as 20.
In the multilingual BPE experiment M0 in Table 3, low-resource
and mid-resource languages gain great WER reduction, while high-
resource language get performance reduction. Multilingual models
suffer from loss in performance for high resource or unrelated lan- Fig. 2. Phone-based BPE units distribution of the six languages.
guages, and the reason is discussed in other articles [21][22][23]. In
the multilingual PBPE experiment M2, PM method get performance 4.2. Subword Composition
leading on all six languages. WM and SS method are not getting
WER reduction like trilingual experiment. By analyzing M3, we For further analysis, as shown in Figure 2, in multilingual experi-
find that sentences sampling cause increasing of low-resource lan- ment M4, the Venn diagram of the distribution of PBPE units in the
guage text data for training PBPE segmentation model, and lead to above six languages. It illustrates how the six languages are inter-
longer subword (even words) appear in vocabulary set. Longer sub- sected with each other. The 150 PBPE units are shared by all the 6
words which shared across languages may increase confusion, be- languages. Uzbek and Uyghur has the smallest number of language-
cause pronunciation differences start to widen. By using the length unique PBPE units among the four languages. Many PBPE units in
penalty discussed in Section 2.2, we can shrink the gap with a length Uzbek and Uyghur are also shared by other languages. Also note
penalty factor of 0.99. For the length penalty, we choose the cutoff that Uzbek and Uyghur have the smallest amount of training data,
N = 3 and N = 4 under the SS method based PBPE experiment. In which can be seen from Table 1. These may explain the signifi-
the multilingual PBPE experiment M4 and M5 in Table 3, compar- cant gains for Uzbek and Uyghur from multilingual training. As the
ing to character-based BPE, PBPE achieves great WER reduction low-resource languages in our experiment, Uzbek share 355 phones
over all six languages. The penalty scheme discourages the BPE al- from Kazakh which as their similar language belong to same lan-
gorithm from generating longer phonetic symbols, encourages gen- guage family. Uzbek and French shared 386 phones, even though
erating more short phonetic symbols, and increases symbol sharing they differ a lot in linguistic form. Despite of similar or non-similar
across languages. scenario, low resource languages also can benefit a lot from this ad-
vantage of shared PBPE units that leads performance increment.
4. ANALYSIS

4.1. Symbol sharing across languages


5. CONCLUSION
One motivation for using PBPE is to allow more symbols to be
shared in the multilingual scenario. Symbol sharing rate of a mul-
tilingual model is measured based on symbol sets, it is defined as We advocate the use of phone-based BPE (PBPE), i.e., the use of
the ratio of symbols existing in both monolingual symbol sets to phone-based subwords in Multilingual end-to-end ASR, and experi-
the number of symbols in the combined multilingual symbol set. ment three types of phone-based BPE training methods in Multilin-
As shown in Table 4, in the multilingual experiment M0, 48.3% of gual scenario. In trilingual experiment, we compare different out-
the symbols are shared between Uzbek and Multilingual grapheme put representations in similar language use case, and find that sub-
dataset. Having a same writing system is contribute to this high word based system leads the performance. Phone-based methods
Symbol sharing rate. Meanwhile Uyghur has only 5.8% symbol gain considerable result compare to character-based BPE in trilin-
sharing rate because of having writing system different from other gual scenario. In Multilingual experiment, We verify the phone-
five languages. In the multilingual experiment M2, benefit from uni- based BPE by adding three European languages which belongs to
fied IPA phones and PBPE representation, Kazakh and Uyghur have different language family. Experiment shows that WM method leads
great increasing on symbol sharing rate. The affect of symbol shar- the performance for low-resource languages. We propose a penalty
ing rate increasing is, as we observed in Table 3, WER reduction mechanism and resulting PBPE-based multilingual system to outper-
caused correspondingly. In the multilingual experiment M3, it is form the baseline multilingual system using character-based BPE. In
important to note that SS method increases the number of tokens our analysis, we find PBPE has high language sharing rate, despite of
associated to low-resource languages and alleviates the bias towards similar or non-similar scenario. Particularly, the best PBPE model in
high-resource languages. But high symbol sharing rate not bring cor- multilingual experiments achieve 25% relative improvement on low-
responding performance improvement, because of longer subwords resource language compare to character based BPE system. In sum-
generated from SS method in low-resource languages. In the multi- mary, the phone-based BPE method provides a effective approach
lingual experiment M4 and M5, compare to M3, using length penalty to multilingual speech recognition. Some promising directions in-
mechanism can increasing symbol sharing rate both of low-resource clude exploring other types of subwords than BPE, and incorporate
languages and high resource languages that leads performance im- subword regularization [24] which is shown to improve phone-based
provement. subword systems.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on July 02,2023 at 14:18:47 UTC from IEEE Xplore. Restrictions apply.
6. REFERENCES [17] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang,
Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and
[1] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural machine Ruoming Pang, “Conformer: Convolution-augmented Transformer for
translation of rare words with subword units,” in Proceedings of the Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
54th Annual Meeting of the Association for Computational Linguistics, [18] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro
2016, pp. 1715–1725. Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann,
[2] Mike Schuster and Kaisuke Nakajima, “Japanese and korean voice Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa
search,” in 2012 IEEE international conference on acoustics, speech Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proceed-
and signal processing (ICASSP), 2012, pp. 5149–5152. ings of Interspeech 2018, 2018, pp. 2207–2211.
[3] Piotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, [19] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and
Odette Scharenborg, and Najim Dehak, “That sounds familiar: an anal- Tomoki Hayashi, “Hybrid ctc/attention architecture for end-to-end
ysis of phonetic representations transfer across languages,” in Proceed- speech recognition,” IEEE Journal of Selected Topics in Signal Pro-
ings of Interspeech 2020, 2020, pp. 3705–3709. cessing, pp. 1240–1253, 2017.
[4] Vasileios Papadourakis, Markus Müller, Jing Liu, Athanasios [20] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret
Mouchtaris, and Maurizio Omologo, “Phonetically induced subwords Zoph, Ekin D. Cubuk, and Quoc V. Le, “SpecAugment: A Simple Data
for end-to-end speech recognition.,” in Interspeech, 2021, pp. 1992– Augmentation Method for Automatic Speech Recognition,” in Proc.
1996. Interspeech 2019, 2019, pp. 2613–2617.

[5] Hainan Xu, Shuoyang Ding, and Shinji Watanabe, “Improving end-to- [21] Neeraj Gaur, Brian Farris, Parisa Haghani, Isabel Leal, Pedro J.
end speech recognition with pronunciation-assisted sub-word model- Moreno, Manasa Prasad, Bhuvana Ramabhadran, and Yun Zhu, “Mix-
ing,” in ICASSP 2019-2019 IEEE International Conference on Acous- ture of informed experts for multilingual speech recognition,” in
tics, Speech and Signal Processing (ICASSP), 2019, pp. 7110–7114. ICASSP 2021 - 2021 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2021, pp. 6234–6238.
[6] Weiran Wang, Guangsen Wang, Aadyot Bhatnagar, Yingbo Zhou,
[22] Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang,
Caiming Xiong, and Richard Socher, “An Investigation of Phone-
James Qin, Parisa Haghani, W. Ronny Huang, Min Ma, and Junwen
Based Subword Units for End-to-End Speech Recognition,” in Proc.
Bai, “Scaling end-to-end models for large-scale multilingual asr,” in
Interspeech 2020, 2020, pp. 1778–1782.
2021 IEEE Automatic Speech Recognition and Understanding Work-
[7] Tanja Schultz, Ngoc Thang Vu, and Tim Schlippe, “Globalphone: A shop (ASRU), 2021, pp. 1011–1018.
multilingual text & speech database in 20 languages,” in 2013 IEEE [23] Zirui Wang and Yulia Tsvetkov, “Gradient vaccine: Investigating and
International Conference on Acoustics, Speech and Signal Processing, improving multi-task optimization in massively multilingual models,”
2013, pp. 8126–8130. in Proceedings of the International Conference on Learning Represen-
[8] Steven Moran and Daniel McCloy, Eds., PHOIBLE 2.0, Max Planck tations (ICLR), 2021.
Institute for the Science of Human History, 2019. [24] Jennifer Drexler and James Glass, “Subword regularization and
[9] Alan W Black, “Cmu wilderness multilingual speech dataset,” in 2019 beam search decoding for end-to-end automatic speech recognition,”
IEEE International Conference on Acoustics, Speech and Signal Pro- in ICASSP 2019-2019 IEEE International Conference on Acoustics,
cessing (ICASSP), 2019, pp. 5971–5975. Speech and Signal Processing (ICASSP), 2019, pp. 6266–6270.
[10] Kanishka Rao, Fuchun Peng, Haşim Sak, and Françoise Beaufays,
“Grapheme-to-phoneme conversion using long short-term memory re-
current neural networks,” in 2015 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4225–
4229.
[11] Yolchuyeva Sevinj, Németh Géza, and Gyires-Tóth Bálint, “Trans-
former based grapheme-to-phoneme conversion,” Proceedings of In-
terspeech 2019, pp. 2095–2099, 2019.
[12] Shubham Toshniwal and Karen Livescu, “Jointly learning to align and
convert graphemes to phonemes with neural attention models,” in 2016
IEEE Spoken Language Technology Workshop (SLT), 2016, pp. 76–82.
[13] Taku Kudo and John Richardson, “SentencePiece: A simple and lan-
guage independent subword tokenizer and detokenizer for neural text
processing,” in Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing: System Demonstrations,
2018, pp. 66–71.
[14] Alexis Conneau and Guillaume Lample, “Cross-lingual language
model pretraining,” Advances in neural information processing sys-
tems, 2019.
[15] Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh
Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis
Tyers, and Gregor Weber, “Common voice: A massively-multilingual
speech corpus,” in Proceedings of the Twelfth Language Resources and
Evaluation Conference, 2020, pp. 4218–4222.
[16] Yerbolat Khassanov, Saida Mussakhojayeva, Almas Mirzakhmetov,
Alen Adiyev, Mukhamet Nurpeiissov, and Huseyin Atakan Varol, “A
crowdsourced open-source Kazakh speech corpus and initial speech
recognition baseline,” in Proceedings of the 16th Conference of the
European Chapter of the Association for Computational Linguistics,
2021, pp. 697–706.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on July 02,2023 at 14:18:47 UTC from IEEE Xplore. Restrictions apply.

You might also like