Professional Documents
Culture Documents
Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020
c European Language Resources Association (ELRA), licensed under CC-BY-NC
265
Wolaytta. There are also a lot of other Ethiopian languages over TDNN and has been shown to be effective in under-
(more than 70) that are in similar or worse condition than resourced scenarios. We have used these state-of-the-art
Wolaytta with respect to language and speech resources. neural network architecture in the development of DNN
We wanted, therefore, to investigate the use of existing lan- based ASR systems for the Ethiopian languages.
guage resources to develop ASR for other Ethiopian lan-
guages. As a proof of concept, we investigated the devel- 2. Oromo and Wolaytta
opment of Wolaytta (target language) ASR using Oromo More than 80 languages are spoken in Ethiopia. Ethiopian
(source language) training speech. languages are divided into four major language families:
In this work, we present the results of different experi- Semitic, Cushitic, Omotic and Nilo-Saharan. The Semitic
ments we have conducted to explore the benefit we gain language family is one of the most widespread language
from MLASR approach for two languages from two differ- families (with more than 20 languages) in the country. Of
ent language families. First, we have conducted a cross- which, Amharic (spoken by 29.3% of the total Ethiopian
lingual ASR experiment where we decoded Wolaytta test population) and Tigrigna (spoken by 5.9% of the total
speech using Oromo acoustic model (which is developed Ethiopian population) are the most spoken languages. The
using Oromo training speech), Wolaytta language and lex- Cushitic language family has also a long list of (about 22)
ical models. Second, we have developed Wolaytta ASR languages spoken in Ethiopia. Amongst them, Oromo is
systems using various sizes of Wolaytta training speech the most widely spoken language in the country (spoken by
(ranging from 30 minutes to 29 hours) with and without 33.8% of the total Ethiopian population). The Omotic fam-
the whole amount of Oromo training speech (22.8). We ily has a large number of (more than 30) languages spoken
have also conducted experiments to see if the source lan- in Ethiopia, one of which is Wolaytta (spoken by 2.2% of
guage (Oromo) can benefit from sharing training speech of the total Ethiopian population) (CSAE, 2010).
the target language (Wolaytta) to improve the performance The Cushitic and Omotic language families use Latin script
of the ASRSs. for writing. In both the languages the current writers dif-
In the following section 1.1., we give a brief description ferentiate the gemminated and the non-gemminated con-
on the application of deep neural networks for the develop- sonants. Similarly, long and short vowels are indicated in
ment of ASRSs. In section 2., we describe the languages their writing system.
considered in this paper. The speech corpora we used for Having a newly developed speech corpora (Abate et al.,
the research are described in section 3. The development 2020a) for Oromo (a Cushitic language) and Wolaytta (an
of the monolingual ASR using different sizes of Wolaytta Omotic language), we have selected these languages to ex-
training speech, which are our baseline systems, and the re- plore the application of MLASR development approach in
sults achieved by the use of MLASR approach for Wolaytta the ANN framework.
using Oromo training speech are presented in section 4. Fi-
nally in section 5., we give conclusions and forward future 2.1. Phonology
directions. Although they belong to different language families,
Oromo and Wolaytta share several phonetic properties in-
1.1. Deep Neural Networks in ASR cluding the use of long and short vowels. These languages
Over the last 10 years, DNNs methods for ASR were de- have five similar vowels and each of the vowels in both lan-
veloped and outperform the traditional Gaussian Mixture guages has long and short variants. Having their own in-
Model (HMM-GMM). The major factors for their supe- ventory of consonants, Oromo and Wolaytta share a num-
rior performance are the availability of GPUs and the intro- ber of them (see Table 1). Of course, each of the languages
duction of different types of neural network architectures has its own consonants. For instance, phones ñ and x are
such as Convolutional Neural networks (CNN) and more used in Oromo but not in Wolaytta while phone Z is used in
recently Time Delay Neural Networks (TDNN) and Fac- Wolaytta but not in Oromo.
tored TDNN (TDNNf). Almost all the consonants of these languages occur in both
Since 2009, DNNs are widely used in automatic speech single and gemminated forms. The other common phonetic
recognition and they presented dramatic improvement in feature of these languages is the use of tones which makes
performance. Numerous studies showed hybrid HMM- both of them tonal languages. However, in this study we
DNN systems outperform the dominant HMM-GMM on did not differentiate between vowels of different tones since
the same data (Hinton et al., 2012). Currently, TDNNs, also the writing system does not show the tones of the vowels
called one-dimensional Convolutional Neural Networks, and the pronunciation dictionaries for our study have been
are an efficient and well-performing neural network archi- generated automatically from the text.
tectures for ASR (Peddinti et al., 2015). TDNN has the
ability to learn long term temporal contexts. Moreover, Language Consonants (IPA) Vowels (IPA)
by using singular value decomposition (SVD) the number Oromo b d â f g h j k k’ l m n ñ p p’ r s aeiou
of parameters in TDNN models is reduced which makes S t t’ tS tS’ Ã v w x z P a: e: i: o: u:
them inexpensive compared to RNNs. The factored form Wolaytta b d â f g h j k k’ l m n p p’ r s aeiou
of TDNNs (TDNNf)(Povey et al., 2018) has similar struc- S t t’ tS tS’ Ã w z Z P a: e: i: o: u:
ture with TDNN, but is trained from a random start with
one of the two factors of each matrix constrained to be Table 1: Oromo and Wolaytta phones
semi-orthogonal. TDNNf gives substantial improvement
266
2.2. Morphology (CMVN) is applied. The AM uses a fully-continuous 3-
Reflecting the morphological nature of their language fam- state left-to-right HMM. Then we did Linear Discriminant
ilies, Oromo and Wolaytta are not as simple as English and Analysis (LDA) and Maximum Likelihood Linear Trans-
not as complex as the Semitic language families. In both form (MLLT) feature transformation for each of the mod-
Oromo and Wolaytta nominals are inflected for number, els. Then Speaker Adaptive Training (SAT) has been done
gender, case and definiteness and verbs are inflected for per- using an offline transform, feature space Maximum Like-
son, number, gender, tense, aspect and mood (Griefenow- lihood Linear Regression (fMLLR). We did tuning to find
Mewis, 2001). Unlike the Semitic languages, which allow the best number of states and Gaussians for different sizes
prefixing, Oromo and Wolaytta are suffixing languages. In of the training data.
these languages words can be generated from stems recur- To train the DNN-based AMs, we have used the best HMM-
sively by adding suffixes only. GMM models to get alignments and the same training
speech used to train HMM-GMM models. But we have ap-
3. The Speech Corpora plied a three-fold data augmentation (Ko et al., 2015) prior
to the extraction of 40-dimensional MFCCs without deriva-
It is known that the Ethiopian languages, specially Oromo tives, 3-dimensional pitch features and 100-dimensional i-
and Wolaytta are under-resourced. As a result, all of vectors for speaker adaptation. The neural network archi-
the previous works conducted towards the development of tecture we used is Factored Time Delay Neural Networks
ASRSs for these languages are based on limited amounts with additional Convolutional layers (CNN-TDNNf) ac-
of speech data. It is only recently that a work on the devel- cording to the standard Kaldi WSJ recipe. The Neural net-
opment of four standard medium-sized read speech corpora work has 15 hidden layers (6 CNN followed by 9 TDNNf)
(Abate et al., 2020a) has been conducted for four Ethiopian and a rank reduction layer. The number of units in the
languages including Oromo and Wolaytta. For a country TDDNf consists of 1024 and 128 bottleneck units except
like Ethiopia with more than 80 languages, unless a techno- for the TDNNf layer immediately following the CNN lay-
logical solution is used, it looks hopeless to have equivalent ers which has 256 bottleneck units.
speech corpora for all its languages. The list of word entries both for training and decoding lex-
In this work, we have used the existing speech corpora of icons have been extracted from the training speech tran-
Oromo (Abate et al., 2020a) to find out a solution for the scription in both the source and target languages. Us-
development of an ASRS for an under-resourced language, ing the nature of writing system that indicates gemminated
Wolaytta. We considered Oromo as a source and Wolaytta and non-gemminated consonants as well as the long and
as a target language considering the fact that there are more short vowels, we have generated the pronunciation of these
previous works conducted for Oromo, such as (Gelana, words automatically. However, since the tones are not indi-
2016; Gutu, 2016) than what we have for Wolaytta. We cated in written form of both languages, we did not consider
hope that our findings will be extended to solve the prob- tones in the current pronunciation dictionaries.
lems of the other Ethiopian languages that fall under four
For the development of the LMs we have used the text used
different language families.
in (Abate et al., 2020a). We have developed trigram LMs
using the SRILM toolkit (Stolcke, 2002). The LMs are
4. Multilingual ASR for Wolaytta smoothed with unmodified Kneser-Ney smoothing tech-
4.1. Development of ASR Systems for Wolaytta niques (Chen and Goodman, 1996) and made open by in-
Although the aim of our current work is to explore the de- cluding a special unknown word token. LM probabilities
velopment of MLASR for Wolaytta as a target language are computed for the lexicon of the training transcription.
using Oromo training speech (as a source language), we
have developed different monolingual GMM- and DNN- 4.1.2. Evaluation Results
based ASRSs for Wolaytta using different sizes of Wolaytta We have evaluated all AMs trained with different sizes of
speech corpus for comparison purposes. The description training speech using the same test set (1:45 hours of speech
of the procedures we followed is presented in sub-section recorded from four speakers who read a total of 578 utter-
4.1.1.. ances), pronunciation dictionary and language model. The
performance of the systems is given in Figure 1. These
4.1.1. Acoustic, Lexical and Language Models results are our reference points or baselines for the results
To build reference AMs that use different sizes of training achieved by using only the source language, and combined
speech, we have splitted the Wolaytta training speech into with different amounts of target language’s training speech.
11 clusters: with 30 minutes, 1, 2, 4, 6, 8, 10, 15, 20, 25 and As we can observe from Figure 1, obviously, the WER
29 (all) hours of speech length. We have selected roughly reduces with the additional training speech in almost all
equal number of utterances from each speaker randomly for the AMs. The DNN-based systems outperform the HMM-
each of these clusters. Each of them has been used to train GMM-based ones regardless of the size of the training
different AMs. speech, except for 30 minutes. The DNN-based AMs has
All the AMs have been built in a similar fashion using brought a relative WER reductions that range from 9.03%
Kaldi ASR toolkit (Povey et al., 2011). We have built (with 1 hour) to 31.45% (with all the training speech).
context dependent HMM-GMM based AM using 39 di- The best system developed using all the available train-
mensional mel-frequency cepstral coefficients (MFCCs) to ing speech has achieved a WER of 23.23% with the DNN-
each of which cepstral mean and variance normalization based AM.
267
60
from the target language, the improvement in performance
tri1 LDA+MLLT FMMLR cnn-tdnn
50 reduces. A relative WER reduction of 32.77% has been
achieved as a result of adding only 30 minutes of training
Wolaytta WER in %
40
speech from the target language. That means the WER we
30 could achieve by using only the source language’s training
20
speech has been reduced from 48.34% to 32.5% by adding
only 30 minutes training speech of the target language that
10
is randomly selected from all the speakers (76) of the target
0
language.
30Min 1hrs 2hrs 4hrs 6hrs 8hrs 10hrs 15hrs 20hrs 25hrs 29.7hrs
Size of Wolaytta trianing speech
Our results also show that instead of using only small
amount of monolingual training speech in the develop-
Figure 1: Wolaytta WERs with different sizes of Wolaytta ment of an ASRS, specially in the DNN framework, the
training speech use of speech data from other related languages bring per-
formance improvement. We have presented this improve-
ment in Figure 3 that shows the comparison of WERs of
ASRSs developed using Wolaytta training speech only and
4.2. Use of Oromo Speech for Wolaytta ASR that of the ASRSs developed using different sizes of train-
First we have decoded the Wolaytta evaluation test speech ing speech from Wolaytta combined with all (22.8 hours)
using a DNN-based Oromo AM (trained using all the train- Oromo training speech. As it can be seen from the Figure,
ing speech of the Oromo corpus), Wolaytta pronunciation by adding only 30 minutes of Wolaytta training speech to
dictionary and Wolaytta language model and achieved a all of the Oromo training speech, we have achieved a rela-
WER of 48.34%. For this purpose we needed to map the tive WER reduction of 33.55% and 5.52% when 25 hours
Wolaytta phones that are not found in Oromo to the nearest of Wolaytta training speech is added.
possible Oromo phones (see Table 2).
60
40
7 (P) hh (P) Same IPA
zh (Z) z (z) Different IPA 30
zz (z:) z (z) z (z) Double to single mapping
ssh (S:) sh (S) sh (S) Double to single mapping 20
40
We have decoded Oromo test set using the acoustic mod-
30
els (Wolaytta only AM and MLASR AMs) discussed in
20 the previous sections, Oromo pronunciation dictionary and
10 Oromo language model developed by (Abate et al., 2020a).
0 The results presented in Figure 4 show that we have
achieved a WER of 49.25% using the DNN-based AM de-
R
AL
in
rs
rs
s
rs
rs
rs
s
hr
hr
hr
hr
hr
O
1h
2h
4h
6h
8h
W
10
15
20
25
.7
30
s
R+
R+
R+
R+
R+
29
hr
R+
R+
R+
R+
s
R+
hr
.8
R+
.7
O
22
29
Training data
performance of MLASR systems on Oromo test set brought
Figure 2: Wolaytta WERs with different sizes of Wolaytta slight WER reductions compared to the best WER ob-
training speech added to the whole training speech of tained from a system that is developed using Oromo train-
Oromo ing speech only. The relative WER reductions we have ob-
tained range from 1.27% (gained from the addition of 10
hours of Wolaytta speech) to 3.31% (gained from the addi-
The results in Figure 2 show that performance improve- tion of 25 hours of Wolaytta speech). We could observe that
ment can be obtained by adding training speech from the adding 30 minutes to 8 hours of Wolaytta training speech
target language. As we add more and more training speech has negatively affected Oromo ASR.
268
70
Besacier, L., Barnard, E., Karpov, A., and Schultz,
60 T. (2014). Automatic speech recognition for under-
tri1 LDA+MLLT FMMLR cnn-tdnn
50
resourced languages: A survey. Speech Communication,
56:85–100.
40
Oromo WER in %
269
Understanding. IEEE Signal Processing Society, De-
cember. IEEE Catalog No.: CFP11SRW-USB.
Povey, D., Cheng, G., Wang, Y., Li, K., Xu, H., Yarmoham-
madi, M., and Khudanpur, S. (2018). Semi-orthogonal
low-rank matrix factorization for deep neural networks.
In Interspeech, pages 3743–3747.
Schultz, T. and Waibel, A. (1998). Multilingual and
crosslingual speech recognition. In Proc. DARPA Work-
shop on Broadcast News Transcription and Understand-
ing, pages 259–262.
Schultz, T. and Waibel, A. (2001). Language-independent
and language-adaptive acoustic modeling for speech
recognition. Speech Commun., 35(1-2):31–51, August.
Schultz, T., Vu, N. T., and Schlippe, T. (2013). Global-
phone: A multilingual text and speech database in 20
languages. In ICASSP.
Schultz, T. (2002). Globalphone: a multilingual speech
and text database developed at karlsruhe university. In
John H. L. Hansen et al., editors, INTERSPEECH. ISCA.
Stolcke, A. (2002). Srilm – an extensible language model-
ing toolkit. In Proceedings of the 7th International Con-
ference on Spoken Language Processing (ICSLP) 2002,
pages 901–904.
Tachbelie, M. Y., Abate, S. T., and Schultz, T. (2020a).
Analysis of globalphone and ethiopian languages speech
corpora for multilingual asr. In LREC 2020.
Tachbelie, M. Y., Abulimiti, A., Abate, S. T., and Schultz,
T. (2020b). Dnn-based speech recognition for global-
phone languages. In ICASSP 2020.
Vu, N. T., Imseng, D., Povey, D., Motlı́cek, P., Schultz, T.,
and Bourlard, H. (2014). Multilingual deep neural net-
work based acoustic modeling for rapid language adapta-
tion. 2014 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 7639–
7643.
Weng, F., Bratt, H., Neumeyer, L., and Stolcke, A. (1997).
A study of multilingual speech recognition. In EU-
ROSPEECH.
270