You are on page 1of 5

2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

October 11-14, 2020. Toronto, Canada

Sub-word Based End-to-End Speech Recognition


for an Under-Resourced Language: Amharic
Nirayo Hailu Gebreegziabher Andreas Nürnberger
Faculty of Computer Science, Data and Faculty of Computer Science, Data and
Knowledge Engineering Group, Knowledge Engineering Group,
Otto-von-Guericke University of Otto-von-Guericke University of
Magdeburg Magdeburg
Magdeburg, Germany Magdeburg, Germany
nirayo.hailugebreegziabher@ovgu.de andreas.nuernberger@ovgu.de

Abstract—In this work, we focused on end-to-end speech language models leads to out-of-vocabulary (OOV) problem
recognition for less-resourced language, Amharic. The result can and high language model (LM) perplexities, which is a signif-
be integrated with other tasks such as spoken content retrieval. icant challenge for morphologically rich languages. Amharic
We explored three models, which consist of Convolutional Neu-
ral Networks, Recurrent Neural Networks, and Connectionist is a morphologically rich language with a high morpheme per
Temporal Classification, towards end-to-end speech recognition word ratio. Hence, it is challenging to develop a word-based
on less-resourced language. Further, we studied the possibility lexicon and LM. Developing ASR systems for morphologically
of having an end-to-end system with 1-best output keeping the rich languages requires selecting a suitable sub-lexical unit to
network parameters and computational resource minimal. The represent words in the pronunciation dictionary.
paper gives attention to finding a more suitable sub-lexical unit
for the Amharic end-to-end speech recognition system which To avoid the need for a GMM and to ease the complexity of
can be used as an audio indexing unit. We present the first the approach, researchers have proposed different techniques,
result comparing grapheme, phoneme, and syllable-based end- such as introducing a flat start [5]. However, the complexity of
to-end speech recognition systems for our target language. The
models are evaluated on approximately 52 hours of Amharic the approach could not be removed. Since even in the flat start,
speech corpus containing read-speech, audiobooks, and multi- the system requires iterative procedures such as generating
genre radio programs. On the test set, we report a character forced alignments and decision trees. On the other research
error rate (CER) of 19.21% and a syllable error rate (SER) of direction, researchers have focused on end-to-end ASR, which
39.98% for a syllable-based end-to-end model without lexicons aims to model the mapping between speech and labels directly
and language model integrated.
Index Terms—Speech Recognition, Phoneme, Grapheme, End- with no intermediate components. Graves et al. [6] introduced
to-End Mode, Syllable Units. the Connectionist Temporal Classification (CTC) objective
function to infer the alignment between speech and label
I. Introduction sequence automatically.
To come up with an enhanced Automatic speech recognition The performance of end-to-end systems is evaluated on
(ASR) system, researchers have been using the hidden Markov well-known corpora in high-resource languages such as En-
and Gaussian mixture model (HMM-GMM) paradigm for glish and Mandarin. In this work, we investigated the use
acoustic modeling. Recently, end-to-end based architectures of CTC, Convolutional Neural Networks (CNNs), Recurrent
covering acoustic, pronunciation, and language models in one Neural Networks (RNNs) towards an end-to-end ASR system
model have become increasingly popular and an alternative to for the less-resourced and morphologically rich language,
the hybrid architecture [1], [2]. Amharic. We compared the grapheme, phoneme, and sub-
The hybrid model, Deep Neural Network (DNN) integrated word based end-to-end ASR systems for Amharic. This paper
with the HMM model, has shown significant gains over the presents the first results using end-to-end techniques for speech
HMM models [3]. We have investigated the hybrid model recognition on less-resourced language Amharic, comparing
on the same corpus in our previous work [4]. Despite this sub-word units.
progress, building a state-of-the-art ASR system remains a
The remainder of the paper is organized as: Section II
complicated, and expertise-intensive task, especially for less-
describes essential aspects of the Amharic language. Section
resourced and morphologically rich languages.
III discusses the end-to-end speech recognition systems. In
In ASR systems, usually, phone-based acoustic models
Section IV, the model architecture is described. Section V
(AMs) and word-based lexical and language models are used.
explains the dataset used. Section VI presents the experiments
However, phone-based AMs are not efficient in modeling long-
on the corpus, and the results are discussed under section
term temporal dependencies and using words in lexical and
VII. The last section, Section VIII, provides conclusions and
*Supported by DAAD and MoSHE highlights of the future work.

978-1-7281-8526-2/20/$31.00 ©2020 IEEE 3466

Authorized licensed use limited to: Rutgers University. Downloaded on May 18,2021 at 01:55:41 UTC from IEEE Xplore. Restrictions apply.
II. The Amharic Language
Amharic is the official language spoken in Ethiopia. There
are over 25.6 million users, according to Ethnologue1 . Its
orthography represents a consonant-vowel sequence. The lan-
guage has seven vowels and 33 primary characters with each
having seven forms for each consonant-vowel combination (33
x 7 = 231) with additional characters there are 274 distinct
graphemes. The primary difficulties in solving Amharic speech
Fig. 1. A waveform and spectrogram for Amharic word ትምህርት /tɨmɨhɨrt/
recognition are the limited availability of corpora and the
complex morphological nature of the language. There are
efforts made to develop both morpheme [7] and syllable-based language which affects the semantics and the syllabification of
[4], [8] ASR systems to overcome language-specific issues. a given word. For instance, the word ይመታል /jɨmɛtäll/ mean-
A. Amharic Morphology ing ”he hits” its semantics could change when geminated as
ይመታል /jɨmmɛttäll/ meaning ”he will be hit”. The automatic
The morphological richness of a language increases the size handling of gemination is not well addressed for the Amharic
of the lexicon in speech recognition. Since Amharic is mor- language.
phologically rich, the same concept applies to the language.
A novel syllabification algorithm for Amharic has been
A single Amharic word could give a large number of mor-
presented in [11]. Acoustic evidence, the maximum onset, and
phologically inflected different words. For example, the root
sonority hierarchy principles have been used to develop a rule-
word ጅምር /dʒɨmɨr/ meaning ”started” could provide inflected
based syllabification algorithm. All six Amharic syllable tem-
words such as ጀመረች/dʒɛmɛrɛtʃ "she started"/, አስጀመረች
plates are also considered. The algorithm considers gemination
/ʔsɨdʒɛmɛrɛtʃ "she made it start"/, አስጀመሩ /ʔsɨdʒɛmɛruɲ "they
and the unpredictable nature of the Amharic epenthesis vowel.
made it start"/, ጀመርኩኝ /dʒɛmɛrɨkuɲ "I start"/, ጅማሬ /dʒɨmäre
For our experiment, the algorithm has been re-implemented
"beginning"/, ሲጀመር /sidʒɛmɛr "When it started"/, etc. Using
in Python with minor improvements. The algorithm is used to
morphemes as a sub-word unit for the Amharic speech recog-
prepare a syllable-based text corpus for the intended experi-
nition system is shown on [7]. However, out of vocabulary
ments after testing the algorithm.
morphemes problem persists. Besides, it is practically difficult
to use the Amharic rule-based morphological analyzer like III. End-to-End Speech Recognition
HornMorpho [9] for speech recognition purposes. Therefore,
usually, researchers use a data-driven approach for automati- The key idea behind end-to-end systems is to skip the
cally segmenting words into morphemes. Such methods still need for precise alignments between the acoustic and target
require a large number of carefully designed training data. state label sequences. It covers directly learning to map
Therefore, in this paper, we focus on syllables as a modeling acoustics to word sequences [12]. There are different DNN
unit. layer architectures, which are used in ASR, such as RNNs,
CNNs, and different types of RNNs such as Long Short
B. Amharic Syllabification Term Memory (LSTM) [12], and Gated Recurrent Unit (GRU)
Syllabification is a process of segmenting words into sylla- [13]. LSTM/GRU is useful in modeling sequence-to-sequence
bles. A syllable is a unit of sound composed of a central peak Natural Language Processing problems. They can use their
of sonority, usually a vowel (V), and the consonants (C), which internal memory to process arbitrary sequences of inputs. This
cluster around this central peak. A syllable can be defined by a makes them applicable to tasks such as ASR.
sequence of grammars such as the consonant-vowel-consonant Therefore, two fundamental methodologies have been uti-
(CVC) sequence. lized, both of which exploit LSTM/GRU potential capacity to
Amharic is a syllabic language in which every grapheme remember data for an extended period and follow up on it.
characterizes consonant-vowel assimilation. However, not all The first of these methodologies utilize LSTM/GRU cells in
syllables in Amharic satisfy the CV sequence represented the CTC structure to anticipate letter (grapheme) as opposed to
by the graphemes. Amharic syllables can be (C)V(C)(C) phonetic output [1], [2]. The other methodology uses attention
[10], including possible consonant clusters and gemination. mechanisms, such as Listen, Attend, and Spell (LAS), which
Amharic orthography does not show epenthetic vowel and provides a weighted sum of the hidden activations in a learned
geminated consonants that make it challenging to perform encoding of the input frames as an additional input to the RNN
syllabification simply following the templates. For instance, at each time frame [14], [15].
the word ትምህርት /tmhrt/ meaning ”education” does not show CNNs used to encode the information from the Mel Fre-
any vowel when transliterated to IPA. However, the acoustic quency Cepstral Coefficients (MFCCs) or the spectrogram
evidence shows there is an epenthetic vowel /ɨ/ (see Fig. 1), [1], [16]. Then, the encoded data is passed to another con-
/tɨmɨhɨrt/. Gemination is a typical characteristic of Amharic volutional layer or LSTM/GRU layer [1]. It finally gives the
Softmax probabilities of each grapheme or phoneme that can
1 https://www.ethnologue.com/language/amh be placed in the transcription. Alternatively, CNN encodes the

3467

Authorized licensed use limited to: Rutgers University. Downloaded on May 18,2021 at 01:55:41 UTC from IEEE Xplore. Restrictions apply.
information from input features and inputs that data to the
RNN based decoder. It decodes the data and then estimates
the probability of a phoneme or a grapheme by processing
the speech signal. The rectified linear unit (ReLu) is the most
widely used nonlinearity function in training ASR models.
The joint training of the feature stage and the classifier stage
is performed using the back-propagation algorithm. Once the
prediction matrix is gained, the decoding can be done using a
CTC decoding algorithm [6].
To our knowledge, there is no published work so far ap-
plying the end-to-end ASR approach to the publicly available
Amharic speech corpus. The major challenge in building a
speech recognition system for less-resourced languages like
Amharic is the unavailability of well-prepared datasets and
language-specific issues such as the phonetic nature and
orthography of the language. To reduce the need for more
datasets, researchers have studied the use of a language-
adaptive [17], and language-universal techniques [20]. For this
research, we have used approximately 52 hours of Amharic
speech corpus, which is part of our previous work [4].
Fig. 2. CNN+GRU+CTC based ASR Model Architecture.
IV. Model Architectures
The model architectures used in this paper are based on TABLE I
[1], and [2]. However, we focused on keeping the number Comparison of LSTM and GRU Model Parameters.
of parameters minimal and come up with a better result.
Number of Layers Trainable Parameters
In the first model, one layer of 1D CNN in the early 5 GRU ≈1.8 M
stage and five-layered unidirectional GRU followed by batch 5 LSTM ≈2.2 M
normalization (BN) [18] in each layer constitute the first 7 GRU ≈2.6 M
7 LSTM ≈3.6 M
network (see Fig. 2). The prediction from the GRU layers is
feed into the connected layer, which uses softmax computing
to calculate a probability distribution matrix. The network
is trained to receive 13-dimensional MFCCs and generate update gate. represents the Hadamard product (component-
Amharic graphemes, phonemes, or syllables [11]. GRU is wise matrix multiplication). W is the recurrent connection at
selected since experiments showed that the GRU and LSTM the previous hidden layer and current hidden layer, U is the
reach similar accuracy for the same number of layers. However, weight matrix connecting the inputs to the current hidden layer.
the number of trainable parameters is higher for LSTM than Intuitively, the reset gate determines how to combine the new
GRU, given the same number of layers and hidden units. input with the previous memory, and the update gate defines
As it is shown in Table I, there are more trainable parameters how much of the previous memory to keep.
for the LSTM with only 250 hidden units per layer, which A single utterance xi and label yi are sampled from a
makes the training computationally expensive. GRUs are faster training set X = (x1 , y1 ), (x2 , y2 ), …,. Each utterance, xi ,
to train and less likely to diverge as it is also shown in is a time-series of length Ti where every time-slice is a vector
[1]. In this work, we have used the standard GRU with a of audio features, xi t, t = 0, …, ti−1 . The GRU network tries
ReLu activation function (Equation (4)). In standard GRU, the to convert an input sequence xi into a final transcription yi .
hidden state ht is computed as the equation given in Equations The outputs of the network are the Amharic graphemes or
1-5 [12]. phonemes. At each output time-step t, the network makes a
prediction over characters, p(ct |x), where ct is a character in
zt = σ(xt Wz + ht−1 Uz + bz ) (1) the alphabet or the blank symbol. A blank is used to denote
word or syllable boundaries. The model is trained using the
rt = σ(xt Wr + ht−1 Ur + br ) (2) CTC loss function given an input-output pair (x, y) and the
ȟt = tanh(xt W + (rt ht−1 )U + b) (3) current parameters of the network θ. The loss function is
computed as loss(x, y; θ)) and its derivative with respect to the
ȟt = relu(xt W + (rt ht−1 )U + b) (4) parameters of the network 5 θ loss(x, y; θ). This derivative
is then used to update the network parameters through the
ht = (1 − zt ) ht−1 + zt ȟt (5)
backpropagation through time algorithm.
Where matrices Wz , Wr , Uz , Ur and vectors bz , br , b are model In the second model, we similarly kept the network parame-
parameters. σ is sigmoid function r is a reset gate, and z is an ters of the first model. However, we have changed the filter size

3468

Authorized licensed use limited to: Rutgers University. Downloaded on May 18,2021 at 01:55:41 UTC from IEEE Xplore. Restrictions apply.
(256) and the hidden unit size (512). Also, we have included are used. The prediction of the final GRU layer fed to the
the attention layer after the last GRU layer. The unidirectional fully connected layer to calculate and output the probability
GRU cells are replaced with the bidirectional GRU (BiGRU) distribution matrix in the softmax layer. The output matrix of
expecting an improvement in the prediction. Attention mech- the network and the corresponding ground-truth transcriptions
anism is combined in the BiGRU allowing it to focus on text is input to the CTC loss function. A beam search with
certain parts of the MFCCs input sequence while predicting a beamwidth of 100 is used in the CTC decoding algorithm.
certain parts of the character sequence. This approach has In the second model, the attention layer is included after the
proven easier learning and better performance [19]. Attention last BiGRU layer. The third model is the 1D CNN already
mechanism is supposed to improve performance in many tasks discussed in Section IV. This model is much faster in all the
when integrated with RNN networks [15]. training. Adam optimizer with a learning rate of 0.0001, beta1
In our third model, we have designed 1D deep CNN model value of 0.9, beta2 value of 0.999, clipnorm value of 1, and
architecture, which gets rid of GRUs in the previous models. clipvalue of 0.5 is used in all sections of the training. The
RNNs require more training time and computational resources experiments were conducted on Geforce RTX 2080 with 8GB
than CNNs and Attentions. CNNs with a dropout regular- VRAM, 32 GB RAM, and a 16 core processor workstation.
ization technique can outperform both RNNs and Attention
mechanisms. In this model, we have used a 10-layer 1D CNN
layers with a filters size of 200 and kernel size 11 without VII. Results and Discussions
max-pooling. The dropout layer [20] is used after each CNN
layer with varying dropout rates. As a baseline, we have trained the grapheme-based end-
to-end ASR system for all the three models. We evaluated
V. Data set the mapping of Amharic graphemes from the MFCCs input.
There were only 20 hours of read-speech data available The first model result showed a character error rate (CER)
[21] for Amharic, which makes it challenging to develop of 28.21% on the development set and 32.45% CER on the
a better speech recognizer system. Collecting a substantial test set as it is shown in Table II. Similarly, we have trained
speech corpus for developing speech recognizer is a costly and a network, which predicts phonemes to evaluate phoneme se-
labor intensive task. In our recent study [4], we have prepared quence prediction. We have also trained syllable-based models.
approximately 90 hours of speech corpus, which includes In all the model training, we kept the configuration of the
read-speech, audiobooks, and multi-genre radio shows from network identical. The lowest CER achieved is 19.21%, which
publicly available archives. The dataset is provided for ASR is obtained after training 180 epochs on the test set. The result
related extensive research [4]. In this study, we demonstrate was collected in the third model which is the syllable-based 1D
the possibility of having an end-to-end ASR system with a CNN model. Phoneme-based models have also shown better
relatively small subset of the dataset selected. performance compared with the grapheme-based models.
The subset of the dataset contains a total of 52 hours of In all experiments, adding layer demands more computing
speech data. The dataset is partitioned into training, devel- power and training time to converge with no improvement in
opment, and test set. 56% of the training set is augmented the decoding result. Also, we have increased the number of
to expand and balance the training set. The dataset partition neurons in each layer, but that did not improve the result
contains 16,976 sentences in the training set, 359 in the either. Instead, it took more time to converge. The CNN
development set, and 2780 sentences in the test set. Phoneme model took much less time to train and converge compared
and syllable based labels for each audio is prepared in addition with the other two models. The syllable error rate achieved
to the graphemes based labels. in the deep 1D CNN model is much lower (39.98 SER)
compared with the other two models. The result indicates that
VI. Experiments 1D CNN models can perform better with fewer computation
The experiments were carried out on the 52 hours of resources and training time. Adding the attention layer in the
the Amharic dataset mentioned in Section V. The dataset CNN+GRU+CTC model did not improve the CER in any piece
contains approximately 20k utterances under the training set of the training. The deep 1D CNN model could provide a better
and 3k utterances under the development and test set. In this result if it is trained for longer epochs since it does not show
experiment, we have used a subset of the dataset published on over fitting in all the model training.
[4]. The acoustic features extracted for the experiment consist Compared with the grapheme and phoneme based models,
of the 13-dimensional MFCCs, with their first and second- the syllable based models have shown better CER. This
order derivatives. A window size of 20ms with an overlap of suggests that syllables are a good modeling unit for Amharic
10ms has been used in the estimation of the MFCCs. speech recognition. Besides, for tasks such as spoken content
All three models mention in Section IV are selected after retrieval(SCR) syllables could be an indexing unit that makes
testing several model architectures and hyper-parameter tuning. them a good choice of modeling unit. The result achieved
In the first model, at the early stage 1D CNN layer with a filter could be feasible in real-world applications such as spoken
size of 200, a kernel size of 11 and a stride size of 2 is used. content retrieval since it was shown that an error rate of up to
Five stacked GRUs network layers with 250 units in each layer 50% still provided a reasonable SCR accuracy [22].

3469

Authorized licensed use limited to: Rutgers University. Downloaded on May 18,2021 at 01:55:41 UTC from IEEE Xplore. Restrictions apply.
TABLE II [5] “Flat-Start Single-Stage Discriminatively Trained HMM-Based Models
CER, and SER results of the models. for ASR,” IEEE/ACM Transactions on Audio Speech and Language
Processing, vol. 26, no. 11, pp. 1949–1961, nov 2018.
Dev Set Test Set [6] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist
Model Token temporal classification: Labelling unsegmented sequence data with re-
CER SER CER SER
Graphemes 28.21 - 32.45 - current neural networks,” in ACM International Conference Proceeding
CNN+GRU+TD Series, vol. 148, 2006, pp. 369–376.
Phoneme 21.02 - 24.19 -
+CTC [7] M. Y. Tachbelie and L. Besacier, “Using different acoustic, lexical and
Syllables 24.67 54.56 26.86 55.56
language modeling units for ASR of an under-resourced language -
CNN+BiGRU+ Graphemes 28.34 - 33.34 -
Amharic,” Speech Communication, vol. 56, no. 1, pp. 181–194, 2014.
Attention+TD Phoneme 22.04 - 25.94 -
[8] S. T. Abate and W. Menzel, “Syllable-based speech recognition for
+CTC Syllables 32.16 67.16 34.65 68.46
Amharic,” in Proceedings of the 2007 Workshop on Computational
Graphemes 34.88 - 35.97 -
Approaches to Semitic Languages: Common Issues and Resources.
CNN+TD+CTC Phoneme 19.18 - 21.28 -
Association for Computational Linguistics, 2007, pp. 33–40.
Syllables 17.25 38.87 19.21 39.98
[9] M. Gasser, “HornMorpho: a system for morphological processing of
Amharic, Oromo, and Tigrinya,” Conference on Human Language
Technology for Development, no. April 2011, pp. 94–99, 2011.
[10] M. Seyoum, “The syllable Structure and Syllabification in Amharic,”
VIII. Conclusions Masters of philosophy in general linguistic thesis, Trondheim, Norway.,
2001.
In this paper, we have presented an end-to-end neural archi- [11] N. Hailu and S. Hailemariam, “Modeling improved syllabification al-
tecture for the Amharic speech recognition system using CTC gorithm for Amharic,” in Proceedings of the International Conference
based training and decoding. We have shown the possibility on Management of Emergent Digital EcoSystems, MEDES 2012. New
York, New York, USA: ACM Press, 2012, pp. 16–21.
of having an end-to-end system for a resource-scarce language [12] G. Zweig, C. Yu, J. Droppo, and A. Stolcke, “Advances in all-neural
on minimal dataset and computing resources. Further, we speech recognition,” in ICASSP, IEEE International Conference on
have trained grapheme, phoneme, and syllable-based models Acoustics, Speech and Signal Processing - Proceedings. Institute of
Electrical and Electronics Engineers Inc., jun 2017, pp. 4805–4809.
to find a suitable sub-word unit for Amharic end-to-end [13] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares,
speech recognition. The result could be taken as an input H. Schwenk, and Y. Bengio, “Learning phrase representations using
to other speech-related tasks such as spoken content retrieval RNN encoder-decoder for statistical machine translation,” in EMNLP
2014 - 2014 Conference on Empirical Methods in Natural Language
and as a baseline system for further investigation. By using Processing, Proceedings of the Conference. Association for Computa-
data augmentation, integration of language model, and noisy tional Linguistics (ACL), jun 2014, pp. 1724–1734.
text correction methods [23], the recognition performance [14] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A
neural network for large vocabulary conversational speech recognition,”
could even be improved more. Hyperparameter tuning and in ICASSP, IEEE International Conference on Acoustics, Speech and
training for longer epochs could improve the performance, Signal Processing - Proceedings, vol. 2016-May. Institute of Electrical
especially on the deep 1D CNN model. Future work includes and Electronics Engineers Inc., may 2016, pp. 4960–4964.
[15] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio,
the evaluation of additional data, testing syllable-based models “End-to-end attention-based large vocabulary speech recognition,” in
on CNN-based architectures, and integration with a spoken ICASSP, IEEE International Conference on Acoustics, Speech and Signal
content retrieval system. Processing - Proceedings, vol. 2016-May. Institute of Electrical and
Electronics Engineers Inc., may 2016, pp. 4945–4949.
[16] “Convolutional Neural Networks for Speech Recognition,” IEEE/ACM
IX. Acknowledgements TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESS-
ING, vol. 22, no. 10, 2014.
The authors would like to thank the DAAD and MoSHE [17] “Transfer Learning for Speech Recognition on a Budget,” pp. 168–177,
for funding this research work. We thank Deutsche Welle for jun 2017.
[18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep net-
their cooperation in granting permission to use Amharic text work training by reducing internal covariate shift,” in 32nd International
and audio from their online archive. Conference on Machine Learning, ICML 2015, vol. 1. International
Machine Learning Society (IMLS), feb 2015, pp. 448–456.
References [19] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,
“Attention-Based Models for Speech Recognition,” Advances in Neural
[1] D. Amodei and et al., “Deep speech 2: End-to-end speech recognition in Information Processing Systems, vol. 2015-January, pp. 577–585, jun
English and Mandarin,” in 33rd International Conference on Machine 2015.
Learning, ICML 2016, vol. 1. International Machine Learning Society [20] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
(IMLS), dec 2016, pp. 312–321. dinov, “Dropout: A Simple Way to Prevent Neural Networks from Over-
[2] Y. Zhang, W. Chan, and N. Jaitly, “Very deep convolutional networks fitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958,
for end-to-end speech recognition,” in ICASSP, IEEE International 2014.
Conference on Acoustics, Speech and Signal Processing - Proceedings. [21] S. T. Abate, W. Menzel, and B. Tafila, “An amharic speech corpus
Institute of Electrical and Electronics Engineers Inc., jun 2017, pp. for large vocabulary continuous speech recognition,” 9th European
4845–4849. Conference on Speech Communication and Technology, pp. 1601–1604,
[3] G. Hinton, L. Deng, D. Yu, G. Dahl, A. R. Mohamed, N. Jaitly, 2005.
A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, [22] M. Larson and G. J. F. Jones, “Spoken content retrieval: A survey of
“Deep neural networks for acoustic modeling in speech recognition: techniques and technologies,” Found. Trends Inf. Retr., vol. 5, no. 4—5,
The shared views of four research groups,” IEEE Signal Processing p. 235–422, Apr. 2012.
Magazine, vol. 29, no. 6, pp. 82–97, 2012. [23] A. Mekonnen Gezmu, A. Nürnberger, and B. Ephrem Seyoum, “Portable
[4] N. H. Gebreegziabher and A. Nürnberger, “An Amharic Syllable-Based Spelling Corrector for a Less-Resourced Language: Amharic,” Tech.
Speech Corpus for Continuous Speech Recognition,” in Lecture Notes Rep., 2018.
in Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics), vol. 11816 LNAI.
Springer, oct 2019, pp. 177–187.

3470

Authorized licensed use limited to: Rutgers University. Downloaded on May 18,2021 at 01:55:41 UTC from IEEE Xplore. Restrictions apply.

You might also like