You are on page 1of 5

Investigating The Use Of Syllable Acoustic Units For

Amharic Speech Recognition


Adey Edessa Dribssa, Martha Yifiru Tachbelie
School of Information Science,
Addis Ababa University
Addis Ababa, Ethiopia
Email: {adey.edessa, martha.yifiru}@aau.edu.et

Abstract— This study investigated the possibility of solution is the use of longer length sub-word units, particularly
developing a large vocabulary continuous speech recognizer syllables in LVCSR systems.
(LVCSR) for Amharic using the different syllable types V, CV,
VC, CVC, VCC and CVCC found in the language as acoustic Various research works have been conducted in the area of
units. Syllables as longer length acoustic units are able to embed speech recognition for Amharic using phones, tri-phones, CV
the spectral and temporal dependencies found in speech and thus syllables as acoustic units. Some of the works include a speech
able to model it well. The recognizer was developed using the recognizer that recognizes a subset of consonant-vowel (CV)
Hidden Markov Model as a modeling technique. The result of the syllable using the HTK [15], an isolated word recognition
experiments shows that syllables are promising units for system using phones, tri-phones, and CV-syllables [4], a
Amharic LVCSR provided that enough training data is available. speaker independent LVCSR using phones and context
dependent phones [17], [9] investigated the use of both context
Keywords— Syllable Based ASR; Amharic LVCSR; Longer dependent and context independent consonant-vowel (CV)
Length Acoustic Unit syllable acoustic models (AM) by comparing them with tied
state tri-phone based recognizers, [10] investigated the use of
I. INTRODUCTION hybrid units in acoustic modeling and the use of syllable as
Automatic Speech Recognition (ASR) is the process of well as hybrid AMs in morpheme-based speech recognition.
decoding the information conveyed by a speech signal to a Most Amharic linguists agree that the syllable structure of
string of words [4]. The resulting words can be used to perform Amharic is (C)V(C)(C) where C represents optional
various tasks such as controlling a machine, accessing a consonants and V represents vowel. That means the syllable
database, or producing a written form of the input speech. types of Amharic are V, CV, CVC, VC, CVCC and VCC
The development of a Large Vocabulary Continuous including possible consonant clusters and gemination.
Speech Recognition system for a language involves the As stated above, previous researches in ASR for Amharic
selection of a set of suitable Lexical Units (LU) [5]. ASR concentrated on the use of phones (including tri-phone) and
systems may use words, phones, context dependent phones or CV syllables for acoustic modeling. As a result, the full benefit
tri-phones, phone like (PL) units and syllables to model the of using Amharic syllables for acoustic modeling has not been
basic units of speech. Words are able to model contexts well. investigated. Thus, this research investigates the possibility of
However, using word models in large vocabulary recognition designing a speech recognizer for Amharic by using the entire
requires considerable amount of data to train the models syllable structures V, CV, CVC, VC, CVCC and VCC as
adequately and thus are impractical for LVCSR[4]. recognition units.
Sub-word acoustic units such as phones or context
dependent phones are generally used in LVCSR as a solution. II. THE AMHARIC LANGUAGE
According to [1], phoneme substitution and reduction are well Amharic is a Semitic language that has the second large
captured by tri-phones however, co-articulation effects number of speakers next to Arabic. It is one of the major
typically have a long time span and the corresponding spectral languages in Ethiopia spoken by more than 17 million people,
and temporal dependencies are not easy to capture in tri-phones about one third of Ethiopia's population [2]. Amharic remains
making these sub-word units inefficient in modeling the long one of the most widely studied languages in Ethiopia. In
term dependencies. These inherent restrictions raise the addition to being the medium of instruction for primary level
question of how the long-term spectral and temporal education, it is also taught as a subject in most elementary and
dependencies present in speech could be modeled in large secondary levels of education as well as studied at a B.A and
vocabulary continuous speech recognition (LVCSR). One M.A level. Currently, Amharic is the official working language
of the federal democratic republic of Ethiopia and several of states the syllable structure of Amharic as CV and CVC. This
the states within the federal system. study considers all the six syllable types identified by [3] as the
aim is to investigate the use of longer length acoustic units for
A. Amharic Phonetics LVCSR.
The Amharic Language has a total of 38 phonemes which
are further divided into 31 consonants and 7 vowels [16][3]. III. EXPERIMENTAL SETUP
The 31 consonants are categorized based on their place and A. Speech and Text Corpus
manner of articulation. Based on their manner of articulation,
the consonants are divided into stops, affricate, fricative, nasal, The database used for this work is an Amharic read speech
liquid and glide. The Amharic consonants b d f g h k l m n p r corpus [14]. It contains 20 hours of training speech collected
s t w y z have their phonemic transcription correspond to that from 100 speakers who read a total of 10,850 sentences
of the English consonants. However, the ejective S or (28,666 tokens). For the purpose of developmental and final
glotilized sounds (t’ k’ p’ s’Ù’) are peculiar to Amharic and do testing, a development and evaluation sets with a vocabulary
not have correspondents in English [16]. Moreover, there are size of 5k, each containing 360 sentences also prepared by [14]
sounds which are the same or nearly the same as the English were used. To develop the language model, the text corpus
sounds but are represented by special phonetic symbols. ATC_120k [8] was used. It consists of 120,262 sentences
(2,348,150 tokens or 211,120 types).
There are seven vowels in Amharic or traditionally “seven
orders” [3]. There is no precise correspondence in the B. Lexical, Language and Acoustic Models
pronunciation of the Amharic and English vowels. The vowels
A syllable-based LVCSR for Amharic requires the
in Amharic are never written in isolation but combined with a
development of a word based language model, a syllable based
consonant.
pronunciation dictionary and a set of syllable based acoustic
models. This section presents the general development process
B. The Amharic Writing System undergone to build the syllable based recognizer.
Amharic has its own syllabic witting system. It uses a style
of script known as the Ge’ez alphabet or Fidel [2]. The writing Lexical model (Pronunciation Dictionary) specifies the
system consists of 276 distinct symbols, 20 numerals and 8 finite set of words that may be output by the speech recognizer
punctuations. The written alphabet has 33 consonants each and maps each of these words with their equivalent sub-word
consonant having seven shapes when combined with a vowel. (usually phoneme or syllable) sequences. The first step in
This makes 231 distinct symbols out of the 276. The remaining building the lexical model required for the experiment was to
symbols include 18 labialized consonants, the consonant v syllabify each word in the orthographically transcribed training
which appears in its seven orders and the 20 labiovelars. and test sentences into its proper VC, CV, V, CVC, VCC and
CVCC syllable form. To achieve this, automatic syllabification
software, which uses linguistic implementation notions
C. The Syllable Structure of Amharic
designed following the Maximal Onset and Sonority Hierarchy
Consonants and vowels combine to make a syllable. A principles, was used [12]. All the sentences were given as an
syllable is a vowel-like (or sonorant) sound together with some input to this automatic syllabification software, which
of the surrounding consonants that are most closely associated segmented each word into a sequence of syllables. Following
with it. The vowel at the core of a syllable is the nucleus. The the syllabification of each word into its proper form, training
optional initial consonant or set of consonants is called the and decoding dictionary were prepared using the HTK tool
onset. The coda is the optional consonant or sequence of HDMan.
consonants following the nucleus. The rhyme is the nucleus
plus the coda [9]. To train a set of syllable HMM models, designing a
topology with proper consideration of the size of the
SYLLABLE recognition unit and the amount of training speech data is
crucial. For a phone based system, a good topology to use is a
CORE 3-state Bakis model with no skips [11]. In accordance with this
concept, 3 emitting states per phone were chosen for each
ONSET NUCLEUS CODA syllable model type. Thus, the syllable type V had 3 emitting
states per phone resulting in a 5 state model including the two
Fig. 1 non-emitting states. Applying the same analogy, the syllable
types CV together with VC, CVC together with VCC and
Amharic is a syllabic language in which every of the CVCC resulted in an 8, 11 and 14 state (including the non-
grapheme (character) represent consonant-vowel assimilation. emitting start and end states) prototype models respectively.
However, not all syllables in Amharic follow the CV sequence The prototype models were initialized with a bootstrap method
represented by the graphemes. Several researchers studied the using the 10,850 parameterized speech data. Once the
syllable structure of the Amharic language and came up with initialized syllable model set was created, the transition
different syllable templates. For example, [3] states that the probabilities and observation likelihood parameters were re-
syllable structure of Amharic is (C)V(C)(C) where C estimated using the Baum-Welch re-estimation algorithm.
represents a consonant and V a vowel. That means the syllable Multiple iterations of this algorithm were performed until well
types of Amharic are V, CV, CVC, VCC, VC, CVCC and [7]
trained syllable models resulted. Training of all the syllable consideration, the CV and VC syllable models among the set
models was done using the HTK toolkit. were changed to 7 state syllable models (5 emitting and two
non-emitting states). The other models remained with the same
Continuous speech recognition always requires some number of states used in experiment A. The overall result of
language information to restrict the recognition. Simple this experiment is summarized in Table 4:
dictionary based recognition is rarely enough for achieving
acceptable recognition accuracy. Thus, one of the required
elements in the development of large vocabulary continuous No. of Mix WER
speech recognition systems (LVCSRS) is the language model.
The Language model used in this work was a backed-off Single 24.76
bigram language model. It was developed using the ATC_120k Mix 2 23.5
corpus for a 5k vocabulary size (vocabulary size of the test
data) using SRILM toolkit [13]. The LM was smoothed with a Mix 4 22.39
modified Kneser-Ney smoothing technique and made open Mix 6 22.18
using a special unknown word token.
Mix 8 22.28

IV. EXPERIMENTATION Mix 10 21.56

Mix 12 21.4
A. Preliminary Syllable Based Recognizer
TABLE 4. Performance of Syllable Recognizer using small number of
In the first set of experiment performed for the syllable states for VC and CV syllables
recognizer, the syllabified pronunciation lexicon created
contained a total of 2667 unique syllables for a complete
coverage of the 20 hour training data. These 2667 syllables The best performance obtained in this experiment was the
were used to construct the syllable HMM set used in this one with 12 Gaussian mixtures with a WER of 21.4. In
experiment. comparison to the first experiment there is a slight decrease in
The models were trained following the procedures performance (increase in WER).
explained in section III B. The trainings were done for
different Gaussian mixtures starting with a single mixture and C. Syllable Recognizer Considering Frequently Occurring
incrementing by two until a performance degradation was seen Syllables
(after 12 Gaussian mixtures performance degradation was The first and second experiments (A and B) were done
seen). The models constructed were tested using the 360 using 2667 syllable models out of which, 1796 of them had
sentences described in section III A. The overall result of the insufficient training instances (training instances less than 15)
experiment is summarized in Table 3: to result in well trained, good performing models. The training
instance 15 was chosen after trying different number of
No Mix WER instances. Thus, in this experiment 871 syllables with sufficient
training instances were chosen to be used as syllable models
Single 23.81 replacing the remaining 1796 poorly trained syllables in the
22.34
lexicon with their phonemic representation. This approach was
Mix 2 chosen because the 871 syllables cover more than 80% of the
Mix 4 21.39 training database and thus are the frequently occurring
21.25 syllables. The performance obtained from experiment A and
Mix 6
experiment B showed that the CV and VC syllables with 6
Mix 8 21.5 emitting states resulted in better performance thus this
Mix 10 21.25 experiment used 6 emitting states for those syllable types.

Mix 12 21.15 The syllable based pronunciation lexicon was modified to


TABLE 3. Performance of Preliminary Syllable Recognizer
replace the pronunciation entries incorporating the syllable
with insufficient training instance with its phonemic
representation (which covered all the phones found in the
The results obtained here indicate that the model trained Amharic language). In addition to the modification of the
with 12 Gaussian mixtures performed better with a WER of pronunciation lexicon, a 3 state Bakis HMM model had to be
21.15, than the ones trained on different number of Gaussian trained for each of the phones resulting from the segmentation
mixtures. of the syllable to its phonemic equivalent. For example, for the
word entry “wetet” in the pronunciation lexicon the syllabified
B. Syllable Recognizer With The State Number Changed form of the word is “we” “tet” resulting in two syllables. If the
syllable “tet” was the one with the insufficient training instance
The change that was made in this experiment was the then it would be reduced to its phonemic level “t” ”e” ”t” in the
number of emitting states used in the CV and VC syllable dictionary.
HMM models. Solomon [15] used five emitting states to model
the CV syllable based speech recognizer he developed and Previous Entry ->wetet we tet
therefore achieved a good performance. Thus taking this into
Modified Entry ->wetet we t e t V. CONCLUSION AND RECOMMENDATION
The overall result of experiment C is summarized in Table 5: This research investigated the possibility of designing a
LVCSRS for Amharic using all the syllable types of the
No. of Mix WER
language as acustic units. A set of consequetive experiments
were conducted trying to improve the performance of the
Single 18.33 recognizer. The preliminary syllable recognizer developed
first had poorly trained models as a result of data insufficiency.
16.71
Mix 2 The poor performances obtained are partly attributed to this
Mix 4 14.65 problem. Changing the number of states of the CV and VC
syllables in experiment B resulted in a slight decrease in
14.32
Mix 6 performance. Whereas, segmenting the poorly trained syllable
Mix 8 14.54 models into their phonemic representation and considering
only syllables that have sufficient training example resulted in
14.24
Mix 10 a high performance gain.
Mix 12 14.2
The results obtained in all the experiments show that the
TABLE 5. Performance of Syllable Recognizer only considering use of all Amharic syllable types as acoustic units is promising
frequently occurring syllables in LVCSR given that enough amount of training data is used.
Further research can be done in the future, by applying
clustering and tying mechanisms to rarely occurring syllable
The best performance obtained in this experiment was type models to gain performance improvement. There is also a
again the one with 12 Gaussian mixtures with a WER of 14.2. need for a speech corpus with a complete coverage of all the
This high reduction in the WER shows that the low syllables found in Amharic.
performance obtained in experiment A and B is attributed to
the poorly trained syllable models and that performance can be
increased if sufficient data was used.
REFERENCES
D. Syllable Based Recognizer Four
[1] A. Hämäläinen, J. Veth, L.Boves, “Longer length acoustic units for
This experiment was done by using the same syllable continuous speech recognition,” in proceeding of the EUSIPCO-2005,
model system described in experiment A. The difference is 2005.
that, the performance of the system was evaluated by using a [2] A. Wimsatt, R. Wynn, “Amharic language and culture manual” Texas
test data that contained only the 871 syllables with sufficient State University, 2011.
training instances. The results of this experiment are shown in [3] B. Yimam, “y¨aamarI˜na s¨awas¨aw”, EMPDE, Addis Ababa, 2nd. ed.
Table 6. edition, 2007.
[4] K. Lee: Automatic Speech Recognition: The development of the Sphinx
System, 1990.
No. of Mix WER
[5] K. Lopez, “Selection of lexical units for continuous speech recognition
of Basque”, 2002.
Single 20.4
[6] K. Tadesse, “Sub-word based Amharic speech recognizer: An
Mix 2 19.91 experiment using Hidden Markov Model (HMM),” MSc Thesis, School
of Information Studies for Africa, Addis Ababa University, Ethiopia,
Mix 4 19.00 2002.
17.32 [7] M. Seyoum, “The syllable structure and syllabification in Amharic,”
Mix 6 MSc, Thesis, Department of Linguistis, Trondheim, Norway.
Mix 8 17.6 [8] M Yifru, “Morphology-based language modeling for Amharic”, Ph.D.
thesis, University of Hamburg, Germany.
Mix 10 16.24
[9] M. Yifiru, L. S. Teferra, Besacier, and S. Rossato, “Comparison of
16.2 syllable and triphone based speech recognition for Amharic,” in
Mix 12 Proceedings of the LTC 2011, pp. 207–211, 2011.
TABLE 6. Performance of Syllable Recognizer only considering [10] M. Yifiru, L. Besacier, and S. Rossato, “Comparison of syllable and
frequently occurring syllables triphone based speech recognition for Amharic,” in Proceedings of the
LTC 2011, pp. 207–211, 2011.
[11] P. Woodland, S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev
The best performance obtained in this experiment was The HTK book, Microsoft Corporation, 2001.
again the one with 12 Gaussian mixtures with a WER of 16.2. [12] S. H/Mariam, N. Hailu, “Modeling improved syllabification for
This result shows also that the low performance obtained in Amharic” in proceeding of the ICMEDEs, NY, USA, 2012.
experiment A and B is attributed to the poorly trained syllable [13] S. Teferra, M.Yifru, W. Menzel, “Amharic speech recognition: past,
models and that performance can be increased if sufficient data present and future,” In Proceedings of the ICES, Trondheim,
Norway,2007.
was used.
[14] S. Teferra, M.Yifru, W. Menzel, “An Amharic speech corpus for large
vocabulary continuous speech recogntion,” In Proceedings of the
INTERSPEECH, Lisbon, Portugal,2005.
[15] S. Teferra, W. Menzel, “Syllable based speech recognition for [17] Z. Seifu, “HMM based large vocabulary, speaker independent,
Amharic,” in Proceedings of the CASL, Prague, Check Republic, pp.33- continuous Amharic speech recognizer,” MSc Thesis, School of
40, 2007. Information Studies for Africa, Addis Ababa University, Ethiopia, 2003.
[16] W. Leslau, “Introductory Grammar of Amharic”, Harrassowits Verlag,
Wiesbaden, 2000.

You might also like