Professional Documents
Culture Documents
Abstract— This study investigated the possibility of solution is the use of longer length sub-word units, particularly
developing a large vocabulary continuous speech recognizer syllables in LVCSR systems.
(LVCSR) for Amharic using the different syllable types V, CV,
VC, CVC, VCC and CVCC found in the language as acoustic Various research works have been conducted in the area of
units. Syllables as longer length acoustic units are able to embed speech recognition for Amharic using phones, tri-phones, CV
the spectral and temporal dependencies found in speech and thus syllables as acoustic units. Some of the works include a speech
able to model it well. The recognizer was developed using the recognizer that recognizes a subset of consonant-vowel (CV)
Hidden Markov Model as a modeling technique. The result of the syllable using the HTK [15], an isolated word recognition
experiments shows that syllables are promising units for system using phones, tri-phones, and CV-syllables [4], a
Amharic LVCSR provided that enough training data is available. speaker independent LVCSR using phones and context
dependent phones [17], [9] investigated the use of both context
Keywords— Syllable Based ASR; Amharic LVCSR; Longer dependent and context independent consonant-vowel (CV)
Length Acoustic Unit syllable acoustic models (AM) by comparing them with tied
state tri-phone based recognizers, [10] investigated the use of
I. INTRODUCTION hybrid units in acoustic modeling and the use of syllable as
Automatic Speech Recognition (ASR) is the process of well as hybrid AMs in morpheme-based speech recognition.
decoding the information conveyed by a speech signal to a Most Amharic linguists agree that the syllable structure of
string of words [4]. The resulting words can be used to perform Amharic is (C)V(C)(C) where C represents optional
various tasks such as controlling a machine, accessing a consonants and V represents vowel. That means the syllable
database, or producing a written form of the input speech. types of Amharic are V, CV, CVC, VC, CVCC and VCC
The development of a Large Vocabulary Continuous including possible consonant clusters and gemination.
Speech Recognition system for a language involves the As stated above, previous researches in ASR for Amharic
selection of a set of suitable Lexical Units (LU) [5]. ASR concentrated on the use of phones (including tri-phone) and
systems may use words, phones, context dependent phones or CV syllables for acoustic modeling. As a result, the full benefit
tri-phones, phone like (PL) units and syllables to model the of using Amharic syllables for acoustic modeling has not been
basic units of speech. Words are able to model contexts well. investigated. Thus, this research investigates the possibility of
However, using word models in large vocabulary recognition designing a speech recognizer for Amharic by using the entire
requires considerable amount of data to train the models syllable structures V, CV, CVC, VC, CVCC and VCC as
adequately and thus are impractical for LVCSR[4]. recognition units.
Sub-word acoustic units such as phones or context
dependent phones are generally used in LVCSR as a solution. II. THE AMHARIC LANGUAGE
According to [1], phoneme substitution and reduction are well Amharic is a Semitic language that has the second large
captured by tri-phones however, co-articulation effects number of speakers next to Arabic. It is one of the major
typically have a long time span and the corresponding spectral languages in Ethiopia spoken by more than 17 million people,
and temporal dependencies are not easy to capture in tri-phones about one third of Ethiopia's population [2]. Amharic remains
making these sub-word units inefficient in modeling the long one of the most widely studied languages in Ethiopia. In
term dependencies. These inherent restrictions raise the addition to being the medium of instruction for primary level
question of how the long-term spectral and temporal education, it is also taught as a subject in most elementary and
dependencies present in speech could be modeled in large secondary levels of education as well as studied at a B.A and
vocabulary continuous speech recognition (LVCSR). One M.A level. Currently, Amharic is the official working language
of the federal democratic republic of Ethiopia and several of states the syllable structure of Amharic as CV and CVC. This
the states within the federal system. study considers all the six syllable types identified by [3] as the
aim is to investigate the use of longer length acoustic units for
A. Amharic Phonetics LVCSR.
The Amharic Language has a total of 38 phonemes which
are further divided into 31 consonants and 7 vowels [16][3]. III. EXPERIMENTAL SETUP
The 31 consonants are categorized based on their place and A. Speech and Text Corpus
manner of articulation. Based on their manner of articulation,
the consonants are divided into stops, affricate, fricative, nasal, The database used for this work is an Amharic read speech
liquid and glide. The Amharic consonants b d f g h k l m n p r corpus [14]. It contains 20 hours of training speech collected
s t w y z have their phonemic transcription correspond to that from 100 speakers who read a total of 10,850 sentences
of the English consonants. However, the ejective S or (28,666 tokens). For the purpose of developmental and final
glotilized sounds (t’ k’ p’ s’Ù’) are peculiar to Amharic and do testing, a development and evaluation sets with a vocabulary
not have correspondents in English [16]. Moreover, there are size of 5k, each containing 360 sentences also prepared by [14]
sounds which are the same or nearly the same as the English were used. To develop the language model, the text corpus
sounds but are represented by special phonetic symbols. ATC_120k [8] was used. It consists of 120,262 sentences
(2,348,150 tokens or 211,120 types).
There are seven vowels in Amharic or traditionally “seven
orders” [3]. There is no precise correspondence in the B. Lexical, Language and Acoustic Models
pronunciation of the Amharic and English vowels. The vowels
A syllable-based LVCSR for Amharic requires the
in Amharic are never written in isolation but combined with a
development of a word based language model, a syllable based
consonant.
pronunciation dictionary and a set of syllable based acoustic
models. This section presents the general development process
B. The Amharic Writing System undergone to build the syllable based recognizer.
Amharic has its own syllabic witting system. It uses a style
of script known as the Ge’ez alphabet or Fidel [2]. The writing Lexical model (Pronunciation Dictionary) specifies the
system consists of 276 distinct symbols, 20 numerals and 8 finite set of words that may be output by the speech recognizer
punctuations. The written alphabet has 33 consonants each and maps each of these words with their equivalent sub-word
consonant having seven shapes when combined with a vowel. (usually phoneme or syllable) sequences. The first step in
This makes 231 distinct symbols out of the 276. The remaining building the lexical model required for the experiment was to
symbols include 18 labialized consonants, the consonant v syllabify each word in the orthographically transcribed training
which appears in its seven orders and the 20 labiovelars. and test sentences into its proper VC, CV, V, CVC, VCC and
CVCC syllable form. To achieve this, automatic syllabification
software, which uses linguistic implementation notions
C. The Syllable Structure of Amharic
designed following the Maximal Onset and Sonority Hierarchy
Consonants and vowels combine to make a syllable. A principles, was used [12]. All the sentences were given as an
syllable is a vowel-like (or sonorant) sound together with some input to this automatic syllabification software, which
of the surrounding consonants that are most closely associated segmented each word into a sequence of syllables. Following
with it. The vowel at the core of a syllable is the nucleus. The the syllabification of each word into its proper form, training
optional initial consonant or set of consonants is called the and decoding dictionary were prepared using the HTK tool
onset. The coda is the optional consonant or sequence of HDMan.
consonants following the nucleus. The rhyme is the nucleus
plus the coda [9]. To train a set of syllable HMM models, designing a
topology with proper consideration of the size of the
SYLLABLE recognition unit and the amount of training speech data is
crucial. For a phone based system, a good topology to use is a
CORE 3-state Bakis model with no skips [11]. In accordance with this
concept, 3 emitting states per phone were chosen for each
ONSET NUCLEUS CODA syllable model type. Thus, the syllable type V had 3 emitting
states per phone resulting in a 5 state model including the two
Fig. 1 non-emitting states. Applying the same analogy, the syllable
types CV together with VC, CVC together with VCC and
Amharic is a syllabic language in which every of the CVCC resulted in an 8, 11 and 14 state (including the non-
grapheme (character) represent consonant-vowel assimilation. emitting start and end states) prototype models respectively.
However, not all syllables in Amharic follow the CV sequence The prototype models were initialized with a bootstrap method
represented by the graphemes. Several researchers studied the using the 10,850 parameterized speech data. Once the
syllable structure of the Amharic language and came up with initialized syllable model set was created, the transition
different syllable templates. For example, [3] states that the probabilities and observation likelihood parameters were re-
syllable structure of Amharic is (C)V(C)(C) where C estimated using the Baum-Welch re-estimation algorithm.
represents a consonant and V a vowel. That means the syllable Multiple iterations of this algorithm were performed until well
types of Amharic are V, CV, CVC, VCC, VC, CVCC and [7]
trained syllable models resulted. Training of all the syllable consideration, the CV and VC syllable models among the set
models was done using the HTK toolkit. were changed to 7 state syllable models (5 emitting and two
non-emitting states). The other models remained with the same
Continuous speech recognition always requires some number of states used in experiment A. The overall result of
language information to restrict the recognition. Simple this experiment is summarized in Table 4:
dictionary based recognition is rarely enough for achieving
acceptable recognition accuracy. Thus, one of the required
elements in the development of large vocabulary continuous No. of Mix WER
speech recognition systems (LVCSRS) is the language model.
The Language model used in this work was a backed-off Single 24.76
bigram language model. It was developed using the ATC_120k Mix 2 23.5
corpus for a 5k vocabulary size (vocabulary size of the test
data) using SRILM toolkit [13]. The LM was smoothed with a Mix 4 22.39
modified Kneser-Ney smoothing technique and made open Mix 6 22.18
using a special unknown word token.
Mix 8 22.28
Mix 12 21.4
A. Preliminary Syllable Based Recognizer
TABLE 4. Performance of Syllable Recognizer using small number of
In the first set of experiment performed for the syllable states for VC and CV syllables
recognizer, the syllabified pronunciation lexicon created
contained a total of 2667 unique syllables for a complete
coverage of the 20 hour training data. These 2667 syllables The best performance obtained in this experiment was the
were used to construct the syllable HMM set used in this one with 12 Gaussian mixtures with a WER of 21.4. In
experiment. comparison to the first experiment there is a slight decrease in
The models were trained following the procedures performance (increase in WER).
explained in section III B. The trainings were done for
different Gaussian mixtures starting with a single mixture and C. Syllable Recognizer Considering Frequently Occurring
incrementing by two until a performance degradation was seen Syllables
(after 12 Gaussian mixtures performance degradation was The first and second experiments (A and B) were done
seen). The models constructed were tested using the 360 using 2667 syllable models out of which, 1796 of them had
sentences described in section III A. The overall result of the insufficient training instances (training instances less than 15)
experiment is summarized in Table 3: to result in well trained, good performing models. The training
instance 15 was chosen after trying different number of
No Mix WER instances. Thus, in this experiment 871 syllables with sufficient
training instances were chosen to be used as syllable models
Single 23.81 replacing the remaining 1796 poorly trained syllables in the
22.34
lexicon with their phonemic representation. This approach was
Mix 2 chosen because the 871 syllables cover more than 80% of the
Mix 4 21.39 training database and thus are the frequently occurring
21.25 syllables. The performance obtained from experiment A and
Mix 6
experiment B showed that the CV and VC syllables with 6
Mix 8 21.5 emitting states resulted in better performance thus this
Mix 10 21.25 experiment used 6 emitting states for those syllable types.