Text to Speech Synthesis for Ethiopian Semitic

Languages: Issues and the Way Forward

Lemlem Hagos Million Meshesha
IT PhD Program School of Information Science
Addis Ababa University Addis Ababa University
Addis Ababa, Ethiopia Addis Ababa, Ethiopia

Abstractcurrently there is a tremendous increase in synthesis part is also concerned with selection and
electronic text in various languages, including Ethiopian Semitic concatenation of appropriate speech units [3].
languages. Yet, it is inaccessible to visually impaired people and
the illiterate. To come up with high quality text to speech There are a wide range of problems in TTS [4]. One of the
synthesizers for local languages, it is imperative that research in challenges is pre-processing of non-standard words such as
natural language processing and synthetic speech generation be numerals, abbreviations and acronyms [2]. While standard
improved. Accordingly, we critically reviewed core issues in text words have specific pronunciation that can be described
to speech synthesis for Ethiopian Semitic languages and revealed phonetically by letter to phone (L2P) rule, pronunciation of
that further research to improve the quality of text to speech non-standard words is not straight forward as the written form
synthesis for Ethiopian Semitic languages is mandatory. To is far different from the spoken form [4][1]. There are also
optimize linguistic resources required for high quality synthetic problems related to correct prosody (rhythm and melody) and
speech output we propose designing a generic bilingual Text to pronunciation analysis of written text [2]. Generation of full
Speech framework for Ethiopian Semitic languages. range of prosodic effects from text is challenging as text
encodes verbal message that the author generated whereas
Keywordsspeech synthesis; Amharic Tigrigna; Bilingual prosody reflects the intention of the author. Thus, extraction of
TTS; Ethiopian Semitic languages prosodic information embedded in the context requires learning
about the prosodic features in context [1].
Text-To-Speech (TTS) is the process of converting text to II. ETHIOPIAN SEMITIC LANGUAGE
audible speech [1]. Towards this end, there has been research Ethiopian Semitic language is a sub family of south-Semitic
endeavor to achieve the ultimate goal of speech synthesis that which in turn is a sub family of West Semitic language under
mimic the speech production system of the human being [2]. the Semitic language of the Afro-Asiatic super family.
The state-of-the-art shows that, nowadays researchers are Ethiopian Semitic language includes Geez, Tigrigna, Tigre,
aiming at a more natural sounding and intelligible speech Amharic, Argoba, and others [5]. The language family
output to be achieved [1]. basically uses the same writing system with slight modification
TTS problem is an integration of two sub-problems, which mainly needed to fit to the specific language characteristics. In
involve text analysis phase and speech generation phase [1]. this section, we compare and contrast Amharic and Tigrigna,
Text analysis is the process of determining the underlying the most widely spoken Semitic languages in Ethiopia with
structure of the sentence and the phonemic composition of each specific emphasis on phonetic composition and syllable
word. Text analysis phase involves text preprocessing (text structure.
normalization), word pronunciation (syllabification), phrasing A. Phone-sets of Amharic and Tigrigna
intonation (identifying phrase pattern), segmental duration and
Amongst Ethiopian Semitic languages, Amharic and
computation of intonation contour from phonological
Tigrigna share common characteristics derived from the same
representations of the text [3]. When real world text is
Geez root [5].
considered, it contains non-standard words (NSWs) such as
numbers, acronyms and abbreviations in addition to standard According to Girmay, Tigrigna has 29 consonantal
words, thus normalization process that expands each NSW into phonemes and seven vowels [6]. This is so without considering
respective pronounceable form is required [4]. the labiovelars [gw], [kw], [kw] and the fricatives [X] and
Speech generation transforms the abstract linguistic [X]. Girmay argues that the labiovelars are derivable and the
representation of text into speech waveform. Speech generation fricatives are allophones of [k] and [k], respectively [6].
According to other researchers, however, these labiovelar and
phase is responsible for phonetic realization of each phoneme,
the fricatives as well as the phoneme [V] are included in the
that is conversion of the phonemic representation in the text
constant chart of Tigrigna as a result of which the number of
into its spoken counterpart or speech sound [1]. The speech
consonants would be 37 for Tigrigna [7][8].

Amharic is also investigated to have controversial number For instance, in Amharic the number 1980 can be read in a
of consonants. According to Baye, Amharic has 30 consonants number of ways such as Iasra ztn semanja or shizetn
[9]. Another argument is that Amharic has 21 underlying meto semanja when used as a year or Iand shizeten meto
consonants and 6 palatal consonants. Thus, these palatal semanja when serving as a decimal number specifying a
consonants are not required to be shown in the consonant chart quantity [11].
as they are derivable from the underlying consonants [10]. As
in Tigrigna, Amharic has seven vowels. Analysis on Amharic Similarly in Tigrigna text, there are various ways of
and Tigrigna phonology also shows that Amharic and Tigrigna reading numbers and acronyms. A number such as 12605 can
share about 30 consonants in addition to the vowels, as be read as in more than one way when enumerated, when
indicated in Table I. serving as a car plate number and when serving as a numeric
quantity. Acronyms such as tImt can assume variants of
Even though there are controversies over the inclusion of readings depending on what it represents, initials of proper
the phoneme [V] and over the palatals, it is imperative that names for instance.
text to speech synthesis requires consideration of inclusion of
all possible phonemes in the language. As shown in Table I, Thus, there is a need to implement a machine learning
the pharyngeal and fricative velars are the differentiating technique so that the possible variants of NSWs in Amharic as
well as Tigrigna can be captured.
phonemes that exist in Tigrigna but not in Amharic.
Accordingly, shared phonemes in these languages can be a III. STATEMENT OF THE PROBLEM
basis for developing a generic bilingual text to speech
synthesizer. The motivation behind this research is related to the
availability of many local languages and the need to ease the
B. Standard and Non-Standard Words sharing of information between those who hear only and read.
Standard words refers to alphabetic strings that have a This is related to literacy level and disability especially visual
regular entry in a lexicon while Non-Standard Words (NSWs) impairment. The 2011 Central Statistics Agency report
comprise numerical patterns and alphabetic strings that do not indicates that 46.8% of the entire population is found to be
have a regular entry in a lexicon and their pronunciation needs literate. This shows that the level of illiteracy exceeds 50% at
to be generated by a more complicated natural language national level [12].
processing methods [4]. Written documents of Ethiopian Speech is the most effective means of communication
Semitic languages such as Amharic and Tigrigna contain among humans [1]. Towards effective human-machine
unrestricted texts which include both standard and non- communication TTS research has been pursued for the various
standard words. Pronunciation of non-standard words is more languages in the world, such as English, European languages
challenging compared to that of standard words. and Asian languages [2].

Bilabial Labio- Alveolar Palatal Velar plain Velar Laryngeal Pharyngeal

dental Labialized
Voiced b d g gw I
Fricative Plosives

Voiceless p t k kw I
Ejective p t k kw
Voiced v z z
voiceless f s sh x xw h
Ejective s x xw
Voiced dz
Voiceless c

Ejective c
Nasal m n n
Approximant w j
Liquids l
Ethiopia being multilingual and multinational country, its this study is to design a novel bilingual text to speech synthesis
constitution decrees that each nation, nationality and people has framework for Ethiopian Semitic languages.
the right to speak, write and develop its language [13].
Accordingly, there are a lot of written documents and text IV. REVIEW OF RELATED WORK
books being produced in respective working languages of the There are few research works based on Concatenative,
regional states. In addition to text books, a lot of journals, Formant as well as Parametric statistical methods for Ethiopian
magazines, newspapers and novels are available in Amharic languages. Laine [14] did the first Text to Speech System for
and Tigrigna. Furthermore, Ethiopian Semitic languages are Amharic language using diphone based concatenative
under resourced that to optimize utilization of linguistic synthesis. Tools used were Pascal and MATLAB, and the
resource there is a need for a generic bilingual text to speech evaluation reported as good. Furthermore, Laine noted that
synthesis framework for these languages. Thus, the purpose of prosodic information was not considered in his work.
Henock [15] did the next attempt that applied concatenative These sources are among the widely used public media that
speech synthesis on Amharic. Tools and techniques include publish issues related to economics, politics, sport, social and
TD-PSOLA technique for smoothing, PRAAT for technology issues among others.
spectrographic analysis; and Delphi and MATLAB for
prototype development. Evaluation used included ORT (Open Text data collected is further preprocessed so that the text is
Rhyme Test), and MOS (Mean Opinion Score), and result was composed of standard words from which punctuation marks are
reported promising. removed. For data cleaning and sentence level tokenization,
Python programming language is used.
Tesfay [16] attempted the first text to speech for Tigrigna
language using diphone based concatenative approach with B. Implementation
MATLAB. The performance of the synthesizer measured using Towards our effort to explore the extent to which an
MOS was 3.05. Inclusions of acronym converter to the text already existing speech synthesis system such as festival can be
processing module and prosody control are some of the things used for Ethiopian Semitic languages such as Amharic and
that need researched further [16]. Tigrigna, a Linux environment was set up. With selected words
and phrases we tested the performance of the synthesizer with
Nadew [17] implemented formant based speech synthesis default parameters. As shown in Table II, ten words and
for Amharic vowels using MATLAB. The focus was on phrases were selected that are composed of different syllables,
vowels since vowels play a big role in change of pronunciation the smallest being one syllable word, the largest used is ten
of a word in different contexts. Result indicated intelligibility syllables phrase.
of 88.85% for isolated vowels. Nadew recommended
refinement of the work including consonant consideration, and To make the text input understandable to festival,
preparation of appropriate speech corpus [17]. transliteration of Ethiopic orthography was necessary.
Accordingly, we developed a transliteration algorithm and
Bereket [18] modeled an HMM based speech synthesizer implemented it with python2.78. Transliterated version of the
with the objective of developing unlimited domain speech text was then used as an input to the synthesizer. Speech output
synthesizer for Amharic language that can generate a natural corresponding to the text input was then saved as a .wav file for
sounding and intelligible synthetic speech with less resource later evaluation by listeners. The preliminary experimentation
requirement. Out of 11,670 sentences of corpus 500 sentences shows that there is a promising result enabling the researchers
were used to train the HTS, and 20 sentences were used for to identify the points where the existing speech synthesis
testing. The performance reported was MOS of 4.12 and 3.6 system fails to catch parts of Ethiopian Semitic languages.
for intelligibility and naturalness respectively. As a future
work, Bereket recommended inclusion of prosodic information C. Experimental Result and Discusion
to identify dialects and word meanings [18]. The performance of the prototype developed during
Experiment by Alula [19] shows the possibility of including experimentation is evaluated by means of intelligibility
non-standard words (NSWs) in Amharic TTS. Alula applied measure, which tells the extent to which the synthetic speech is
Festival diphone based unit concatenative synthesis and RELP comprehendible. Since the speech output produced is based on
Coding (Residual Excited Linear Predictive Coding). default parameters which make use of foreign pronunciation,
Performance of the synthesizer was MOS of 3.0 and 2.83 for naturalness test was not conducted.
intelligibility and naturalness respectively. Alula recommended As shown in Table II, intelligibility of the speech output
consideration of all types of NSWs and incorporation of part of was evaluated by two bilingual speakers of the two languages
speech (POS) tagged corpus for prosody control as a future where they were asked to write what they perceived while
work [19]. listening to the speech output. Then average number of
The few attempts that mostly test the possibility of syllables correctly perceived was computed against the number
developing TTS synthesizers for Ethiopian languages are done of syllables in the text to get the percentage of accuracy. Their
in a fragmented manner. Hence, there is a need for a generic average response, 86.03% accuracy indicates that the point of
TTS synthesizer that would optimize linguistic resources. difficulties of the already existing system to pronounce t, k,
Accordingly, the level of TTS quality for Ethiopian Semitic s, H among the unique phonemes in Amharic and Tigrigna is
languages needs further improvement. To this end the core visible. Moreover, the presence of allophones in Amharic and
issues that need be addressed include homograph Tigrigna are not recognized by the already existing synthesis
disambiguation, extracting prosodic information from text and system, thus, there is a need to create voices for the languages
incorporation of non-standard words in the development of a to improve the quality of the synthesized speech in terms of
generic bilingual TTS both intelligibility and naturalness.


This involves text corpus preparation, implementation, and This paper discusses TTS issues related to Ethiopian
performance evaluation. Semitic language with specific emphasis to Amharic and
Tigrigna. There are a number of challenging issues that affect
A. Corpus Preparation the quality of synthetic speech for Ethiopian Semitic
This includes acquisition, preprocessing and annotation of languages. These issues include geminates, epenthetic vowel
text data. Text data are acquired from Woyin newspaper and insertion, homograph disambiguation, and normalization of
reporter newspaper, for Tigrigna and Amharic respectively. non-standard words. Even though TTS research attempts for
Tigrigna word and phrases with Gloss number of correctly Average
respective phonetic representation syllables perceived intelligibilit
average y
HI-luf past 2 1 50%
mar honey 1 1 100%
tIm-hIrt Education 2 2 100%
wI-lad we-li-da gives birth 4 3.5 87.5%
TI-rat jas-fe-llI-gal Effort needed 6 4.5 75%
wu-fu-jat tem-ha-ro committed students 6 5.5 91.67%
hIz-bba-wi-na s-la-ma-wi public and peaceful 7 7 100%
me-na-II-saj te-te-ki-Iom the young took over 8 7 87.5%
ke-ffI-te-Na wI-Tet jI-me-ze-ge-bbal Best to be achieved 11 9.5 86.36%
kI-rIn mo-ge-sIn nI-sI-wu-Iat-na Respect and grace for patriots 10 8.5 85.0%
be-weq-tu je-te-fe-Se-mew dzeg-nI-net Patriotism at that time 11 9 81.81%
Total 68 58.5 86.03%