You are on page 1of 6

A Comparison Between Allophone, Syllable, and Diphone Based TTS Systems for

Kurdish Language
Wafa Barkhoda
1
, Bahram ZahirAzami
1
, Anvar Bahrampour', Om-Kolsoom Shahryari
1
'Department of Computer and IT, University of Kurdistan
Sanandaj, Iran
2Department of Information Technology, Islamic Azad University, Sanandaj Branch
Sanandaj, Iran
{w.barkhoda, zahir, shahryarLkolsoom}@ieee.org, bahrampour58@gmail.com
Abstract- nowadays, concatenative method is used in most modern
TTS systems to produce artificial speech. The most important
challenge in this method is choosing an appropriate unit for
creating a database. This unit must warranty smoothness and high
quality speech, and also, creating database for it must take
reasonable resources and should be inexpensive. Syllable,
phoneme, allophone, and, diphone are usually used as the units in
such systems. In this paper, we implemented three synthesis systems
for Kurdish language, respectively based on syllable, allophone,
and diphone. We compare the quality of the three systems, using
subjective tests.
Keywords- Speech Synthesis; Concatenative Method; Kurdish TTS
System; Allophone; Syllable, and Diphone.
1. INTRODUCTION
High quality speech synthesis from the electronic form of
text has been a focus of research activities during the last two
decades, and it has led to an increasing horizon of
applications. To mention a few, commercial telephone
response systems, natural language computer interfaces,
reading machines for blind people and other aids for the
handicapped, language learning systems, multimedia
applications, talking books and toys are among the many
examples [1].
Most of the existing commercial speech synthesis systems
can be classified as either formant synthesizers [2,3] or
concatenation synthesizers [4,5]. Formant synthesizers,
which are usually controlled by rules, have the advantage of
having small footprints at the expense of the quality and
naturalness of the synthesized speech [6]. On the other hand,
concatenative speech synthesis, using large speech databases,
has become popular due to its ability to produce high quality
natural speech output [7]. The large footprints of these
systems do not present a practical problem for applications
where the synthesis engine runs on a server with enough
computational power and sufficient storage [7].
Concatenative speech synthesis systems have grown in
popularity in recent years. As memory costs have dropped, it
has become possible to increase the size of the acoustic
inventory that can be used in such a system. The first
successful concatenative systems were diphone based [8],
with only one diphone unit representing each combination of
consecutive phones. An important issue for these systems
978-1-4244-5950-6/09/$26.00 ©2009 IEEE
was how to select, offline, the single best unit of each
diphone for inclusion in the acoustic inventory [9,10]. More
recently there has been interest in automation of the process
of creating databases and in allowing multiple instances of
particular phones or groups of phones in the database, with
the selection decided at run time. A new, but related problem
has emerged: that of dynamically choosing the most
adequate unit for any particular synthesized utterance [11].
The development and application of text to speech
synthesis technology for various languages are growing
rapidly [12,13]. Designing a synthesizer for a language is
largely dependant on the structure of that language. In
addition, there can be variations (dialects) particular to
geographic regions. Designing a synthesizer requires
significant investigation into the language structure or
linguistics of a given region.
In most languages, widespread researches are done on
Text-to-Speech systems and also, in some of these languages
commercial versions of system are offered. CHATR [14, 15]
and AT&T NEXT GEN [16] are two examples offered in
English language. Also, in other languages such as French
[17,18], Arabic [4,19,20], Norwegian [21], Korean [22],
Greek [23], Persian [24-27], etc, much effort has been done
in this field.
The area of Kurdish Text-to-Speech (TTS) is still in its
infancy, and compared to other languages, there has been
little research carried on in this language. To the best of our
knowledge, nobody has performed any serious academic
research on various branches of Kurdish language processing
yet (recognition, synthesis, etc.) [28,29].
Kurdish is one of the Iranian languages, which are a sub
category of the Indian-European family [30,31]. The Kurdish
phonemics consists of 24 consonants, 4 semi vowels and 6
vowels. Also lei, Itl, and Iti have entered Kurdish from
Arabic. Also, this language has two scripts: the first one is a
modified Arabic alphabet and the second one is a modified
Latin alphabet [32,33]. For example "trifa" which means
"moon light" in Kurdish, is written as I ~ y l in the Arabic
script and as "tirife" in the Latin. Whereas both scripts are in
use, both ofthem suffer some problems (e.g., in Arabic script
the phoneme Iii is not written; also both Iwl and lui are
written with the same Arabic written sign /.JI [32,33], and
Latin script does not have the Arabic phoneme I ~ / , and it does
not have any standard written sign for foreign phonemes
[33]).
557
Finally, in standard script, 41 standard symbols were
spotted. Notice that this is more than the number of Kurdish
phonemes, because we also include three standard symbols
for space, comma and dot. Table 1 shows all the standard
letters that are used by the proposed system. Table 2 shows
the same sentence in various scripts. Also, the standard
converter performs standard text normalization tasks such as
converting digits into their word equivalents, spelling out
some known abbreviations, etc.
In the next stage, allophones are extracted from the
standard text. This task is done using a neural network.
Kurdish phonemes have about approximately 200 allophones,
but some ofthem are very similar, and non-expert people can
not detect the differences [32]. As a result, it is not necessary
for our TTS system to include all of them (for simplicity,
only 66 important and clear instances have been included;
see Table 3). Also the allophones are not divided equally
between all phonemes (e.g., Ipl is presented by five
allophones but Irl has only one allophone [32]). However,
the neural network implementation is very flexible as it is
very simple to change the number of allophones or
phonemes.
Major Kurdish allophones (more than 80% of them) are
dependent only on two consecutive phonemes. Others (about
20%) may be dependent on the current, one preceding and
two succeeding phonemes [32]. Hence, we employed four
sets of neurons in the input layer, each having 41 neurons for
detection of the 41 mentioned standard symbols. A sliding
window of width four provides input phonemes for the
network input layer. Each set of input layer is responsible for
one of the phonemes in the window. The aim is to recognize
the relevant allophone to the second phoneme of the window.
The output layer has 66 neurons (corresponding to the 66
Kurdish allophones used here) for the recognition of the
corresponding allophones and the middle layer is responsible
for detecting language rules and it has 60 neurons (these
values are obtained empirically); (See Fig. 2). The neural
network accuracy rate is equal to 98%. In Table 4, neural
network output and desired output are compared.
target letter
4 groups of 41input
units
60hi dden units
4 Letters of text input
target output
66output units
o n
IiI
0000.0000
I i
000000000000000
/ 1 1
oro oro 000 ° °
~
Preprocessing
Raw
I
II--- Standard
Text
Text II Standard
Normalizer Convert er
Text
L.
Neural
Allophone
Allophone
Speech
...
~
Network
~
to Sound
2. ALLOPHONE BASED TIS SYSTEM
Fig. 1. Architecture ofthe proposed KurdishTTS system
In this part, a Text-To-Speech system for Kurdish
language, which is constructed based on concatenation
method of speech synthesis and use allophones (several
pronunciation of a phoneme [34]) as basic unit will be
introduced [28,29]. According to the input text, proper
allophones from database have been chosen and
concatenated to obtain the primary output.
Differences between allophones in Kurdish language are
normally very clear; therefore, we preferred to explicitly use
allophone units for the concatenative method. Some of
allophones obey obvious rules; for example if a word end
with a voiced phoneme, the phoneme would lose the voicing
feature and is called devoiced [32]. However, in most cases
there is not a clear and constant rule for all of them. As a
result, for extracting allophones we used a neural network.
Because their learning power, neural networks can learn
from a database and can recognize allophones properly.
Fig. 1 shows the architecture of the proposed system. It is
composed of three major components: a pre-processing
module, a neural network module and an allophone-to-sound
module. After converting the input raw text to the standard
text, a sliding window of width offour is used as the network
input. The network detects second phoneme's allophone, and
the allophone waveform is concatenated to the preceding
waveform.
In concatenative systems, one of the most important
challenges is to select an appropriate unit for concatenation.
Each unit has its own advantages and disadvantages, and
might be appropriate for a specific system. In this paper we
develop three various concatenative TIS systems for
Kurdish language based on syllable, allophone, and diphones,
and compare these systems in intelligibility, naturalness, and
overall quality.
The rest of the paper is organized as follows: Section 2
introduces the allophone based TTS system. Section 3 and 4
presents syllable and diphone based systems respectively,
and finally, comparison between these systems and quality
test results are presented in Section 5. Conclusions are drawn
in Section 6.
The pre-processing module includes a text normalizer and
a standard converter. The text normalizer is an application
that converts the input text (in Arabic or Latin script) to our
standard script; in this conversion we encountered some
problems [30-33].
Fig. 2. The neural network structure
After allophone recognition, corresponding waveform of
allophones should be concatenated. For each allophone we
selected a suitable word and recorded it in a noiseless
environment. Separation of allophones in waveforms was
978-1-4244-5950-6/09/$26.00 ©2009 IEEE 558
done manually by using of Wavesurfer software. The results
of this system and comparison between it and other systems
are presented in Section 5.
3. SYLLABLE BASED TTS SYSTEM
Syllable is another unit, which is used for developing a
text-to-speech system. Various languages have different
patterns for syllable. In most of these languages, there are
many patterns for syllable and therefore, the number of
syllables is large; so usually syllable is not used in all-
purpose TTS systems. For example, there are more than
15000 syllables in English [6]. Creating a database for this
number of units is a very difficult and time-consuming task.
In some languages, the number of syllable patterns is
limited, so the number of syllables is small, and creating a
database for them is reasonable; therefore this unit can be
used in all-purpose TTS systems. For example, Indian
language has ev, eev, ve, and eve syllable patterns, and
the total number of syllables in this language is 10000. In
[35], some syllable-like units are used; the number of this
unit is 1242.
Syllable is used in some Persian TTS systems, too [26].
This language has only ev, eve, and evee patterns for its
syllables and so, its syllable number is limited to 4000 [26].
Kurdish has three main groups of syllables that are Asayi,
Lekdraw, and Natewaw [33]. Asayi is the most important
group and it includes most of the Kurdish syllables. In
Lekdraw group, two consonant phonemes occur at the onset
of syllable. For example, in I p ~ O / two phonemes Ipl and I ~ I
make a cluster and the syllable pattern is eev. Finally,
Natewaw group occurs seldom, too. Each group is divided
into three groups, Suk, Pir, and Giran [33]. Table 5 shows
these groups with corresponding patterns and appropriate
examples.
According to Table 5, Kurdish has 9 syllable patterns; but
two groups Lekdraw and Natewaw are used seldom and in
practice, three patterns, ev, eve, and evee are the most
used patterns in Kurdish language. According to this fact, we
can consider only Asayi group in the implementations, and
so the number of database syllables are less than 4000. In our
system, we consider these syllables and extend our TTS
system using them.
4. DIPHONE BASED TTS SYSTEM
Nowadays diphone is the most popular unit in synthesis
systems. Diphones include a transition part from the first unit
to the next unit, and so, have a more desirable quality rather
than other units. Also, in some modem systems, a
combination of this unit and other methods such as unit
selection are used.
Kurdish has 37 phonemes, so as an upper hand estimate, it
has 37x36=1332 diphones. However, all of these
combinations are not valid. For example, in Kurdish two
phonemes Ix! and Igl or Ix! and !hi do not succeed each other
978-1-4244-5950-6/09/$26.00 ©2009 IEEE
immediately. Also, vowels do not form a cluster. So, the
number of serviceable diphones in Kurdish is less than 1300.
After choosing the appropriate unit, we should choose the
suitable instance for each unit. For this reason, we chose a
proper word for each diphone and then extract its
corresponded signal using COOL EDIT. Quality testing
results are discussed in Section 5.
5. QUALITY TESTING RESULTS
For evaluating our proposed TTS systems and comparing
them, various tests have been carried out. In the first test, a
set of seven sentences which were produced with each
system was used as the test material. The test sets were
played to 17 volunteer listeners. All of the listeners were
Kurd and have not any audition problem. The listeners were
asked to rate the systems' naturalness and overall voice
quality on a scale of 1 (bad) to 5 (good). The volunteers
didn't know anything about the sentences before listening to
them. The obtained test results are shown in Table 6.
To determine the system's intelligibility, a second test has
been carried out. In this test, the listeners were asked to write
down the text they understood; then WR and SR were
computed using the following equations:
WR =Correct WordsNumber , SR =Correct Syllable Number
Total WordsNumber Total Syllable Number
Table 7 shows the results for various systems. According
to these results, all systems' intelligibilities (especially that of
the diphone based system) are acceptable.
In the next stage, the Diagnostic Rhyme Test has been
used to compare system's quality. The DRT, introduced by
Fairbanks in 1958, uses a set of isolated words to test for
consonant intelligibility in initial position [36,37]. The test
consists of 96 word pairs that differ by a single acoustic
feature in the initial consonant. Word pairs are chosen to
evaluate the six phonetic characteristics listed in Table 8.
The listener hears one word at the time and marks to the
answering sheet which one of the two words he thinks is
correct. Finally, the results are summarized by averaging the
error rates from the answer sheets.
For this reason, we chose 96 sets of two one-syllable
words; these words are shown in Table 9. Most of these
words are meaningful in Kurdish; however, the few
meaningless words are shown with bold style.
Hence in this test, evaluating the syllable based system
quality may show unfairly good results, and we decided not
to carry out this test for the syllable based system. The test
sets were played to 12 volunteer listeners. Ten listeners were
Kurd and two of them were non-Kurd. Table 10 shows the
results ofthis test.
The last test which has been carried out is Modified
Rhyme Test. The MRT, which is a sort of extension to the
DRT, tests for both initial and fmal consonant apprehension
[36,37]. The test consists of 50 sets of six one-syllable words
which makes a total set of 300 words. The set of six words is
played one at the time and the listener marks which word he
559
thinks he hears on a multiple choice answer sheet. The first
half of the words is used for the evaluation of the initial
consonants and the second one for the final ones. Table 11
summarizes the test format [38]. Results are summarized as
in DRT, but both final and initial error rates are given
individually [39].
The same group of 12 listeners as in our DRT test has
been used in this test. Final results are shown in Table 12.
6. CONCLUSION AND FUTURE WORKS
Nowadays, most modem TTS systems use concatenative
method to produce artificial speech. The most important
challenge in this method is choosing appropriate unit for the
database. This unit must warranty smoothness and high
quality speech, and creating the database must be reasonable
and inexpensive. For example, syllable, phoneme, allophone,
and, diphone are usually considered as appropriate for all-
purpose systems.
In this paper, we implemented three synthesis systems for
Kurdish language based on syllable, allophone, and diphone
and compared their quality using various tests. The diphone-
based TIS system showed to be the most natural one while
all systems' Intelligibilities are acceptable.
Unit selection method [40] can produce high quality and
natural output speech. Developing a TTS system using unit
selection and combining it with other methods is our goal in
the future works.
d dl d f h bl Ta e I: LIst 0 t e ronose system s stan ar etters
Arabic f r r.J. <.i"
j j
..< .J

r r 1'< I"
..::. ...,.. y
,
,j
Latin - - S
s j Z IT r d x - c
c t
P
b a -
Standard X G S s j z R R d x H C c t P b a A
Arabic t..S t..S
.
- o
A
JJ
j
J J W J J
.s .s c.:; w .....
Latin y 1 e i e h u 0 U w n m II I g k Q V f
Standard v I e i Y h U 0 U w n m L I g k Q V f
Table 2: The same sentence ID various scripts
Arabic Format Yi"hr.J.l...,.;...,.; ...,..J.l...,..J.l
Latin Format Dillop dillop baran gull enusetewe unime nimeys cawanim to
Standard Format diLop diLop baran guL AenUsYtewe U nime nimeyS Cawanim to
T bl 5 K d' h II bl a e : ur IS SYI a e patterns
Suk Pir Giran
Asayi Svllable Pattern CV CVC CVCC
Example De, To Waz,Llx Kurt, Berd
Lekdraw Syllable Pattern CCV CCVC CCVCC
Example Bro, cya Bjar, Bzut Xuast, Bnesht
Natewaw Syllable Pattern V VC VCC
Example -I -an -and
Table 6' First test results
Naturalness Overall Quality
Allophone Based System 2.29 2.45
Syllable Based System 2.65 3.02
Diphone Based System 3.37 3.51
978-1-4244-5950-6/09/$26.00 ©2009 IEEE 560
Table 7: Secondtest results
WR SR
Allophone Based System 79.4 83.2
Syllable Based System 82.6 87
Diphone Based System 93.8 97.2
Table 8: The DRT characteristics
Characteristics Description Example
Voicing Voiced - Unvoiced veal- feel
Nasality Nasal- Oral reed - deed
Sustension Sustained - Interrupted Sheat - cheat
Sibilation Sibilated- Unsibilated Sing - thing
Graveness Grave- Acute Weed - reed
Compactness Compact - Diffuse Key - tea
d' DRT d
T bl 9 K di h .. a e : ur IS minima pair wor s use In test
Voicing Nasality Sustention Sibilation Graveness Compactness
ban tan mez tez sirr Cirr coll koll bar dar turd kurd
gall kall merd terd ver ber soz toz perr terr fan han
bull pull nan dan firr pirr sam tam berd derd fall hall
din tin nez dez cern jir gir fall tall var yar
der ter mall tall sill em jar yar pek tek due kue
zall sall min tin van ban sall tall pirs tirs tam kam
zerd serd mil til til pil zin tin fam tam torr korr
gorr korr nas das fall pall sem tern birr dirr dall gall
viz tiz nem dem sen cen zam tam boll doll ferd herd
zuer suer maf taf var bar zil til fer ter kli
birr pirr nos sall call zem tern pem tern derd gerd
ders ters ner der sorr corr call kall ball dall vall yall
gom com nill dill cing saf taf ban dan tue kue
ZlIT SlIT nerd derd tis pis kern ferz terz dirr girr
girr cirr man tan fas pas kirr til til tall kall
vam fam mas vor bor sex tex pall tall tern kern
Table 10: The results ofDRT test
Voicing Nasality Sustension Sibilation Graveness Compactness Average
Allophone Based System 96.87 97.39 94.27 95.83 95.31 97.91 96.26
Syllable Based System --- --- --- --- --- --- ---
Diphone Based System 97.91 98.95 95.31 96.35 97.39 98.43 97.39
. MRT
fh T bill E a e xamp es 0 t e response sets In
A B C D E F
1 bar ban bax barr bas
2 korr koll kox kot kok
...
26 ban tan man yan wan san
27 dall hall gall mall fall yall
28 hoz toz soz moz qoz poz
...
Table 12: The final MRT results
Initial position Final Position Average
Allophone Based System 77.7 81.4 79.55
Syllable Based System --- --- ---
Diphone Based System 88.4 86.4 87.4
978-1-4244-5950-6/09/$26.00 ©2009 IEEE 561
REFERENCES
[1] H. AI-Muhtasebl, M. Elshafeil, and M. AI-Ghamdi, "Techniques for
High Quality Arabic Speech Synthesis," Information sciences,
Elsevier, 2002.
[2] T. Styger and E. Keller, Fundamentals ofSpeech Synthesis and Speech
Recognition: Basic Concepts, State of the Art, and Future
Challenges Formant synthesis, In Keller E. (ed.), 109-128,
Chichester: John Wiley, 1994.
[3] D. H. Klatt, "Software for a Cascade/Parallel Formant Synthesizer,"
Journal of the Acoustical Society of America, Vol 67, 971-995,
1980.
[4] W. Harnza, Arabic Speech Synthesis Using Large Speech Database,
PhD. thesis, Cairo University, Electronics and Communications
Engineering Department, 2000.
[5] R. E. Donovan, Trainable Speech Synthesis, PhD. thesis, Cambridge
University, Engineering Department, 1996.
[6] S. Lemmetty, Review of Speech Synthesis Technology, M.Sc Thesis,
Helsinki University of Technology, Department of Electrical and
Communications Engineering, 1999.
[7] A. Youssef, et ai, "An Arabic TTS System Based on the IBM
Trainable Speech Synthesizer," Le traitement automatique de
l'arabe, JEP-TALN 2004, Fes, 2004.
[8] 1. P. Olive, "Rule synthesis of speech from diadic units," ICASSP,
pages 568-570, 1977.
[9] A. Syrdal, "Development of a female voice for a concatenative text-
to-speech synthesis system," Current Topics in Acoust. Res., 1:169-
181, 1994.
[10] 1. Olive, 1. van Santen, B. Moebius, and C. Shih, Multilingual Text-
to-Speech Synthesis: The Bell Labs Approach, pages 191-228.
Kluwer Academic Publishers, Norwell, Massachusetts, 1998.
[11] M. Beutnagel, A. Conkie, and A. K. Syrdal, "Diphone Synthesis
using Unit Selection," The Third ESCAICOCOSDA Workshop
(ETRW) on Speech Synthesis, ISCA - 1998.
[12] R. Sproat, 1. Hu, and H. Chen, "Emu: An e-mail pre-processor for
text-to-speech," Proc. IEEE Workshop on Multimedia Signal Proc.,
pp. 239-244, Dec. 1998.
[13] C. H. Wu and 1. H. Chen, "Speech Activated Telephony Email
Reader (SATER) Based on Speaker Verification and Text-to-
Speech Conversion," IEEE Trans. Consumer Electronics, vol. 43,
no. 3, pp. 707- 716, Aug. 1997.
[14] A. Black, CHATR, Version 0.8, a generic speech synthesis, System
documentation, ATR-Interpreting Telecommunications
Laboratories, Kyoto, Japan, March 1996.
[15] A. Hunt and A. Black, "Unit selection in a concatenative speech
synthesis system using a large speech database," ICASSP, 1:373-
376, 1996.
[16] M. Beutnagel, A. Conkie, 1. Schroeter, Y. Stylianou, and A. Syrdal,
"The AT&T NEXT-GEN TTS System," Joint Meeting ofASA, EAA,
and DAGA, 1999.
[17] T. Dutoit, High Quality Text-To-Speech Synthesis of the French
Language, Ph.D. dissertation, submitted at the Faculte
Polytechnique de Mons, 1993.
[18] T. Dutoit, et ai, "The MBROLA project: towards a set of high quality
speech synthesizers free of use of non commercial purposes," ICSLP
96, Proceedings, Fourth International Conference, IEEE, 1996.
[19] F. Chouireb, M. Guerti, M. NaIl, and Y. Dimeh, "Development of a
Prosodic Database for Standard Arabic," Arabian Journal for
Science and Engineering, 2007.
[20] A. Ramsay and H. Mansour, "Towards including prosody in a text-to-
speech system for modern standard Arabic," Computer Speech &
Language, Elsevier, 2008.
[21] I. Amdal and T. Svendsen, "A Speech Synthesis Corpus for
Norwegian", Irec'06, 2006.
[22] K. Yoon, "A prosodic phrasing model for a Korean text-to-speech
synthesis system," Computer Speech &Language, Elsevier, 2006.
[23] P. Zervas, I. Potamitis, N. Fakotakis, G. Kokkinakis, "A Greek TTS
based on Non uniform unit concatenation and the utilization of
978-1-4244-5950-6/09/$26.00 ©2009 IEEE
Festival architecture," First Balkan Conference on Informatics,
Thessalonica, Greece, pp. 662-668, 2003.
[24] A. Farrohki, S. Ghaemmaghami, and M. Sheikhan, "Estimation of
Prosodic Information for Persian Text-to-Speech System Using a
Recurrent Neural Network," ISCA, Speech Prosody 2004,
International Conference, 2004.
[25] M. Namnabat and M. M. Homayunpoor, "Letter-to-Sound in Persian
Language Using Multy Layer Perceptron Neural Network," Iranian
Electrical and Computer Engineering Journal, 2006 (in Persian).
[26] H. R. Abutalebi and M. Bijankhan, "Implementation of a Text-to-
Speech System for Farsi Language," Sixth International Conference
on Spoken Language Processing (ISCA), 2000.
[27] F. Hendessi, A. Ghayoori, and T. A. Gulliver, "A Speech Synthesizer
for Persian Text Using a Neural Network with a Smooth Ergodic
HMM," ACM Transactions on Asian Language Information
Processing (TALIP), 2005.
[28] F. Daneshfar, W. Barkhoda, and B. ZahirAzami, "Implementation of
a Text-to-Speech System for Kurdish Language," ICDT'09, Colmar,
France, July 2009.
[29] W. Barkhoda, F. Daneshfar, and B. ZahirAzami, "Design and
Implementation of a Kurdish TTS System Based on Allophones
Using Neural Network," ISCEE'08, Zanjan, Iran, 2008 (In Persian).
[30] W. M. Thackston, Sorani Kurdish: A Reference Grammar with
Selected Reading, Harvard: Iranian Studies at Harvard University,
2006.
[31] A. Rokhzadi, Kurdish Phonetics and Grammar, Tarfarnd press,
Tehran, Iran, 2000 (In Persian).
[32] M. Kaveh, Kurdish Linguistic and Grammar (Saqizi accent), Ehsan
Press, first edition, Tehran, ISBN 964-356-355-3, 2005 (In Persian).
[33] S. Baban, Phonology and Syllabication in Kurdish Language,
Kurdish Academy Press, first edition, Arbil, 2005 (In Kurdish).
[34] R. 1. Deller, et aI., Discrete time processing of speech signals, John
Wiley and Sons, 2000.
[35] M. N. Rao, S. Thomas, T. Nagarajan, and H. A. Murthy, "Text-to-
Speech Synthesis using syllable-like units," National Conference on
Communication, India, 2005.
[36] M. Goldstein, "Classification of Methods Used for Assessment of
Text-to-Speech Systems According to the Demands Placed on the
Listener," Speech Communication, vol. 16: 225-244, 1995.
[37] 1. Logan, B. Greene, and D. Pisoni, "Segmental Intelligibility of
Synthetic Speech Produced by Rule," Journal of the Acoustical
Society ofAmerica, JASA vol. 86 (2): 566-581, 1989.
[38] Y. Shiga, Y. Hara, and T. Nitta, "A Novel Segment-Concatenation
Algorithm for a Cepstrum-Based Synthesizer," Proceedings of
ICSLP 94, (4): 1783-1786, 1994.
[39] D. Pisoni and S. Hunnicutt, "Perceptual Evaluation of MITalk: The
MIT Unrestricted Text-to-Speech System," Proceedings of ICASSP
80, voL 5: 572-575,1980.
[40] H. Sak, A Corpuse-Based Concatenative Speech Synthesis Systemfor
Turkish, M.Sc. Thesis, Bogazici University, 2004.
562

and compare these systems in intelligibility. in most cases there is not a clear and constant rule for all of them. for example if a word end with a voiced phoneme. Architecture ofthe proposed KurdishTTS system Fig. Ipl is presented by five allophones but Irl has only one allophone [32]). After converting the input raw text to the standard text. only 66 important and clear instances have been included. neural networks can learn from a database and can recognize allophones properly. (See Fig. and overall quality. the phoneme would lose the voicing feature and is called devoiced [32]. However. Each set of input layer is responsible for one of the phonemes in the window. Others (about 20%) may be dependent on the current. However. one preceding and two succeeding phonemes [32]. The rest of the paper is organized as follows: Section 2 introduces the allophone based TTS system.In concatenative systems. a Text-To-Speech system for Kurdish language. In this paper we develop three various concatenative TIS systems for Kurdish language based on syllable. Separation of allophones in waveforms was 978-1-4244-5950-6/09/$26.. Notice that this is more than the number of Kurdish phonemes. Also the allophones are not divided equally between all phonemes (e. Also. Kurdish phonemes have about approximately 200 allophones. for extracting allophones we used a neural network. and non-expert people can not detect the differences [32]. because we also include three standard symbols for space.g. in standard script. The network detects second phoneme's allophone. ~ Allophone to Sound Fig. Table 1 shows all the standard letters that are used by the proposed system. naturalness. the standard converter performs standard text normalization tasks such as converting digits into their word equivalents. each having 41 neurons for detection of the 41 mentioned standard symbols. 1. a neural network module and an allophone-to-sound module. and the allophone waveform is concatenated to the preceding waveform. Neural Network Allophone . The neural network structure The pre-processing module includes a text normalizer and a standard converter. For each allophone we selected a suitable word and recorded it in a noiseless environment. proper allophones from database have been chosen and concatenated to obtain the primary output. which is constructed based on concatenation method of speech synthesis and use allophones (several pronunciation of a phoneme [34]) as basic unit will be introduced [28. It is composed of three major components: a pre-processing module. A sliding window of width four provides input phonemes for the network input layer. and diphones. 41 standard symbols were spotted. Conclusions are drawn in Section 6. The text normalizer is an application that converts the input text (in Arabic or Latin script) to our standard script. 2. Major Kurdish allophones (more than 80% of them) are dependent only on two consecutive phonemes. Table 2 shows the same sentence in various scripts. allophones are extracted from the standard text. Fig. we employed four sets of neurons in the input layer. 1 shows the architecture of the proposed system. 2). in this conversion we encountered some problems [30-33].29]. see Table 3). IiI target output 66 output units 0000 . it is not necessary for our TTS system to include all of them (for simplicity. In the next stage. neural network output and desired output are compared. etc. the neural network implementation is very flexible as it is very simple to change the number of allophones or phonemes. 2. therefore. comma and dot. comparison between these systems and quality test results are presented in Section 5.. This task is done using a neural network. As a result. Each unit has its own advantages and disadvantages. ALLOPHONE BASED TIS SYSTEM In this part. Hence. and might be appropriate for a specific system.0000 000000000000000 I Text Normalizer II Standard Convert er II--- Standard Text Speech ~ o o ro ro I i / 1 1 ~ target letter n 60 hidden units 000 °° o 4 groups of 41 input units 4 Letters of text input L. and finally. spelling out some known abbreviations. The neural network accuracy rate is equal to 98%. In Table 4. a sliding window of width offour is used as the network input. The output layer has 66 neurons (corresponding to the 66 Kurdish allophones used here) for the recognition of the corresponding allophones and the middle layer is responsible for detecting language rules and it has 60 neurons (these values are obtained empirically). After allophone recognition. According to the input text. we preferred to explicitly use allophone units for the concatenative method. Because their learning power.. one of the most important challenges is to select an appropriate unit for concatenation. Section 3 and 4 presents syllable and diphone based systems respectively. allophone.00 ©2009 IEEE 558 . Differences between allophones in Kurdish language are normally very clear. Some of allophones obey obvious rules. but some of them are very similar. Preprocessing Raw Text Finally. As a result. corresponding waveform of allophones should be concatenated. The aim is to recognize the relevant allophone to the second phoneme of the window.

we chose 96 sets of two one-syllable words. and evee are the most used patterns in Kurdish language. SYLLABLE BASED TTS SYSTEM Syllable is another unit. Pir. there are more than 15000 syllables in English [6]. The volunteers didn't know anything about the sentences before listening to them. eev. DIPHONE BASED TTS SYSTEM Nowadays diphone is the most popular unit in synthesis systems. According to these results. The test sets were played to 12 volunteer listeners. Table 5 shows these groups with corresponding patterns and appropriate examples. Syllable is used in some Persian TTS systems. and so. and the total number of syllables in this language is 10000. Hence in this test. and Giran [33]. various tests have been carried out. Creating a database for this number of units is a very difficult and time-consuming task. some syllable-like units are used. a set of seven sentences which were produced with each system was used as the test material. we consider these syllables and extend our TTS system using them. all of these combinations are not valid. the few meaningless words are shown with bold style. Most of these words are meaningful in Kurdish. 4. then WR and SR were computed using the following equations: WR = Correct WordsNumber . we should choose the suitable instance for each unit. vowels do not form a cluster. Kurdish has 9 syllable patterns. For this reason. evaluating the syllable based system quality may show unfairly good results. In some languages. Also. tests for both initial and fmal consonant apprehension [36. introduced by Fairbanks in 1958. In [35]. and we decided not to carry out this test for the syllable based system. The MRT. Word pairs are chosen to evaluate the six phonetic characteristics listed in Table 8. Indian language has ev. but two groups Lekdraw and Natewaw are used seldom and in practice. Kurdish has three main groups of syllables that are Asayi. In the first test. the Diagnostic Rhyme Test has been used to compare system's quality. According to Table 5.00 ©2009 IEEE 559 .37]. eve. 3. it has 37x36=1332 diphones. we can consider only Asayi group in the implementations. The DRT. a second test has been carried out. the listeners were asked to write down the text they understood. which is used for developing a text-to-speech system. Various languages have different patterns for syllable. Each group is divided into three groups. Asayi is the most important group and it includes most of the Kurdish syllables. Also. To determine the system's intelligibility. QUALITY TESTING RESULTS For evaluating our proposed TTS systems and comparing them. SR = Correct Syllable Number Total WordsNumber Total Syllable Number Table 7 shows the results for various systems. For this reason. The listener hears one word at the time and marks to the answering sheet which one of the two words he thinks is correct. these words are shown in Table 9. so the number of syllables is small. three patterns. there are many patterns for syllable and therefore. Kurdish has 37 phonemes. The results of this system and comparison between it and other systems are presented in Section 5. Quality testing results are discussed in Section 5. The listeners were asked to rate the systems' naturalness and overall voice quality on a scale of 1 (bad) to 5 (good). Table 10 shows the results of this test. Natewaw group occurs seldom. so as an upper hand estimate. After choosing the appropriate unit. Finally. According to this fact. the results are summarized by averaging the error rates from the answer sheets. The test consists of 50 sets of six one-syllable words which makes a total set of 300 words. 5. so usually syllable is not used in allpurpose TTS systems. and Natewaw [33]. therefore this unit can be used in all-purpose TTS systems. ev. In our system. For example. the number of syllables is large. in Kurdish two phonemes Ix! and Igl or Ix! and !hi do not succeed each other immediately. For example. the number of syllable patterns is limited. The obtained test results are shown in Table 6. too. In the next stage. we chose a proper word for each diphone and then extract its corresponded signal using COOL EDIT. The test sets were played to 17 volunteer listeners. its syllable number is limited to 4000 [26]. Suk. and so the number of database syllables are less than 4000. ve. the number of serviceable diphones in Kurdish is less than 1300. have a more desirable quality rather than other units. however. in some modem systems. This language has only ev. In Lekdraw group. and evee patterns for its syllables and so. two consonant phonemes occur at the onset of syllable. However. all systems' intelligibilities (especially that of the diphone based system) are acceptable. Ten listeners were Kurd and two of them were non-Kurd. Finally.37]. too [26]. The test consists of 96 word pairs that differ by a single acoustic feature in the initial consonant. For example. in Ip~O/ two phonemes Ipl and I~I make a cluster and the syllable pattern is eev. the number of this unit is 1242. So. The set of six words is played one at the time and the listener marks which word he 978-1-4244-5950-6/09/$26. Diphones include a transition part from the first unit to the next unit. which is a sort of extension to the DRT. and eve syllable patterns. uses a set of isolated words to test for consonant intelligibility in initial position [36.done manually by using of Wavesurfer software. Lekdraw. All of the listeners were Kurd and have not any audition problem. In this test. In most of these languages. For example. eve. a combination of this unit and other methods such as unit selection are used. The last test which has been carried out is Modified Rhyme Test. and creating a database for them is reasonable.

02 3.. diphone are usually considered as appropriate for allpurpose systems..S r - r. 6.. t t .. The diphonebased TIS system showed to be the most natural one while all systems' Intelligibilities are acceptable. and creating the database must be reasonable and inexpensive.....J4 .i" S G t.51 978-1-4244-5950-6/09/$26.J. f f m m Table 2: The same sentence Arabic Format Latin Format Standard Format ID various scripts Yi"hr.. most modem TTS systems use concatenative method to produce artificial speech. Unit selection method [40] can produce high quality and natural output speech.l.::. y . CONCLUSION AND FUTURE WORKS Nowadays... The same group of 12 listeners as in our DRT test has been used in this test.. For example.. allophone. Berd CCVCC Xuast.00 ©2009 IEEE 560 .s k k w V V ... J·J~JY~Jfiwl. syllable.j P P b b A i i e o A Y h h u U JJ j J U U J w w 0 0 W n n ~ J II L J I I ...l Dillop dillop baran gull enusetewe unime nimeys cawanim to diLop diLop baran guL AenUsYtewe U nime nimeyS Cawanim to T able 5: Kurd' h SYIIIable patterns IS Suk Pir Asayi Lekdraw Svllable Pattern CV De. allophone. Bzut VC Giran CVCC Kurt.. we implemented three synthesis systems for Kurdish language based on syllable. Bnesht VCC -and Example Syllable Pattern Example Natewaw Syllable Pattern Example -I -an Table 6' First test results Overall Quality Naturalness Allophone Based System Syllable Based System Diphone Based System 2.:. Q Q .J..J.l...s g g .thinks he hears on a multiple choice answer sheet. and diphone and compared their quality using various tests..J .... To CCV Bro. Final results are shown in Table 12..29 2. Results are summarized as in DRT.. but both final and initial error rates are given individually [39].45 3. <. cya V CVC Waz.J...37 2. In this paper.c c j z R R x H d C c .Llx CCVC Bjar. and.65 3...< j Z IT r d x . Table 11 summarizes the test format [38].. e e S s s Ta ble I: LIst 0 f the ronose d system s stan dardl etters ~ j j r r 1'< I" . phoneme. This unit must warranty smoothness and high quality speech. a a c. The most important challenge in this method is choosing appropriate unit for the database. Arabic Latin Standard Arabic Latin Standard f X t.. Developing a TTS system using unit selection and combining it with other methods is our goal in the future works..S y v 1 I . The first half of the words is used for the evaluation of the initial consonants and the second one for the final ones.

27 --- 95.reed Compactness Compact .Acute Weed .95 95.cheat Sing .thing Sibilation Sibilated .2 Allophone Based System Syllable Based System Diphone Based System 79..26 --- 94..6 93.. 26 27 28 man gall soz .91 --- Average 96..tea Table 9: Kurdi h minima pair words used'In DRT test .Oral reed .87 --- Table 10: The results ofDRT test Nasality Sustension Sibilation 97.deed Sustension Sustained .Unvoiced veal.39 --- Graveness 95.feel Nasality Nasal.39 . Tabill Examp es 0 fh e response sets In MRT t e A B C D E 1 2 bar korr ban dall hoz ban koll tan hall toz b~ ko~ F bas kok san yall poz bax kox yan mall moz barr kot wan fall qoz .Diffuse Key .39 98.Table 7: Second test results WR SR 83.31 96.55 --- 81.4 978-1-4244-5950-6/09/$26.83 --- 97.43 97. IS Voicing Nasality ban tan mez tez gall kall merd terd bull pull nan dan nez dez din tin der ter mall tall zall sall min tin zerd serd mil til gorr korr nas das viz tiz nem dem zuer suer maf taf birr pirr nos do~ der ders ters ner gom com nill dill ZlIT SlIT nerd derd girr cirr man tan vam fam mas ta~ Sustention sirr Cirr ver ber firr pirr cern ~em sill van til fall sen var sall sorr ~ing tis fas vor Sibilation Graveness Compactness koll bar dar turd kurd soz toz perr terr fan han sam tam berd derd fall hall jir gir fall tall var yar em jar yar pek tek due kue ban sall tall pirs tirs tam kam pil zin tin fam tam torr korr pall sem tern birr dirr dall gall cen zam tam boll doll ferd herd bar zil til fer ter kli t~i call zem tern pem tern derd gerd yall corr call kall ball dall vall kue cing saf taf ban dan tue pis ~em kern ferz terz dirr girr kirr til til tall kall pas ~irr bor sex tex pall tall tern kern coll Voicing Allophone Based System Syllable Based System Diphone Based System 96.35 97.00 ©2009 IEEE 561 .Unsibilated Graveness Grave .4 87.4 86.91 98.Interrupted Sheat .7 --- Average 79.31 --- Compactness 97.. Table 12: The final MRT results Initial position Final Position Allophone Based System Syllable Based System Diphone Based System 77.8 Table 8: The DRT characteristics Characteristics Description Example Voicing Voiced .2 87 97.4 --- 88.4 82.

Chen. Syrdal. and B. [37] 1. "Diphone Synthesis using Unit Selection. Zervas. Conkie. and B. F. ISCA . R. "Software for a Cascade/Parallel Formant Synthesizer. 1. Sheikhan. and A." ACM Transactions on Asian Language Information [4] W. [5] R. thesis. no. [38] Y. Olive. Logan. [15] A." Journal of the Acoustical Society ofAmerica. [6] S. Elsevier. M. Hara. dissertation. Sorani Kurdish: A Reference Grammar with Selected Reading. H. Beutnagel. 1996. International Conference. M. Fes. [33] S.Sc Thesis." The Third ESCAICOCOSDA Workshop (ETRW) on Speech Synthesis. M. 2002. State of the Art. [2] T.716. A. S. 1998. Thackston. 1. and M." ICASSP. Festival architecture. Chichester: John Wiley. ZahirAzami. Hendessi. Y. 1:373376. Rokhzadi. "Development of a female voice for a concatenative textto-speech synthesis system.00 ©2009 IEEE 562 . pages 191-228." Sixth International Conference on Spoken Language Processing (ISCA). 707. [23] P. N. "Unit selection in a concatenative speech synthesis system using a large speech database. IEEE. thesis. "A Greek TTS based on Non uniform unit concatenation and the utilization of 978-1-4244-5950-6/09/$26. pp.D. "Towards including prosody in a text-tospeech system for modern standard Arabic. Vol 67. PhD. M. 1999. S. Syrdal. first edition. [20] A." Proceedings of ICASSP 80. Homayunpoor. 1. 1994. Tehran. [3] D. H." Computer Speech & Language. 2005. Chen. [24] A. "Text-toSpeech Synthesis using syllable-like units. Proceedings. voL 5: 572-575. Rao. Massachusetts. Bogazici University. [31] A. Baban. JASA vol. Colmar. Daneshfar. Dimeh. Ghayoori. Bijankhan. and DAGA. A. Ramsay and H. 1997. Kurdish Phonetics and Grammar. NaIl. Lemmetty. [26] H. "Segmental Intelligibility of Synthetic Speech Produced by Rule. Y. [8] [9] [10] 1. [17] T. and C." ICASSP. Deller. and A." ISCEE'08. Version 0. Thessalonica. ISBN 964-356-355-3. PhD. 16: 225-244. Ph. Mansour. [34] R. 1. Pisoni. 2008. 2000 (In Persian). submitted at the Faculte Polytechnique de Mons. Dutoit. van Santen. Kokkinakis. "Speech Activated Telephony Email Reader (SATER) Based on Speaker Verification and Text-toSpeech Conversion. "Classification of Methods Used for Assessment of Text-to-Speech Systems According to the Demands Placed on the Listener. 2007. Conkie. March 1996. Harnza. Electronics and Communications Engineering Department. EAA. Stylianou.Sc. pp. Black. Nitta. and D. A. Elshafeil. July 2009. Elsevier. Potamitis. [40] H. Goldstein. 2000. 662-668. "Development of a Prosodic Database for Standard Arabic. Tehran. 2008 (In Persian). 2000. Review of Speech Synthesis Technology. Ehsan Press. 2005 (In Kurdish). Thomas. [27] F. Olive. "The AT&T NEXT-GEN TTS System. Keller.REFERENCES [1] H. and T." ISCA. Kluwer Academic Publishers. W. Abutalebi and M. Greene. Arbil." Joint Meeting ofASA." ICDT'09. et aI. 1:169181." IEEE Trans. Barkhoda.1998. Beutnagel. 86 (2): 566-581. Elsevier. "Letter-to-Sound in Persian Language Using Multy Layer Perceptron Neural Network. [32] M. CHATR. 971-995. Greece. Cairo University. Helsinki University of Technology. [11] M. Iran. A. Syrdal. [7] A. 2000. Shih. pages 568-570. and T. Fourth International Conference. 2006 (in Persian). [29] W. B. and M.. AI-Muhtasebl." Proceedings of ICSLP 94. [14] A. Fundamentals ofSpeech Synthesis and Speech Recognition: Basic Concepts. John Wiley and Sons. 2005. "Techniques for High Quality Arabic Speech Synthesis. Aug.1980. E. Sak. Donovan." ICSLP 96. Ghaemmaghami. M. 2005 (In Persian). Nagarajan. Moebius.. Kaveh. [16] M. Fakotakis. 2006. Res." Information sciences. AI-Ghamdi. "Design and Implementation of a Kurdish TTS System Based on Allophones Using Neural Network." Proc. Amdal and T. Processing (TALIP). (ed. K. G. Black. Youssef." Iranian Electrical and Computer Engineering Journal. and Future Challenges Formant synthesis." Computer Speech & Language. [25] M. Guerti. 1994. Farrohki. 2006.. Arabic Speech Synthesis Using Large Speech Database. and H. 239-244. Phonology and Syllabication in Kurdish Language. India. 1980. 1996. Zanjan. France. "Estimation of Prosodic Information for Persian Text-to-Speech System Using a Recurrent Neural Network. vol. a generic speech synthesis. Kurdish Academy Press. pp. JEP-TALN 2004." First Balkan Conference on Informatics. 1977. Irec'06. Iran. et ai. A Corpuse-Based Concatenative Speech Synthesis Systemfor Turkish. Trainable Speech Synthesis. 1. 2004. Sproat. Hunt and A. Namnabat and M. "Implementation of a Text-toSpeech System for Farsi Language. Discrete time processing of speech signals." Arabian Journal for Science and Engineering. Hunnicutt. A." Journal of the Acoustical Society of America. [22] K. Multilingual Textto-Speech Synthesis: The Bell Labs Approach. Dutoit. Wu and 1. 2006. and Y. ATR-Interpreting Telecommunications Laboratories. Norwell. N. 109-128. [30] W. Schroeter. [39] D." Speech Communication. IEEE Workshop on Multimedia Signal Proc. [36] M. A. "Emu: An e-mail pre-processor for text-to-speech. [12] R. 1994. 2004. [21] I. vol. Speech Prosody 2004. ZahirAzami. Barkhoda.). Kurdish Linguistic and Grammar (Saqizi accent)." National Conference on Communication. Cambridge University. H. Pisoni and S. "Rule synthesis of speech from diadic units. Yoon. "A Speech Synthesis Corpus for Norwegian". "Implementation of a Text-to-Speech System for Kurdish Language. first edition. Consumer Electronics. 2004. P. Murthy. Chouireb. Harvard: Iranian Studies at Harvard University. and H." Current Topics in Acoust. Svendsen. High Quality Text-To-Speech Synthesis of the French Language. 43. M. Department of Electrical and Communications Engineering. "The MBROLA project: towards a set of high quality speech synthesizers free of use of non commercial purposes. In Keller E. Hu." Le traitement automatique de l'arabe. I. [13] C. Klatt. Japan. [18] T. et ai. "An Arabic TTS System Based on the IBM Trainable Speech Synthesizer. 1996.8. 1995. Kyoto. [28] F. Thesis. "Perceptual Evaluation of MITalk: The MIT Unrestricted Text-to-Speech System. Dec. [35] M. 1989. System documentation. Shiga. 1999. "A Novel Segment-Concatenation Algorithm for a Cepstrum-Based Synthesizer. 1998. 3. 2003. (4): 1783-1786. Daneshfar. Gulliver. Styger and E. 1993. Tarfarnd press. "A prosodic phrasing model for a Korean text-to-speech synthesis system. T. "A Speech Synthesizer for Persian Text Using a Neural Network with a Smooth Ergodic HMM. Engineering Department. [19] F. B. M.