Professional Documents
Culture Documents
conversion, Syllabification of words, and Prosody form and structure but also in their typical uses and
(intonation, stress, phoneme durations, pitch, and functions. Literary Sinhala is generally considered the
accent) were carried out since they are vital to our ‘higher’ variety in that its structure is closer to the
TTS. classical literary idiom. It is used in all forms of non-
fictional writing, including news bulletins, and in
3. Segmental Features of Sinhala electronic media. News is read rather than spoken.
Different genres of fiction use a mixture of both:
literary Sinhala for narration and spoken Sinhala for
3.1. Sinhala phonemic inventory
dialog. Spoken Sinhala is used in all face-to-face
communication.
3.1.1. Background. The Sinhala language is a member
of the Indo-Aryan subfamily, which is a member of a 3.2.2. Phonemes in Sinhala. Spoken Sinhala contains
still larger family of languages known as Indo- 40 segmental phonemes; 14 vowels and 26 consonants,
European. Sinhala is the official language of Sri Lanka including a set of 4 pre-nasalized voiced stops peculiar
and the mother tongue of the majority of the people to Sinhala, as classified below in Table 1 and Table 2.
constituting about 74% of its population. Sinhala [4]
language is presented in two major varieties: the
Spoken and the Literary. They differ not only in their
474
Working Papers 2004-2007
475
Sinhala
Syllabification Procedure for Nishpanna focused preliminarily due to this reason. For this
Category: purpose, a carefully chosen list of related words1 was
a. Reach the first vowel of the given word and then, presented to the distinguished scholars of Sinhala and
1. If the phoneme next to this vowel is a Sanskrit for syllabification and recommendations. By
consonant followed by another vowel, mark carefully analyzing the gathered information and
the syllable boundary just after the first views, a set of rules have been identified on how to
vowel, i.e., word having a consonant-vowel syllabify the borrowed Sanskrit. Critical analysis of
structure of xVCV.. should be syllabified as gathered information and views, lead to a valid
(xV)(CV..), where x denotes either a single syllable structures followed by a set of rules to
consonant or zero consonant. syllabify the borrowed Sanskrit words. Syllabic
structures for those words can be represented as
2. If the phoneme next to this vowel is (C)(C)(C)V(C)(C)(C). It is also noteworthy to
consonant followed by another consonant mention that, syllabic structures for words belong to
and a vowel, mark the syllable boundary the category of Nishpanna i.e. (C)V(C) is actually a
just after the first consonant, i.e., word subset of these structures.
having a consonant-vowel structure of Languages are unique in syllable structures.
xVCCV.., should be syllabified as Syllabification of words belongs to the categories of
(xVC)(CV..), where x denotes either a Thathsama and Thadbhava does not completely
single consonant or zero consonant. adhere to syllabification rules imposed in the
language from which the word is originated.
3. If the phoneme next to this vowel is another Syllabification of such words will naturally be altered
vowel, mark the syllable boundary just after according to the ease of pronunciation and the
the first vowel, i.e., word having a existing syllable structures in Sinhala. This view was
consonant-vowel structure of xVV.., should expressed by 100 % of the distinguished scholars to
be syllabified as (xV)(V..), where x denotes whom the chosen list of word was presented. As a
either a single consonant or zero consonant. proof of this fact, it was evident that the
Only few words were found in Sinhala syllabification of all most all the words borrowed
having two consecutive vowels in a single from languages other than Sanskrit (Pali, Tamil and
word. eg. “giriullt\” (Name of city.), English) are also consistent with the above
“aa:v\ ”(Alphabet). Thus, the syllable determined set of syllabification rules and the
structure (V) mostly occurs at the beginning syllabic structures.
of the word, except for above listed words.
Syllabification procedure for Thathsama and
b. Having marked the first syllable boundary, Thadbhava category:
continue the same procedure for the a. Reach the first vowel of the given word and then,
rest phonemes as a new word, i.e., repeat the step (a)
to the rest of the word, until the whole word is 1. If the vowel is preceded by a consonant cluster
syllabified. of 3, followed by a vowel,
The accuracy was tested by working out with the
examples given in the literature. Since results were • If the consonant preceded by the last vowel
consistent, it was concluded that above set of rules is /r/ or /y/ then mark the syllable boundary
describes the accurate syllabification procedure for after the first consonant of the consonant
the words belonging to the category of Nishpanna. cluster, i.e., word having a consonant-vowel
structure of xVCC[/r/ or /y/]V.., should be
4.1.3. Syllabification of Words Belonging to the syllabified as (xVC)(C[/r/ or /y/]V..), where
Category of Thathsama and Thadbhava. In x denotes zero or any number of consonants.
Syllabification, words borrowed from Sanskrit play a
major role than the words borrowed from other • In the above rule, if the consonant preceded
languages. Reason behind is, distribution of large by the last vowel is phoneme other than /r/
number of Sanskrit originated words and the or /y/ then,
complexity of codas and onsets of those words.
Defining proper syllabic structures for words
borrowed or either derived from Sanskrit had been 1
See Appendix A: Word-List.
476
Working Papers 2004-2007
− If the first two consonants in the In a word such as /sampre:kß\n\/ (transmit) the /p/
consonant cluster are both stop can be interpreted as a coda with respect to the
consonants, mark the syllable boundary preceding vowel -/samp/re:k/ß\/n\/or as an onset
after the first consonant of the consonant with respect to the following vowel -
cluster. /sam/pre:k/ß\/n\/.
i.e., word having a consonant-vowel More examples for this kind of words includes,
structure of xV[C-Stop][C-Stop]CV.., then it /mats/y\/, /mat/sy\/(fish);
should be syllabified as (xVC)(CCV..), /sank/ya:/, /san/kya:/ (number) and
where x denotes zero or any number of
/lakß/y\/,/lak/ßy\/(point).
consonants.
Among the Sanskrit loan words in Sinhala
− At other situations mark the syllable
including word internal clusters ending in /r/ and
boundary after the second consonant of
preceding a vowel, can either be syllabified by
the consonant cluster.
reduplicating the first consonant sound of the cluster
i.e., word having a consonant-vowel
or by retaining the original word as follows.
structure of xVCCCV.., should be
/kr\/mak/r\/m\/y\/ /kr\/mak/kr\/m\/y\/ (gra
syllabified as (xVCC)(CV..), where x
denotes zero or any number of consonants. dually)
/ap/r\/ma:/n\/ /ap/pr\/ma:/n\/ (unlimited)
2. If this vowel is preceded by a consonant /ja/yag/ra:/hi:/ /ja/yag/gra:/hi:/ (victory)
cluster of more than 3 consonants, This explanation shows that, the determined
• If the consonant just before the last vowel rules and procedures comply even with the
is /r/ or /y/ then, mark the syllable boundary ambisyllabic words as well. It is important to note
before 2 consonants from the last vowel. that one of the syllabified forms of such word will be
i.e., word having a consonant-vowel provided in accordance with the determined rules.
structure of xVCCC…[/r/ or /y/]V.., then it
4.1.5. The syllabification algorithm. In this section,
should be syllabified as (xVCCC…)(C[/r/ or
the identified Sinhala syllabification linguistic rules
/y/]V..), where x denotes zero or any
are presented as an algorithm. The function
number of consonants. “syllabify” accepts an array of phonemes generated
• At other situations, mark the syllable by our Letter-To-Sound module for a particular
boundary just after the minimum sonorant word, along with a variable called current_position
consonant phoneme of the consonant cluster, which is used to determine the position of the given
i.e., word having a consonant-vowel array which is currently being processed by the
structure of xVCCC…CV.., then it should be algorithm.
syllabified just next to the minimum Initially the current_position variable will hold
sonorant consonant in the consonant cluster. the value 0. This function is called recursively until
b. Having marked the first syllable boundary, all phonemes in the array are processed. The function
continue the same procedure for the mark_syllable_boundary(postion) will mark the
rest phonemes as a new word, i.e., repeat the step (a) syllable boundaries of accepted array of phonemes.
to the rest of the word, until the whole word is The other functions used within the “syllabify”
syllabified. function are described below.
The words with consonant clusters of more than no_of_vowels(phonemes): accepts an array of
3 consonants are rarely found in Sinhala, and to phonemes and returns the number of
avoid the confusions of syllabification of words in vowels contained in that array.
those situations, the algorithm makes use of the is_a_vowel(phoneme): accepts a phoneme and
universal sonority hierarchy [2] in deciding the return true if the given phoneme is a vowel.
proper position to mark the syllable boundary. count_no_of_consonants_upto_next_vowel(phoneme
s, position): accepts an array of phonemes and a
4.1.4. Ambisyllabic words in Sinhala. Some starting position; and returns the count of consonants
ambisyllabic words are also found in Sinhala. This from the starting position of the given array until next
situation arises due to the fact that some words with vowel is found.
complex coda or onset can be syllabified in several is_a_stop(phoneme): returns true if the accepted
ways. phoneme is a stop consonant.
Example:
477
Sinhala
478
Working Papers 2004-2007
479
Sinhala
Some words are inaccurately phonetized even may pronounced as /karə/ or /kərə/ depending
after applying the above set of rules. Therefore, it on the context. Considering all these issues, it was
was decided to maintain a lexicon to obtain the decided to maintain a lexicon consist of such words
accurate phonetized form of a word. that are most frequently occurred in text with their
proper pronunciation form.
6. Results and conclusion Unlike in English, Sinhala does not make a very
clear distinction between stressed and unstressed
syllables. In Sinhala, the position of the stress is
The fundamental approach of our Sinhala TTS
fixed in relation to the word. Sinhala words nearly
system is diphone concatenation. The research on
always have the stress on the first syllable,
Sinhala Phonemic Inventory leads to the
irrespective of the number of syllables in the word.
identification and preparation of a list of diphones.
Also, the long vowels in the words are stressed. The
All together 1,401 diphones were identified, and
phrase stress in Sinhala cannot be easily predicted
some of the diphones were listed for the purpose of
since it depends on the context. Hence, it was
reading the foreign words appear in the text.
decided to place the stress on the first syllable and
The extensive study carried out on
long vowels of each word by the stress assignment
Syllabification of Sinhala words leads to the
algorithm. A research is underway to determine the
definition of set of rules which is capable of
prediction of phase stress of an utterance.
predicting the syllable boundaries of a given word
The available literature on Sinhala Intonation is
accurately. The algorithm was tested on 30,000
inadequate, and an extensive research on Sinhala
distinct words extracted from “UCSC Sinhala Corpus
prosody is underway. The research criterion includes
BETA” corpus, and compared the results with
the observation and analysis of selected previous
correctly syllabified words. Text obtained from the
news, broadcasted on radio with their scripts. Since,
category “News Paper > Feature Auricles > Other ”
our TTS engine is also prose type reader, it is
was chosen for testing the algorithm, due to the
expected that this kind of study will reveal the
heterogeneous nature of the text and better
necessary information to model Sinhala Prosodic
representation of corpus (i.e. approx. 2/3 of the entire
features. Also, the study will be useful in determining
corpus). A list of distinct words was obtained, and
the phrase breaks in an utterance too.
first 30,000 words with the highest frequency of
occurrence were chosen for testing the algorithm.
The algorithm achieved an accuracy of 99.95%. 7. References
Most of the errors were due to,
[1] P. Ladeforged, A Course In Phonetics, 3rd edn.,
1. Words developed by composing two or Harcourt Brace Jovanovich College Publishers, 301,
more words. (ie. Single word is formed by Commerce Street, Suite 3700, Fort Worth TX 76102,
combining 2 or more distinct words; like the 1993.
word “thereafter”). In this case,
syllabification should be carried out [2] C. Gussenhoven, and H. Jacobs, Understanding
separately for each word that the word is Phonology, Oxford University Press Inc, 198,
comprised of and then it should be Madison Avenue, New York, NY 10016, 1998.
concatenated to form a single word.
[3] A.W. Black and K.A. Lenzo, Building Synthetic
2. Directly termed other language words.
Voices, Carnegie Melon University, Edinburgh,
Letter-to-Sound rules were also tested with the 2000.
same set of words, and compared with the correct
phonetized representation of each word. The [4] W.S. Karunatillake, An introduction to spoken
algorithm achieved an accuracy of 98.21%. Sinhala, 3rd ed., M.D. Gunasena & Co. ltd., 217,
It was noticed that, here also the words Olcott Mawatha, Colombo 11, 2004.
developed by composing two or more words caused
most of the errors. Beside, the contribution for errors [5] J.B. Disanayaka, The structure of spoken Sinhala,
of the directly termed foreign words is also a National Institute of Education, Maharagama, 1991.
significant factor. Few errors were occurred due to
the words which have the same written form but with [6] R.M.W. Rajapaksa, Verb: A Prosodic analysis,
different pronunciation, i.e. the written form “කර”, Department of Phonetics and Linguistics, School of
480
Working Papers 2004-2007
Appendix A: Word-List
කූෙටෝපකම k u: t o: p a k k r \ m \
පෘතග්ජන prutagjan\
කමකමෙයන් kr\makkr\m\yen
ස්තීන් s t r i: n
විදXාඥයා v i d y a: j µ \ y a:
කෘමියා k r u m i y a:
පීතෲන් p i: t t r u: n
ෙසෞභාගX s a u b a: g y \
කාෙවXෝපෙද්ශය k a: v y o: p \ d e: ß \ y \
අවිදXාව a v i d y a: v \
පඥාව p r a j µ a: v \
ෙකොන්ස්තන්තිෙනෝපලය k o ˜ s t a n t i n o: p \ l \ y \
ස්වප්න svap n\
ශලXකර්ම ßaly\karm\
පාර්ලිෙම්න්තුව p a: r l i m e: n t u w \
ද්වන්ද්ව dvandv\
හෘදස්පන්දනය h\rd\spand\n\y\
සම්ෙපේක්ෂණ s a m p r e: k ß \ n \
ෙක්ෂේත ß e: ß t r \
පවෘත්ති pr\vurti
පවජXා p r \ v u r j ya:
පශබ්ධිය pr\ßrabdiy\
සංස්කෘත s a n s k r u t\
springs spri˜gs
scratched skræcÎ
straights s t r e: i t s
strength strents
postscript p o: s t s k r i p t
area e: r i a:
if no_of_vowels(phonemes) is 1 then
mark_syllable_boundary(at the end of phonemes)
else
if is_a_vowel(phonemes[current position]) is true then
no_of_consonants=count_no_of_consonants_upto_next_vowel
(phonemes, current position)
481
Sinhala
if no_of_consonants is 0 then
if is_a_vowel(phonemes[current_position+1]) is true then
mark_syllable_boundary (current_position+1)
syllabify(phonemes, current_position+1)
end if
else
if no_of_consonants is 1 then
mark_syllable_boundary(current_position+1)
syllabify(phonemes, current_position+2)
end if
end if
482
Working Papers 2004-2007
end if
end if
else
temp=current_postion
repeat
temp = temp + 1;
until ( is_a_vowel(phonemes[temp]) is true
syllabify(phonemes,temp)
end if
end if
483
Sinhala
ප, ඵ
p p
බ, භ
b b
ම
m m
ඹ
b~ mbz
ය
y y
ර
r r
ල, ළ
l I
ව
v v
ශ, ෂ
s` shz
ස
s s
හ, (◌ඃ)
h h
ෆ
f f
484