You are on page 1of 7

Grapheme to phoneme rules for text to speech synthesis in Malayalam

1. Introduction
The culture, literature, art forms, science, technology and everything that human civilization
possess, is because of language, which is considered as the most valuable single possession of
human race. The words that we think, speak or write are not really ours. It is coming through us,
but not ours. We have received it from our parents, teachers and others. They have received it
from their predecessors and so on ad infinitum. The transaction of knowledge, information and
skills from generation to generation is possible only because of language. Each generation
contributes to the accumulated knowledge of human kind incrementally.

Speech, writing and signs are the main media through which language exists. The patterns
formed in these media carry the language. The spoken, written and sign messages carry the same
information embodied in different media. Speech carries information as variation in acoustic
energy with respect to time (sound units or phonemes) and writing as patterns in a plane (visual
symbols or alphabets). The sign language used by hearing impaired combines hand shapes,
orientation and movement of the hands, arms or body, and facial expressions to convey
messages. The language was developed initially through the medium of speech, primarily for
interaction. The writing developed later through cave markings, inscriptions and alphabet
writing, perhaps mostly for the need of transaction of knowledge.

The conversion of messages from one form to another is essential for the purpose of assimilation
and transaction of knowledge. When people become literate in a language, they learn to convert
a written message(text) to spoken message(speech) and vice versa. They learn to read(text to
speech conversion) and write(speech to text conversion). The text to speech conversion and
speech to text conversion in its typical forms (using alphabets and phonemes) are not possible for
visually impaired or hearing impaired. The visually impaired persons use Braille language for
communicating with those who know that language and hearing impaired use sign language
instead of speech sounds to communicate between themselves. To communicate with the
majority population who don’t know Braille or sign language and to be part of the mainstream
society they need some means to convert their message to the typical forms of speech and
writing.

The artificial techniques for converting text to speech and speech to text allow them to
communicate and be integrated with the world around them. These techniques come under the
discipline of speech processing, that is digital processing of speech signalsfor applications
including generating or synthesizing speech and converting given speech signal to corresponding
text. These applications help the differently-abled to access advancements in human culture,
civilization, science, and technology. Being integrated with common society also enables them to
contribute to the society in various means. Society also gets benefitted that way, the greatest
testimony being Prof. Stephen Hawkins whose enormous contribution to the humanity
consistently through the last many decades is facilitated through computer based communication
systems that is developed for him.

2. Text-to-speech synthesis (TTS)


The text-to -speech synthesis systems (TTS) generates speech signals corresponding to the given
text. It is a system that reads a given text aloud. TTS systems find a wide range of applications in
the current digitized world ranging from support systems to vision and speech impaired to use as
a laboratory tool by linguists for language learning and analysis. The visually impaired people
use TTS to read a given text in digital form aloud. With a scanner and Optical character
recognition system, along with a TTS system they can get access to any written information.
They can use computers for all sorts of applications with a screen reader that reads aloud any text
where the mouse is placed. TTS enables the visually impaired people access the digital world.
People with reading disorders and severe speech impairment can use TTS as an aid to
communicate with the world around. It is also helpful to those who are learning a new language.
The other applications of TTS include talking books and toys, in games and animations, reading
messages, emails, notifications, weather reports, travel directionsetc and in call centre
automation.

The artificial speech generation systems have a long history, starting from the mechanical
synthesizers in 18th and 19th century1 to electrical synthesizers in early 20th century2.The full text
to speech synthesis systems (generation of speech for any given input) were developed in the
latter half of 20th century3. The text to speech synthesis system works in two phases, the first one
is natural language processing, where the input text is transcribed into a phonetic representation,
and the second one is the generation of speech waveforms, where the acoustic output is produced
from this. The input text may be data from a word processor, a mobile text message, email
message or scanned text from a newspaper or book. The character string is then processed and
analyzed to obtain the corresponding phonetic representation, which is usually a string of
phonemes. The natural language processing (NLP) module is developed based on linguistic and
phonological knowledge of the language. The speech generation module takes in the string of
phonemes and generates the corresponding speech. The speech is generated, after applying the
necessary signal processing to the acoustic parameters, for the manipulation of prosody so as to
result in natural sounding speech.

3. Natural language processing module


The natural language processing module consists of the text analysis part and grapheme to
phoneme convertor. In the text analysis module, the sentences in the given text are first
1
In 1779 Russian Professor Christian Kratzenstein constructed acoustic resonators similar to the human vocal tract
to produce five long vowels
2
VODER (Voice Operating Demonstrator) introduced by Homer Dudley in New York World’s Fair 1939
3
The full text to speech synthesis systems for English was developed in the Electrotechnical Laboratory, Japan 1968
by Noriko Umeda. In 1979 Allen Hunnicutt and Klatt demonstrated the MITalk laboratory TTS developed at MIT
segmented into words and numbers. The text normalization is then done to obtain the correct
pronunciation of abbreviations, acronyms and numbers. The text normalization comprises of
 expansion of abbreviations
eg. Expanding Dr. to Doctor, Prof. to Professor etc.
 interpretation of numbers
eg. Interpreting numbers as per context as numbers, years etc. 1993 has to be expanded as
nineteen ninety three if is an year and as one thousand nine hundred ninety three
otherwise. Similarly correct expansion of decimal numbers has to be taken care of say
14.3 0r .03 etc.
 recognition of symbols
eg. Expansion of symbols like %, $, & etc.
 proper expansion of acronymes
eg. HIV or IBM are spelled out by pronouncing the letters, while NATO or AIDS are
pronounced as words.

The normalized text is then converted into the corresponding phonemes by the grapheme to
phoneme convertor. The Natural language processing module is language dependent and hence
has to be developed for each language with the help of linguists and language experts.

4. Grapheme to phoneme convertor


The Grapheme to phoneme conversion system (G2P) converts words from a normalized text
into their corresponding phonemes. It can be described as a function mapping the spelling form
of words to a string of phonemes. In other words it is the process of converting a sequence of
letters into a sequence of phonemes. Some languages like English have no obvious grapheme
to phoneme correspondence, whereas languages like Spanish or Indian languages being
phonetic in nature show regular correspondence between the graphemes and the phonemes. In
such cases a simple set of rules can be derived for the conversion from the orthographic
symbols to phonetic symbols. G2P conversion is very crucial in TTS, since generation of
natural sounding speech depends on the phoneme strings given by this system. It should
correspond to the pronunciation of the words in natural speech. Even in phonetic languages,
where it seems that there is a one to one correspondence between graphemes and phonemes,
the phones corresponding to the different phonemes varies in different contexts. If this
variation is not accounted while speech generation, the synthesized speech will appear artificial
and sometimes not even intelligible. Rules can be framed if the variation is regular and
dictionary look up can be given if otherwise. Hence for each language the speech data has to be
analyzed and rules or dictionary look up made to build a G2P convertor.

5. Grapheme to phoneme conversion for Malayalam TTS


The set of orthographic symbols and the set of contrastive units in Malayalam correspond very
closely, but not exactly. The phones produced for the graphemes vary in different contexts due
to the various phonological processes prevalent in the natural speech of the language.
Malayalam speech undergoes the phonological processes like voicing of stops, lenition of stops
and spreading of nasality.

A striking feature of Malayalam is the close blend of Dravidian and Sanskrit elements in the
language. Malayalam has borrowed heavily from Sanskrit in vocabulary, syntax and
phonology. This blending has influenced the phonological process in the language. Most of the
phonological processes in the language are regular and hence rules can be framed for these,
with few exceptions, which can also be listed.

Since the grapheme to phoneme correspondence is fairly regular in Malayalam, only a small
set of rules are required, where the phonemes corresponding to the graphemes change due to
the phonological processes. The set of G2P rules framed during the two day workshop of
language experts, linguists, signal processing experts and computer engineers held at
University of Calicut, organized by SCERT, Thiruvananthapuram from April 22 nd to 23rd, 2017
is enlisted below.

5.1 G2P rules for Malayalam


In Malayalam, there is one to one mapping between graphemes and phonemes, with very few
exceptions. In most of the cases, the phones produced by the graphemes will not vary with
the context of occurrence of the grapheme in the text. The G2P rules are framed for those
cases where the corresponding phoneme changes depending on the context of occurrence.

G2P rules : Table I


Grapheme

Phoneme
Sl. No.

G2P conversion Rule

1 ഉ /u/ Rounded only if ഉ is in the initial If not word initial or word final and if the
syllable or in the final syllable (eg. preceding vowel (vowel in the preceding
/ammu/, /umma/ - rounded). phoneme) is not /u/ then replace /u/ with
In word middle it is rounded only if the raised and retracted form of shwa (this
vowel in the preceding syllable is is to be stored as a separate version of /u/
also /u/ (eg. /uDuppy/ - rounded; for synthesis)
/karuNa/ - unrounded)

2 അ /a/ When അ is not in word final syllable If /a/ is not in word final syllable and
it will get an /e/ colour, i) if the if the succeeding consonant is a palatal or
succeeding consonant is a palatal or an an alveolar or
alveolar or ii) if the preceding if the preceding consonant is a voiced stop
consonant is a voiced stop (/ga, /ja, then, replace /a/ with the sound /e/.
/da, /Da/, /ba/) or /ya/, /ra/, /Ra/, /la/)
(eg: /aTayum/, /balaM/, /jalaja/)
3 ഇ /i/ The phoneme corresponding to ഇ will If not word initial or word final then
be a form of shwa in word middle and replace /i/ with a raised and fronted form
the duration will be very less of shwa.
(this is to be stored as a separate version
of /i/ with very small duration for
synthesis)
4 ക /kk/ Geminated /k/ will become palatalized if If preceding phoneme is /i/, /e/ or /y/
the preceding phoneme is /i/, /e/ or /y/ then replace /kk/ with its palatalized
version (/k'k'/)
(this is to be stored as a separate version
of /kk/)
5 ന alve In word beginning dental /n/ (exception If not in word initial position and if
olar only in loan words), otherwise preceding consonant or succeeding
/n/ alveolar /n/. When combined with consonant is dental
dentals, then dental /n/ then replace with dental /n/
Exceptions: i) In compound words like
perunnal, varumnal, malanad,
somanadhan, karinizhal, vananira
ii) /ennal/ - when the meaning ‘but’, it is
dental, where as when meaning is ‘by
myself’, it is dental
6 ഹ /h/ When in a consonant cluster with a If in a consonant cluster and if the
nasal, /h/ is not pronounced, instead the succeeding phoneme is nasal
consonant is geminated (eg: then replace /h/ with that nasal
/brahmam/, /chihnam/)
7 യ /y/ When preceded by a consonant and If preceding phoneme is a consonant and
followed by /a/, /y/ changes into an is succeeded by / a/ then
opened up /e/ (eg: /vyasanam/) replace /y/ with a special form of /e/.
(this is to be stored as a separate version
of /e/ )
8 വ /w/ Labiodental when preceded by a vowel If preceding phoneme is a consonant then
(/waram/). replace /w/ with a special form, which is
Bilabial when preceded by a consonant bilabial
(/swarnam/)
Yet another form when preceded by If preceding phoneme is anuswaram then
anuswaram (/swayamvaram/) replace /w/ with a special form of /w/

otherwise labiodentals

(this is to be stored as a separate version


of /w/)
9 ത /t/ In the middle of a word, if succeeded by If not word initial or word final and
m, it changes to /lpm/ or /tmp/ if succeeding phoneme is not /m/
then replace /t/ with /l/
else if succeeding phoneme is /m/
then replace /t/ with /lpm/ or /tmp/
(these has to be stored separately)
anuswaram

10 O If the anuswaram comes in the middle If not in the word initial position or in the
of of a word, if the succeeding phoneme word final position and if the succeeding
is velar, it will become nasalized. phoneme is velar then replace it with
(eg: /bhaNgi/, /saNgiidam/) corresponding nasal.
11 : visa Geminate the succeeding consonant When a visargam occurs, geminate the
rga succeeding consonant
m

G2P rules : Table II


Sl. No.
Phonological

G2P conversion Rule


of voicing Process

1 When preceded by nasal sonorants, If preceding phoneme is nasal sonorant or


of

stops (/k/, /c/, /t/, /p/) appears to be if preceding phoneme is nonnasal


voiced (/g/, /j/, /d/, /b/) and /ţ / sonorant and if succeeding phoneme is a
becomes /ഡ/ vowel then
stops

2 Stop is inserted between stops For /n/ if preceding phoneme is /p/


stop Insertion

and nasals then replace /p/ with a special


form /pt/.(swapnam)

Fornot/m/ if the preceding phoneme


and is
nNasalizatio

3 Post nasal stops are converted to If word initial or word final


nasals. if the succeeding phoneme is a
consonant
(eg. In /nandi/, /da/ is replaced then replace it with the corresponding
with /na/; ie it is sounded as nasal sound

6. Conclusion
G2P rules play a crucial role in TTS, since for the generation of natural sounding speech,
the phones synthesized corresponding to the given text, should replicate the way a native
speaker of the language, produce speech sounds. G2P rules are also important for other
speech processing applications including, speech recognition, spelling correction and
speech to speech machine translation.

Malayalam being a phonetic language, there is one to one correspondence between


graphemes and phonemes. Hence the mapping between graphemes to phonemes is easy.
Rules can be framed for those cases, where the phones generated for the graphemes
varies with context of occurrence. The phonological process being regular, only 14 rules
are required to generate speech as spoken by a native speaker of the language. Many
languages require a large number of G2P rules. For certain languages, it is very difficult
to frame rules and for such languages, it is not possible to formulate a rule based G2P
convertor. Such languages use dictionary based G2P convertors.

The G2P rules listed in this paper is for standard speaking style or reading style. For
different dialects the G2P rules may differ. Rules can be framed for each of the various
dialects (for example, palatalization of geminate of /k/ is not there in northern Kerala
dialect). Usually TTS are built to generate standard speech. Hence G2P rules for standard
speech is a basic minimum requirement to build an assistive system or reading software
for the language. At the same time it is very interesting to have a reading system which
generates speech in various speaking styles and dialects. The linguists, language experts
and native speakers of the language have to come together for a combined integrated
work, so as to facilitate the development of good quality TTS in various styles.

You might also like