Professional Documents
Culture Documents
1. Introduction
The culture, literature, art forms, science, technology and everything that human civilization
possess, is because of language, which is considered as the most valuable single possession of
human race. The words that we think, speak or write are not really ours. It is coming through us,
but not ours. We have received it from our parents, teachers and others. They have received it
from their predecessors and so on ad infinitum. The transaction of knowledge, information and
skills from generation to generation is possible only because of language. Each generation
contributes to the accumulated knowledge of human kind incrementally.
Speech, writing and signs are the main media through which language exists. The patterns
formed in these media carry the language. The spoken, written and sign messages carry the same
information embodied in different media. Speech carries information as variation in acoustic
energy with respect to time (sound units or phonemes) and writing as patterns in a plane (visual
symbols or alphabets). The sign language used by hearing impaired combines hand shapes,
orientation and movement of the hands, arms or body, and facial expressions to convey
messages. The language was developed initially through the medium of speech, primarily for
interaction. The writing developed later through cave markings, inscriptions and alphabet
writing, perhaps mostly for the need of transaction of knowledge.
The conversion of messages from one form to another is essential for the purpose of assimilation
and transaction of knowledge. When people become literate in a language, they learn to convert
a written message(text) to spoken message(speech) and vice versa. They learn to read(text to
speech conversion) and write(speech to text conversion). The text to speech conversion and
speech to text conversion in its typical forms (using alphabets and phonemes) are not possible for
visually impaired or hearing impaired. The visually impaired persons use Braille language for
communicating with those who know that language and hearing impaired use sign language
instead of speech sounds to communicate between themselves. To communicate with the
majority population who don’t know Braille or sign language and to be part of the mainstream
society they need some means to convert their message to the typical forms of speech and
writing.
The artificial techniques for converting text to speech and speech to text allow them to
communicate and be integrated with the world around them. These techniques come under the
discipline of speech processing, that is digital processing of speech signalsfor applications
including generating or synthesizing speech and converting given speech signal to corresponding
text. These applications help the differently-abled to access advancements in human culture,
civilization, science, and technology. Being integrated with common society also enables them to
contribute to the society in various means. Society also gets benefitted that way, the greatest
testimony being Prof. Stephen Hawkins whose enormous contribution to the humanity
consistently through the last many decades is facilitated through computer based communication
systems that is developed for him.
The artificial speech generation systems have a long history, starting from the mechanical
synthesizers in 18th and 19th century1 to electrical synthesizers in early 20th century2.The full text
to speech synthesis systems (generation of speech for any given input) were developed in the
latter half of 20th century3. The text to speech synthesis system works in two phases, the first one
is natural language processing, where the input text is transcribed into a phonetic representation,
and the second one is the generation of speech waveforms, where the acoustic output is produced
from this. The input text may be data from a word processor, a mobile text message, email
message or scanned text from a newspaper or book. The character string is then processed and
analyzed to obtain the corresponding phonetic representation, which is usually a string of
phonemes. The natural language processing (NLP) module is developed based on linguistic and
phonological knowledge of the language. The speech generation module takes in the string of
phonemes and generates the corresponding speech. The speech is generated, after applying the
necessary signal processing to the acoustic parameters, for the manipulation of prosody so as to
result in natural sounding speech.
The normalized text is then converted into the corresponding phonemes by the grapheme to
phoneme convertor. The Natural language processing module is language dependent and hence
has to be developed for each language with the help of linguists and language experts.
A striking feature of Malayalam is the close blend of Dravidian and Sanskrit elements in the
language. Malayalam has borrowed heavily from Sanskrit in vocabulary, syntax and
phonology. This blending has influenced the phonological process in the language. Most of the
phonological processes in the language are regular and hence rules can be framed for these,
with few exceptions, which can also be listed.
Since the grapheme to phoneme correspondence is fairly regular in Malayalam, only a small
set of rules are required, where the phonemes corresponding to the graphemes change due to
the phonological processes. The set of G2P rules framed during the two day workshop of
language experts, linguists, signal processing experts and computer engineers held at
University of Calicut, organized by SCERT, Thiruvananthapuram from April 22 nd to 23rd, 2017
is enlisted below.
Phoneme
Sl. No.
1 ഉ /u/ Rounded only if ഉ is in the initial If not word initial or word final and if the
syllable or in the final syllable (eg. preceding vowel (vowel in the preceding
/ammu/, /umma/ - rounded). phoneme) is not /u/ then replace /u/ with
In word middle it is rounded only if the raised and retracted form of shwa (this
vowel in the preceding syllable is is to be stored as a separate version of /u/
also /u/ (eg. /uDuppy/ - rounded; for synthesis)
/karuNa/ - unrounded)
2 അ /a/ When അ is not in word final syllable If /a/ is not in word final syllable and
it will get an /e/ colour, i) if the if the succeeding consonant is a palatal or
succeeding consonant is a palatal or an an alveolar or
alveolar or ii) if the preceding if the preceding consonant is a voiced stop
consonant is a voiced stop (/ga, /ja, then, replace /a/ with the sound /e/.
/da, /Da/, /ba/) or /ya/, /ra/, /Ra/, /la/)
(eg: /aTayum/, /balaM/, /jalaja/)
3 ഇ /i/ The phoneme corresponding to ഇ will If not word initial or word final then
be a form of shwa in word middle and replace /i/ with a raised and fronted form
the duration will be very less of shwa.
(this is to be stored as a separate version
of /i/ with very small duration for
synthesis)
4 ക /kk/ Geminated /k/ will become palatalized if If preceding phoneme is /i/, /e/ or /y/
the preceding phoneme is /i/, /e/ or /y/ then replace /kk/ with its palatalized
version (/k'k'/)
(this is to be stored as a separate version
of /kk/)
5 ന alve In word beginning dental /n/ (exception If not in word initial position and if
olar only in loan words), otherwise preceding consonant or succeeding
/n/ alveolar /n/. When combined with consonant is dental
dentals, then dental /n/ then replace with dental /n/
Exceptions: i) In compound words like
perunnal, varumnal, malanad,
somanadhan, karinizhal, vananira
ii) /ennal/ - when the meaning ‘but’, it is
dental, where as when meaning is ‘by
myself’, it is dental
6 ഹ /h/ When in a consonant cluster with a If in a consonant cluster and if the
nasal, /h/ is not pronounced, instead the succeeding phoneme is nasal
consonant is geminated (eg: then replace /h/ with that nasal
/brahmam/, /chihnam/)
7 യ /y/ When preceded by a consonant and If preceding phoneme is a consonant and
followed by /a/, /y/ changes into an is succeeded by / a/ then
opened up /e/ (eg: /vyasanam/) replace /y/ with a special form of /e/.
(this is to be stored as a separate version
of /e/ )
8 വ /w/ Labiodental when preceded by a vowel If preceding phoneme is a consonant then
(/waram/). replace /w/ with a special form, which is
Bilabial when preceded by a consonant bilabial
(/swarnam/)
Yet another form when preceded by If preceding phoneme is anuswaram then
anuswaram (/swayamvaram/) replace /w/ with a special form of /w/
otherwise labiodentals
10 O If the anuswaram comes in the middle If not in the word initial position or in the
of of a word, if the succeeding phoneme word final position and if the succeeding
is velar, it will become nasalized. phoneme is velar then replace it with
(eg: /bhaNgi/, /saNgiidam/) corresponding nasal.
11 : visa Geminate the succeeding consonant When a visargam occurs, geminate the
rga succeeding consonant
m
6. Conclusion
G2P rules play a crucial role in TTS, since for the generation of natural sounding speech,
the phones synthesized corresponding to the given text, should replicate the way a native
speaker of the language, produce speech sounds. G2P rules are also important for other
speech processing applications including, speech recognition, spelling correction and
speech to speech machine translation.
The G2P rules listed in this paper is for standard speaking style or reading style. For
different dialects the G2P rules may differ. Rules can be framed for each of the various
dialects (for example, palatalization of geminate of /k/ is not there in northern Kerala
dialect). Usually TTS are built to generate standard speech. Hence G2P rules for standard
speech is a basic minimum requirement to build an assistive system or reading software
for the language. At the same time it is very interesting to have a reading system which
generates speech in various speaking styles and dialects. The linguists, language experts
and native speakers of the language have to come together for a combined integrated
work, so as to facilitate the development of good quality TTS in various styles.