Professional Documents
Culture Documents
INTRODUCTION
1
INTRODUCTION
1.1 Introduction
generate natural language text (paragraphs, sections and even entire documents)
from non-linguistic input (Reiter E, Dale R 2000). For example, an NLG system
working in the healthcare domain could start with a patient record and physiological
data (such as heart rate and blood pressure) collected over a shift period to auto-
generate a shift handover report from an outgoing nurse to an incoming nurse (James
NLG has come a long way from (in the 60s and 70s) being a research field with
http://www.gartner.com/smarterwithgartner/gartner-predicts-our-digital-future/, that
NLG software will auto-generate 20% of business content by 2018, just the next
and third sector) and users’ demand for information shooting up all the time, NLG
At the state-of-the-art the complex task of NLG is divided into subtasks that either
make text design decisions (both decisions related to subject matter expertise as well
linguistic) or execute the decisions made earlier to produce the surface text. The
2
second set of subtasks that execute decisions are collectively grouped into a subtask
called surface realization. The research work presented in this thesis relates to the
(https://en.wikipedia.org/wiki/Dravidian_languages).
The subtask of surface realization is challenging, despite the fact that all the text
design decisions would have been made prior to surface realization, because a
surface realizer is responsible for ensuring the grammatical (e.g. formation of well-
would have evolved over millennia, its grammar and orthography would also have
evolved into a complex body of knowledge. The real challenge in building a surface
realizer is to acquire all this complex body of knowledge first, represent this
knowledge for computational modelling and finally develop algorithms that exploit
For English and other similar European languages with well-developed grammatical
resources, building surface realizers has been undertaken with significant success in
the last three decades. Several off-the-shelf English surface realizers such as
surface realizers for Indian languages ae not commonplace. In the current socio-
societal transformation, language technologies have a big role to play. This is being
Hyderabad and CIIL Mysore) have been setup to develop Indian language
technology. While machine translation has always been the primary focus of Indian
3
language technology, it is time that other language technologies such as NLG should
times, there is a great potential to deliver finance, health and transport related
It is worth noting that in all these diverse set of applications (from finance to
transport) the surface realizer is the only NLG module that could be designed and
across multiple domains. This means, realizers are reusable and as such a thorough
scientific investigation into a realizer for a new language, such as Telugu, will
1.2 Motivation
In the standard Reiter and Dale architecture (Reiter E, Dale R 2000) of NLG a
surface realizer is the most language dependent module and also the one that could
knowledge there have been no previous research studies that explored Telugu
realizers systematically. On the other hand in the recent past there has been a
resurgence of interest in general purpose realizers in the research field of NLG. This
resurgence is largely caused by the SimpleNLG (Albert Gatt and Ehud Reiter, 2009)
realizer for English. This realizer is unique in comparison to earlier general purpose
using commonly used grammatical concepts without the overweight of any linguistic
4
theory. SimpleNLG is the most widely used realizer for English both in academia
neutral, realizers for a wide range of languages have been reported in the literature
The use of software applications is increasing day by day in all parts of India,
especially in the rural areas of the sates of Telangana and Andhra Pradesh where
people do not understand English. The need for a general purpose surface realizer
which generates sentences in their native language helps them understand the
applications better is quite evident. The current work is motivated by the need for a
general purpose Telugu realizer and the availability of the popular SimpleNLG
design.
This thesis reports a realization engine for Telugu that automates the task of building
1.3 Objectives
In this thesis the design and development of a surface realization engine for Telugu
is investigated. The following objectives have been chosen to achieve the aim:
associated features.
5
b. To identify the right sources of the required Telugu grammatical
sources.
algorithmically.
3. The objectives for the noun and pronoun morphological engine are:
a. To design and develop the plural formation mechanism for Telugu nouns
and pronouns.
b. To design and develop the oblique stem formation mechanism for Telugu
b. To design and develop a mechanism for the final verb formation with the
6
b. To design and develop a mechanism for applying agreement rules, and to
1.4 Methodology
This thesis presents the research work carried out to design and develop a Telugu
realization engine adapting the SimpleNLG (Albert Gatt and Ehud Reiter, 2009)
framework for Telugu. The input to the Telugu realization engine is an XML file
1. The input specification for the Telugu realization engine is an XML file
sentence level. Therefore in the current thesis the agreement of the verb
7
1.5 Organization of the Thesis
The thesis is organized into seven chapters. The following are the major
1. This thesis presents a systematic and thorough investigation into the first
2. Telugu syntactic structures are very simple by nature and Telugu sentence
developed.
The description of the contributions in the thesis is properly scattered through all the
Chapter 1 deals with the introduction to surface realization the topic of the thesis. It
Chapter 2 deals with review of literature on foundational concepts and past work in
past work in a hierarchical manner starting from Natural language processing (NLP),
morphology in the context of languages all over the world and in India.
8
Chapter 3 deals with surface realizer for Telugu which is a java based application
that accepts a lexicalized XML file as input and generates well-formed Telugu
engine. It also provides a useful standard surface realization input mechanism for
Chapter 4 deals with the construction of morphological engine process for plural
formation, the inflection based on number, person and gender, and agglutination with
suffixes. The verbs are classified into six conjugations based on the
morphophonemic changes, of which five classes contain regular verbs and the last
formation, sentence level issues like subject verb agreement, and the word order of
the constituents of the sentence. The chapter also provides a detailed explanation of
the process followed and the modules in the implementation of the software for the
Chapter 7 discusses the conclusion of the thesis which includes a critical review of
the current work and the possibilities of the future work for the extension of the
current research.
9
-------------------------CHAPTER-2
REVIEW OF LITERATURE
10
REVIEW OF LITERATURE
2.1 Introduction
This chapter presents a literature review on the topics of this thesis which are natural
the context of Telugu language. In addition, this chapter also presents background
language processing, this chapter starts with a brief introduction of this area and how
introduction of natural language generation (NLG), which is the main topic of this
thesis, is presented. In this introduction, the task of NLG is first defined and then the
most commonly used subtasks and the different architectures for NLG are described.
Computer Science
Artiificial
Intelligence
Computational
Lingusitics/Natural
Language Processing
11
The topic of this thesis is a subfield of Artificial Intelligence (AI) (Winograd, 1972;
processing has two subfields namely Natural Language Understanding (NLU) and
systems take strings of words (sentences) as their input and produce structured
representations capturing the meaning of those strings as their output. The nature of
this output depends heavily on the task at hand. For instance, a natural language
natural language which relate to the kind of data held by the database. In this case
the meaning of the input (the output of the system) might be expressed in terms of
structured SQL queries which can be directly submitted to the database. The now
12
2.3 Natural Language Generation
representation of input, NLG systems use knowledge about language and the
messages and other kinds of text. Many applications have been developed over the
years which automatically generate text from non-linguistic data including but not
soccer reports (e.g., Theune et al., 2001; Chen & Mooney, 2008);
weather and financial reports (Goldberg et al., 1994; Reiter et al., 2005;
Turner et al., 2008; Ramos-Soto et al., 2015; Wanner et al., 2015; Plachouras
et al., 2016);
Harris, 2008; Portet et al., 2009; Gatt et al., 2009; Banaee et al., 2013);
13
NLG is both a fascinating area of research and an emerging technology with many
interaction. These include questions such as how linguistic and domain knowledge
should be represented and reasoned with, what it means for a text to be well written,
and how information is best communicated between machine and human. From a
document creation, removing much of the drudgery associated with such tasks. It is
an actively researched topic in the research laboratories around the world, and is also
who do not have the background or time required to understand the raw data. In the
interfaces and will allow much richer interaction with machines than is possible
today.
In one sense language generation is the oldest subfield of language processing when
computers were able to understand only the most unnatural of command languages
they were spitting out natural texts. For example, the oldest and most famous C
force this text holds is produced not by the program itself but by the author of that
program. This canned text approach to generation is easy to implement but is unable
Most of us received a form letter with our name carefully inserted in just the right
places along with eloquent appeals for one thing or another. This sort of program is
14
easy to implement as well but I doubt if many are fooled into thinking that such a
letter is handwritten English. This approach called template filling is more flexible
than canned text and has been used in a variety of applications but is still limited. For
Both the canned text and template approaches to NLG do not really capture the rich
sentence has a structure (syntax of phrases and clauses) and consists of well-formed
capture the rich linguistic structures of natural languages for the purpose of language
production.
John Bateman and Michael Zock created an exhaustive list of NLG systems which is
observed that NLG has been applied to a wide spectrum of applications, from
medicine to meteorology and from finance to engineering. Table 2.1 shows a list of
In the nineties, the field of NLG further consolidated the idea of NLG as a complex
task and moved towards the consensus architecture for Natural Language Generation
15
http://homepages.abdn.ac.uk/e.reiter/pages/papers/nlgw94.pdf). Much of the applied
NLG work since then has been majorly influenced by this consensus architecture.
The NLG problem of converting input data into output text was addressed by
splitting it up into a number of sub problems. The following six are frequently found
construction,
the text,
16
3. Sentence aggregation: Deciding which information to present in individual
sentences,
domain objects,
sentences.
The above mentioned NLG sub tasks are very complex addressing the following two
issues:
sentences.
In the generation process the NLG system needs to decide which information should
be included in the text under construction and which should not. In general the data
contains more information than required. It is the task of NLG systems to decide the
required content for the given application. Content determination involves choice.
17
2.3.2.2 Text Structuring
needs to decide the order of presentation to the user. This task is referred to as Text
Each and every message in the text plan need not be represented as a separate
sentence. Some messages can be combined to form a sentence. The generated text by
combining some messages becomes more fluid and readable (Dalianis, 1999; Cheng
2.3.2.4 Lexicalization
process depends on the number of alternatives the NLG system can offer. Lexical
Referring Expression Generation is defined (Reiter and Dale 1997) as the task of
close similarity to lexicalisation, but (Reiter and Dale 2000) point out that the
18
domain entity from other domain entities. Referring Expression Generation is among
the tasks within the field of automated text generation (Mellish et al., 2006;
important complication in this task is that the output needs to have many linguistic
components which may not be present in the input. Thus this task is the projection of
non-isomorphic structures (cf. Ballesteros et al., 2015). Some of the approaches that
a) Human-crafted templates
b) Statistical approaches
When the application is small and the variation in the output is minimal the outputs
This template has two variables which can be filled with the names of a player and
the number of runs scored by the player. It can generate sentences like:
virAt kohli vaMxa parugulu koVttAdu. (Virat Kohli hit hundred runs)
The advantage with the use of templates is that they avoid the generation of
ungrammatical structures and allow for full control over the quality of the output.
19
The template based methods have started including syntactic information and
sophisticated rules for filling the gaps (Theune et al., 2001) making it difficult to
distinguish template based methods from other methods (van Deemter et al., 2005).
The disadvantage of template based systems is they are labour intensive and do not
Some applications acquire probabilistic grammars from large corpora reducing the
manual labour required while increasing coverage. Two approaches have been taken
from which a stochastic re-ranker selects the optimal candidate. The system relies on
corpus based statistical knowledge in the form of n-grams. There are also other
Ratnaparkhi, 2000; Cahill et al., 2007) and models trained on user ratings of
decision level, by training models that find the set of generation parameters
maximising an objective function, e.g. producing a target linguistic style (Paiva and
Evans, 2005; Mairesse and Walker, 2010), generating the most likely context-free
derivations given a corpus (Belz, 2008), or maximising the expected reward using
reinforcement learning (Rieser and Lemon, 2009). While such methods do not suffer
from the computational cost of an over generation phase, they still require a
20
handcrafted generator to define the generation decision space within which statistics
The topic of the current thesis which constructs a surface realization engine for
alternative. Most of these systems are grammar-based, that is, they make their
choices on the basis of a grammar of the language under consideration. The human-
crafted grammar-based surface Realizer while generating the surface forms must
b) The generated surface forms are grammatical with respect to the language in
question.
In order to satisfy the above requirements, the Surface Realizer may select content
words, insert function words, perform morphological inflections, and take care of the
order of the surface forms making the grammar of the language under consideration
as the basis. The difficulty with grammar-based systems is how to make choices
The grammar for hand-crafted systems can be manually written, as in many off-the-
shelf realizers such as fuf/surge (Elhadad & Robin, 1996), mumble (Meteer et al.,
1987), kpml (Bateman, 1997), Nigel (Mann & Matthiessen, 1983), and RealPro
21
detailed input as in KPML (Bateman, 1997) which is based on Systemic-Functional
Grammar (sfg; Halliday & Matthiessen, 2004). The levels of details that are required
make these realizers difficult to use as simple plug-and-play or off the shelf modules
(e.g., Kasper, 1989). The difficulty with systems like KPML has motivated the
APIs, but leave choice-making up to the developer (Gatt et al., 2009; Vaudry &
Lapalme, 2013; Marcel Bollmann, 2011; Rodrigo de Oliveira & Sripada, 2014).
2.3.2.6.3.1 SimpleNLG
SimpleNLG (Gatt and Reiter, 2009). SimpleNLG is a simple java API designed to
function as a realization engine which is the last stage in natural language generation.
It has been used successfully in a number of projects belonging to both academic and
commercial. SimpleNLG automates some of the mundane tasks that all natural
b) Morphology, which includes handling inflected forms that are modifying a word
and gender.
planning or micro planning so that the mundane tasks of realization need not bother
22
them. It is also used by anyone who wants to write programs to generate English
sentences.
The input specification mechanism for the current thesis is modelled on the
SimpleNLG XML Realizer framework. The input for the current thesis is an XML
file which is similar to the SimpleNLG XML input design but not same. A simple
The SimpleNLG XML Realizer walks through the Input Specification in a top-down
during the traversal. The nodes in the Input Specification provide the information as
1. The XML realizer framework uses the code generation tools, xjc for java
and xsd for C# to generate wrapper classes for the relevant elements in the
schema. A client application can invoke simpleNLG to get the realized text
through the wrapper classes which act as Data Transfer Objects. Wrapper
classes have the same names as real simpleNLG classes, with the prefix
“Xml”. Wrapper classes need to be generated only once, and only if changes
23
simplenlg.xmlrealiser.wrapper.XmlDocumentElement object that represents
Mann and Matthiessen 1985 have done a lot of research in text generation and
created a software Nigel in the framework of systemic linguistics. Along with the
specification of function and structures of English Nigel also has a semantic stratum
generation tool box. It was first developed for French language during 1993-94. The
English and Spanish versions were also developed later. The main characteristics of
and customizability.
2) Its ability to produce an extensive set of different text structures due to its
Paths Generation” which is a hybrid generator that takes the help of statistical
24
Lavoie and Rambow, in 1997 developed a new off-the-shelf realizer REALPRO
which is derived from existing systems with a new design and completely new
2) It can run as a stand-alone server and has both C++ and Java API.
for multilingual natural language generation. KPML provides the following features
to generation processes:
1) A set of standardized linguistic resources useful for text generation which are
always improving.
engine. The grammars which come as input to this linearization engine are written in
target language into programs that accept feature graphs as input and generate word
lattices. The word lattices are passed as input to the statistical extraction module of
25
Irene Langkilde, in 2000 proposed Forest-Based Statistical Sentence Generator in
as trees or forests.
Susan W. McRoy, Songsak Channarukul, and Syed S. Ali in 2000 proposed YAG
the interactive context. It does not require the extensive knowledge of the grammar
of the target language or all possible output strings ahead of time. YAG provides
Gatt and Reiter in 2009 created a java API SimpleNLG a realization engine for
English with an aim to provide simple and robust interfaces to generate syntactic
structures and linearize them. This realization engine is the main source of
inspiration for the work reported in this thesis. Therefore details of SimpleNLG are
describes the characteristics of German language and the changes made to the
SimpleNLG framework to meet the requirements of the relatively free word order
German language.
Pierre-luc Vaudry and Guy Lapalme 2013 report Adapting SimpleNLG for bilingual
SimpleNLG-EnFr is the name given to the bilingual realisation engine. This paper
26
describes the general characteristics of the software and the adaptions made for the
French Language.
Rodrigo de Oliveira and Somayajulu Sripada 2014 report Adapting SimpleNLG for
Brazilian Portuguese realisation which reports the ongoing implementation and the
Portuguese.
SimpleNLG. The paper gives some details about the grammar and the lexicon
employed by the system and reports some results about a first evaluation based on a
Damani in 2007 created HinD, a Hindi generation software from Interlingua (UNL).
syntax planning.
Uma Maheshwar Rao, G. and Christopher Mala (2011) presented Telugu Word
Synthesizer which is a generic engine that can be used for any language by plugging
in a specific language database. The generator synthesizes all and only the well-
27
formed word forms. The generator engine is independent of language and works
Vishal Goyal and Gurpreet Singh (2011) developed a Hindi to Punjabi machine
translation system. The key activities involved during translation process are pre-
processing, translation engine and post processing. Lookup algorithms and pattern
matching algorithms formed the basis for solving these issues. The system accuracy
has been evaluated using intelligibility test, accuracy test and BLEU score. The
Johann Wolfgang von Goethe (1749-1832) a German poet, novelist, and philosopher
was the first person to coin the term morphology. He coined it in the early nineteenth
century in a biological context. The term morphology originated from the Greek
where morph- means ‘shape or form’ and morphology is the study of morphs or
forms. In linguistics morphology is the branch which deals with words, their internal
A number of theoretical models have been developed over the years. Each model has
a specific set of claims about the nature of morphology and specific focuses in terms
of data that is covered by the theory. A lot of research is done on various aspects of
(Hockett, 1954) and (Stump, 2001). A description of their theories is provided in the
following sections.
28
2.4.1.1 Two Models of Grammatical Description
models of grammatical description” was proposed by (Hockett, 1954) which are Item
and Arrangement model (IA) and Item and Process model (IP). He also mentioned
about the Word and Paradigm approach (WP) which is the oldest among the three
approaches.
they result in six classes of conjugation types out of which five classes are regular
verbs and one irregular class of verbs. The classes I, II and III of regular verbs have
ten subclasses. In a word such as ammu-wA-nu (I will sell), the morphemes are
ammu-,-wA, and -nu where ammu- is the root, -wA is the tense-mode suffix, and –
based morphology a word form is assumed to be a result of applying rules that alter a
29
stem to produce a new one. Inflectional rules, derivational rules and compounding
In Telugu language, inflected word forms are derived by a set of sandhi rules
verbs falls into two types regular, and irregular. The regular verb pilus-wA-nu (I will
produce pilus-. Some of the irregular verb stem variants are derived by lexical
approach. This theory instead of stating rules to combine morphemes to form words
paradigms.
classified into twelve paradigmatic classes of regular verbs and ten irregular verbs.
Each class is described by giving a verb paradigm for that class. The word ‘pilucu’
(to call) represents one class of the twelve regular paradigmatic classes and all the
words that have the same inflectional paradigm like ‘kalcu’ (to burn) come under
this paradigmatic class. The irregular verbs do not have a paradigmatic class instead
each verb is treated separately. The word ‘icc’ (to give) is one among the ten
30
2.4.1.2 A Two Dimensional Taxonomy of Morphology
situated relative to one another. He proposed the lexical/inferential axis and the
listed, and are therefore subject to the same principles of lexical insertion as ordinary
lexical morphemes. In a lexical theory, the Telugu verb form “pAduwAnu” arises
through the insertion of the lexically listed morphemes “pAdu”, “wA”, and “nu” into
infer inflectionally complex word forms from more basic stems or from other word
forms; inflectional morphemes are not listed in the lexicon, but are the mark of a
particular step in the inference of a complex word form. Such inferences may be
form “pAdu”.
person’ and ‘singular’ through the lexical insertion of the agreement suffix “vu” or
31
by means of a rule inferring “pAduwAvu”, from the perfective stem “pAdu”. Thus,
that determines the lexical insertion of its affixes (if the theory is lexical) or
determines the rules by which it is inferred from a stem or related word form (if the
perfective indicative active}⟩ licenses either the lexical insertion of the morphemes
pAd, u, wA and vu, or the stem based chain of inferences pAd → pAdu → pAduwA
properties with which a lexeme L may be associated, and for each such property set
σ, the morphology of the language defines the word form realizing the pairing (L, σ).
grammars and stochastic models and algorithms. Finite state automata and
discussed in the previous section (Section 2.4.1.1.1). In the current thesis the
morphological generation of Telugu words uses finite automata to check for patterns
32
2.4.2.1 Finite State Morphology
and implemented. In a nutshell, FSM provides the tools for turning rules into
laid out (Johnson 1972) and, independently in the early 1990s, (Kaplan & Kay
1994). They show that rewrite rules are equivalent in power to Finite State
transducers, which are a variant of Finite State automata that linguists are more
familiar with. Instead of accepting or rejecting a single string, as in the case of Finite
State automata, a Finite State transducer accepts or rejects two strings whose letters
are pair-matched, while still retaining the Markovian property of Finite State
transitions. As a result, Finite State transducers are simple, well understood and easy
rewrite rules can in principle be automatically compiled into a single Finite State
transducer, thus capturing the mapping from the underlying form to the surface form
between lexical and surface levels, rather than as rules applied in serial order. Ever
since gaining prominence in the 1980s, Two-Level Morphology has become a staple
in computational linguistics. But it is not the easiest tool to use. The two-level
represent serial rules as parallel constraints. This can be a highly unintuitive and
33
In the current thesis Finite state automata is used extensively in the morphology
engine. Finite state automata are used in pattern matching to categorize the given
word into one of the number of classes specified by the grammar rules in
(Krishnamurti 1985). All the grammar rules implemented by the Telugu morphology
Guido Minnen, John Carroll and Derren Pearce (2000) developed a robust applied
from a given specification of a lemma, a part of speech label and an inflectional type.
The generator was built using data from several large corpora and machine readable
into applications.
algorithm uses the frequency of occurrences of word forms in a raw corpus. They
feature-structures which are not obvious. Frequency of word forms for each
equivalence class is collected from such data for known paradigms. In this algorithm,
suppose the morphological analyser cannot recognize the inflectional form. The
possible stem and paradigm was guessed using the corpus frequencies. The method
assumes that the morphological package makes use of paradigms. This package was
able to guess stem paradigm pair for an unknown word. This method only depends
on the frequencies of the word forms in raw corpora and does not require any
34
linguistic rules or tagger. The performance of this system is depends on the size of
the corpora.
Madhavi Ganapathiraju and Lori Levin (2006) presented a rule based morphological
generator for Telugu nouns and verbs. The implementation was a perl program
suffix list and concatenation algorithm for Telugu morphological generator which
extraction of the suffix list and efficient algorithm for concatenating the lemma and
the morphological features. The preliminary results obtained from this system are
significant.
generator and analyser using a tries based data structure. The generator and analyser
can handle up to maximum 3700 root words and around 88K inflected words.
using APERTIUM tool kit. This attempt involves a practical adoption of lttoolbox
for the modern standard written Tamil in order to develop an improvised open source
Morphological Analyzer and generator. The tool uses the computational algorithm
Finite State Transducers (FST) for one-pass analysis and generation, and the
35
semi-supervised approach to the problem of paradigm induction from inflectional
tables. The system extracts generalizations from inflectional tables representing and
unseen words.
Girija V R and T Anuradha (2017) report the design of a morphological analyser for
Malayalam modelled on finite state techniques that can be used for text analysis
where the model recognizes and strips the morphemes in a string of text.
2.6 Summary
This chapter presents a literature review of all the subtasks in Natural Language
36
-----------------------CHAPTER-3
37
TELUGU REALIZATION ENGINE OVERVIEW
3.1 Introduction
Telugu is a Dravidian language with nearly 100 million first language speakers. It is
a morphologically rich language (MRL) with a simple syntax where the sentence
constituents can be ordered freely without impacting the primary meaning of the
sentence. In this thesis we describe a surface realization engine for Telugu. Surface
realization is the final subtask of an NLG pipeline (Reiter and Dale, 2000) that is
responsible for mechanically applying all the linguistic choices made by upstream
architecture for Natural Language Generation Systems became very important. In the
early days of work in NLG a distinction between ‘strategy’ and ‘tactics’ was made,
where the strategy is concerned with determining ‘what to say’ and tactics are
concerned with deciding ‘how to say it’. The result of this distinction is a particular
modularization where NLG systems had two specific tasks referred to as text
construct NLG systems with an additional module in between the two modules
in Fig 3.1). The use of embedded graphics, formatting mark-ups, and hypertext links
motivated the use of the term document planner in place of text planner. Finally, the
38
linguistic realizer is more generally termed as surface realizer to acknowledge the
fact that the surface forms are not always linguistic in nature.
this information is to be structured for presentation. There are two tasks performed in
structuring the information for presentation. The input for the document planner is a
is the user model and d is the discourse history. The Document Planner takes the
39
a) Construction of messages from the information source;
communicative goal;
3.1.1.2 Microplanning
A Document Plan specifies the final structure and content of the text to be generated
in very broad terms. The purpose of Microplanner is to refine the document plan to
produce a more specified text specification from many possible output texts that are
compatible with the document plan. The Microplanner performs the following
subtasks:
grammar. It takes as input an abstract specification of the text. There are three kinds
40
Syntactic Realization: Syntactic realization uses grammatical knowledge to choose
inflections based on grammatical features, add function words if required, and decide
the order of the components. For example, in Telugu the object of the sentence
usually precedes the verb and the syntactic realizer has to take care of the order.
forms of the words depending on the grammatical features. For example the plural
The abstract specification of the text that comes as input to the surface realizer is the
text specification which is constructed from the document plan by the microplanner.
The text specification describes what text is to be generated and how the text is to be
formatted. Thus there are two distinct aspects of processing of text specifications.
One concerned with mapping logical constructs (specific formatting constructs in the
text) to appropriate document formatting and the other concerned with the
objects such as phrases) so that the final surface forms are realized.
The process of generating the surface forms from the phrase specification is the area
which the phrase specification abstracts away from the actual surface forms. The
Our Telugu realization engine is designed following the SimpleNLG (Gatt and
Reiter, 2009) approach which recently has been used to build surface realizers for
German (Marcel Bollmann, 2011), Filipino (Ethel Ong et al., 2011), French (Vaudry
41
and Lapalme, 2013) and Brazilian Portuguese (Rodrigo de Oliveira and Sripada,
<?xml version=”1.0”encoding=”UTF8”standalone=”no”>
<document> <sentence type=” ” predicatetype=”verbal” respect=”no”>
<nounphrase role=”subject”>
<head pos=”pronoun” gender=”human” number=”plural” person=”third”
casemarker=” ” stem=”basic”> vAdu</head>
</nounphrase>
<nounphrase role=”complement”>
<modifier pos=”adjective” type=”descriptive” suffix=”aEna”> aMxamu</modifier>
<head pos=”noun” gender=”nonmasculine” number=”singular” person=”third”
casemarker=”lo” stem=”basic”> wota</head>
</nounphrase>
<verbphrase type=” ”>
<modifier pos=”adverb” suffix=”gA”> neVmmaxi</modifier>
<head pos=”verb” tensemode=”presentparticiple”> naducu</head>
</verbphrase>
</sentence>
</document>
Figure 3.2. Example XML Input Specification
Several realizers are available for English and other European languages (Gatt and
Reiter, 2009; Vaudry and Lapalme, 2013; Marcel Bollmann, 2011; Elhadad and
Robin, 1996). Some general purpose realizers (as opposed to realizers built as part of
an MT system) have started appearing for Indian languages as well. Smriti Singh et
al. (2007) report a Hindi realizer that includes functionality for choosing post-
the realization engine reported in the current chapter which assumes that choices of
constituents, root words and grammatical features are all preselected before
realization engine is called. As per the review of literature there are no realization
engines for Telugu. However, a rich body of work exists for Telugu language
42
processing in the context of machine translation (MT). In this context, earlier work
reported Telugu morphological processors that perform both analysis and generation
(Badri et al., 2009; Rao and Mala, 2011; Ganapathiraju and Levin, 2006) but none of
the authors reported a surface realization engine for Telugu language. Here in this
Realizers are classified based on the source of grammatical knowledge. There are
linguistic theory (Elhadad and Robin, 1996). There have also been realizers that use
and Oxygen (Habash, 2000). While linguistic theory based grammars are attractive,
1985). Besides, non-linguists (most application developers) may find working with
such theory heavy realizers difficult because of the initial steep learning curve.
necessary to adopt grammar engineering strategies that have low initial costs. The
knowledge corresponding to only the most frequently used phrases and clauses and
43
therefore involve low cost grammar engineering. The main features of a realization
2. A light syntax module that offers functionality to build frequently used phrases
and clauses without any commitment to a linguistic theory. The large uptake of
the SimpleNLG realizer both in the academia and in the industry shows that
3. Using ‘canned’ text elements to be directly dropped into the generation process
4. A rich set of lexical and grammatical features that guide the morphological and
The current work follows the SimpleNLG framework. However, because of the
known differences between Telugu and English SimpleNLG codebase could not be
reused for building Telugu realizer. Instead our Telugu realizer was built from
scratch adapting several features of the SimpleNLG framework for the context of
Telugu. There are significant variations in spoken and written usage of Telugu.
There are also significant dialectical variations, most prominent ones correspond to
the four regions of the state of Andhra Pradesh, India – Northern, Southern, Eastern
from other Indian languages such as Urdu and Hindi. As a result, a design choice for
44
Telugu realization engine is to decide the specific variety of Telugu whose grammar
and vocabulary needs to be represented in the system. In our work, we use the
have decided to include only a small lexicon in our realization engine. This is
because host NLG systems that use our engine could use their own application
specific lexicons. More over modern Telugu has been absorbing large amounts of
As stated in section 3.2, a critical step in building a realization engine for a new
presented in our chosen grammar reference (Krishnamurti and Gwynn 1985). From a
45
3. Telugu, like many other Indian languages, is not governed by a phrase structure
grammar, instead fits better into a Paninian Grammar Formalism (Bharati et al.,
1995) which uses dependency grammar. This means, dependency trees represent
the structure of phrases and sentences. At the sentence level verb phrase is the
head and all the other constituents have a dependency link to the head. At the
Several grammatical and semantic features are used to define agreement rules.
Well-formed Telugu sentences are the result of applying agreement rules at the
Based on the above observations we found that the SimpleNLG framework with its
features mentioned in section 3.2 is a good fit for guiding the design of our Telugu
realization engine. Thus our realization engine is designed with a wide coverage
morphology module and a light-weight syntax module where features play a major
Having decided the SimpleNLG framework for representing and operationalizing the
grammatical knowledge, the following design decisions were made while building
our Telugu realizer (we believe that these decisions might drive design of realizers
2. Define the tag names and the feature names used in the input XML file
46
that using English terminology for specifying input to our Telugu
grammar.
placed at the end) using the same order in which they are specified in the
3.3.1 WX-Notation
representing Indian languages in the ASCII character set. This scheme is widely used
small case letters are used for un-aspirated consonants and short vowels while the
capital case letters are used for aspirated consonants and long vowels. The
retroflexed voiced and voiceless consonants are mapped to ‘t, T, d and D’. The
47
dentals are mapped to ‘w, W, x and X’. Hence the name of the scheme “WX”,
There are a wide range of approaches to specify input to linguistic realization. The
inputs starting with very abstract representation are discussed in the following
sections. The following example will be given as input in the Input Specification
rAmudu kowini karrawo kottAdu. (Ramu bet the monkey with a stick)
example sentence in section 3.3.2. This representation does not say anything about
the content of the individual noun phrases in the sentence. This representation only
indicates that an event of “kottu” (beat) happened and identifies three participants in
48
3.3.2.2 Meaning Specification
The representation of Figure 3.4 does not specify that the object of action “kottu”
(beat) was “kowi” (monkey). This and other information omitted by the Skeletal
The Microplanner not only selects the required elements in the knowledge base for
inclusion in the text to be generated, but also takes certain decision about the
structure into which the information will be placed. The result of this is a
The structure presented in the previous section is still abstract because many
realizers expect the selection of base lexemes to be used to express the semantic
content to be carried out by the previous stage. Once these decisions are made the
usefulness of the content of index and features are exhausted. These features are
49
Figure 3.5 Lexicalized Case Frame
The base lexemes used in this representation, still need to go through the
In certain cases, it may be appropriate for the processes carried out before the
structure. For example, some additional information is added to Figure 3.5 about
specifies that the second argument is to be in focus then the realizer produces the
following sentence:
The role of a realizer in such a case becomes very simple only to encode the
representation.
50
Figure 3.6 Abstract Syntactic Structures
predetermined and stored directly as text strings. For example, the closing salutation
of a letter like “aMxariki suBAkAMkRalu” can be stored as text strings.. Figure 3.7
. A text specification, together with its children (for example, SPhraseSpecs) can be
expressed in XML, based on a predefined XML schema that mirrors the relevant
51
The patient as a result of the procedure had an adverse contrast media reaction, had
52
The input to the Telugu surface realization engine is a tree structure specified in
Figure 3.2. The root node is the sentence and the nodes at the next level are the
constituent phrases that have a role feature representing the grammatical functions
such as subject, verb and complement performed by the phrase. Each of the lower
level nodes could in turn have their own head and modifier children. Each node also
can take attributes which represent grammatical or lexical features such as number
and tense.
For example the subject node in Figure 3.2 can be understood as follows:
<nounphraserole=”subject”><head
pos=”pronoun”gender=”human”number=”plural”person=”third”casemarker=” ”
stem=”basic”> vAdu</head>
</nounphrase>
This node represents the noun phrase that plays the role of subject in the sentence.
There is only one feature, the head to the subject node whose type is nominative. The
lexical features of the head “vAdu” are part-of-speech (pos) which is pronoun,
person which is third person, number which is plural, gender which is human and
The sentence construction for Telugu involves the following three steps:
53
3. Apply sentence level agreement by applying agreement rules selected
Our system architecture is shown in Figure 3.9 which involves morphology engine,
phrase builder and sentence builder corresponding to these three steps. The rest of
the section presents how the example sentence of section 3.1.1.3 is generated from
The Input Reader is the module which acts as an interface between the sentence
builder and the input. Currently the input reader accepts only our XML input
specification but in the future we would like to extend it to accept other input
specifications such as SSF (Bharati et al., 2007). This module ensures that the rest of
54
3.6 Sentence Builder
The Sentence Builder is the main module of the current system which has a
centralized control over all the other modules. It performs four subtasks:
subject, object, complement, and verb which are defined as features of the
respective phrases in the input. It then calls the appropriate element builder for
each of these to create element objects which store all the information extracted
2. These element objects are then passed to appropriate phrase builder to receive
back a string which is the phrase that is being constructed according to the
3. After receiving all the phrases from the appropriate phrase builders the Sentence
language the verb agrees with the argument in the nominative case. Therefore the
predicate inflects based on the gender, person and number of the noun in the
nominative case. There are three features at the sentence level namely type,
predicate-type, and respect. The feature type refers to the type of the sentence.
The current work handles only simple sentences therefore it is not set to any
value. The feature predicate-type can have any one of the three values namely
verbal, nominative, and abstract. The feature respect can have values yes or no.
4. Finally, the sentence builder orders the phrases in the same order they are
55
In the case of the example in Figure 3.2 the sentence builder finds three grammatical
functions - one finite verb, one locative complement, and one nominative subject. In
the example input in section 3.1.1.3 the values for the feature predicate-type is
“verbal” and for respect is “no”. The Sentence Builder retrieves appropriate rule
from an externally stored agreement rule base. In the example input in section
3.1.1.3 where predicate-type is set to verbal, the number of the subject is plural and
the gender is human the Sentence Builder retrieves the appropriate suffix “nnAru”.
This suffix is then agglutinated to the verb “naduswu” which is returned by the
morphology engine to generate the final verb form, “naduswunnAru” with the
After the construction of the sentence the Sentence Builder passes it to the Output
The element builder of each grammatical function checks for lower level functions
like head and modifier and calls the appropriate element builder for the head and
modifier which converts the lexicalized input into element objects with the
grammatical constituents as their instance variables and returns the element objects
back to the Sentence Builder. Our realizer creates four types of element objects
The subject in the example sentence of section 3.1.1.3 is “vAdu” for which a
created for the complement “wota” and its modifier “aMxamu” which is an
56
AdjectiveElement. Finally a VAElement is created for the verb “naducu” and the
well-formed phrases or word groups. In Telugu the main and auxiliary verbs occur
engine. Telugu sentences are mainly made up of four types of phrases - Noun Phrase,
Verb Phrase, Adjective Phrase, and Adverb Phrase. Noun phrases and verb phrases
are the main constituents in a sentence while the Adjective Phrase and the Adverb
Phrase only play the role of a modifier in a noun or verb phrase. There is one feature
at the Noun Phrase level “role” which specifies the role of the Noun Phrase in the
sentence. The phrase builder passes the elements constructed by the element builder
to the morphology engine and gets back the respective phrases with appropriately
inflected words. In the example input in section 3.1.1.3, there are three constituent
phrases, viz, two noun phrases for subject and complement and a verb phrase. One of
the noun phrases also contains an adjective phrase which is an optional modifying
element of noun heads in head-modifier noun phrases. The adjective phrase may be a
single element or sometimes composed of more than one element. The verb phrase
verb. The phrase builder passes five objects i.e., two SOCElement objects, one
the morphology engine and gets back five inflected words which finally become
three phrases, viz, two noun phrases “vAlYlYu”, “aMxamEna wotalo”, and one verb
57
3.9 Morphology Engine
The morphology engine is the most important module in the Telugu realization
engine. It is responsible for the inflection and agglutination of the words and phrases.
The morphology engine behaves differently for different words based on their part of
speech (pos). The morphology engine takes the element object as the input, and
returns to the phrase builder the inflected or agglutinated word forms based on the
rules of the language. In the current work morphology engine is a rule based engine
with the lexicon to account for exceptions to the rules. The rules used by the
externally.
3.9.1 Noun
Noun is the head of the noun phrase. Telugu nouns are divided into three classes
namely (i) proper nouns and common nouns, (ii) pronouns, and (iii) special types of
nouns (e.g. numerals) (Krishnamurti and Gwynn, 1985). All nouns except few
special type nouns have gender, number, and person. Noun morphology involves
mainly plural formation and case inflection. All the plural formation rules from
sections 6.11 to 6.13 of our grammar reference have been implemented in our
engine. The head of the complement in the example of section 3.1.1.3 has one noun
“wotalo”. The word “wota” along with its feature values can be written as follows:
The formation of this word is very simple because the word “wota” in its singular
form and the case marker “lo” get agglutinated through a sandhi (a morpho-
58
‘wota’+lo----- wotalo
3.9.2 Pronoun
Pronouns vary according to gender, number, and person. There are three persons in
Telugu namely first, second, and third. The gender of the nouns and pronouns in
Telugu depend on the number. The relation between the number and gender is shown
in Table 3.1.
Number Gender
Singular masculine, non-masculine
Plural human, nonhuman
Table 3.1: Relationship between Number and Gender
Plural formation of pronouns is not rule based. Therefore they are stored externally
in the lexicon. The first person pronoun “nenu” has two plural forms “memu” which
is the exclusive plural form and “manamu” which is the inclusive plural form. In the
generation of the plural of the first person a feature called “exclusive” has to be
specified with the value “yes”, or “no”. Along with gender, number, and person there
is one more feature which is stem. The stem can be either basic or oblique. The
formation of the pronoun “vAlYlYu” in the example of section 3.1.1.3 which is the
head of the subject along with its feature values can be written as follows:
In this case the stem is basic. The gender of the pronoun is human because the
number is plural as mentioned in Table 1. The word “vAlYlYu” is retrieved from the
lexicon as the plural for the word “vAdu” and the feature values.
59
3.9.3 Adjective
Adjectives occur most often immediately before the noun they qualify. The basic
adjectives or the adjectival roots which occur only as adjectives are indeclinable (e.g.
oka (one), ara (half)). Adjectives can also be derived from other parts of speech like
3.1.1.3 is a derived adjective formed by adding the adjectival suffix “aEna” to the
noun “aMxamu”. The formation of the word “aMxamEna” in the example of section
The current work does not take into consideration the type of an adjective and will
be included in a future version. The formation of this word is again through a sandhi
formation as follows:
aMxamu+aEna-------- aMxamEna
Here the sandhi formation eliminates the “u” in the first word; “a” in the second
3.9.4 Verb
Telugu verbs inflect to encode gender, number and person suffixes of the subject
along with tense mode suffixes. As already mentioned gender, number and person
agreement is applied at the sentence level. At the word level, verb is the most
being agglutinated with the tense-aspect-mode suffix (TAM). Telugu verbs are
classified into six classes (Krishnamurti, 1961). Our engine implements all these
classes and the phonetic alternations applicable to each of these classes are stored
60
externally in a file. The verb in the example of Figure 3.2 has one verb “naducu”
along with its feature values. The formation of the verb “naduswu” can be written as
follows:
The word “naducu” belongs to class IIa, for which the phonetic alteration is to
substitute “cu” with “s”, and therefore the word gets inflected as follows:
naducu----------------nadus
The tense mode suffix for present participle is “wu”, and the word becomes
“naduswu”. The gender and number of the subject also play a role in the formation
3.9.5 Adverb
All adverbs fall into three semantic domains, those denoting time, place and manner
(Krishnamurti and Gwynn 1985). The adverb “neVmmaxigA” in the example (1) is a
manner adverb as it tells about the way they are walking “neVmmaxigA
“neVmmaxigA” in the example (1) along with its feature values can be written as
follows:
“neVmmaxi”, adverb,“gA”-------------neVmmaxigA
61
3.10 Output Generator
Output Generator is the module which actually generates text in Telugu font. The
Appendix) and gives as output a sentence in Telugu based on the Unicode Characters
for Telugu. The output generated for the example input of Figure 3.2 in Telugu
వాళ్ళు అందమైన తోటలో నెమ్మ దిగా నడుస్తున్నా రు. (They are walking slowly in
the garden).
3.11 Evaluation
to test the robustness of the realization engine as the input to the realizer changes we
initially ran the engine in a batch mode to generate all possible sentence variations
given an input similar to the one shown in Figure 3.2. In the batch mode the engine
uses the same input root words in a single run of the engine, but uses different
combinations of values for the grammatical features such as tense, aspect, mode,
number and gender in each new run. Although the batch run was originally intended
for software quality testing before conducting evaluation studies, these tests showed
that certain grammatical feature combinations might make the realization engine
produce unacceptable output. This is an expected outcome because our engine in the
current state performs very limited consistency checks on the input. The purpose of
our evaluation is to measure our realizer’s coverage of the Telugu language. One
62
objective measure could be to measure the proportion of sentences from a specific
text source (such as a Telugu newspaper) that our realizer could generate. As a first
step towards such an objective evaluation, we first evaluate our realizer using
example sentences from our grammar reference. Although not ideal this evaluation
helps us to measure our progress and prepares us for the objective evaluation. The
individual chapters and sections in the book by Krishnamurti and Gwynn (1985)
follow a standard structure where every new concept of grammar is introduced with
the help of a list of example sentences that illustrate the usage of that particular
concept. We used these sentences for our evaluation. Please note that we collect
sentences from all chapters. This means our realizer is required to generate for
example verb forms used in example sentences from other chapters in addition to
those from the chapter on verbs. A total of 738 sentences were collected from
chapter 6 to chapter 26, the main chapters which cover Telugu grammar. Because the
coverage of the current system is limited, we don’t expect the system to generate all
these 738 sentences. Among these, 419/738 (57%) sentences were found to be within
the scope of our current realizer. Many of these sentences are simple and short. For
each of the 419 selected sentences our realizer was run to generate the 419 output
sentences. The output sentences matched the original sentences from the book
completely. This means at this stage we can quantify the coverage of our realizer as
57% (419/738) against our own grammar source. A more objective measure of
63
Having built the functionality for the main sentence construction tasks, we are now
in a good position to widen the coverage. Majority of the remaining 319 sentences
(=738-419) involve verb forms such as participles and compound verbs and medium
to complex sentence types. As stated above, we intend to use this evaluation to drive
our development. This means every time we extend the coverage of the realizer we
will rerun the evaluation to quantify the extended coverage of our realizer. The idea
is not to achieve 100% coverage. Our strategy has always been to select each new
sentence or phrase type to be included in the realizer based on its utility to express
3.12 Summary
This chapter describes a surface realizer for Telugu which was designed by adapting
the SimpleNLG framework for free word order languages. This chapter mainly
focused on the architecture of the Telugu realization engine and the input
different parts of speech, the phrase formation and the sentence formation are only
introduced in this chapter and will be discussed in detail in the further chapters.
64
-----------------------CHAPTER-4
4.1 Introduction
Telugu is a free word order language in which various grammatical categories (case,
for Telugu nouns and pronouns modelled on finite state techniques. The
morphological generator generates the required word form for nouns and pronouns
from an input specification consisting of the lemma and its associated features. The
engine that automates the task of building grammatically correct Telugu sentences.
generating the appropriate verb form by passing the appropriate features (person,
number and gender) required for the formation of the appropriate verb form.
Natural Language Processing (NLP) applications which are growing in number these
days can be categorized into two broad areas namely Natural Language
role and Natural Language Generation (NLG) where Morphological Generators play
and gives as output its root along with its grammatical features. A Morphological
Generator takes a root along with its grammatical features as input and generates the
required word form. Morphological Generators (MG) have a very important role to
2015, Gatt and Reiter, 2009) and Machine Translation (Kristina Toutanova, Hisami
Suzuki, Achim Ruopp, 2008) of free word order languages like Telugu. It is always
66
advantageous for free word order languages like Telugu to have Morphological
Generator as a separate component that is separate from the rest of the NLG system
(Guido Minnen et.al 2000). The current work is a separate module of a surface
realization engine for Telugu (Dokkara, Penumathsa, and Sripada, 2015), a java
pipeline (Reiter and Dale, 2000). The sentence realization engine for Telugu is
designed following the SimpleNLG (Gatt and Reiter, 2009) approach which is a very
Telugu is a morphologically rich free word order language spoken by people from
the south Indian states of Andhra Pradesh and Telangana. In this paper we describe a
morphology engine which automatically generates the different forms of nouns and
pronouns in Telugu. The current work is modelled on the morphology engine for
Telugu nouns are divided into three classes namely (i) proper nouns and common
nouns, (ii) pronouns, and (iii) special types of nouns (e.g. numerals) (Krishnamurti
and Gwynn, 1985). All nouns except few special type nouns have gender, number,
and person. Noun morphology involves mainly plural formation, oblique stem
formation and case inflection. The current work discusses in detail the first two
classes of nouns.
models of grammatical description” was proposed by (Hockett, 1954) which are Item
and Arrangement model (IA) and Item and Process model (IP). Item and
67
units and morphology is an agglutination of such units to form words. Item and
Process (IP) is considered as a derivational process where new word forms can be
existing model the Word and Paradigm (WP) which is a word based morphological
approach which states generalizations that hold between the different forms of
inflectional paradigms and used in languages like Latin, Greek, and Sanskrit. A two
are computationally equivalent and can be implemented using finite state techniques.
2003, Beesley and Karttunen, 2003, Karttunen and Beesley, 2005, Roark and Sproat,
Among the morphological tools for Indian languages (Goyal and Lehal 2011) report
where all word forms are stored in relational database. A number of morphological
tools for Tamil are reported by (PJ Antony and Soman 2012) which range from
corpus based through suffix stripping to finite state techniques. For Telugu language,
(Rao et.al 2011) describe a word and paradigm based morphological analyser and
68
morphological generator for Telugu. (Ganapatiraju et al 2006) describe a rule based
grammatical constituents and associated features in the form of an XML file. The
XML file given as input provides the required grammatical information not only at
the sentence level but also at the word level which acts as the input to the
The head of the complement in the example of section 4.2 has one noun “iMtiki”.
The word “illu” given in the XML specification of Figure 4.1 as the head of the noun
69
phrase which plays the role of a complement in the sentence along with its feature
First the oblique stem of the word “illu” is formed as the word needs to get
agglutinated with the case marker. The formation of the oblique stem is a two-step
process. In the first step the class to which the root word belongs is identified. In the
current work the identification of the class is modelled on finite state techniques. The
root word “illu” belongs to class III. A pictorial representation of the finite automata
Figure 4.2: Finite Automata to identify class III stems for Oblique Stem Formation
In the second step the oblique stem of “illu” which is “iMti” is formed by replacing
“llu” by “Mti”.
After the formation of the oblique stem the case marker gets agglutinated to the
70
“iMti” + “ki”----- “iMtiki”.
The formation of the pronoun “vAlYlYu” in the example of section 4.2 which is the
head of the subject along with its feature values can be written as follows:
The formation of plurals for pronouns does not have any rules and therefore they are
stored in a lexicon. The word “vAlYlYu” is retrieved from the lexicon stored in an
external file “pronounplural.xml” as the plural for the word “vAdu” and the feature
values.
The steps taken by the noun or pronoun root to get the required inflection form are as
follows:
Common nouns can be divided into count and non-count nouns. Non-count nouns
(mass nouns, indivisible objects and abstract nouns) cannot be distinguished for
number they are either singular or plural. Some count nouns do not conform to any
rules of plural formation. The singular and plural forms of the non-count nouns and
the count nouns which do not conform to any rules of plural formation are stored
externally in a lexicon as an XML file named “plural.xml”. Some mass nouns that
71
Word Meaning in English
uppu Salt
nUne Oil
inumu Iron
veVMdi Silver
biyyaM Rice
janaM People
Table 4.1: Example Mass Nouns in Singular
Some mass nouns that exist only in the plural form are given in Table 4.2.
Indivisible objects cannot have both singular and plural forms. Some indivisible
Some example abstract nouns which are non-count nouns are shown in Table 4.4.
Among the count nouns some nouns do not have any rules for the formation of
plurals. Table 4.5 is a list of count nouns which do not confirm to any rules for the
formation of plurals:
72
Singular Word Plural Word
rAyi rAlYlYu
poyyi Poyyilu
peMdli peMdliMdlu
vari Vadlu
gAru gArlu
sAri sArlu
kumArudu kumArulu
eVxxu eVdlu
veVyyi Velu
cenu Celu
penu Pelu
kAdi kAMdlu
jIwagAdu jIwagAlYlYu
alludu allulYlYlYlYu
manamarAlu manamarAlYlYu
ceVlleVlu ceVlleVlYlYu
kUwuru kUwulYlYu
koVdavali koVdavalYlYu
rAwri rAwrilYlYu
Table 4.5: Count Nouns with no rules for Plural Formation
Pronouns do not conform to any rules regarding the formation of plurals. All the
pronouns and their plurals listed in Table 4.6 are also stored externally in a lexicon
73
The regular way of forming the nominative plural is by adding the plural suffix “lu”
Example:
plural suffix “lu” become “lYlYu”. The morphophonemic changes occur based on
the class to which the singular stem belongs. The formation of the plural is a two-
step process. First the class to which the stem belongs is identified. In the current
work the identification of the class to which the stem belongs is implemented as a
finite automata illustrated in Figure 4.2 for class VII. Second the sandhi
The stems can be categorized into different classes for plural formation of nouns
I) Stem final ending in “i/u” preceded by “t”,”Mt”,”Md” is lost before the plural
suffix “lu”
Example:
II) In all stems ending in “di”,”du”,”lu” and “ru” and in stems of more than two
syllables ending in “li” and “ri” the final syllable becomes lYlY before lYlYu
Example:
Exception 1:
74
Example:
Exception 2:
Example:
Example:
IV) Stem final following “llu”,”nnu” following a short vowel becomes “Md” or
lY before lYu
Example:
Exception 1:
Some basic stems ending in “nnu” form the plural by adding lu.
Example:
V) Stem final aM,AM is replaced by A and stem final ending in eVo is replaced
Example:
VI) Stems ending in Ayi form the plural in the regular way by adding lu.
Example:
75
VII) Stem final ending in “yi”,”yyi” is replaced by “wu” before “lu”. The vowel
Example:
The class identification for stems belonging to class VII can be done through the
Figure 4.3: Finite Automata to identify class VII stems for Plural Formation
VIII) Stems that do not confirm to the above classes and when the stem ends in “i”:
syllables and the vowel in the middle syllable is not “i” then the final “i”
Example:
2) If the stem consists of two or more syllables and the vowel in the middle
Example:
76
Exception:
Example:
Proper nouns are not generally used in the plural but when they are used the rules are
Each noun in Telugu has an oblique stem along with the basic stem in both singular
and plural forms. The oblique stem is used to indicate possession or adjectival
in English.
The oblique stems of the personal pronouns and a few demonstrative pronouns in the
singular form like “axi”, “ixi”, “exi” do not confirm to any rules for oblique stem
77
The oblique stems in the singular for common nouns and demonstrative pronouns
are formed based on some morphophonemic rules. Common nouns in Telugu are
divided into six classes based on the manner in which the oblique stem is formed
“ru”, “nu”, “lu” and a few non-human nouns ending in “ru”, “lu” preceded by
a long vowel fall into this category. These form the oblique stem by deleting
the final “u” and adding “i” to the basic stem. Some example nouns and
78
III) Six stems ending in nnu, llu, lYlYu replace them by Mti in forming the
oblique stem. All the six stems belonging to Class III are listed in Table 4.10.
IV) Five stems of two syllables ending in “yi” and two stems ending in “rru”
replace the final syllable by “wi” in the formation of the oblique stem. All the
V) All nouns ending in “M” have two oblique stems, one in the genitive with no
modification and the other before the accusative and dative case which is
VI) Basic stems ending in “e”,”a” or a long vowel, those ending in “i” or “u”
preceded by double consonants except “ll” or “nn” and all nouns not covered
79
by classes I-V have both their basic stem and oblique stem identical. Some
Telugu language has two words for the English word “we”, one exclusive “memu”
which does not include the person who is addressed and one inclusive “manamu”
which includes the person who is addressed. The list of personal pronouns and a few
demonstrative pronouns like “avi”, “ivi”, “evi” in the plural form do not confirm to
any rules for oblique stem formation. They need to be memorized and are listed in
Table 4.14.
The plural oblique stem of the common noun and demonstrative pronoun is formed
“a” added to the plural stem. In Sandhi the final “u” of the plural stem is lost
before “a”.
80
4.4.2.3 Oblique Stems of Proper Nouns
The oblique stem of proper nouns both singular and plural is formed in the same way
as those of common nouns. Table 4.15 lists some example proper nouns and their
oblique stems.
semantic roles establish some grammatical relations between the nouns which they
follow and the verbs of the sentence. In Telugu postpositions are added to the
Postpositions in Telugu can be classified into two types namely Type-1 and Type-2.
Postpositions belonging to Type-1 only occur bound to oblique stems. They never
Most commonly used postpositions of this type are listed in Table 4.16.
Postposition Meaning
ni/nu Accusative
ki/ku Dative
kosaM for the sake of
wo with, along with
nuMci/niMci From
a/na/ni in/on/at
kaMteV than, compared
guMdA/xwArA Through
Table 4.16: List of Type-1 Postpositions
Among the Type-1 postpositions dative and accusative can be grouped into a
81
different from the other Type-1 postpositions.
The accusative and dative postpositions, “nu” and “ku” respectively take the form
“ni” and “ki” if the preceding syllables end in “i” or “I”, except in the case of
personal pronouns like “nI-ku” (for you), “mI-ku” (for you in plural) with single
syllable.
Example:
The use of accusative suffix for nouns denoting inanimate objects is optional. The
Example:
In the above sentence “kaXa” (story) is an inanimate noun which must have taken
the form” kaXanu” but the accusative suffix being optional it takes the nominative
form “kaXa” .In the singular of nouns ending in “aM”,”AM”, and “eVM” the dative
suffix “ki” and the accusative suffix “ni” are added to variant forms of the oblique
stems. The stems ending in “aM” or “AM” (shown in section 4.4.2.1 class V) have
“Ani” as the variant and stems ending in “eVM” have “eni” as the variant of the
oblique stem.
Example:
82
Postpositions belonging to Type-2 are separate words generally denoting place and
that postpositions of Type-1 can be added to them for example “lo” which is a Type-
2 postposition can be added to “nuMci” and “ki” to form “lonuMci” and “loki”.
Postpostion Meaning
lo In
lopala Inside
mIxa On
kiMxa Under
bEta Outside
xaggara Near
veVnaka Behind
muMxu in front of, before
lA,lAgu,lAgA Like
prakAraM according to
warvAwa After
varaku,xAkA up to, until
eVxuta Opposite
maXyana Between
pakkana by the side of
pAtu for (of time)
vEpu in the direction
Table 4.17: List of Type-2 Postpositions
of, towards
Example sentences of Type-2 postpositions are as follows:
In Telugu the gender of nouns and pronouns depend on the number. There are two
genders masculine and non-masculine when the number is singular. All nouns
denoting male persons belong to the masculine gender and all the others belong to
non-masculine gender. There is no feminine gender and all nouns denoting female
83
persons are treated as non-masculine when the number is singular. There are two
genders human and non-human when the number is plural. All nouns denoting male
and female persons belong to the human gender and all others belong to the non-
human gender. The relationship between gender and number is shown in Table 3.1
As a result two demonstrative pronouns “axi”(that thing , that lady) and “ixi”(this
thing, this lady) are non-masculine when the number is singular but when the
number is plural they fall into two genders human “vAlYlYu”(those people) and
the current work both “axi” and “ixi” are treated as referring to things and not female
persons because using these words to refer to female persons happens only in casual
talk.
Nouns generally do not have any marker of gender but some words and suffixes are
used to differentiate between male and female sexes. The different nouns that use
a) Some masculine nouns end with “rAlu” to indicate the female sex.
Example:
snehiwudu(male friend)
snehiwurAlu(female friend)
b) Some descriptive words use the suffixes “amma”, “kawwe” to denote female
Example:
84
musalayya (old man)
c) The word “moVga” (male) and “Ada” (female) are used to distinguish sex in
Example:
moVgapilla
Adapilla
d) Various words are used to distinguish male and female animals and birds.
Example:
kodipuMju (cock)
kodipeVtta (hen)
e) Among pronouns and numerals certain forms are used to distinguish male
Example:
vAdu (he)
AmeV (she)
There are three grammatical persons namely First Person, Second Person, and Third
Person in Telugu. They are used to distinguish between the speaker, the addressee,
and others. The personal pronouns of Telugu language are defined by the
grammatical person.
85
Verbs in Telugu take a form dependent on the person and number of the subject.
Table 4.18 is a list of verb endings depending on the number and person of the noun
or pronoun.
Verb
Number Person
Ending
Example sentences of different grammatical persons with their verb endings are as
follows:
86
4.7 Evaluation
with respect to the Telugu Noun database downloaded from the web resource Telugu
nouns were downloaded to perform the evaluation. The nouns were tested for plural
Evaluation was not done for pronouns as both plural formation and oblique stems
formation for pronouns are shown in Table No 4.6 and Table No 4.7 in section 4.4.1
Among the 524 nouns that were downloaded some nouns which were repeated and
some of them like “kri . pU” (Before Christ) which are not suitable for pluralization
and oblique form generation were removed from the list. A total of 480 nouns were
The evaluation was performed by giving the nouns as input to the surface realization
engine because it speeds up the evaluation as more number of nouns can be tested at
the same time. The evaluation results for the plural formation of the nouns are given
in Table 4.19.
The evaluation results in Table 4.19 show that no nouns are categorized under Class
VI and Class VII. Class VI consists of stems that end with “Ayi”. The downloaded
database which majorly consists of proper nouns does not contain any nouns ending
with “Ayi” which generally occurs at the end of common nouns like “abbAyi” (boy),
“ammAyi” (girl) etc. Class VII consists of only three stems “ceVyyi” (hand), goVyyi
87
(pit), and “nuyyi (water well)” which end with “yyi”. All the three stems are
The evaluation results for oblique stem formation of the nouns are given in Table
4.21.
The evaluation results in Table 4.20 show that none of the nouns are categorized
under Class II, Class III, and Class IV. Class II consists of non-human common
nouns ending in “du”, “di”, “ru”, “ri”, “lu”, “li” which get replaced by “ti”. The
downloaded dataset which majorly consists of proper nouns does not have any noun
belonging to this class. Some example nouns belonging to this class are listed in
Table 9 of section 4.4.2.1. Class III consists of only six stems ending in “nnu”, and
“llu” which are thoroughly tested (shown in Table 10 Section 4.4.2.1). The head of
the complement in the example sentence (1) “iMtiki” shown in its root form “illu” in
88
Figure 1 also belongs to Class III. Class IV consists of only seven stems ending in
“yi”, and “rru” which are also tested thoroughly (shown in Table 11 Section 4.4.2.1).
Case marker generation primarily includes the oblique form generation and then
joining the case marker to the form. As the results in Table 4.20 show that our engine
generates the correct oblique forms. Our evaluation of case marker agglutination
shows that for all the oblique forms our engine produces the correct surface noun
forms.
The results show that all the 480 nouns were identified as belonging to different
classes but in both the plural formation and oblique stem formation a few classes do
not have any nouns in them. The list majorly consists of proper nouns and a very few
common nouns because of which the nouns are not evenly distributed through all the
classes.
4.8 Summary
The morphology engine for nouns and pronouns described in this chapter along with
verb, adverb and adjective morphology engines are separate modules in a surface
realization engine for generating well-formed sentences in Telugu. The noun and
pronoun morphology engine plays a very important role in the surface realization
89
-----------------------CHAPTER-5
MORPHOLOGY OF VERBS
90
MORPHOLOGY OF VERBS
5.1 Introduction
Telugu is a Dravidian language spoken by people from the south Indian states of
language with nearly 90 million first language speakers. In this paper we describe a
Telugu. Morphological Analyser (MA) and Morphological Generator (MG) are two
machine translation systems (Rao et.al 2006) and surface realization engines
processes it into its root along with its grammatical information whereas a
important in Natural Language Generation (NLG) of free word order languages like
separate component that is separate from the rest of the NLG system (Guido Minnen
et.al 2000). The current work is a separate module of a surface realization engine for
Natural Language Generation (NLG) pipeline (Reiter and Dale, 2000). The sentence
realization engine for Telugu is designed following the SimpleNLG (Gatt and Reiter,
2009) approach which is a very popular surface realization engine for English.
engine for English (Guido Minnen et.al 2000). Because Telugu is morphologically
rich language the morphology of Telugu verbs and nouns is comparatively more
91
complex. For example the morphology of Telugu verbs involves defining
involves defining morphology for seven different classes of nouns with respect to the
one for verbs and one for nouns and pronouns. In the implementation instead of
using tools like Flex or JFlex we programmed our morphological engine in Java
using the regular expression package. The process of verb morphology depends on
the way in which the verbs are classified. Linguistic classification of verbs in
morphophonemic changes the verb stems undergo when inflected with tense-mode
suffixes. The model of the analysis decides the number of types into which the verbs
can be classified.
In the current work the verb morphological generator does not have an explicit
lexicon or word list but has a computational model based on finite state techniques to
classify all the verbs into a few regular classes and a very small list of words for the
irregular class. The suffixes to be added to the verbs are maintained in separated
XML files and concatenated to the variants of the verb roots to form the final
inflected form.
Morphology has been well studied both by theoretical (Hockett, 1954, Stump 2001)
and computational linguists (Beesley and Kartunnen, 2003, Roark and Sproat, 2007).
models:
92
morpheme is treated as the minimal meaningful unit of a language and words are
Item and Process model which is lexeme based morphology in which a word form
is assumed to be a result of applying rules that alter a stem to produce a new one. In
this model, inflectional rules, derivational rules, and compounding rules are applied
Word and Paradigm model which is a word based morphological approach which
paradigms.
above have been shown to offer no significant computational advantage to the finite
state approaches that have been widely applied to building MAs and MGs for a
diverse range of languages (Guido Minnen et.al 2000, Karttunen, 2003, Beesley and
Karttunen, 2003, Karttunen and Beesley, 2005, Roark and Sproat, 2007). Therefore,
in the current work we apply finite state techniques to Telugu Morphology. Amongst
Indian languages (PJ Antony and Soman 2012) reported highest number of
morphology tools for Tamil. According to their survey a wide range of approaches,
from corpus based through suffix stripping to finite state exist. A database approach
is described (Goyal and Lehal, 2011) where they store all the word forms in a
relational database. For Telugu language, (Rao et.al 2011) describes a word and
Telugu. A rule based (item and process based) morphological generator for Telugu is
93
5.2 Inputs to the Morphology Engine
The verb morphology engine in the current work is part of a surface realization
Telugu sentences. The input for the surface realization engine is an XML file which
has all the grammatical information required both at the sentence level and word
level. Figure 5.1 shows an example XML specification corresponding to the Telugu
The current chapter describes a computational model based on finite state techniques
and XML files. The computational model is a java application which uses the
“java.util.regex” package. The input to the computational model is the verb lemma,
the tense mode of the verb, PNG (person, number and gender) of the subject and the
case marker of the subject. The “Pattern” class of the “regex” package has a method
94
“matches” which creates a finite state automata for a given regular expression to
identify the class to which a given verb lemma belongs. The computational model
also computes the final constituent of the stem in the inflected verb and finally
The input for the verb in the example XML specification of Figure 5.1 is as follows:
The first attribute is “pos” which stands for part of speech and the second attribute is
“tensemode”.
Verbs in Indian languages inflect for tense, aspect, modality (mode), and PNG
(person, number and gender) endings. The verbs co-occur with tense, aspect, and
modality in most of the languages whereas aspect and modality are packed into a
single verbal inflection word in Telugu and referred to as “tensemode” in the current
work. There are a total of 18 verb forms including both finite and non-finite forms
which are of importance in Telugu. Our morphological engine has the capacity to
generate all the verb forms but only the finite forms are used by the surface
realization engine and therefore we discuss about the finite forms in detail.
The imperative verbs are used to express a command or a request. The meaning of
the imperative verb takes the form of a command in the singular and a request in the
plural. The imperative forms of the verb are only used when the first person in the
singular addresses the second person either in the singular or in the plural. Therefore,
the imperative forms carry two suffixes. In the case of negative imperative the
95
second person suffix is added to the verb root + “ak” (negative imperative suffix).
a) The basic verb stems when the imperative suffixes and the negative tense are
same.
b) The rules of stem final vowel loss and harmony (i.e. change of medial “u” to
Example:
c) Stems ending in “s” preceded by a long vowel change “s” to “y” in the
imperative mood. These stems optionally add the suffix “i” instead of “u” in
the singular. When the “i” suffix is added the stem vowel is optionally
Example:
Exception:
Example:
d) In the case of basic stems having two syllables ending “c” or “s” the final
96
Example:
e) When the stem variant ends in a long vowel the beginning of the imperative
suffixes is dropped.
Example:
f) One irregular verb in the imperative is “pax-a” (go). The last “a” here is
Many verbs cannot occur in this mood due to semantic restrictions. A few verbs like
“kAlu” (to burn), “kUlu” (to fall), “cAvu” (to die), “pagulu” (to break) etc., occur in
this mood. Some example sentences using abusive verb forms are as follows:
The obligative is formed by adding the finite or perfective form of a defective verb
“vAl” to the infinitive of a main verb. The finite form of this verb in the future
habitual tense is “vAli” (must). Some example sentences using Obligative verb
97
mIru mA Uru rAvAli (You should come to our town)
singular. The obligative verb does not agree with the subject in person, gender, or
number. It occurs always in the third person non-masculine singular or without any
personal suffix.
Example:
The future habitual tense in Telugu can express an action or a state that will take
place in the future or an action or state that is habitual. The sentence “nenu annaM
wiMtAnu” can mean either ‘I will eat food’ or ‘I eat food’. Principles for the
b) The verb stems like “ammu” (sell), “adugu” (ask) occur unchanged before
Example:
c) In the case of the basic stems ending in “s” or a long vowel the tense suffix is
added directly.
Example:
d) In the case of basic stems ending in “n” the tense suffix changes to “tA/tun”.
98
Example:
win + tA wiMtA
e) Single syllable stems ending in “tt” (koVttu) (beat), “pp” (ceVppu) (tell)
“wuM”.
f) Stems ending in “c”, “cc”, “Mc” changes those elements to “s” before the
tense suffix.
Example:
In Telugu the past tense corresponds to two past tenses in English for example
“vaccAnu” in Telugu represents both ‘I came’ and ‘I have come’. The following are
a) The tense suffix “e/iM”, and the personal suffix are added to the verb stem to
Example:
b) The stem final “u” before the tense suffix “e/iM” is dropped as a result of
sandhi formation.
Example:
woVdugu + e woVdigA
c) A non-initial “u” in the stem becomes “i” when the past tense suffix is added.
99
Example:
piluc + e pilicA
d) Verb stem suffixes that end with a short vowel + n generally have “nA” as
the past tense suffix but the 3rd person singular female has “na” as the past
tense suffix.
Example:
win + nA winnA
e) The past tense suffix for the verb stem “pad” (fall) is “dA” but in the case of
Example:
pad + dA paddA
f) The verb stem ending in “s” becomes “S” in some cases when the past tense
suffix follows.
Example:
kalus + e kaliSA
An Imperative verb that includes the speaker is called the hortative verb. In Telugu
the hortative verb is formed by adding to the verb stem the hortative suffix “xA”
followed by the first person plural “mu/M”. The hortative form also conveys a future
a) The hortative tense form is obtained by adding the verb stem in the habitual
100
Example:
b) In the case of the future habitual tense forms ending in “c” and “s” they
Example:
In Telugu the negative tense happens by the formation of a verb paradigm rather than
the use of a separate word of negation as in most languages. The negative verbs are
in the future habitual tense and negate the affirmative verb occurring in that tense.
Some example sentences using the negative finite verb forms are as follows:
The negative suffix in Telugu is “a”. It occurs after the verb root and before the
personal suffix in the verb. The personal suffix in the negative tense is same as in the
other tenses except for third person singular non-masculine and third person plural
non-human which are “xi” and “yi” become “xu” and “vu” in negative tense.
a) The negative tense is formed by adding the negative tense suffix “a” to the
Example:
101
b) The medial “u” of the basic stems having two or more syllables of the form
Example:
c) A number of basic stems ending in “c”, “s” replaces these constants by “v”,”
Example:
The durative verb is not a regular finite verb as the other finite forms discussed
earlier. The durative verb is a compound verb as at least two verb roots are involved
Telugu language does not distinguish present, past and perfect continuous tenses as
English does. It is shown by the use of adverb of time or only by the context of
discourse. In the absence of time specifying clues the durative verb carries the
present continuous meaning. The durative verb is formed by adding to the basic verb
stem the durative suffix “w/t” followed by “un” in its finite form.
The principles for the formation of the durative finite verb are:
a) In the case of verb stems ending in a short vowel followed by n the durative
suffix are “t”. The durative verb is “basic stem + t + finite form of un”
Example:
102
b) In all the other cases the durative suffix is “w”. The durative verb is “basic
Example:
In the current work the verb root “piluc” of the example in Figure 5.1 becomes
“piliciMxi” after going through a few steps. The steps the verb root undergoes to get
subject.
current work the morphophonemic group A is divided into three groups namely
A123, A4, and A5 because the phonetic alteration of certain verb classes are
different for these subgroups of the group A. Group C is also divided into two
103
groups namely C 1-8 and C9 for the same reason as A. Each of the tense modes in
Telugu belongs to one morphophonemic group. Table 5.2 shows the list of tense
modes and the morphophonemic group they belong. In the example of Figure 5.1 the
tense mode for the verb is specified as “pasttense”. The morphophonemic group for
past tense is identified as group B by looking at Table 5.2. In the current work Table
Telugu verbs are divided into six classes’ (krishnamurti 1961) of which classes I, II,
III, IV, and V are conjugations of weak (regular) verbs and Class VI consists of
104
Class I consist of four subclasses which are as follows:
a) Verb bases with three syllables of the form (C1)V1C2V2C3V3 (C stands for
optional) in which “u” occurs as V2 and V3, and C2 is not “c” or “s”.
Example:
Example:
c) Monosyllabic bases of the form (C1)V1C2 where “n” or “l” occur as the final
consonant.
Example:
d) Disyllabic bases of (C1)V1C2V2C3 type where the final consonant is “l” and
Example:
Example:
b) Monosyllabic bases of the (C1)V1C2 type in which the final consonant is “c”
or “s”.
105
Example:
In the implementation of the morphology engine the Class II verbs are further
divided into sub classes. The subclass ‘a’ is further divided into ClassIIa1 and
ClassIIa2 where ClassIIa1 has the final consonant as “c” and ClassIIa2 has the final
consonant as “s”. The subclass ‘b’ is also further divided into two subclasses namely
Class III consists of three sub classes which are defined as follows:
a) A few monosyllabic bases of the form (C1)V1C2 with final “c” belong to this
sub class.
Example:
Example:
Example:
a) Monosyllabic bases of the type (C1)V1C2C3- which end in final “tt” or in final
Example:
106
b) Two monosyllabic bases of the same type, one in final “nn” another in
Example:
In the current work the ClassIVa sub class is further subdivided into ClassIVa1, and
ClassIVa2.
Class V consists of seven monosyllabic bases of type (C1)V1C2 in final “n” belong to
this class. The seven bases are an- (to say), kan-1 (to see) kan-2 (to bring forth), kon-
Class VI consists of irregular bases. The irregular bases that belong to this class are
icc- (to give), cacc- (to die), weVcc- (to bring), vacc- (to come), av- (to become), po
(to go), cUc- (to see), leVc- (to rise), le (to be), pax- (to go, depart).
The verb “piluc” in the example of Figure 5.1 is of the form (C1)V1C2V2C3 a
disyllabic base where the final consonant is “c” and the second vowel is “u”. It
belongs to the Class IIa. Figure 5.2 is the diagrammatic representation of the finite
automata created by the computational model for Class IIa. The state 0 is the start
state of the finite automata and the state 4 is the final state. We can see that the first
consonant C1 is optional going to the same state. In the example of Figure 5.1 the
first consonant is “p”, which the finite automata takes as input and goes to the same
state 0. The finite automata then takes V1 which is “i” as input and goes to state 1, at
state 1 it takes C2 which is “l” as input and goes to state 2, at state 2 it takes V2 which
107
is “u” as input and goes to state 3 and finally at state 3 it takes C3 which is “c” as
The extraction of phonetic alternations is done based on the verb class and the
morphophonemic group of the specified tense mode. Table 5.3 clearly shows the
phonetic alterations each verb class goes through in the process of generating the
final inflected form of the verb. In the case of the verb “piluc” in the example of
Figure 5.1 it is clearly shown in Table 5.3 at class IIa1 under group B (to which the
1) The required deletions and replacements are performed on the verb root
2) The required alterations to be added are extracted from the XML file
“palterations.xml”.
The first part is the java programming logic which along with the identification of
the verb class performs the required deletion to form the variant of the verb which is
108
the final constituent of the stem in the inflected verb. The fragment of the java code
Class Basic
alternant Morphophonemic Groups
and
Example
word Group Group Group Group Group Group
A123 A4 A5 B C1-8 C9
Ia (C)VCuCu uCu uCu uCu iC aC uC
adugu ad-ugu ad-ugu ad-ugu ad-ig ad-ag ad-ug
Ib (C)VC(C)u u u u - - -
pAdu pAd-u pAd-u pAd-u pAd pAd pAd
Ic (C)Vn/l - - - - - -
nAn- nAn nAn nAn nAn nAn nAn
Id (C)VCul ul il ul il al ul
kaxul- kax-ul kax-il kax-ul kax-il kax-al kax-ul
IIa1 (C)VCuc us is ux ic av -
piluc pil-us pil-is pil-ux pil-ic pil-av pil-uc
IIa2 (C)VCus us Is ux is av -
wadus wad-us wad-is wad-ux wad-is wad-av wad-us
II b1 (C)Vs s s x s (V)yy (V)yy
wIs wI-s wI-s wI-x wI-s wi-yy wi-yy
IIb2 (C)Vc s S x c y y
vAc vA-s vA-s vA-x vA-c vA-y vA-y
IIIa (C)Vc s s x c c c
kAc kA-s kA-s kA-x kA-c kA-c kA-c
IIIb (C)VCuc us is ux c c c
kAluc kAl-us kAl-is kAl-ux kAlu-c kAlu-c kAlu-c
IIIc .*iMc is is ix iMc iMc iMc
wittiMc witt-is witt-is witt-ix witt-iMc witt-iMc witt-iMc
IVa1 (C)Vtt du Di Da tt Tt tt
koVtt koV-du koV-di koV-da koV-tt koV-tt koV-tt
IVa2 (C)Vpp bu bi ba pp pp pp
ceVpp ceV-bu ceV-bi ceV-ba ceV-pp ceV-pp ceV-pp
IVb1 (C)Vnn M M M nn Nn nn
wann wa-M wa-M wa-M wa-nn wa-nn wa-nn
IVb2 (C)VlYlY lY lY lY lYlY lYlY lYlY
veVlYlY veV-lY veV-lY veV-lY veV- veV- veV-
V (C)Vn N N N nlYlY lYlY
N nlYlY
vin vi-n vi-n vi-n vi-n vi-n vi-n
Table 5.3: Phonetic Alternations for Verb Classes
109
The fragment of code presented in Figure 5.3 deletes the last two letters in the
piluc pil
In step 2 the alternation “ic” is extracted from the XML file “palterations.xml”.
If
(Pattern.matches("[^aAiIIuUeEoOM]?[aAiIIuUeEoOM][^aAiIIuUeEoOM][u][c]",v
erb)) { vclass = "classIIa1";
verb = verb.substring(0,verb.length() - 2);
}
Figure 5.3: Fragment of Java Code for Class IIa1
Tense mode suffixes are those suffixes which are agglutinated to the verb based on
Morphophonemic Criteria
Group A123
Group A4
110
Group A5
Group C1-8
Group C9
111
In the case of the verb “piluc” in the example of Figure 5.1 the tense mode being
“pasttense” which belongs to Group B and the subject being “nonmasculine” the
Telugu verbs inflect to encode gender, number, and person suffixes of the subject. In
the current work the morphology engine gets the information about attributes of the
subject and uses that information to agglutinate the gender, number, person suffixes
and tense mode suffix to the verb. The eight personal suffixes of the finite verb for
masculine, singular, 3rd person which means the personal suffix is “xi” from the
The final inflected verb is formed by concatenation of all the strings formed from
Final verb verb +phonetic alternation+ tense mode suffix+ personal suffix
112
5.6 Evaluation
with respect to the Telugu verb database downloaded from the Telugu Wiktionary at
database by giving them as input to the surface realizer rather than the morphology
engine separately because we wanted to test them with various alternatives of the
A total of 503 verbs were downloaded and the evaluation was performed. The verbs
were tested for habitual future, durative and past tense. The most important part of
the evaluation was categorizing the verbs into different verb classes based on our
reference grammar book (Bh Krishnamurti 1985). The results of the evaluation are
The results show that 418 verbs were identified as to belonging to different classes
and were able to generate the different verb forms without any errors. The results
also show that 85 verbs were not recognized as belonging to any verb class
according to the current work. Among the 85 words which were not identified as
belonging to the verb classes are words like “pilupu” which are not considered as
verbs according to our grammar reference. Some of the words end with “agu” which
means “to become” but our grammar reference considers only “avu” as the verb to
be used to mean “to become” and we did not consider these two to be the same
otherwise the number of failed verbs would have been reduced by 20. We intend to
use this evaluation results to drive the development of the morphology engine to
113
Class No. of verbs identified in
each class
Ia 33
Ib 118
Ic 2
Id 1
II a1 10
II a2 0
II b1 24
II b2 0
III a 4
III b 6
III c 141
IV a1 26
IV a2 3
IV b1 4
IV b2 0
V 38
VI 8
Table 5.11: Evaluation Results for Verb Formation
5.7 Summary
This chapter describes a morphology engine for Telugu verbs based on finite state
techniques. The different verb forms and the different verb classes are described in
the chapter. The process of the formation of inflected verb is also described in detail
in the chapter. The verb morphological engine plays a very important role in the
114
-----------------------CHAPTER-6
SENTENCE FORMATION
115
SENTENCE FORMATION
6.1 Introduction
The grammatical operations of a language like Telugu are basically based on the
category of the words rather than the structure of its constituents. The category of the
words can be pronoun, noun, verb, adjective, adverb etc. The words are grouped to
form more syntactically relevant categories called phrases or word groups. The
Telugu is the major purpose of the current thesis. The responsibility of the sentence
formation in the current work lies with the SentenceBuilder module. The
SentenceBuilder module is the one which has centralized control over all the other
called to create element objects to store all the information received from the
3) The element objects are passed to the appropriate phrase builders to generate
phrases according to the requirements specified in the input. The phrases are
complete sentence.
5) The word order of the sentence constituents in which they were given in the
116
6) The sentence is sent to the output generator to produce the sentences in
Grammatical Functions are syntactic roles played by words or phrases in the context
represented through a sentence level feature called role. The grammatical function is
used to identify the noun phrase which plays the role of subject so that verb
agreement becomes easy. One more advantage with the use of grammatical functions
in the current work is it facilitates free word order for the sentence constituents
which are required in Telugu language. The grammatical functions are identified by
the SentenceBuilder module by looking for the sentence level feature called role.
The ElementBilder module in the current work is used only to make the application
better and facilitate better communication among the modules of the application.
Each grammatical function has a specific element builder. There are four element
builders is to convert the lexicalized input provided in the input XML file into
element objects. Element objects are objects of grammatical functions with their
grammatical features as instance variables. There are four element objects namely
117
element object and return it to the SentenceBuilder. The element objects are passed
word groups or phrases. The order in which the phrases are grouped in a sentence is
relatively free in Telugu when compared to languages like English. In the current
work the module that takes care of the formation of a phrase is the Phrase Builder
module to which the Sentence Builder module sends the required objects to be
phrase, adjective phrase, verb phrase, and adverb phrase are the main four types of
phrases in Telugu language. In Telugu language the phrases generally exist as head-
modifier phrases and therefore in our current word the phrase are head-modifier
phrases. The noun phrase and the verb phrase play a very important role in the
and the adverb phrase play the secondary role of being a modifier of the phrase head.
In the following sections a detailed description about the construction of all the
phrases is provided along with the morphology of the adjectives and the adverbs.
The noun phrases are composed of one or more nouns or pronouns or a noun
modified by one or more adjectives. Every noun phrase has an identifiable head, a
noun or a pronoun. In the current work the noun phrase is a very simple phrase
which has a head and an optional modifier which is an adjective. The construction of
118
sIwa aMxamEna pAta gattigA pAdiMxi. (Sita sang a beautiful song loudly)
<?xml version=”1.0”encoding=”UTF8”standalone=”no”>
<document> <sentence type=” ” predicatetype=”verbal” respect=”no”>
<nounphrase role=”subject”>
<head pos=”noun” gender=”nonmasculine” number=”singular” person=”third”
casemarker=” ” stem=”basic”> sIwa</head>
</nounphrase>
<nounphrase role=”complement”>
<modifier pos=”adjective” type=”descriptive” suffix=”aEna”> aMxamu</modifier>
<head pos=”noun” gender=”nonmasculine” number=”singular” person=”third”
casemarker=”” stem=”basic”> pAta</head>
</nounphrase>
<verbphrase type=” ”>
<modifier pos=”adverb” suffix=”gA”> gatti</modifier>
<head pos=”verb” tensemode=”pasttense”>pAdu</head>
</verbphrase>
</sentence>
</document>
Figure 6.1. Example Input XML Specification
In the above sentence there are two noun phrases which are passed to the noun
phrase builder as two SOCElements (see section) by the sentence builder module.
NounPhraseBuilder to get back the noun phrase “sIwa” which has only the head and
no modifier. The second SOCElement contains a noun and another element which is
the AdjectiveElement. The noun is passed to the morphology engine and the
In the above sentence the noun phrase “sIwa” plays the role of a subject and the noun
phrase “aMxamEna pAta” plays the role of a complement. Telugu language has the
nominative-accusative pattern with the subject predicate agreement and not the
119
In the current work there can be one noun phrase which plays the role of a subject
and one or more noun phrases which play the role of complements. The noun phrase
in the subject role can be in the nominative (unmarked) or the dative (marked with
“ki” or “ku”).
The above sentence is an example where the noun phrase in the subject role is
modifier noun phrase. The adjective phrase can be a single word or a group of words
which act as the modifier of a noun. In the current work the generation of the
phrase currently is restricted to a single word which is expected to change in the next
versions. In the example XML file of Figure 6.1 the AdjectivePhraseBuilder passes
the AdjectiveElement to the morphology engine to get back the adjective phrase
back to the noun phrase which finishes the construction of the noun phrase
“aMxamEna pAta”.
Telugu adjectives are generally indeclinable and most often occur before the noun
120
Class I. The first class of adjectives are called basic adjectives. The adjective roots
that occur only as adjectives belong to this class. These roots always appear before
Class II. The second class of adjectives are called derived adjectives. These
Class III. The third class of adjectives are called positional adjectives. These
words can either be used as nouns or as adjectives depending on the position in the
sentence.
Class IV. The fourth class of adjectives are called bound adjectives. They occur in
There are very few words that can be used as adjectives only and not as anything
else. Some examples of sentences that use basic adjectives are as follows:
121
6.4.2.1.2 Derived Adjectives
Adjectives derived from other parts of speech like nouns, verbs, adverbs are called
follows:
Almost all the nouns in the nominative singular can function as adjectives when
followed by a noun. Here the noun which does not take the oblique form acts as an
adjective based on its position and therefore it is called a positional adjective. All
the cardinal numbers used adjectively belong to this class of adjectives. Some
All words of colour, taste, and density belong to this class of adjectives.
122
6.4.2.1.5 Pronominalized Adjectives
pronominal suffix that agrees with the subject phrase in number and gender.
Pronominal adjectives like nA, mA, mana, nI, mI when used as predicate take the
appropriate pronominal suffix in the same way as other adjectives. A few example
123
Pronominalized adjectives can also occur in the subject position because any noun
can be pronominalized.
Adjectives play the role of a modifier in a noun phrase. All the four categories of
adjectives discussed above can be used as modifier in a noun phrase. In the current
work the XML file which comes as input to form the sentence contains the word
and the associated feature “suffix” of the adjective as modifier in a noun phrase.
Figure (6.1) shows an example noun phrase along with the adjective as modifier.
<nounphrase role=”complement”>
<modifier pos=”adjective” type=”descriptive” suffix=”ti”>pulupu</modifier>
<head pos=”noun” gender=”nonmasculine” number=”plural” person=”third”
casemarker=”” stem=”basic”> paMdu</head>
</nounphrase>
Figure 6.2 . Adjective as a Modifier in a Noun Phrase
The noun phrase specified by the input in Figure 6.2 is “pullati palYlYu” (sour
fruits). The modifier “pullati” is a derived adjective formed by adding the suffix
“ti” to the noun “pulupu”. The formation of the modifier in the Figure 6.2 can be
The current work does not take the type of the attribute into consideration which will
be included in the future work. The formation of the adjective “pullati” is as follows:
pulupu pul
pulpulla
pulla + ti pullati
124
After the final sandhi the adjective “pullati” is formed from the noun “pulupu” as the
Simple verbs in their finite forms are inflected for tense followed by person, number,
and gender endings or states. In order to indicate aspectual, modal and voice
distinctions in the actions or states denoted by the verbs, various auxiliaries are
forms of verbs are derived by affixing “A”, “wA”, and “wunnA”, to the root/stem
In the current work the responsibility of the generation of the verb phrase is with the
VerbPhraseBuilder. The verb phrase is also a head-modifier verb phrase with a verb
as the head and an adverb as the modifier of the phrase. In the example XML file of
Figure 6.1 there is only one verb phrase. The VerbPhraseBuilder gets a VAElement
which has a verb along with its features and an AdverbElement. The verb is passed
on to the morphology engine to get back the head of the verb phrase “pAdiMxi”. The
125
6.4.4 Adverb Phrase
The word adverb is derived from Latin where ad means attached to which indicates
that an adverb is a modifier of a verb. In the current work the construction of the
Adverbs generally occur as modifiers of the verb in a sentence. All the adverbs fall
into three semantic domains denoting time, place and manner. Adverbs belonging to
time and place semantic domain are morphologically nouns since they form oblique
stems and inflect with case suffixes. Some words like “muMxu”(before),
predicate phrase.
adverbial nouns that occur uninflected are illustrated in the following sentences:
126
In the above sentences the italicised words “repu”, “rAwrulu”, “iMkA” occur
uninflected in the sentence. Some adverbial nouns of time include bound particles or
suffixes like “e”,”ke”, ”lo”. Some sentences that include these suffixes are as
follows:
awanu iMtinuMci veMtane vaccAdu. (He came immediately from the house)
In the above sentences the italicized words veVMtan + e, ixivara + ke add the
suffixes “e” and “ke”. Some adverbs are derived from nouns by the addition of “gA”
the infinitive of “av” (to become). Some example sentences to illustrate the use of
demonstrative adjectives like “A”,”I” act as adverbs in the sentence. Some example
Some adverbs of place like “akkada”, “ikkada” are used in uninflected form. Some
127
Some nouns of place or direction become adverbs by the addition of “gA”. Some
awanu nAku xUraMgA kurcunnAdu (He sat far away from me)
One way of forming manner adverb is by adding “gA” to adjectives and nouns
(which do not specify about time or place). The following is a list of adjectives and
Adjective/Noun Adverb
peVxxa peVxxagA
bAgu bAgugA
ceVdda ceVddagA
cinna cinnagA
meVwwa meVwwagA
Table 6.2 Example Manner Adverbs
128
Words like “alA”, “ilA”, “eVlA” are shorter forms of “alAgA”, “ilAgA”, “eVlAgA”
The suffixes “gA” and “lAgA” convert nominal predicates to adverbs when used
The suffix “gA” when added to nouns referring to physical or psychological states
converts them to manner adverbs. The subject in such sentences occurs in the dative
The words “niMdA”, and “ArA” are added to nouns to form adverbs. Some example
129
6.4.4.2 Adverb Morphological Process
Adverbs play the role of a modifier in the verb phrase. All the categories of adverbs
discussed above can be used as a modifier in a verb phrase. In the current word the
adverb and its suffix are specified as the modifier of a verb phrase in the XML file.
An example verb phrase along with the adverb as modifier is given in Figure 6.2.
The verb phrase specified in the Figure 6.3 is “cinnagA pAduwu” (singing slowly).
The modifier “cinnagA” is a manner adverb formed by adding the suffix “gA” to the
noun “cinna”.
The formation of the modifier in the Figure 6.3 can be written along with its features
as follows:
follows:
cinna+gAcinnagA.
morphological phenomenon in which the words get sensitive to each other. Predicate
agreement explains the morphological changes that occur in the predicate appearing
in a sentence with respect to the presence of some other word (Subject or object) in
130
the sentence. The predicates in Telugu can be divided into three different categories
namely verbal, nominative and abstract. In the case of the verbal predicate in Telugu
the finite verb exhibits agreement in number, gender, and person with its subject
nominal, which is always in the nominative. The use of accusative case for inanimate
objects is optional. In the case of inanimate objects the accusative case can be
The first sentence above is the one in which the accusative suffix is added to the
complement of the sentence. The second sentence is the one in which the accusative
suffix is not added to the complement because the use of accusative case for
above sentences mean the same but the second sentence is the one which is regularly
used.
In the current thesis the predicate type can only be verbal and the other predicate
types like abstract and nominative will be included in the future versions of the
work. Some example input XML files and the surface forms generated are discussed
current work the responsibility of the agreement between the nominative subject and
the verb is with the Sentence Builder a module which has centralized control over all
the other processing modules. Here are some examples which discuss the agreement
of person, number, and gender (PNG) between the subject and the verb.
131
6.5.1 Agreement with the First Person
In Telugu language “nenu” is the pronoun for the first person. Here we look at an
example XML file which illustrates the agreement between the first person and the
verb.
The sentence has a subject and a verb. The subject “nenu” along with its features can
be written as follows:
The verb “vaswAnu” along with its features can be written as follows:
The future tense mode suffix “wA” is agglutinated to the word and it becomes:
vas +wAvaswA
132
The sentence builder then agglutinates the PNG suffix for the first person “nu” to the
vaswA+nuvaswAnu.
In case the number feature of the XML file is “plural” and not singular then the
or
“exclusive” which is only used for pronouns in the first person and when the number
is plural.
The subject “manamu” along with its features can be written as follows:
The verb will be inflected as usual except for the PNG suffix. The PNG suffix “mu”
vaswA+muvaswAmu.
The pronoun used to represent the second person in Telugu language is “nuvvu”. An
example XML file which illustrates the agreement between the second person and
133
The XML file in Fig 6.5 generates the following sentence:
The subject “nuvvu” along with its features can be written as follows:
The verb “vaswAvu” along with its features can be written as follows:
The future tense mode suffix “wA” is agglutinated to the word and it becomes:
vas +wAvaswA
The sentence builder then agglutinates the PNG suffix for the second person “vu” to
vaswA+vuvaswAvu.
In case the number feature of the XML file is “plural” and not singular then the
134
The subject “mIru” along with its features can be written as follows:
The verb will be inflected as usual except for the PNG suffix. The PNG suffix “ru”
vaswA+ruvaswAru.
The gender of the person does not show any distinction in the case of second person
An example XML file to illustrate the agreement between the third person and the
The subject “vAdu” along with its features can be written as follows:
The verb “vaswAdu” along with its features can be written as follows:
The future tense mode suffix “wA” is agglutinated to the word and it becomes:
vas +wAvaswA
135
The sentence builder then agglutinates the PNG suffix for the third person
vaswA+duvaswAdu
In case the number feature of the XML file is “plural” and not singular then the
The subject “vAlYlYu” along with its features can be written as follows:
The gender for the third person as shown above when the number is plural is human
to denote human beings and non-human to denote animals and non-living things.
The verb will be inflected as usual except for the PNG suffix. The PNG suffix “ru”
for the third person, plural and human is added to the verb as follows:
vaswA+ruvaswAru.
136
6.5.4 Agreement with the Third Person Non-Masculine
persons are treated as non-masculine gender in the singular but in the plural those
nouns are treated as human along with the nouns denoting male persons. Here we
have an example XML file which illustrates the agreement between the non-
The sentence has a subject and a verb. The subject “axi” along with its features can
be written as follows:
The verb “vaswAxi” along with its features can be written as follows:
The future tense mode suffix “wA” is agglutinated to the word and it becomes:
vas +wAvaswA
137
The sentence builder then agglutinates the PNG suffix for the third person non-
vaswA+xivaswAxi.
In case the number feature of the XML file is “plural” and not singular then the
or
The sentence that will be generated among the above two is decided by the gender
In case the gender is human the subject “vAlYlYu” is generated as described in the
previous section.
In case the gender is non-human the subject “avi” along with its features can be
written as follows:
The verb will be inflected as usual except for the PNG suffix. The PNG suffix “ru”
for the third person, plural and human is added to the verb as follows:
vaswA+ruvaswAru.
The PNG suffix “yi” for the third person, plural and non-human is added to the verb
as follows:
vaswA+ yivaswAyi
138
6.6 Word order
Certain languages like English which have a relatively fixed word order are called as
positional languages. Telugu unlike English is a free word order language like most
other South Asian languages (Dravidian and Indian). The word order of grammatical
functions like subjects, complements and objects is largely free. Internal changes in
the sentences or position swap between various phrases will not affect grammatical
In Telugu the position or order of occurrence of a noun group does not contain the
information about the karaka or theta roles which specifies the number and type of
contained in the post positions or the surface case endings of nouns (Akshara Bharati
et al., 1995). Therefore the relative free order of the words does not affect the
In the current work also the word order is free. The word order of the output is same
as the word order in which the input is given. The SentenceBuilder module makes
sure that the order in which the words in a sentence are sent to the output module is
139
6.7 Output Generator
The output generator is the module which actually gives the output in Unicode
character set as Telugu sentences. A sample output in Telugu language for the
0C1 ఐ ఒ ఓ ఔ క ఖ గ ఘ ఙ చ ఛ జ ఝ ఞ ట
0C2 ఠ డ ఢ ణ త థ ద ధ న ప ఫ బ భ మ య
0C3 ర ఱ ల ళ వ శ ష స హ ఽ ఁ ఁ
0C4 ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ
0C5 ఁ ఁ ౘ ౙ
0C6 ౠౡ ౦ ౧ ౨ ౩ ౪ ౫ ౬ ౭ ౮ ౯
0C7 ౸ ౹ ౺ ౻ ౼ ౽ ౾ ౿
Figure 6.8 is the Unicode block for Telugu as of Unicode version 10.0 which
contains characters for Telugu, Gondi and Lambadi languages of Andhra Pradesh
and Telangana. In its original incarnation the code points U+0C01…U+0C4D were a
direct copy of the Telugu characters A1-ED from the 1988 ISCII standard. The grey
140
6.8 Summary
This chapter presents the details of the sentence formation mechanism used in
current thesis. This chapter gives a detailed description of the different modules used
141
-------------------------CHAPTER-7
CONCLUSION
142
CONCLUSION
7.1 Introduction
realization engine (surface realizer or realizer for short) for Telugu, an Indian
Although surface realization engines for European languages such as English and
French have been available since the 90s, to the best of our knowledge there is no
general-purpose realization engine for any Indian language. In this context, the
framework that has recently been applied to a wide spectrum of languages including
German (Marcel Bollmann, 2011), Filipino (Ethel Ong et al., 2011), French (Vaudry
and Lapalme, 2013), and Brazilian Portuguese (Rodrigo de Oliveira and Sripada,
2014).
A major effort while building a realization engine for a new language relates to
acquire the required linguistic knowledge. Not only finding the required knowledge
sources can be difficult, but having found the right knowledge sources, it can be
quite challenging to then acquire all the required knowledge. The SimpleNLG
143
framework provided the right guidance in both the identification of the correct
knowledge sources and then provided guidelines for acquiring the required
knowledge.
framework required for it should offer a central role to morphology. The SimpleNLG
The evaluation studies carried out during the research work show that the developed
based on the principle that realizers should offer complete morphological coverage
while supporting only the most frequently used syntactic forms (which is one of the
reasons why the framework is called Simple). It should be noted that SimpleNLG
144
framework was originally developed for realizing English, which is known to
the SimpleNLG framework for the morphologically rich Telugu, it has been accepted
theoretical level (the finite state morphological framework) but not necessarily in the
current research work provides wide coverage theoretical basis, our evaluation
studies showed that the realizer (the software) built using the framework needs to
broaden the coverage further. This is particularly true with coverage of noun and
verb forms, the open class words. Our approach while building the software has been
to focus on all the major types of nouns and verbs as described in our knowledge
source.
realization engine software used the WX notation to deal with Telugu orthography
that is not the only notation used by Telugu language technology software. Again,
this is a limitation of the software developed in the research work, but it should be
emphasized that the realization engine framework developed in the current research
An ideal evaluation of the current work should aim to show that the realization
framework and the linguistic knowledge represented by the framework ensure auto-
studies in the current work focus on showing that the realization engine software
surface forms. It is worth emphasizing that further evaluation studies are required to
145
quantify the quality of the software output which in turn brings greater clarity on the
The most important task for future is to apply the developed framework to other
Indian languages so that the developed framework can be claimed to be suitable for
realization of Indian languages. This may involve making further refinements to the
framework which reflects the differences among the Indian languages. A deeper
As argued in the previous section, the current version of the software does not
provide complete coverage of the Telugu grammar. Building on from the current
version, extensions can be made to cover grammatical features currently not covered.
generate more than one grammatically valid form for a given input. This over
generation feature should apply at all levels including words, phrases and sentences.
Software development wise this is a significant extension, but one that makes the
Another important direction for future work is to actually use the realization engine
as part of an NLG application. (Several small-scale efforts have been carried out
during the current research work to apply the realization engine none of which are
evaluation of the engine. In addition, this will also help in specifying the relative
importance of different modules and where the extensions are really required for a
146
Yet another important future work is related to rebuilding the morphology modules
using the finite state tools such as JFlex. Although this will not change the quality of
the output, it would improve the portability of the framework to other Indian
languages significantly. Because the current version already uses Java’s regular
expression library, the grammatical knowledge required to write the JFlex input files
already exists and it should not be hard to incorporate JFlex based lexer for the
147
REFERENCES
148
REFERENCES
4. Albert Gatt and Ehud Reiter 2009 “SimpleNLG: A realization engine for
practical applications”, Proceedings of ENLG 2009, pp 90-93.
7. Ballesteros, M., Bohnet, B., Mille, S., & Wanner, L. 2015. “Data-driven
sentence generation with non-isomorphic trees”. In Proc. NAACL-HTL’15,
pp. 387– 397.
8. Banaee, H., Ahmed, M. U., & Loutfi, A. 2013. “Towards NLG for
Physiological Data Monitoring with Body Area Networks”. In Proc.
ENLG’13, pp. 193– 197.
11. Beesley, Kenneth R, Lauri Karttunen. 2003 “Finite State Morphology”. Palo
Alto, CA: CSLI Publications.
149
13. Benoit Lavoie and Owen Rambow, 1997 “A Fast and Portable Realizer for
Text Generation Systems” Proceedings of the Fifth Conference on Applied
Natural Language Processing (ANLP97), Washington pp. 265–268.
14. Brown, C.P 1991. “The Grammar of the Telugu Language”. New Delhi:
Laurier Books Ltd.
15. Cahill, A., Forst, M., & Rohrer, C. 2007. Stochastic realisation ranking for a
free word order language. In Proc. ENLG’07, pp. 17–24.
16. Carenini, G., & Moore, J. D. 2006. Generating and evaluating evaluative
arguments. Artificial Intelligence, Vol 170 Issue 11, pp. 925–952.
18. Cheng, H., & Mellish, C. 2000. Capturing the interaction between
aggregation and text planning in two generation systems. In Proc. INLG ’00,
Vol. 14, pp. 186–193.
23. Ethel Ong, Stephanie Abella, Lawrence Santos, and Dennis Tiu 2011 “A
Simple Surface Realizer for Filipino” 25th Pacific Asia Conference on
Language, Information and Computation, pp. 51–59, 2011.
26. Gatt, A., Portet, F., Reiter, E., Hunter, J. R., Mahamood, S., Moncur, W., &
Sripada, S. 2009. From data to text in the neonatal intensive care Unit: Using
NLG technology for decision support and information management. AI
Communications, Vol 22 Issue 3, pp.153–186.
150
27. Girija V. R. and T. Anuradha 2017 Application of Finite State Methods in
Malayalam Text Analysis International Journal of Computer Applications
(0975 – 8887) Volume 168 Issue.12, June 2017, pp. 43-47.
28. Goldberg, E., Driedger, N., & Kittredge, R. I. 1994. Using Natural Language
Processing to Produce Weather Forecasts. IEEE Expert, 2, pp. 45–53.
36. James Hunter, Yvonne Freer, Albert Gatt, Ehud Reiter, Somayajulu Sripada,
Cindy Sykes, and Dave Westwater 2011 “BT-Nurse Computer Generation of
Natural Language Shift Summaries from Complex Heterogeneous Medical
Data”. Journal of the American Medical Informatics Association Sep-Oct Vol
18 Issue 5 pp. 621-624.
37. John Henry Clippinger, Jr. 1977 “Meaning and Discourse - A Computer
Model of Psychoanalytic Speech and Cognition”. The Johns Hopkins Univ.
Press, Baltimore, ISBN 0-8018-1943-1.
151
41. Knight, K., Hatzivassiloglou, V 1995 “NITROGEN: Two-Level, Many-
Paths Generation”. Proceedings of the ACL-95 conference. Cambridge, MA
pp. 252-260.
49. Lauri Karttunen and Kenneth R. Beesley 2005 “Twenty-Five Years of Finite
State Morphology”. Inquiries into Words, Constraints and Contexts.
51. Malin Ahlberg, Markus Forsberg, and Mans Hulden. 2014. “Semi-supervised
learning of morphological paradigms and lexicons” in Proceedings of the
14th Conference of the European Chapter of the Association for
Computational Linguistics, Gothenburg, Sweden, April 26-30 2014 pp 569–
578.
52. Malin Ahlberg, Markus Forsberg, and Mans Hulden. 2015. Paradigm
classification in supervised learning of morphology. In Human Language
Technologies: The 2015 Annual Conference of the North American Chapter
of the ACL, Denver, Colorado, May 31 – June 5, 2015 pp 1024–1029.
53. Mann, W. C., & Matthiessen, C. M. 1983. Nigel: A systemic grammar for
text generation. Tech. rep., ISI, University of Southern California, Marina del
Rey, CA (Technical Report RR) pp.83-105.
152
54. Marcel Bollmann, 2011 “Adapting SimpleNLG to German” Proceedings of
the 13th European Workshop on Natural Language Generation (ENLG),
Nancy, France, September pp 133– 138,.
56. Mellish, C., Scott, D., Cahill, L., Paiva, D. S., Evans, R., & Reape, M.
(2006). A Reference Architecture for Natural Language Generation Systems.
Natural Language Engineering, Vol 12 Issue 01, pp.1–34.
57. Meteer, M. W., McDonald, D. D., Anderson, S., Forster, D., Gay, L.,
Iluettner, A., & Sibun, P. 1987. “Mumble-86: Design and Implementation”.
Tech. rep., University of Massachusetts at Amherst, Amherst, MA (Technical
Report COINS 87-87).
58. Minnen GJ, Carroll, 2000 “Robust DP. Robust Applied morphological
generation”. Mitzpe Ramon, Israel: Proceedings of the 1st International
Natural Language Generation Conference. pp. 201-208.
59. Molina, M., Stent, A., & Parodi, E. 2011. “Generating Automated News to
Explain the Meaning of Sensor Data”. In Gama, J., Bradley, E., & Hollm´en,
J. (Eds.), Proc. IDA 2011Springer, Berlin and Heidelberg, pp. 282–293.
65. Pierre-Luc Vaudry and Guy Lapalme 2013 “Adapting SimpleNLG for
bilingual English-French realisation” Proceedings of the 14th European
Workshop on Natural Language Generation, Sofia, Bulgaria, August 8-9 pp
183–187.
153
66. PJ Antony, KP Soman 2012:”Computational morphology and natural
language parsing for Indian languages”: a literature survey. International
Journal of Computer Science and Engineering Technology. International
Journal of Scientific & Engineering Research Volume 3, Issue 3, March-2012
ISSN 2229-5518, pp 1-11.
67. Plachouras, V., Smiley, C., Bretz, H., Taylor, O., Leidner, J. L., Song, D., &
Schilder, F. 2016. Interacting with financial data using natural language. In
Proc. SIGIR’16, pp. 1121–1124.
68. Portet, F., Reiter, E., Gatt, A., Hunter, J. R., Sripada, S., Freer, Y., & Sykes,
C. 2009. Automatic generation of textual summaries from neonatal intensive
care data. Artificial Intelligence, Vol 173 Issue 7-8, pp. 789–816.
70. Ramos-Soto, A., Bugarin, A. J., Barro, S., & Taboada, J. 2015. Linguistic
Descriptions for Automatic Generation of Textual Short-Term Weather
Forecasts on Real Prediction Data. IEEE Transactions on Fuzzy Systems,
Vol 23 Issue 1, pp. 44–57.
74. Reiter, E., Robertson, R., & Osman, L. M. 2003. Lessons from a failure:
Generating tailored smoking cessation letters. Artificial Intelligence, Vol 144
Issue 1-2, pp. 41–58.
75. Reiter, E., Sripada, S., Hunter, J. R., Yu, J., & Davy, I. 2005. Choosing
words in computer-generated weather forecasts. Artificial Intelligence, Vol
167 Issue 1-2, pp. 137–169.
76. Rieser, V., & Lemon, O. 2009. Natural Language Generation as Planning
Under Uncertainty for Spoken Dialogue Systems. In Eacl 2009, pp. 683–
691.
154
Natural Language Generation Conference, Philadelphia, Pennsylvania, 19-21
June 2014 pp 93–94.
79. Siddharthan, A., Green, M., van Deemter, K., Mellish, C., & van der Wal, R.
2013. Blogging birds: Generating narratives about reintroduced species to
promote public engagement. In Proc. INLG’13, pp. 120–124.
80. Siddharthan, A., Nenkova, A., & McKeown, K. R. 2011. Information Status
Distinctions and Referring Expressions: An Empirical Study of References to
People in News Summaries. Computational Linguistics, Vol 37 Issue 4, pp.
811–842.
82. Sri Badri Narayanan R, Saravanan S, Soman KP. 2009 “Data Driven Suffix
List and Concatenation Algorithm for Telugu Morphological Generator”.
International Journal of Engineering Science and Technology (IJEST) ISSN :
0975-5462 Vol. 3 Issue 8 pp. 6712-6717.
83. Stock, O., Zancanaro, M., Busetta, P., Callaway, C., Kru¨ger, A., Kruppa, M.,
Kuflik, T., Not, E., & Rocchi, C. 2007. “Adaptive, intelligent presentation of
information for the museum visitor in PEACH”. User Modeling and User-
Adapted Interaction, Vol 17 Issue 3, pp. 257–304.
84. Theune, M., Klabbers, E., de Pijper, J.-R., Krahmer, E., & Odijk, J. 2001.
From data to speech: a general approach. Natural Language Engineering, Vol
7 Issue 1, pp. 47–86.
85. Thompson H 1977 “Strategy and Tactics: A model for Language Production”
Papers from the Thirteenth Regional Meetings, Chicago Linguistic Society
pp. 651-668.
86. Turner, R., Sripada, S., Reiter, E., & Davy, I. 2008. Selecting the Content of
Textual Descriptions of Geographically Located Events in SpatioTemporal
Weather Data. In Applications and Innovations in Intelligent Systems XV,
pp. 75–88.
87. Uma Maheshwar Rao, G. and Christopher Mala 2011 “TELUGU WORD
SYNTHESIZER” International Telugu Internet Conference Proceedings,
Milpitas, California, USA 28th-30th September pp 1-8.
88. Van Deemter, K., Krahmer, E., & Theune, M. 2005. Real versus
templatebased natural language generation: A false opposition?.
Computational Linguistics, Vol 31 Issue 1, pp. 15–24.
89. Vaudry, P.-L., & Lapalme, G. 2013. Adapting SimpleNLG for bilingual
French English realisation. In Proc. ENLG’13, pp. 183–187.
155
90. Vishal Goyal, Gurpreet Singh Lehal 2011: “Hindi to Punjabi Machine
Translation System Proceedings of the ACL-HLT 2011 System
Demonstrations, Portland, Oregon, USA, 21 June 2011 pp 1–6.
91. Walker, M. A., Rambow, O., & Rogati, M. 2002. Training a sentence planner
for spoken dialogue using boosting. Computer Speech and Language, Vol 16
Issue 34, pp. 409–433.
93. Wanner, L., Bosch, H., Bouayad-Agha, N., & Casamayor, G. 2015. Getting
the environmental information across: from the Web to the user. Expert
Systems, Vol 32 Issue 3, pp. 405–432.
94. Weizenbaum, Joseph 1966. "ELIZA—a computer program for the study of
natural language communication between man and
machine". Communications of the ACM. Vol 9: pp. 36–45.
156
APPENDIX
157
Appendix
158
PAPERS EMANATED FROM THE THESIS
List
159