Thesis "A Simple Surface Realization Engine For Telugu"

-------------------------CHAPTER-1
INTRODUCTION
1
INTRODUCTION
1.1 Introduction
Natural Language Generation (NLG) is a computational task aiming to automatically
generate natural language text (paragraphs, sections and even entire documents)
from non-linguistic input (Reiter E, Dale R 2000). For example, an NLG system
working in the healthcare domain could start with a patient record and physiological
data (such as heart rate and blood pressure) collected over a shift period to auto-
generate a shift handover report from an outgoing nurse to an incoming nurse (James
Hunter et al., 2011).
NLG has come a long way from (in the 60s and 70s) being a research field with
limited objectives (e.g. to demonstrate language generation capabilities as part of an
application) into a major technological advancement backed by solid science and a
huge commercial potential as predicted by Gartner (Gartner predicts,
http://www.gartner.com/smarterwithgartner/gartner-predicts-our-digital-future/, that
NLG software will auto-generate 20% of business content by 2018, just the next
year). With mountains of electronic data available to organizations (public, private
and third sector) and users’ demand for information shooting up all the time, NLG
holds tremendous potential for delivering an unprecedented transformation in our
society by auto-generating human comprehensible information in a language of
choice (English, Hindi or Telugu).
At the state-of-the-art the complex task of NLG is divided into subtasks that either
make text design decisions (both decisions related to subject matter expertise as well
linguistic) or execute the decisions made earlier to produce the surface text. The
2
second set of subtasks that execute decisions are collectively grouped into a subtask
called surface realization. The research work presented in this thesis relates to the
surface realization subtask focusing particularly on Telugu, a south Indian
morphologically rich language belonging to the Dravidian family of languages
(https://en.wikipedia.org/wiki/Dravidian_languages).
The subtask of surface realization is challenging, despite the fact that all the text
design decisions would have been made prior to surface realization, because a
surface realizer is responsible for ensuring the grammatical (e.g. formation of well-
formed words, phrases and sentences) as well as orthographic validity (e.g.
punctuation) of the auto-generated text. Because a natural language (say Telugu)
would have evolved over millennia, its grammar and orthography would also have
evolved into a complex body of knowledge. The real challenge in building a surface
realizer is to acquire all this complex body of knowledge first, represent this
knowledge for computational modelling and finally develop algorithms that exploit
this knowledge for constructing sentences algorithmically.
For English and other similar European languages with well-developed grammatical
resources, building surface realizers has been undertaken with significant success in
the last three decades. Several off-the-shelf English surface realizers such as
FUF/SURGE, RealPro and SimpleNLG are available. However, general purpose
surface realizers for Indian languages ae not commonplace. In the current socio-
economic context of India, when technological advancements have been driving
societal transformation, language technologies have a big role to play. This is being
acknowledged by the government of India and several organizations (e.g. IIIT
Hyderabad and CIIL Mysore) have been setup to develop Indian language
technology. While machine translation has always been the primary focus of Indian
3
language technology, it is time that other language technologies such as NLG should
be encouraged in modern India. With the proliferation of smart phones in recent
times, there is a great potential to deliver finance, health and transport related
information personalized to individuals on their mobile phones in any Indian
language using NLG.
It is worth noting that in all these diverse set of applications (from finance to
transport) the surface realizer is the only NLG module that could be designed and
built domain-independent. The decision-making subtasks of NLG mentioned earlier
are domain-specific and therefore involve significant adaptation work to be portable
across multiple domains. This means, realizers are reusable and as such a thorough
scientific investigation into a realizer for a new language, such as Telugu, will
deliver return on investment (ROI) with the passage of time.
1.2 Motivation
In the standard Reiter and Dale architecture (Reiter E, Dale R 2000) of NLG a
surface realizer is the most language dependent module and also the one that could
be reused across several applications (domain independent). To the best of our
knowledge there have been no previous research studies that explored Telugu
realizers systematically. On the other hand in the recent past there has been a
resurgence of interest in general purpose realizers in the research field of NLG. This
resurgence is largely caused by the SimpleNLG (Albert Gatt and Ehud Reiter, 2009)
realizer for English. This realizer is unique in comparison to earlier general purpose
realizers for English such as PENMAN and FUF/SURGE; SimpleNLG is designed
using commonly used grammatical concepts without the overweight of any linguistic
4
theory. SimpleNLG is the most widely used realizer for English both in academia
and in the industry.
Inspired by the success of the SimpleNLG approach of being linguistic theory
neutral, realizers for a wide range of languages have been reported in the literature
French, German , Brazilian Portuguese and Filipino are a few examples.
The use of software applications is increasing day by day in all parts of India,
especially in the rural areas of the sates of Telangana and Andhra Pradesh where
people do not understand English. The need for a general purpose surface realizer
which generates sentences in their native language helps them understand the
applications better is quite evident. The current work is motivated by the need for a
general purpose Telugu realizer and the availability of the popular SimpleNLG
design.
This thesis reports a realization engine for Telugu that automates the task of building
grammatically well-formed Telugu sentences from an input specification consisting
of lexicalized grammatical constituents and associated features. Our realization
engine adapts the design approach of SimpleNLG family of surface realizers.
1.3 Objectives
In this thesis the design and development of a surface realization engine for Telugu
is investigated. The following objectives have been chosen to achieve the aim:
1. The objectives for the Telugu language are:
a. To identify the most commonly used Telugu sublanguage and its
associated features.
5
b. To identify the right sources of the required Telugu grammatical
(linguistic) knowledge and to acquire the required knowledge from these
sources.
2. The objectives for the realization engine are:
a. To design and develop a framework for modelling the acquired
knowledge for the purpose of Telugu sentence construction
algorithmically.
b. To design, develop and evaluate a surface realization engine for Telugu
using the knowledge acquired and using the developed framework.
3. The objectives for the noun and pronoun morphological engine are:
a. To design and develop the plural formation mechanism for Telugu nouns
and pronouns.
b. To design and develop the oblique stem formation mechanism for Telugu
nouns and pronouns.
c. To design and develop the case-marker agglutination mechanism for
Telugu nouns and pronouns
4. The objectives for the verb morphological engine are:
a. To design and develop a mechanism for identification of verb class.
b. To design and develop a mechanism for the final verb formation with the
agglutination of the required constituents.
5. The objectives for the sentence formation are:
a. To design and develop a mechanism for identifying the input as
predefined grammatical constituents and constructing appropriate element
builders (software component in the implementation) for them.
6
b. To design and develop a mechanism for applying agreement rules, and to
facilitate free word order of the constituents of Telugu sentence.
1.4 Methodology
This thesis presents the research work carried out to design and develop a Telugu
realization engine adapting the SimpleNLG (Albert Gatt and Ehud Reiter, 2009)
framework for Telugu. The input to the Telugu realization engine is an XML file
specifying lexicalized sentence constituents and their associated features.
The methodology followed in the thesis is as follows:
1. The input specification for the Telugu realization engine is an XML file
modelled on the SimpleNLG approach.
2. Primary meaning of Telugu sentences is mainly expressed using
inflected forms of content words and case markers or postpositions.
Therefore each part-of-speech has a separate morphology engine to
generate the inflected forms of words.
3. Telugu is not governed by a phrase structure grammar, instead fits better
into the dependency grammar. Therefore head-modifier dependency
structures are used at the phrase level.
4. Sentence constituents in Telugu can be ordered freely without impacting
the primary meaning of the sentence. Therefore the constituents of a
declarative sentence use a predefined sequence (Subject+ Object +Verb).
5. Agreement among sentence constituents in Telugu is done at the
sentence level. Therefore in the current thesis the agreement of the verb
with the subject is not performed in the morphology engine, instead it is
performed at the sentence level.
7
1.5 Organization of the Thesis
The thesis is organized into seven chapters. The following are the major
contributions in the thesis.
1. This thesis presents a systematic and thorough investigation into the first
Telugu surface realizer.
2. Telugu syntactic structures are very simple by nature and Telugu sentence
construction is dominated by morphological generation. Based on this
understanding Telugu sentence construction as a computational task has been
developed.
3. SimpleNLG approach has been adapted for Telugu sentence construction.
4. SimpleNLG approach like any other surface realization requires grammar
engineering. We found that the grammar available in Krishnamurti's book
serves the purpose. We found a good match between Krishnamurti's Telugu
grammar (linguistic knowledge) and SimpleNLG (Computational approach)
The description of the contributions in the thesis is properly scattered through all the
seven chapters. A brief introduction to each of the chapters is as follows:
Chapter 1 deals with the introduction to surface realization the topic of the thesis. It
provides an overview of the requirement of surface realization in India in the present
scenario. It also provides a very brief introduction to Telugu Language.
Chapter 2 deals with review of literature on foundational concepts and past work in
surface realization and computational morphology. It provides information about the
past work in a hierarchical manner starting from Natural language processing (NLP),
Natural Language Generation (NLG) to Surface Realization and computational
morphology in the context of languages all over the world and in India.
8
Chapter 3 deals with surface realizer for Telugu which is a java based application
that accepts a lexicalized XML file as input and generates well-formed Telugu
sentences. It provides software architecture for the construction of surface realization
engine. It also provides a useful standard surface realization input mechanism for
Telugu which can be extended to other Indian languages.
Chapter 4 deals with the construction of morphological engine process for plural
formation, the inflection based on number, person and gender, and agglutination with
case markers of nouns and pronouns.
Chapter 5 deals with the detailed study of the computational process of
morphophonemic changes the verb undergoes when inflected with tense-mode
suffixes. The verbs are classified into six conjugations based on the
morphophonemic changes, of which five classes contain regular verbs and the last
one contains irregular verbs.
Chapter 6 deals with adjective morphology, adverb morphology, aspects of phrase
formation, sentence level issues like subject verb agreement, and the word order of
the constituents of the sentence. The chapter also provides a detailed explanation of
the process followed and the modules in the implementation of the software for the
Telugu realization engine.
Chapter 7 discusses the conclusion of the thesis which includes a critical review of
the current work and the possibilities of the future work for the extension of the
current research.
9
-------------------------CHAPTER-2
REVIEW OF LITERATURE
10
REVIEW OF LITERATURE
2.1 Introduction
This chapter presents a literature review on the topics of this thesis which are natural
language generation (NLG), realization engine and morphology particularly within
the context of Telugu language. In addition, this chapter also presents background
knowledge supporting the main topics presented in this thesis.
Because the current research is in the area of computational linguistics/natural
language processing, this chapter starts with a brief introduction of this area and how
it relates to artificial intelligence and Computer Science. Following this, a brief
introduction of natural language generation (NLG), which is the main topic of this
thesis, is presented. In this introduction, the task of NLG is first defined and then the
most commonly used subtasks and the different architectures for NLG are described.
Then, a survey of NLG literature focusing on major milestones is described.
Following this, NLG of Indian languages is discussed focusing mainly on
computational morphology and sentence formation in Telugu language.
Computer Science
Artiificial
Intelligence
Computational
Lingusitics/Natural
Language Processing
Natural Language Natural Language

Understanding Generation
Figure 2.1 Hierarchy showing NLG in the context of broader fields
11
The topic of this thesis is a subfield of Artificial Intelligence (AI) (Winograd, 1972;
Appelt, 1985). AI is a field of computing science where computational techniques to
automate mental processes such as language comprehension, language production
and general problem solving are studied. A subfield of AI is Natural Language
Processing which focusses on language understanding and generation by computers
shown in Figure 2.1.
2.2 Natural Language Processing
Natural Language Processing is a subfield of Artificial Intelligence concerned with
programming the computer to process natural language text. Natural language
processing has two subfields namely Natural Language Understanding (NLU) and
Natural Language Generation (NLG). Natural Language Understanding (NLU)
systems take strings of words (sentences) as their input and produce structured
representations capturing the meaning of those strings as their output. The nature of
this output depends heavily on the task at hand. For instance, a natural language
understanding system serving as an interface to a database might accept queries in
natural language which relate to the kind of data held by the database. In this case
the meaning of the input (the output of the system) might be expressed in terms of
structured SQL queries which can be directly submitted to the database. The now
discontinued English Query by Microsoft (https://technet.microsoft.com/en-
us/library/ms143754(v=sql.90).aspx) is a good example of NL interface to relational
databases. Natural Language Generation (NLG) of which is the topic of my research
is discussed in detail in the next section.
12
2.3 Natural Language Generation
Natural Language Generation (NLG) is a subfield of artificial
intelligence/computational linguistic/Natural Language Processing (NLP) that
focuses on developing computational techniques that can automatically produce
understandable text in human languages. It starts from some non-linguistic
representation of input, NLG systems use knowledge about language and the
application domain to automatically produce documents, reports, explanations, help
messages and other kinds of text. Many applications have been developed over the
years which automatically generate text from non-linguistic data including but not
limited to the following:
 soccer reports (e.g., Theune et al., 2001; Chen & Mooney, 2008);
 virtual ‘newspapers’ from sensor data (Molina et al., 2011);
 textual descriptions of the day-to-day lives of birds based on satellite data
(Siddharthan et al., 2013);
 weather and financial reports (Goldberg et al., 1994; Reiter et al., 2005;
Turner et al., 2008; Ramos-Soto et al., 2015; Wanner et al., 2015; Plachouras
et al., 2016);
 summaries of patient information in clinical contexts (Huske-Kraus, 2003;
Harris, 2008; Portet et al., 2009; Gatt et al., 2009; Banaee et al., 2013);
 interactive information about cultural artefacts, for example in a museum
context (e.g., O’Donnell, 2001; Stock et al., 2007); and
 text intended to persuade (Carenini & Moore, 2006) or motivate behaviour
modification (Reiter et al., 2003).
13
NLG is both a fascinating area of research and an emerging technology with many
real-world applications. As a research area, NLG brings a unique perspective on
fundamental issues in artificial intelligence, cognitive science, and human computer
interaction. These include questions such as how linguistic and domain knowledge
should be represented and reasoned with, what it means for a text to be well written,
and how information is best communicated between machine and human. From a
practical perspective, NLG technology is capable of partially automating routine
document creation, removing much of the drudgery associated with such tasks. It is
an actively researched topic in the research laboratories around the world, and is also
deployed in real applications, to present and explain complex information to people
who do not have the background or time required to understand the raw data. In the
longer term, NLG is also likely to play an important role in human-computer
interfaces and will allow much richer interaction with machines than is possible
today.
In one sense language generation is the oldest subfield of language processing when
computers were able to understand only the most unnatural of command languages
they were spitting out natural texts. For example, the oldest and most famous C
program, the ‘hello, world’ program is a generation program. It produces useful
literate English in context. Unfortunately, whatever subtle or sublime communicative
force this text holds is produced not by the program itself but by the author of that
program. This canned text approach to generation is easy to implement but is unable
to adapt to new situations without the intervention of a programmer.
Language generation is also the most pervasive subfield of language processing.
Most of us received a form letter with our name carefully inserted in just the right
places along with eloquent appeals for one thing or another. This sort of program is
14
easy to implement as well but I doubt if many are fooled into thinking that such a
letter is handwritten English. This approach called template filling is more flexible
than canned text and has been used in a variety of applications but is still limited. For
example, (Weizenbaum 1966) use of templates in Eliza worked well in some
situations but produced nonsense in others.
Both the canned text and template approaches to NLG do not really capture the rich
linguistic information that native language users associate with utterances – a
sentence in a natural language is not a sequence of disconnected words. Instead, a
sentence has a structure (syntax of phrases and clauses) and consists of well-formed
words (morphology) all of which communicate a specific message to its recipients.
As a field of research, NLG aims to develop computational models of NLG that
capture the rich linguistic structures of natural languages for the purpose of language
production.
2.3.1 History of NLG
John Bateman and Michael Zock created an exhaustive list of NLG systems which is
available in a wiki form at http://www.nlg-wiki.org/. From this Wiki, it can be
observed that NLG has been applied to a wide spectrum of applications, from
medicine to meteorology and from finance to engineering. Table 2.1 shows a list of
major NLG research before the nineties.
In the nineties, the field of NLG further consolidated the idea of NLG as a complex
task and moved towards the consensus architecture for Natural Language Generation
Systems consisting of three major components namely document planning,
microplanning, and surface realization described in detail in (Reiter, 1994;
15
http://homepages.abdn.ac.uk/e.reiter/pages/papers/nlgw94.pdf). Much of the applied
NLG work since then has been majorly influenced by this consensus architecture.
Year Authors Papers
Richard J Hanson Robert F Generating English Discourse from Semantic

1972
Simmons and J Slocum Networks
Meaning and Discourse - A Computer Model
1977 John H Clippinger
of Psychoanalytic Speech and Cognition
1975 Goldman N Conceptual Generation
Tale-Spin an Interactive Program that writes
1977 Meehan J R
Stories
Strategy and Tactics: A model for Language
1977 Thompson.H
Production
Philip R Cohen and C Elements of a Plan-Based Theory of Speech
1979
Raymond Perrault Acts
James A Moore and A Snapshot of KDS A knowledge Delivery
1979
William C Mann System
Meteer M.,D.McDonald,S
Mumble-86”: Design and Implementation,
1987 .Anderson,D.Forster,L.Ga
Technical Report 87–87
y,A.Huettner, and P.Sibun
Walther V Hahn, The Anatomy of the Natural Language
1978
Wolfgang Hoeppner, Dialogue System HAM-RPM
Antony Jameson,
1985 Appelt D Planning English Sentences
Wolfgang Wahlster
Table 2.1 NLG Research before 1990
2.3.2 NLG Tasks
The NLG problem of converting input data into output text was addressed by
splitting it up into a number of sub problems. The following six are frequently found
in many NLG systems (Reiter & Dale, 1997, 2000)
1. Content determination: Deciding which information to include in the text under
construction,
2. Text structuring: Determining in which order information will be presented in
the text,
16
3. Sentence aggregation: Deciding which information to present in individual
sentences,
4. Lexicalisation: Finding the right words and phrases to express information,
5. Referring expression generation: Selecting the words and phrases to identify
domain objects,
6. Linguistic realisation: Combining all words and phrases into well-formed
sentences.
The above mentioned NLG sub tasks are very complex addressing the following two
issues:
a) Firstly, given a communication goal and the context of communication, it
sets the research agenda for developing computational techniques for
computing the required content (information) that should be verbalized in the
text. Logically, the organization of content (information) into a coherent
narrative has also been closely associated with content determination.
b) Secondly, the focus is on developing computational techniques for
expressing the content (or information) in well-formed words, phrases and
sentences.
2.3.2.1 Content Determination
In the generation process the NLG system needs to decide which information should
be included in the text under construction and which should not. In general the data
contains more information than required. It is the task of NLG systems to decide the
required content for the given application. Content determination involves choice.
Content determination need to be closely related to domain of application (cf.
Millish et al., 2006).
17
2.3.2.2 Text Structuring
After the determination of what information is to be conveyed the NLG system
needs to decide the order of presentation to the user. This task is referred to as Text
Structuring. The ordering of the text in most applications is based on importance,
and grouping of information based on relatedness (Portet et al., 2009).
2.3.2.3 Sentence Aggregation
Each and every message in the text plan need not be represented as a separate
sentence. Some messages can be combined to form a sentence. The generated text by
combining some messages becomes more fluid and readable (Dalianis, 1999; Cheng
& Mellish, 2000)
2.3.2.4 Lexicalization
Lexicalization is a very important decision regarding which words or phrases to use
to express the messages effectively. The complexity involved in the lexicalization
process depends on the number of alternatives the NLG system can offer. Lexical
choice in applications may be informed by contextual constraints, stylistic
constraints, or attitude and affective stance towards the event in question
(Fleischman & Hovy, 2002).
2.3.2.5 Referring Expression Generation
Referring Expression Generation is defined (Reiter and Dale 1997) as the task of
selecting words or phrases to identify domain entities. This definition suggests a
close similarity to lexicalisation, but (Reiter and Dale 2000) point out that the
essential difference is that referring expression generation is a discrimination task,
where the system needs to communicate sufficient information to distinguish one
18
domain entity from other domain entities. Referring Expression Generation is among
the tasks within the field of automated text generation (Mellish et al., 2006;
Siddharthan et al., 2011).
2.3.2.6 Surface Realization
This task referred to as Surface Realization or Linguistic Realization involves
ordering constituents of a sentence and gathering the right morphological forms. An
important complication in this task is that the output needs to have many linguistic
components which may not be present in the input. Thus this task is the projection of
non-isomorphic structures (cf. Ballesteros et al., 2015). Some of the approaches that
are proposed for linguistic realization are:
a) Human-crafted templates
b) Statistical approaches
c) Human-crafted grammar-based systems
2.3.2.6.1 Human-crafted templates
When the application is small and the variation in the output is minimal the outputs
can be specified in the form of templates. For example:
$AtagAdu $parugulu parugulu koVttAdu
This template has two variables which can be filled with the names of a player and
the number of runs scored by the player. It can generate sentences like:
virAt kohli vaMxa parugulu koVttAdu. (Virat Kohli hit hundred runs)
The advantage with the use of templates is that they avoid the generation of
ungrammatical structures and allow for full control over the quality of the output.
19
The template based methods have started including syntactic information and
sophisticated rules for filling the gaps (Theune et al., 2001) making it difficult to
distinguish template based methods from other methods (van Deemter et al., 2005).
The disadvantage of template based systems is they are labour intensive and do not
scale to applications which require considerable linguistic variation.
2.3.2.6.2 Statistical Approaches
Some applications acquire probabilistic grammars from large corpora reducing the
manual labour required while increasing coverage. Two approaches have been taken
to include statistical information in the realization process. One approach
(Langkilde-Geary, 2000; Langkilde-Geary & Knight, 2002) on the
HALOGEN/NITROGEN systems relies on a two-level approach in which a hand
crafted grammar is used to generate alternative realizations represented as a forest,
from which a stochastic re-ranker selects the optimal candidate. The system relies on
corpus based statistical knowledge in the form of n-grams. There are also other
sophisticated models to perform re-ranking (e.g., Bangalore & Rambow, 2000;
Ratnaparkhi, 2000; Cahill et al., 2007) and models trained on user ratings of
utterance quality (Walker et al., 2002).
A second line of research has focused on introducing statistics at the generation
decision level, by training models that find the set of generation parameters
maximising an objective function, e.g. producing a target linguistic style (Paiva and
Evans, 2005; Mairesse and Walker, 2010), generating the most likely context-free
derivations given a corpus (Belz, 2008), or maximising the expected reward using
reinforcement learning (Rieser and Lemon, 2009). While such methods do not suffer
from the computational cost of an over generation phase, they still require a
20
handcrafted generator to define the generation decision space within which statistics
can be used to find an optimal solution.
2.3.2.6.3 Human-crafted Grammar-based systems
The topic of the current thesis which constructs a surface realization engine for
Telugu is modelled on the human-crafted grammar-based systems. A brief
introduction to human-crafted grammar-based systems is provided in this section. As
template based systems cannot scale to applications which require considerable
linguistic variation general purpose domain-independent systems are used as an
alternative. Most of these systems are grammar-based, that is, they make their
choices on the basis of a grammar of the language under consideration. The human-
crafted grammar-based surface Realizer while generating the surface forms must
satisfy the following requirements:
a) The semantics of the Input Specification are to be preserved.
b) The generated surface forms are grammatical with respect to the language in
question.
In order to satisfy the above requirements, the Surface Realizer may select content
words, insert function words, perform morphological inflections, and take care of the
order of the surface forms making the grammar of the language under consideration
as the basis. The difficulty with grammar-based systems is how to make choices
among the related options for generation of the surface forms.
The grammar for hand-crafted systems can be manually written, as in many off-the-
shelf realizers such as fuf/surge (Elhadad & Robin, 1996), mumble (Meteer et al.,
1987), kpml (Bateman, 1997), Nigel (Mann & Matthiessen, 1983), and RealPro
(Lavoie & Rambow, 1997). Hand-crafted grammar-based realizers require very
21
detailed input as in KPML (Bateman, 1997) which is based on Systemic-Functional
Grammar (sfg; Halliday & Matthiessen, 2004). The levels of details that are required
make these realizers difficult to use as simple plug-and-play or off the shelf modules
(e.g., Kasper, 1989). The difficulty with systems like KPML has motivated the
development of simple realization engines which provide syntax and morphology
APIs, but leave choice-making up to the developer (Gatt et al., 2009; Vaudry &
Lapalme, 2013; Marcel Bollmann, 2011; Rodrigo de Oliveira & Sripada, 2014).
2.3.2.6.3.1 SimpleNLG
The current thesis is modelled on a hand-crafted grammar-based surface realizer
SimpleNLG (Gatt and Reiter, 2009). SimpleNLG is a simple java API designed to
function as a realization engine which is the last stage in natural language generation.
It has been used successfully in a number of projects belonging to both academic and
commercial. SimpleNLG automates some of the mundane tasks that all natural
language generation systems require to perform. The tasks such as:
a) Orthography which includes pouring, formatting lists, absorbing punctuation,
and inserting appropriate white spaces in sentences and paragraphs.
b) Morphology, which includes handling inflected forms that are modifying a word
or a lexeme to reflect grammatical information such as number, person, tense,
and gender.
c) Simple Grammar which includes providing appropriate syntactic structure,
creating well- formed word groups, and enforcing noun-verb agreement.
SimpleNLG is used by researchers having their own implementation of document
planning or micro planning so that the mundane tasks of realization need not bother
22
them. It is also used by anyone who wants to write programs to generate English
sentences.
2.3.2.6.3.2 SimpleNLG XML Realizer Framework
The input specification mechanism for the current thesis is modelled on the
SimpleNLG XML Realizer framework. The input for the current thesis is an XML
file which is similar to the SimpleNLG XML input design but not same. A simple
description of the SimpleNLG XML Realizer framework is provided here.
The SimpleNLG XML Realizer walks through the Input Specification in a top-down
left-to-right fashion to produce appropriate output from the nodes encountered
during the traversal. The nodes in the Input Specification provide the information as
to what the realizer needs to do along with the linguistic units.
The xml realiser framework works by:
1. The XML realizer framework uses the code generation tools, xjc for java
and xsd for C# to generate wrapper classes for the relevant elements in the
schema. A client application can invoke simpleNLG to get the realized text
through the wrapper classes which act as Data Transfer Objects. Wrapper
classes are contained in the package simplenlg.xmlrealiser.wrapper. These
classes have the same names as real simpleNLG classes, with the prefix
“Xml”. Wrapper classes need to be generated only once, and only if changes
are actually made to the XML schema.
2. The simplenlg.xmlrealiser.UnWrapper class uses the java DTO that are
created by un-marshalling the xml specification of a DocumentElemement,
conforming to the schema. A document element object
23
simplenlg.xmlrealiser.wrapper.XmlDocumentElement object that represents
the xml is created using the javax.xml.bind.Unmarshaller. The Unwrapper
recursively processes the Data Transfer Objects to produce a
simplenlg.framework.DocumentElement which is then passed to the realiser,
and realized in the usual way.
2.3.3 A Short History of Surface Realization
Mann and Matthiessen 1985 have done a lot of research in text generation and
created a software Nigel in the framework of systemic linguistics. Along with the
specification of function and structures of English Nigel also has a semantic stratum
to specify the situations in which the grammatical features are to be used.
Coch pioneered the developed of AlthGen an automatic multi-paragraph text-
generation tool box. It was first developed for French language during 1993-94. The
English and Spanish versions were also developed later. The main characteristics of
AlthGen are as follows:
1) The high quality of the multi-paragraphs generated, in terms of flow of text,
and customizability.
2) Its ability to produce an extensive set of different text structures due to its
data-driven planning approach.
Knight and Hatzivassiloglou, in 1995 proposed NITROGEN: “Two-Level, Many-
Paths Generation” which is a hybrid generator that takes the help of statistical
methods to fill the gaps in the symbolic knowledge.
Elhadad and Robin in 1996 proposed SURGE (Systemic Unification Realization
Grammar of English) a reusable comprehensive syntactic realization component.
24
Lavoie and Rambow, in 1997 developed a new off-the-shelf realizer REALPRO
which is derived from existing systems with a new design and completely new
implementation. REALPRO has the following characteristics:
1) It is fast and portable across platforms as it was implemented in C++.
2) It can run as a stand-alone server and has both C++ and Java API.
Bateman, in 1997 proposed KPML a development environment to enable technology
for multilingual natural language generation. KPML provides the following features
to generation processes:
1) A set of standardized linguistic resources useful for text generation which are
always improving.
2) A tactical generation engine to for using such linguistic resources.
3) A number of highly focussed debugging aids to further support efficient
maintenance and development of such linguistic resources.
4) A number of customization tools.
5) Specialized techniques to support multilingual works such as contrastive
language development and automatic merging of independently developed
resources for distinct languages.
Nizar Habash in 2000 proposed Oxygen a language independent linearization
engine. The grammars which come as input to this linearization engine are written in
a powerful and flexible language oxyL which is as good as conventional
programming languages. The linearization engine compiles the grammars of the
target language into programs that accept feature graphs as input and generate word
lattices. The word lattices are passed as input to the statistical extraction module of
the generation system Nitrogen.
25
Irene Langkilde, in 2000 proposed Forest-Based Statistical Sentence Generator in
which a phrase is chosen by statistically ranking a set of alternative phrases packed
as trees or forests.
Susan W. McRoy, Songsak Channarukul, and Syed S. Ali in 2000 proposed YAG
(Yet Another Generator) a template-based generator for real-time systems. This
generator works well in interactive applications providing natural language output to
the interactive context. It does not require the extensive knowledge of the grammar
of the target language or all possible output strings ahead of time. YAG provides
support for unspecified inputs, robustness, speed, expressiveness, and coverage to
applications and application designers.
Gatt and Reiter in 2009 created a java API SimpleNLG a realization engine for
English with an aim to provide simple and robust interfaces to generate syntactic
structures and linearize them. This realization engine is the main source of
inspiration for the work reported in this thesis. Therefore details of SimpleNLG are
described in greater detail in the chapter 3.
Marcel Bollman in 2011 reports Adapting SimpleNLG for German a surface
realization engine which is a java framework modelled on SimpleNLG. The paper
describes the characteristics of German language and the changes made to the
SimpleNLG framework to meet the requirements of the relatively free word order
German language.
Pierre-luc Vaudry and Guy Lapalme 2013 report Adapting SimpleNLG for bilingual
English-French realisation a bilingual surface realization for English and French.
SimpleNLG-EnFr is the name given to the bilingual realisation engine. This paper
26
describes the general characteristics of the software and the adaptions made for the
French Language.
Rodrigo de Oliveira and Somayajulu Sripada 2014 report Adapting SimpleNLG for
Brazilian Portuguese realisation which reports the ongoing implementation and the
current coverage of SimpleNLG-BP an adaption of SimpleNLG-EnFr for Brazilian
Portuguese.
Alessandro Mazzei, Cristina Battaglino, and Cristina Bosco 2016 report
SimpleNLG-IT a surface realization engine for Italian modelled on the principle of
SimpleNLG. The paper gives some details about the grammar and the lexicon
employed by the system and reports some results about a first evaluation based on a
dependency tree bank for Italian.
2.3.4 Surface Realization in Indian Context
Smriti Singh, Mrugank Dalal, Vishal Vachhani, Pushpak Bhattacharyya, Om P.
Damani in 2007 created HinD, a Hindi generation software from Interlingua (UNL).
This software is an Interlingua for knowledge representation in the context of
machine translation. The generation process consists of three main stages:
1) Morphological generation of lexical words, function words insertion, and
syntax planning.
2) Case marker insertion after the subject and the object.
3) Finally all the words are arranged to form a valid sentence.
Uma Maheshwar Rao, G. and Christopher Mala (2011) presented Telugu Word
Synthesizer which is a generic engine that can be used for any language by plugging
in a specific language database. The generator synthesizes all and only the well-
27
formed word forms. The generator engine is independent of language and works
effectively and is based on word-and-paradigm method.
Vishal Goyal and Gurpreet Singh (2011) developed a Hindi to Punjabi machine
translation system. The key activities involved during translation process are pre-
processing, translation engine and post processing. Lookup algorithms and pattern
matching algorithms formed the basis for solving these issues. The system accuracy
has been evaluated using intelligibility test, accuracy test and BLEU score. The
hybrid system is found to perform better than the constituent systems.
2.4 Morphological Theories
Johann Wolfgang von Goethe (1749-1832) a German poet, novelist, and philosopher
was the first person to coin the term morphology. He coined it in the early nineteenth
century in a biological context. The term morphology originated from the Greek
where morph- means ‘shape or form’ and morphology is the study of morphs or
forms. In linguistics morphology is the branch which deals with words, their internal
structure, and the way in which they are formed.
2.4.1 Linguistic Theories of Morphology
A number of theoretical models have been developed over the years. Each model has
a specific set of claims about the nature of morphology and specific focuses in terms
of data that is covered by the theory. A lot of research is done on various aspects of
theoretical morphology. The classification of theoretical morphology is done by
(Hockett, 1954) and (Stump, 2001). A description of their theories is provided in the
following sections.
28
2.4.1.1 Two Models of Grammatical Description
Linguistic theories of morphology differ in their view with respect to treating
morpheme as the basic building block of morphological analysis or generation. “Two
models of grammatical description” was proposed by (Hockett, 1954) which are Item
and Arrangement model (IA) and Item and Process model (IP). He also mentioned
about the Word and Paradigm approach (WP) which is the oldest among the three
approaches.
2.4.1.1.1 Item and Arrangement Model
Morpheme-based morphology is an approach in which word forms are analyzed as
arrangements of morphemes. A morpheme is treated as the minimal meaningful unit
of a language. This way of analyzing word forms is called ‘item-and-arrangement’
which treats words as concatenation of morphemes.
In Telugu Language when verbs are classified using item-and-arrangement approach
they result in six classes of conjugation types out of which five classes are regular
verbs and one irregular class of verbs. The classes I, II and III of regular verbs have
ten subclasses. In a word such as ammu-wA-nu (I will sell), the morphemes are
ammu-,-wA, and -nu where ammu- is the root, -wA is the tense-mode suffix, and –
nu is the personal suffix.
2.4.1.1.2 Item and Process Model
Lexeme-based morphology takes an approach called item-and-process. In lexeme-
based morphology a word form is assumed to be a result of applying rules that alter a
29
stem to produce a new one. Inflectional rules, derivational rules and compounding
rules are applied to stems to obtain the required word form.
In Telugu language, inflected word forms are derived by a set of sandhi rules
operating on stem and suffixes. The item-and-process approach when applied to
verbs falls into two types regular, and irregular. The regular verb pilus-wA-nu (I will
call) is a result of phonological substitution of the morpheme piluc- and –wA to
produce pilus-. Some of the irregular verb stem variants are derived by lexical
substitution rules and not by phonological substitution. The morpheme vacc-
becomes rA- when followed by a beginning with the vowel ‘a’.
2.4.1.1.3 Word and Paradigm Approach
The approach used by word-based morphology is called as word-and-paradigm
approach. This theory instead of stating rules to combine morphemes to form words
states generalizations that holds between the different forms of inflectional
paradigms.
In Telugu language when word-and-paradigm approach is applied the verbs are
classified into twelve paradigmatic classes of regular verbs and ten irregular verbs.
Each class is described by giving a verb paradigm for that class. The word ‘pilucu’
(to call) represents one class of the twelve regular paradigmatic classes and all the
words that have the same inflectional paradigm like ‘kalcu’ (to burn) come under
this paradigmatic class. The irregular verbs do not have a paradigmatic class instead
each verb is treated separately. The word ‘icc’ (to give) is one among the ten
irregular verbs in this classification.
30
2.4.1.2 A Two Dimensional Taxonomy of Morphology
A two dimensional taxonomy of morphological theories was proposed by (Stump,
2001). He distinguished two axes along which inflectional morphology may be
situated relative to one another. He proposed the lexical/inferential axis and the
incremental/realizational axis (Stump, 2001) which are orthogonal to each other.
In a lexical theory of inflectional morphology, inflectional morphemes are lexically
listed, and are therefore subject to the same principles of lexical insertion as ordinary
lexical morphemes. In a lexical theory, the Telugu verb form “pAduwAnu” arises
through the insertion of the lexically listed morphemes “pAdu”, “wA”, and “nu” into
a particular constituent structure. Inferential theories, by contrast, rely on rules to
infer inflectionally complex word forms from more basic stems or from other word
forms; inflectional morphemes are not listed in the lexicon, but are the mark of a
particular step in the inference of a complex word form. Such inferences may be
stem-based or word-based: for example, “pAduwAnu” might be deduced from more
basic stems through a chain of inferences: pAd- → pAdu → pAduwA →
pAduwAnu alternatively, “pAduwAnu” might be inferred from the contrasting word-
form “pAdu”.
In an incremental theory, each inflectional morpheme is associated with a particular
morpho-syntactic content—in the lexicon, if the theory is lexical, and in a rule of
inference, if the theory is inferential—and each complex combination of morphemes
acquires its morpho-syntactic properties cumulatively, through the combination of
the morpho-syntactic properties of the individual inflectional morphemes of which it
is composed. In a theory of this sort, “pAduwAvu” acquires the properties ‘second
person’ and ‘singular’ through the lexical insertion of the agreement suffix “vu” or
31
by means of a rule inferring “pAduwAvu”, from the perfective stem “pAdu”. Thus,
in an incremental approach to inflection, a word form’s morpho-syntactic content is
supplied in steps. In a realizational theory, a word form’s association with a
particular set of morpho-syntactic properties logically precedes the expression of
those properties by particular inflectional markings: it is precisely this association
that determines the lexical insertion of its affixes (if the theory is lexical) or
determines the rules by which it is inferred from a stem or related word form (if the
theory is inferential). In such a theory, the association 〈pAdu, {2 sg present
perfective indicative active}⟩ licenses either the lexical insertion of the morphemes
pAd, u, wA and vu, or the stem based chain of inferences pAd → pAdu → pAduwA
→ pAduwAvu, or the word-based inference of pAduwAvu from pAdu. Thus, in a
realizational approach to inflection, a language’s grammar specifies the sets of
properties with which a lexeme L may be associated, and for each such property set
σ, the morphology of the language defines the word form realizing the pairing (L, σ).
2.4.2 Computational Morphology
Computational approaches to morphology are concerned with formal devices such as
grammars and stochastic models and algorithms. Finite state automata and
transducers can be used as formal devices to implement morphological grammars or
statistical part-of speech taggers (Roark Brain et.al 2007). Computational
morphological generators subscribe more to the Item and Arrangement model
discussed in the previous section (Section 2.4.1.1.1). In the current thesis the
morphological generation of Telugu words uses finite automata to check for patterns
modeled on the rules of the grammar.
32
2.4.2.1 Finite State Morphology
Ordered rules, however, are not particularly conducive to computational applications
as there is no appropriate formal framework in which rules can be readily formulated
and implemented. In a nutshell, FSM provides the tools for turning rules into
practical morphological analysers and generators. The theoretical groundwork was
laid out (Johnson 1972) and, independently in the early 1990s, (Kaplan & Kay
1994). They show that rewrite rules are equivalent in power to Finite State
transducers, which are a variant of Finite State automata that linguists are more
familiar with. Instead of accepting or rejecting a single string, as in the case of Finite
State automata, a Finite State transducer accepts or rejects two strings whose letters
are pair-matched, while still retaining the Markovian property of Finite State
transitions. As a result, Finite State transducers are simple, well understood and easy
to implement computationally. Moreover, it is also found that an ordered cascade of
rewrite rules can in principle be automatically compiled into a single Finite State
transducer, thus capturing the mapping from the underlying form to the surface form
in terms of paired strings.
Computational morphology saw the development of Two-Level Morphology
(Koskenniemi 1983), where contextual constraints are expressed in parallel directly
between lexical and surface levels, rather than as rules applied in serial order. Ever
since gaining prominence in the 1980s, Two-Level Morphology has become a staple
in computational linguistics. But it is not the easiest tool to use. The two-level
commitment forces one to directly manipulate input-output letter strings, and
represent serial rules as parallel constraints. This can be a highly unintuitive and
labour-intensive process, even for experienced programmers.
33
In the current thesis Finite state automata is used extensively in the morphology
engine. Finite state automata are used in pattern matching to categorize the given
word into one of the number of classes specified by the grammar rules in
(Krishnamurti 1985). All the grammar rules implemented by the Telugu morphology
engine are specified in the form of Finite automata.
2.5 A Short History of Morphological Analyzers and Generators
Guido Minnen, John Carroll and Derren Pearce (2000) developed a robust applied
morphological generator for English. The morphological generator generates a word
from a given specification of a lemma, a part of speech label and an inflectional type.
The generator was built using data from several large corpora and machine readable
dictionaries. The generator is packed as a UNIX filters making it easy to integrate
into applications.
Akshara bharati et.al (2001) developed an algorithm for unsupervised learning of
morphological analysis and generation of inflectionally rich languages. This
algorithm uses the frequency of occurrences of word forms in a raw corpus. They
introduce the concept of “observable paradigm “by forming equivalence classes of
feature-structures which are not obvious. Frequency of word forms for each
equivalence class is collected from such data for known paradigms. In this algorithm,
suppose the morphological analyser cannot recognize the inflectional form. The
possible stem and paradigm was guessed using the corpus frequencies. The method
assumes that the morphological package makes use of paradigms. This package was
able to guess stem paradigm pair for an unknown word. This method only depends
on the frequencies of the word forms in raw corpora and does not require any
34
linguistic rules or tagger. The performance of this system is depends on the size of
the corpora.
Madhavi Ganapathiraju and Lori Levin (2006) presented a rule based morphological
generator for Telugu nouns and verbs. The implementation was a perl program
modelled on the grammar rules of C P Brown and Krishnamurti’s grammar books.
Sri Badri Narayanan R, Saravanan S, Soman KP.(2009) presented a data driven
suffix list and concatenation algorithm for Telugu morphological generator which
doesn’t require any orthographic and morphotactics rules, using an automated
extraction of the suffix list and efficient algorithm for concatenating the lemma and
the morphological features. The preliminary results obtained from this system are
significant.
Dr. Ramakanth Kumar P, Shambhavi. B. R, Srividya K, Jyothi B J, Spoorti
Kundargi, Varsha Shastri G (2011) developed a paradigm based morphological
generator and analyser using a tries based data structure. The generator and analyser
can handle up to maximum 3700 root words and around 88K inflected words.
Parameswari.K (2011) developed a Tamil morphological Analyzer and generator
using APERTIUM tool kit. This attempt involves a practical adoption of lttoolbox
for the modern standard written Tamil in order to develop an improvised open source
Morphological Analyzer and generator. The tool uses the computational algorithm
Finite State Transducers (FST) for one-pass analysis and generation, and the
database is developed in the morphological model called word and paradigm.
Malin Ahlberg Sprakbanken, Markus Forsberg Sprakbanken, and Mans Hulden
(2014) present Semi-supervised learning of morphological paradigms and lexicons a
35
semi-supervised approach to the problem of paradigm induction from inflectional
tables. The system extracts generalizations from inflectional tables representing and
resulting paradigms in an abstract form.
Malin Ahlberg Sprakbanken, Markus Forsberg Sprakbanken, and Mans Hulden
(2015) present Paradigm classification in supervised learning of morphology which
combines a non-probabilistic strategy of inflection table generalization with a
discriminative classifier to permit the reconstruction of complete inflection tables of
unseen words.
Girija V R and T Anuradha (2017) report the design of a morphological analyser for
Malayalam modelled on finite state techniques that can be used for text analysis
where the model recognizes and strips the morphemes in a string of text.
2.6 Summary
This chapter presents a literature review of all the subtasks in Natural Language
Generation, surface realization in particular which is the topic of this thesis.
Research in surface realization of Indian languages is still very young in comparison
to European languages. Morphological generation is very complex in Indian
languages therefore a review of morphological generation in general context and also
Indian context is presented in this chapter.
36
-----------------------CHAPTER-3
TELUGU REALIZATION ENGINE

OVERVIEW
37
TELUGU REALIZATION ENGINE OVERVIEW
3.1 Introduction
Telugu is a Dravidian language with nearly 100 million first language speakers. It is
a morphologically rich language (MRL) with a simple syntax where the sentence
constituents can be ordered freely without impacting the primary meaning of the
sentence. In this thesis we describe a surface realization engine for Telugu. Surface
realization is the final subtask of an NLG pipeline (Reiter and Dale, 2000) that is
responsible for mechanically applying all the linguistic choices made by upstream
subtasks (such as microplanning) to generate a grammatically valid surface form.
3.1.1 Architecture of Natural Language Generation Systems
Based on a significant amount of experience in building NLG system an appropriate
architecture for Natural Language Generation Systems became very important. In the
early days of work in NLG a distinction between ‘strategy’ and ‘tactics’ was made,
where the strategy is concerned with determining ‘what to say’ and tactics are
concerned with deciding ‘how to say it’. The result of this distinction is a particular
modularization where NLG systems had two specific tasks referred to as text
planning and linguistic realization.
Later as the understanding of NLG systems increased it became common to
construct NLG systems with an additional module in between the two modules
discussed earlier. The intermediate module is referred to as the microplanner (shown
in Fig 3.1). The use of embedded graphics, formatting mark-ups, and hypertext links
motivated the use of the term document planner in place of text planner. Finally, the
38
linguistic realizer is more generally termed as surface realizer to acknowledge the
fact that the surface forms are not always linguistic in nature.
Figure 3.1 NLG System Architecture
3.1.1.1 Document Planning
A Document Planner decides what information to communicate and determines how
this information is to be structured for presentation. There are two tasks performed in
Document Planning namely content determination which takes care of what
information is to be communicated and document structuring which takes care of
structuring the information for presentation. The input for the document planner is a
four-tuple (k,c,u,d) where k is the knowledge source, c is the communicative goal, u
is the user model and d is the discourse history. The Document Planner takes the
four-tuple input and performs the following activities:
39
a) Construction of messages from the information source;
b) Decision making as to which message is required to satisfy the
communicative goal;
c) Document structuring to present the messages satisfying the communicative
goal in a proper manner.
3.1.1.2 Microplanning
A Document Plan specifies the final structure and content of the text to be generated
in very broad terms. The purpose of Microplanner is to refine the document plan to
produce a more specified text specification from many possible output texts that are
compatible with the document plan. The Microplanner performs the following
subtasks:
a) Expressive Lexicalization: It decides the lexical items to be used to realize
the conceptual elements specified by the Document Plan.
b) Linguistic Aggregation: It determines how messages should be composed
to generate specifications for linguistic units.
c) Referring Expression Generation: It determines how the entities in the
messages are to be referred.
3.1.1.3 Surface Realization
The surface realization component of a Natural Language Generation System
produces a well-formed sentence as constrained by the contents of a lexicon and
grammar. It takes as input an abstract specification of the text. There are three kinds
of processing involved in surface realization:
40
Syntactic Realization: Syntactic realization uses grammatical knowledge to choose
inflections based on grammatical features, add function words if required, and decide
the order of the components. For example, in Telugu the object of the sentence
usually precedes the verb and the syntactic realizer has to take care of the order.
Morphological Realization: Morphological realization computes the inflected
forms of the words depending on the grammatical features. For example the plural
form of “ceVyyi” (hand) is “cewulu” (hands).
Orthographic Realization: Orthographic realization deals with punctuation,
formatting and casing.
The abstract specification of the text that comes as input to the surface realizer is the
text specification which is constructed from the document plan by the microplanner.
The text specification describes what text is to be generated and how the text is to be
formatted. Thus there are two distinct aspects of processing of text specifications.
One concerned with mapping logical constructs (specific formatting constructs in the
text) to appropriate document formatting and the other concerned with the
application of grammatical knowledge to the phrase specifications (grammatical
objects such as phrases) so that the final surface forms are realized.
The process of generating the surface forms from the phrase specification is the area
to which my thesis is concerned. The generation process depends on the degree to
which the phrase specification abstracts away from the actual surface forms. The
phrase specifications are referred to as input specifications further in the thesis.
Our Telugu realization engine is designed following the SimpleNLG (Gatt and
Reiter, 2009) approach which recently has been used to build surface realizers for
German (Marcel Bollmann, 2011), Filipino (Ethel Ong et al., 2011), French (Vaudry
41
and Lapalme, 2013) and Brazilian Portuguese (Rodrigo de Oliveira and Sripada,
2014). Figure 3.2 shows an example input specification in XML corresponding to
the following Telugu sentence.
vAlYlYu aMxamEna wotalo neVmmaxigA naduswunnAru. (They are walking
slowly in a beautiful garden.)
<?xml version=”1.0”encoding=”UTF8”standalone=”no”>
<document> <sentence type=” ” predicatetype=”verbal” respect=”no”>
<nounphrase role=”subject”>
<head pos=”pronoun” gender=”human” number=”plural” person=”third”
casemarker=” ” stem=”basic”> vAdu</head>
</nounphrase>
<nounphrase role=”complement”>
<modifier pos=”adjective” type=”descriptive” suffix=”aEna”> aMxamu</modifier>
<head pos=”noun” gender=”nonmasculine” number=”singular” person=”third”
casemarker=”lo” stem=”basic”> wota</head>
</nounphrase>
<verbphrase type=” ”>
<modifier pos=”adverb” suffix=”gA”> neVmmaxi</modifier>
<head pos=”verb” tensemode=”presentparticiple”> naducu</head>
</verbphrase>
</sentence>
</document>
Figure 3.2. Example XML Input Specification
Several realizers are available for English and other European languages (Gatt and
Reiter, 2009; Vaudry and Lapalme, 2013; Marcel Bollmann, 2011; Elhadad and
Robin, 1996). Some general purpose realizers (as opposed to realizers built as part of
an MT system) have started appearing for Indian languages as well. Smriti Singh et
al. (2007) report a Hindi realizer that includes functionality for choosing post-
position markers based on semantic information in the input. This is in contrast to
the realization engine reported in the current chapter which assumes that choices of
constituents, root words and grammatical features are all preselected before
realization engine is called. As per the review of literature there are no realization
engines for Telugu. However, a rich body of work exists for Telugu language
42
processing in the context of machine translation (MT). In this context, earlier work
reported Telugu morphological processors that perform both analysis and generation
(Badri et al., 2009; Rao and Mala, 2011; Ganapathiraju and Levin, 2006) but none of
the authors reported a surface realization engine for Telugu language. Here in this
thesis we designed a surface realization engine for Telugu.
3.2 The Simple NLG Framework
A realization engine is an automaton that generates well-formed sentences according
to a grammar. Therefore, while building a realizer the grammatical knowledge
(syntactic and morphological) of the target language is an important resource.
Realizers are classified based on the source of grammatical knowledge. There are
realizers such as FUF/SURGE that employ grammatical knowledge grounded in a
linguistic theory (Elhadad and Robin, 1996). There have also been realizers that use
statistical language models such as Nitrogen (Knight and Hatzivassiloglou, 1995)
and Oxygen (Habash, 2000). While linguistic theory based grammars are attractive,
authoring these grammars can be a significant endeavor (Mann and Matthiessen,
1985). Besides, non-linguists (most application developers) may find working with
such theory heavy realizers difficult because of the initial steep learning curve.
Similarly building wide coverage statistical models of language too is labour
intensive requiring collection and analysis of large quantities of corpora. It is this
initial cost of building grammatical resources (formal or statistical) that becomes a
significant barrier in building realization engines for new languages. Therefore, it is
necessary to adopt grammar engineering strategies that have low initial costs. The
surface realizers belonging to the SimpleNLG family incorporate grammatical
knowledge corresponding to only the most frequently used phrases and clauses and
43
therefore involve low cost grammar engineering. The main features of a realization
engine following the SimpleNLG framework are:
1. A wide coverage morphology module independent of the syntax module.
2. A light syntax module that offers functionality to build frequently used phrases
and clauses without any commitment to a linguistic theory. The large uptake of
the SimpleNLG realizer both in the academia and in the industry shows that
the light weight approach to syntax is not a limitation.
3. Using ‘canned’ text elements to be directly dropped into the generation process
achieving wider syntax coverage without actually extending the syntactic
knowledge in the realizer.
4. A rich set of lexical and grammatical features that guide the morphological and
syntactic operations locally in the morphology and syntax modules
respectively. In addition, features enforce agreement amongst sentence
constituents more globally at the sentence level.
3.3 Telugu Realization Engine
The current work follows the SimpleNLG framework. However, because of the
known differences between Telugu and English SimpleNLG codebase could not be
reused for building Telugu realizer. Instead our Telugu realizer was built from
scratch adapting several features of the SimpleNLG framework for the context of
Telugu. There are significant variations in spoken and written usage of Telugu.
There are also significant dialectical variations, most prominent ones correspond to
the four regions of the state of Andhra Pradesh, India – Northern, Southern, Eastern
and Central (Brown, 1991). In addition, Telugu absorbed vocabulary (Telugised)
from other Indian languages such as Urdu and Hindi. As a result, a design choice for
44
Telugu realization engine is to decide the specific variety of Telugu whose grammar
and vocabulary needs to be represented in the system. In our work, we use the
grammar of modern Telugu developed by (Krishnamurti and Gwynn, 1985). We
have decided to include only a small lexicon in our realization engine. This is
because host NLG systems that use our engine could use their own application
specific lexicons. More over modern Telugu has been absorbing large amounts of
English vocabulary particularly in the fields of science and technology whose
morphology is unknown. Thus specialized lexicons could be required to model the
morphological behaviour of such vocabulary. In the rest of this section we present
the design of our Telugu realizer.
As stated in section 3.2, a critical step in building a realization engine for a new
language is to review its grammatical knowledge to understand the linguistic means
offered by the language to express meaning. We reviewed Telugu grammar as
presented in our chosen grammar reference (Krishnamurti and Gwynn 1985). From a
realizer design perspective the following observations proved useful:
1. Primary meaning in Telugu sentences is mainly expressed using inflected
forms of content words and case markers or postpositions than by position of
words/phrases in the sentence. This means morpho-phonology plays bigger role
in sentence creation than syntax.
2. Because sentence constituents in Telugu can be ordered freely without
impacting the primary meaning of a sentence, sophisticated grammar knowledge
is not required to order sentence level constituents. It is possible, for instance, to
order constituents of a declarative sentence using a standard predefined
sequence (e.g. Subject + Object + Verb).
45
3. Telugu, like many other Indian languages, is not governed by a phrase structure
grammar, instead fits better into a Paninian Grammar Formalism (Bharati et al.,
1995) which uses dependency grammar. This means, dependency trees represent
the structure of phrases and sentences. At the sentence level verb phrase is the
head and all the other constituents have a dependency link to the head. At the
phrase level too, head-modifier dependency structures are a better fit.
4. Agreement amongst sentence constituents can get quite complicated in Telugu.
Several grammatical and semantic features are used to define agreement rules.
Well-formed Telugu sentences are the result of applying agreement rules at the
sentence level on sentence constituents constructed at the lower level processes.
Based on the above observations we found that the SimpleNLG framework with its
features mentioned in section 3.2 is a good fit for guiding the design of our Telugu
realization engine. Thus our realization engine is designed with a wide coverage
morphology module and a light-weight syntax module where features play a major
role in performing sentence construction operations.
Having decided the SimpleNLG framework for representing and operationalizing the
grammatical knowledge, the following design decisions were made while building
our Telugu realizer (we believe that these decisions might drive design of realizers
for any other Indian Language as well):
1. Use WX-notation(shown in Appendix) for representing Indian language
orthography (shown in section 3.3.1 in detail)
2. Define the tag names and the feature names used in the input XML file
(shown in Figure 3.2) adapted from SimpleNLG and (Krishnamurti and
Gwynn, 1985) for specifying input to the realization engine. It is hoped
46
that using English terminology for specifying input to our Telugu
realizer simplifies creating input by application developers who usually
know English well and possess at least a basic knowledge of English
grammar.
3. In order to offer flexibility to application developers our realization
engine orders sentence level constituents (except verb which is always
placed at the end) using the same order in which they are specified in the
input XML file. This allows application developers to control ordering
based on discourse level requirements such as focus.
4. The grammar terminology used in our engine does not directly
correspond to the Karaka relations (Bharati et al., 1995) from the
Paninian framework because we use the grammar terminology specified
by Krishnamurti and Gwynn (1985) which is lot closer to the
terminology used in SimpleNLG. We are currently investigating
opportunities to align our design lot closer to the Paninian framework.
We expect such approach to help us while extending our framework to
generate other Indian languages as well.
3.3.1 WX-Notation
WX notation (shown in Appendix) is a very popular transliteration scheme for
representing Indian languages in the ASCII character set. This scheme is widely used
in Natural Language Processing in India. In WX notation (shown in Appendix) the
small case letters are used for un-aspirated consonants and short vowels while the
capital case letters are used for aspirated consonants and long vowels. The
retroflexed voiced and voiceless consonants are mapped to ‘t, T, d and D’. The
47
dentals are mapped to ‘w, W, x and X’. Hence the name of the scheme “WX”,
referring to the idiosyncratic mapping.
3.3.2 The Input Specification Scheme
There are a wide range of approaches to specify input to linguistic realization. The
differences in the approaches vary from signalling deep views of what is to be
involved in the process of realization to simple notational differences. A few abstract
inputs starting with very abstract representation are discussed in the following
sections. The following example will be given as input in the Input Specification
discussed in the following sections.
rAmudu kowini karrawo kottAdu. (Ramu bet the monkey with a stick)
3.3.2.1 Skeletal Proposition
The Skeletal Proposition in Figure 3.3 is a very abstract representation of the
example sentence in section 3.3.2. This representation does not say anything about
the content of the individual noun phrases in the sentence. This representation only
indicates that an event of “kottu” (beat) happened and identifies three participants in
this event as c1, p1, and m respectively.
Figure 3.3 Skeletal Proposition
48
3.3.2.2 Meaning Specification
The representation of Figure 3.4 does not specify that the object of action “kottu”
(beat) was “kowi” (monkey). This and other information omitted by the Skeletal
Proposition representation is identified in the knowledge base by the Microplanner.
Figure 3.4 Meaning Specification
The Microplanner not only selects the required elements in the knowledge base for
inclusion in the text to be generated, but also takes certain decision about the
structure into which the information will be placed. The result of this is a
representation called the Meaning Specification.
3.3.2.3 Lexicalized Case Frames
The structure presented in the previous section is still abstract because many
realizers expect the selection of base lexemes to be used to express the semantic
content to be carried out by the previous stage. Once these decisions are made the
usefulness of the content of index and features are exhausted. These features are
omitted in the Lexicalized Case Frame representation shown in Figure 3.5.
49
Figure 3.5 Lexicalized Case Frame
The base lexemes used in this representation, still need to go through the
morphological process to achieve the status of being words.
3.3.2.4 Abstract Syntactic Structures
In certain cases, it may be appropriate for the processes carried out before the
linguistic realizer is invoked to make certain decisions about the grammatical
structure. For example, some additional information is added to Figure 3.5 about
which argument of the representation should be placed in focus. If the representation
specifies that the second argument is to be in focus then the realizer produces the
following sentence:
kowini rAmudu karrawo kottAdu.
The role of a realizer in such a case becomes very simple only to encode the
grammatical knowledge of the language in question applying them to an input
specification called Abstract Syntactic Structure. Figure 3.6 shows such a
representation.
50
Figure 3.6 Abstract Syntactic Structures
3.3.2.5 Canned Text and Templates
Sometimes certain constituents of a sentence which are sufficiently invariant can be
predetermined and stored directly as text strings. For example, the closing salutation
of a letter like “aMxariki suBAkAMkRalu” can be stored as text strings.. Figure 3.7
shows an input with Canned Text.
Figure 3.7 Template for Canned Text
3.3.2.6 SimpleNLG XML Specification
. A text specification, together with its children (for example, SPhraseSpecs) can be
expressed in XML, based on a predefined XML schema that mirrors the relevant
parts of the internal structure of a SimpleNLG specification. Figure 3.8 is an XML
input specification for the following example sentence.
51
The patient as a result of the procedure had an adverse contrast media reaction, had
a decreased platelet count and went into cardiogenic shock.
<?xml version="1.0" encoding="utf-8"?>

<Document xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" cat="PARAGRAPH"
xsi:schemaLocation="http://code.google.com/p/simplenlg/schemas/version1"
xmlns="http://code.google.com/p/simplenlg/schemas/version1">
<child xsi:type="SPhraseSpec">
<preMod xsi:type="PPPhraseSpec">
<head cat="PREPOSITION">as a result of</head>
<compl xsi:type="NPPhraseSpec">
<head cat="NOUN">procedure</head>
<spec cat="DETERMINER">the</spec>
</compl>
</preMod>
<subj xsi:type="NPPhraseSpec">
<head cat="NOUN">patient</head>
<spec cat="DETERMINER">the</spec>
</subj>
<vp xsi:type="CoordinatedPhraseElement" conj="and">
<coord xsi:type="VPPhraseSpec" TENSE="PAST">
<head cat="VERB">have</head>
<head cat="NOUN">adverse contrast media reaction</head>
<spec cat="DETERMINER">a</spec>
</compl>
</coord>
<head cat="VERB">have</head>
<head cat="NOUN">decreased platelet count</head>
<spec cat="DETERMINER">a</spec>
</compl>
</coord>
<head cat="VERB">go</head>
<postMod xsi:type="PPPhraseSpec">
<head cat="PREPOSITION">into</head>
<head cat="NOUN">cardiogenic shock</head>
</compl>
</postMod>
</coord>
</vp>
</child>
</Document>
Figure 3.8 SimpleNLG XML Specifications
52
The input to the Telugu surface realization engine is a tree structure specified in
XML, modelled on the SimpleNLG XML specification, an example is shown in
Figure 3.2. The root node is the sentence and the nodes at the next level are the
constituent phrases that have a role feature representing the grammatical functions
such as subject, verb and complement performed by the phrase. Each of the lower
level nodes could in turn have their own head and modifier children. Each node also
can take attributes which represent grammatical or lexical features such as number
and tense.
For example the subject node in Figure 3.2 can be understood as follows:
<nounphraserole=”subject”><head
pos=”pronoun”gender=”human”number=”plural”person=”third”casemarker=” ”
stem=”basic”> vAdu</head>
</nounphrase>
This node represents the noun phrase that plays the role of subject in the sentence.
There is only one feature, the head to the subject node whose type is nominative. The
lexical features of the head “vAdu” are part-of-speech (pos) which is pronoun,
person which is third person, number which is plural, gender which is human and
case marker which is null.
3.4 System Architecture
The sentence construction for Telugu involves the following three steps:
1. Construct word forms by applying morpho-phonological rules selected
based on features associated with a word (word level morphology)
2. Combine word forms to construct phrases using ‘sandhi’ (a morpho-
phonological fusion operation) if required (phrase building)
53
3. Apply sentence level agreement by applying agreement rules selected
based on relevant features. Order the sentence constituents following a
standard predefined sequence. (Sentence building)
Our system architecture is shown in Figure 3.9 which involves morphology engine,
phrase builder and sentence builder corresponding to these three steps. The rest of
the section presents how the example sentence of section 3.1.1.3 is generated from
the input specification in Figure 3.2.
Figure 3.9. System Architecture
3.5 Input Reader
The Input Reader is the module which acts as an interface between the sentence
builder and the input. Currently the input reader accepts only our XML input
specification but in the future we would like to extend it to accept other input
specifications such as SSF (Bharati et al., 2007). This module ensures that the rest of
the engine receives input in the required form.
54
3.6 Sentence Builder
The Sentence Builder is the main module of the current system which has a
centralized control over all the other modules. It performs four subtasks:
1. Sentence Builder first checks for predefined grammatical functions such as
subject, object, complement, and verb which are defined as features of the
respective phrases in the input. It then calls the appropriate element builder for
each of these to create element objects which store all the information extracted
from the XML node.
2. These element objects are then passed to appropriate phrase builder to receive
back a string which is the phrase that is being constructed according to the
requirements of the input.
3. After receiving all the phrases from the appropriate phrase builders the Sentence
Builder applies the agreement rules. Since Telugu is nominative-accusative
language the verb agrees with the argument in the nominative case. Therefore the
predicate inflects based on the gender, person and number of the noun in the
nominative case. There are three features at the sentence level namely type,
predicate-type, and respect. The feature type refers to the type of the sentence.
The current work handles only simple sentences therefore it is not set to any
value. The feature predicate-type can have any one of the three values namely
verbal, nominative, and abstract. The feature respect can have values yes or no.
The agreement also depends on the features predicate-type, and respect.
4. Finally, the sentence builder orders the phrases in the same order they are
specified in the input.
55
In the case of the example in Figure 3.2 the sentence builder finds three grammatical
functions - one finite verb, one locative complement, and one nominative subject. In
the example input in section 3.1.1.3 the values for the feature predicate-type is
“verbal” and for respect is “no”. The Sentence Builder retrieves appropriate rule
from an externally stored agreement rule base. In the example input in section
3.1.1.3 where predicate-type is set to verbal, the number of the subject is plural and
the gender is human the Sentence Builder retrieves the appropriate suffix “nnAru”.
This suffix is then agglutinated to the verb “naduswu” which is returned by the
morphology engine to generate the final verb form, “naduswunnAru” with the
required agreement with subject.
“naduswu”+ “nnAru” “naduswunnAru”
After the construction of the sentence the Sentence Builder passes it to the Output
Generator which prints the output.
3.7 Element Builder
The element builder of each grammatical function checks for lower level functions
like head and modifier and calls the appropriate element builder for the head and
modifier which converts the lexicalized input into element objects with the
grammatical constituents as their instance variables and returns the element objects
back to the Sentence Builder. Our realizer creates four types of element objects
namely SOCElement, VAElement, AdjectiveElement, and AdverbElement. The
SOCElement represents the grammatical functions subject, object and complement.
The subject in the example sentence of section 3.1.1.3 is “vAdu” for which a
SOCElement is created with the specified features. Similarly a SOCElement is
created for the complement “wota” and its modifier “aMxamu” which is an
56
AdjectiveElement. Finally a VAElement is created for the verb “naducu” and the
modifier “neVmmaxi” which is an AdverbElement.
3.8 Phrase Builder
Telugu sentences express most of the primary meaning in terms of morphologically
well-formed phrases or word groups. In Telugu the main and auxiliary verbs occur
together as a single word. Therefore their generation is done by the morphology
engine. Telugu sentences are mainly made up of four types of phrases - Noun Phrase,
Verb Phrase, Adjective Phrase, and Adverb Phrase. Noun phrases and verb phrases
are the main constituents in a sentence while the Adjective Phrase and the Adverb
Phrase only play the role of a modifier in a noun or verb phrase. There is one feature
at the Noun Phrase level “role” which specifies the role of the Noun Phrase in the
sentence. The phrase builder passes the elements constructed by the element builder
to the morphology engine and gets back the respective phrases with appropriately
inflected words. In the example input in section 3.1.1.3, there are three constituent
phrases, viz, two noun phrases for subject and complement and a verb phrase. One of
the noun phrases also contains an adjective phrase which is an optional modifying
element of noun heads in head-modifier noun phrases. The adjective phrase may be a
single element or sometimes composed of more than one element. The verb phrase
also contains an adverb phrase which is generally considered as a modifier of the
verb. The phrase builder passes five objects i.e., two SOCElement objects, one
AdjectiveElement object, one VAElement object, and one AdverbElement object to
the morphology engine and gets back five inflected words which finally become
three phrases, viz, two noun phrases “vAlYlYu”, “aMxamEna wotalo”, and one verb
phrase “neVmmaxigA naduswu”.
57
3.9 Morphology Engine
The morphology engine is the most important module in the Telugu realization
engine. It is responsible for the inflection and agglutination of the words and phrases.
The morphology engine behaves differently for different words based on their part of
speech (pos). The morphology engine takes the element object as the input, and
returns to the phrase builder the inflected or agglutinated word forms based on the
rules of the language. In the current work morphology engine is a rule based engine
with the lexicon to account for exceptions to the rules. The rules used by the
morphology engine are stored in external files to allow changes to be made
externally.
3.9.1 Noun
Noun is the head of the noun phrase. Telugu nouns are divided into three classes
namely (i) proper nouns and common nouns, (ii) pronouns, and (iii) special types of
nouns (e.g. numerals) (Krishnamurti and Gwynn, 1985). All nouns except few
special type nouns have gender, number, and person. Noun morphology involves
mainly plural formation and case inflection. All the plural formation rules from
sections 6.11 to 6.13 of our grammar reference have been implemented in our
engine. The head of the complement in the example of section 3.1.1.3 has one noun
“wotalo”. The word “wota” along with its feature values can be written as follows:
“wota”, noun, nonmasculine, singular, third, basic, “lo”--- wotalo
The formation of this word is very simple because the word “wota” in its singular
form and the case marker “lo” get agglutinated through a sandhi (a morpho-
phonological fusion operation) formation as follows:
58
‘wota’+lo----- wotalo
3.9.2 Pronoun
Pronouns vary according to gender, number, and person. There are three persons in
Telugu namely first, second, and third. The gender of the nouns and pronouns in
Telugu depend on the number. The relation between the number and gender is shown
in Table 3.1.
Number Gender
Singular masculine, non-masculine
Plural human, nonhuman
Table 3.1: Relationship between Number and Gender
Plural formation of pronouns is not rule based. Therefore they are stored externally
in the lexicon. The first person pronoun “nenu” has two plural forms “memu” which
is the exclusive plural form and “manamu” which is the inclusive plural form. In the
generation of the plural of the first person a feature called “exclusive” has to be
specified with the value “yes”, or “no”. Along with gender, number, and person there
is one more feature which is stem. The stem can be either basic or oblique. The
formation of the pronoun “vAlYlYu” in the example of section 3.1.1.3 which is the
head of the subject along with its feature values can be written as follows:
“vAdu”, pronoun, human, plural, third, basic,“”-vAlYlYu
In this case the stem is basic. The gender of the pronoun is human because the
number is plural as mentioned in Table 1. The word “vAlYlYu” is retrieved from the
lexicon as the plural for the word “vAdu” and the feature values.
59
3.9.3 Adjective
Adjectives occur most often immediately before the noun they qualify. The basic
adjectives or the adjectival roots which occur only as adjectives are indeclinable (e.g.
oka (one), ara (half)). Adjectives can also be derived from other parts of speech like
verbs, adverbs, or nouns. The adjective “aMxamEna” in the example of section
3.1.1.3 is a derived adjective formed by adding the adjectival suffix “aEna” to the
noun “aMxamu”. The formation of the word “aMxamEna” in the example of section
3.1.1.3 along with its feature values can be written as follows:
“aMxamu”, adjective, descriptive,“aEna”--aMxamEna
The current work does not take into consideration the type of an adjective and will
be included in a future version. The formation of this word is again through a sandhi
formation as follows:
aMxamu+aEna-------- aMxamEna
Here the sandhi formation eliminates the “u” in the first word; “a” in the second
word and the word “aMxamEna” is formed.
3.9.4 Verb
Telugu verbs inflect to encode gender, number and person suffixes of the subject
along with tense mode suffixes. As already mentioned gender, number and person
agreement is applied at the sentence level. At the word level, verb is the most
difficult word to handle in Telugu because of phonetic alterations applied to it before
being agglutinated with the tense-aspect-mode suffix (TAM). Telugu verbs are
classified into six classes (Krishnamurti, 1961). Our engine implements all these
classes and the phonetic alternations applicable to each of these classes are stored
60
externally in a file. The verb in the example of Figure 3.2 has one verb “naducu”
along with its feature values. The formation of the verb “naduswu” can be written as
follows:
“naducu”,verb, present participle------naduswu
The word “naducu” belongs to class IIa, for which the phonetic alteration is to
substitute “cu” with “s”, and therefore the word gets inflected as follows:
naducu----------------nadus
The tense mode suffix for present participle is “wu”, and the word becomes
“naduswu”. The gender and number of the subject also play a role in the formation
of the verb which is discussed in section 3.6.
3.9.5 Adverb
All adverbs fall into three semantic domains, those denoting time, place and manner
(Krishnamurti and Gwynn 1985). The adverb “neVmmaxigA” in the example (1) is a
manner adverb as it tells about the way they are walking “neVmmaxigA
naduswunnaru (walking slowly)”. In Telugu manner adverbs are generally formed
by adding “gA” to adjectives and nouns. The formation of the adverb
“neVmmaxigA” in the example (1) along with its feature values can be written as
follows:
“neVmmaxi”, adverb,“gA”-------------neVmmaxigA
The formation of the above word is a simple sandhi formation.
61
3.10 Output Generator
Output Generator is the module which actually generates text in Telugu font. The
Output generator receives the constructed sentence in WX-notation (shown in
Appendix) and gives as output a sentence in Telugu based on the Unicode Characters
for Telugu. The output generated for the example input of Figure 3.2 in Telugu
language generated by the output generator is as follows:
వాళ్ళు అందమైన తోటలో నెమ్మ దిగా నడుస్తున్నా రు. (They are walking slowly in
the garden).
3.11 Evaluation
The current work addresses the problem of generating syntactically and
morphologically well-formed sentences in Telugu from an input specification
consisting of lexicalized grammatical constituents and associated features. In order
to test the robustness of the realization engine as the input to the realizer changes we
initially ran the engine in a batch mode to generate all possible sentence variations
given an input similar to the one shown in Figure 3.2. In the batch mode the engine
uses the same input root words in a single run of the engine, but uses different
combinations of values for the grammatical features such as tense, aspect, mode,
number and gender in each new run. Although the batch run was originally intended
for software quality testing before conducting evaluation studies, these tests showed
that certain grammatical feature combinations might make the realization engine
produce unacceptable output. This is an expected outcome because our engine in the
current state performs very limited consistency checks on the input. The purpose of
our evaluation is to measure our realizer’s coverage of the Telugu language. One
62
objective measure could be to measure the proportion of sentences from a specific
text source (such as a Telugu newspaper) that our realizer could generate. As a first
step towards such an objective evaluation, we first evaluate our realizer using
example sentences from our grammar reference. Although not ideal this evaluation
helps us to measure our progress and prepares us for the objective evaluation. The
individual chapters and sections in the book by Krishnamurti and Gwynn (1985)
follow a standard structure where every new concept of grammar is introduced with
the help of a list of example sentences that illustrate the usage of that particular
concept. We used these sentences for our evaluation. Please note that we collect
sentences from all chapters. This means our realizer is required to generate for
example verb forms used in example sentences from other chapters in addition to
those from the chapter on verbs. A total of 738 sentences were collected from
chapter 6 to chapter 26, the main chapters which cover Telugu grammar. Because the
coverage of the current system is limited, we don’t expect the system to generate all
these 738 sentences. Among these, 419/738 (57%) sentences were found to be within
the scope of our current realizer. Many of these sentences are simple and short. For
each of the 419 selected sentences our realizer was run to generate the 419 output
sentences. The output sentences matched the original sentences from the book
completely. This means at this stage we can quantify the coverage of our realizer as
57% (419/738) against our own grammar source. A more objective measure of
coverage will be estimated in the future.
Total No. of Sentences in the Sentences not in Sentences Coverage in

Sentences scope of our realizer the scope of our generated percentage
realizer
738 419 319 419 57%
Table 3.2 Evaluation Results of Sentence Generation
63
Having built the functionality for the main sentence construction tasks, we are now
in a good position to widen the coverage. Majority of the remaining 319 sentences
(=738-419) involve verb forms such as participles and compound verbs and medium
to complex sentence types. As stated above, we intend to use this evaluation to drive
our development. This means every time we extend the coverage of the realizer we
will rerun the evaluation to quantify the extended coverage of our realizer. The idea
is not to achieve 100% coverage. Our strategy has always been to select each new
sentence or phrase type to be included in the realizer based on its utility to express
meanings in some of the popular NLG application domains such as medicine,
weather, sports and finance.
3.12 Summary
This chapter describes a surface realizer for Telugu which was designed by adapting
the SimpleNLG framework for free word order languages. This chapter mainly
focused on the architecture of the Telugu realization engine and the input
specification mechanism employed. The other aspects like morphology of the
different parts of speech, the phrase formation and the sentence formation are only
introduced in this chapter and will be discussed in detail in the further chapters.
64
-----------------------CHAPTER-4
MORPHOLOGY OF NOUNS AND

PRONOUNS
65
MORPHOLOGY OF NOUNS AND PRONOUNS
4.1 Introduction
Telugu is a free word order language in which various grammatical categories (case,
gender, number, person etc.) are morphologically encoded making it a
morphologically rich language. In this chapter we present a morphological generator
for Telugu nouns and pronouns modelled on finite state techniques. The
morphological generator generates the required word form for nouns and pronouns
from an input specification consisting of the lemma and its associated features. The
module discussed in this chapter is an independent module of a surface realization
engine that automates the task of building grammatically correct Telugu sentences.
The morphological generator also supports verb morphological generator in
generating the appropriate verb form by passing the appropriate features (person,
number and gender) required for the formation of the appropriate verb form.
Natural Language Processing (NLP) applications which are growing in number these
days can be categorized into two broad areas namely Natural Language
Understanding (NLU) where Morphological Analysers (MA) play a very important
role and Natural Language Generation (NLG) where Morphological Generators play
a very important role. A Morphological Analyser takes a word as input, processes it
and gives as output its root along with its grammatical features. A Morphological
Generator takes a root along with its grammatical features as input and generates the
required word form. Morphological Generators (MG) have a very important role to
play in applications like Surface Realization (Dokkara, Penumathsa, and Sripada,
2015, Gatt and Reiter, 2009) and Machine Translation (Kristina Toutanova, Hisami
Suzuki, Achim Ruopp, 2008) of free word order languages like Telugu. It is always
66
advantageous for free word order languages like Telugu to have Morphological
Generator as a separate component that is separate from the rest of the NLG system
(Guido Minnen et.al 2000). The current work is a separate module of a surface
realization engine for Telugu (Dokkara, Penumathsa, and Sripada, 2015), a java
application which is the final subtask of a Natural Language Generation (NLG)
pipeline (Reiter and Dale, 2000). The sentence realization engine for Telugu is
designed following the SimpleNLG (Gatt and Reiter, 2009) approach which is a very
popular surface realization engine for English.
Telugu is a morphologically rich free word order language spoken by people from
the south Indian states of Andhra Pradesh and Telangana. In this paper we describe a
morphology engine which automatically generates the different forms of nouns and
pronouns in Telugu. The current work is modelled on the morphology engine for
English (Guido Minnen et.al 2000).
Telugu nouns are divided into three classes namely (i) proper nouns and common
nouns, (ii) pronouns, and (iii) special types of nouns (e.g. numerals) (Krishnamurti
and Gwynn, 1985). All nouns except few special type nouns have gender, number,
and person. Noun morphology involves mainly plural formation, oblique stem
formation and case inflection. The current work discusses in detail the first two
classes of nouns.
Linguistic theories of morphology differ in their view with respect to treating
morpheme as the basic building block of morphological analysis or generation. “Two
models of grammatical description” was proposed by (Hockett, 1954) which are Item
and Arrangement model (IA) and Item and Process model (IP). Item and
Arrangement model is a concatenative approach where morphemes are the lexical
67
units and morphology is an agglutination of such units to form words. Item and
Process (IP) is considered as a derivational process where new word forms can be
produced by acting on morphemes and words. He also mentioned about an already
existing model the Word and Paradigm (WP) which is a word based morphological
approach which states generalizations that hold between the different forms of
inflectional paradigms and used in languages like Latin, Greek, and Sanskrit. A two
dimensional taxonomy of morphological theories was proposed by (Stump, 2001).
He distinguished two axes along which inflectional morphology may be situated
relative to one another. He proposed the lexical/inferential axis and the
incremental/realizational axis (Stump, 2001) which are orthogonal to each other.
From a computational perspective lexical-incremental and inferential-realizational
are computationally equivalent and can be implemented using finite state techniques.
A diverse range of languages have used finite state techniques to build
morphological analysers (MA) and morphological generators (MG) (Karttunen,
2003, Beesley and Karttunen, 2003, Karttunen and Beesley, 2005, Roark and Sproat,
2007). Therefore, in this paper we apply finite state techniques to Telugu
morphology of nouns and pronouns.
Among the morphological tools for Indian languages (Goyal and Lehal 2011) report
a machine translation system from Hindi-Punjabi modelled on database approach
where all word forms are stored in relational database. A number of morphological
tools for Tamil are reported by (PJ Antony and Soman 2012) which range from
corpus based through suffix stripping to finite state techniques. For Telugu language,
(Rao et.al 2011) describe a word and paradigm based morphological analyser and
generator. (Sribadrinarayan et al 2009) describe an item and arrangement based
68
morphological generator for Telugu. (Ganapatiraju et al 2006) describe a rule based
(item and process based) morphological generator for Telugu.
4.2 Inputs to the Morphology Engine
The noun and pronoun morphology engine is an independent module of a surface
realization engine which produces grammatically correct Telugu sentences. The
input specification for the surface realization engine consists of lexicalized
grammatical constituents and associated features in the form of an XML file. The
XML file given as input provides the required grammatical information not only at
the sentence level but also at the word level which acts as the input to the
morphology engine. An example XML specification corresponding to the Telugu
sentence in WX-notation given below is shown in Figure 4.1.
vAlYlYu iMtiki vaccAru. (They came home.)
<?xml version=”1.0” encoding=”UTF8” standalone=”no”>

<document>
<sentence type=” ” predicatetype=”verbal” respect=”no”>
<head pos=”pronoun” gender=”human” number=”plural” person=”third”
casemarker=” ” stem=”basic”>vAdu</head>
</nounphrase>
casemarker=”ki” stem=”oblique”> illu</head>
</nounphrase>
<head pos=”verb” tensemode=”pasttense”> vaccu</head>
</verbphrase>
</sentence>
</document>
4.3 Noun Morphological Process
The head of the complement in the example of section 4.2 has one noun “iMtiki”.
The word “illu” given in the XML specification of Figure 4.1 as the head of the noun
69
phrase which plays the role of a complement in the sentence along with its feature
values can be written as follows:
“illu”, noun, nonmasculine, singular, third, oblique, “ki”--- iMtiki
First the oblique stem of the word “illu” is formed as the word needs to get
agglutinated with the case marker. The formation of the oblique stem is a two-step
process. In the first step the class to which the root word belongs is identified. In the
current work the identification of the class is modelled on finite state techniques. The
root word “illu” belongs to class III. A pictorial representation of the finite automata
for identification of class III stems is shown in Figure 4.2.
Figure 4.2: Finite Automata to identify class III stems for Oblique Stem Formation
In the second step the oblique stem of “illu” which is “iMti” is formed by replacing
“llu” by “Mti”.
After the formation of the oblique stem the case marker gets agglutinated to the
oblique stem to form the final word as follows:
70
“iMti” + “ki”----- “iMtiki”.
The formation of the pronoun “vAlYlYu” in the example of section 4.2 which is the
head of the subject along with its feature values can be written as follows:
“vAdu”, pronoun, human, plural, third, basic,“”-“vAlYlYu”
The formation of plurals for pronouns does not have any rules and therefore they are
stored in a lexicon. The word “vAlYlYu” is retrieved from the lexicon stored in an
external file “pronounplural.xml” as the plural for the word “vAdu” and the feature
values.
4.4. Noun Morphology Engine
The steps taken by the noun or pronoun root to get the required inflection form are as
follows:
a) Formation of the plural of the root if required.
b) Formation of the oblique stem if required.
c) Agglutination of the case-marker to the oblique stem if required.
4.4.1 Plural Formation
Common nouns can be divided into count and non-count nouns. Non-count nouns
(mass nouns, indivisible objects and abstract nouns) cannot be distinguished for
number they are either singular or plural. Some count nouns do not conform to any
rules of plural formation. The singular and plural forms of the non-count nouns and
the count nouns which do not conform to any rules of plural formation are stored
externally in a lexicon as an XML file named “plural.xml”. Some mass nouns that
exist only in the singular form are given in Table4.1.
71
Word Meaning in English
uppu Salt
nUne Oil
inumu Iron
veVMdi Silver
biyyaM Rice
janaM People
Table 4.1: Example Mass Nouns in Singular
Some mass nouns that exist only in the plural form are given in Table 4.2.

vadlu Paddy
peVsalu Green gram
kaMxulu Red gram
nIlYlYu Water
pAlu Milk
Table 4.2: Example Mass Nouns in Plural
Indivisible objects cannot have both singular and plural forms. Some indivisible
objects are shown in Table 4.3.

AkAsaM Sky
samuxraM Sea
Table 4.3: Example Indivisible Objects
Some example abstract nouns which are non-count nouns are shown in Table 4.4.

welupu Whiteness
welivi Intelligence
balaM Strength
saMwoRaM Happiness
nixra Sleep
Table 4.4: Example Abstract Nouns
Among the count nouns some nouns do not have any rules for the formation of
plurals. Table 4.5 is a list of count nouns which do not confirm to any rules for the
formation of plurals:
72
Singular Word Plural Word
rAyi rAlYlYu
poyyi Poyyilu
peMdli peMdliMdlu
vari Vadlu
gAru gArlu
sAri sArlu
kumArudu kumArulu
eVxxu eVdlu
veVyyi Velu
cenu Celu
penu Pelu
kAdi kAMdlu
jIwagAdu jIwagAlYlYu
alludu allulYlYlYlYu
manamarAlu manamarAlYlYu
ceVlleVlu ceVlleVlYlYu
kUwuru kUwulYlYu
koVdavali koVdavalYlYu
rAwri rAwrilYlYu
Table 4.5: Count Nouns with no rules for Plural Formation
Pronouns do not conform to any rules regarding the formation of plurals. All the
pronouns and their plurals listed in Table 4.6 are also stored externally in a lexicon
as an XML file named “pronounplural.xml”.
Person Singular Plural

First nenu memu,manamu
Second nuvvu mIru
vAdu vAlYlYu
axi Avi
vAru vAlYlYu
ixi Ivi
awanu vAlYlYu
Ayana vAlYlYu
AmeV vAlYlYu
Third imeV vIlYlYu
Avida vAlYlYu
vIdu vIlYlYu
iwanu vIlYlYu
Iyana vIlYlYu
vIru vIlYlYu
Ivida vIlYlYu
exi Evi
wAnu wAmu
Table 4.6: Example Pronouns and their Plurals
73
The regular way of forming the nominative plural is by adding the plural suffix “lu”
to the basic stem.
Example:
Avu (cow)Avulu (cows)
A number of morphophonemic changes may occur because of which sometimes the
plural suffix “lu” become “lYlYu”. The morphophonemic changes occur based on
the class to which the singular stem belongs. The formation of the plural is a two-
step process. First the class to which the stem belongs is identified. In the current
work the identification of the class to which the stem belongs is implemented as a
finite automata illustrated in Figure 4.2 for class VII. Second the sandhi
(morphophonemic changes) occurs and the final plural form is generated.
The stems can be categorized into different classes for plural formation of nouns
(Krishnamurti and Gwynn, 1985) as follows:
I) Stem final ending in “i/u” preceded by “t”,”Mt”,”Md” is lost before the plural
suffix “lu”
Example:
koti (one crore)kotlu (crores)
II) In all stems ending in “di”,”du”,”lu” and “ru” and in stems of more than two
syllables ending in “li” and “ri” the final syllable becomes lYlY before lYlYu
Example:
badi (school) balYlYu (schools)
Exception 1:
Masculine nouns of Sanskrit origin ending in “du” replace “du” by
“lu” to form the plural
74
Example:
snehiwudu (friend) snehiwulu (friends)
Exception 2:
Loanwords from foreign languages ending in “ru” form the plural
by adding “lu” to the basic stem.
Example:
nOkaru (servant)nOkarlu (servants)
III) Stem final “tti”,”ttu”,”ddi”,”ddu” becomes “t”,”d” before “lu”
Example:
ceVttu (tree)ceVtlu (trees)
IV) Stem final following “llu”,”nnu” following a short vowel becomes “Md” or
lY before lYu
Example:
illu (house) ilYlYu (houses)
Exception 1:
Some basic stems ending in “nnu” form the plural by adding lu.
Example:
pannu (tax)pannulu (taxes)
V) Stem final aM,AM is replaced by A and stem final ending in eVo is replaced
by E before the plural suffix lu
Example:
puswakaM (book)puswakAlu (books)
VI) Stems ending in Ayi form the plural in the regular way by adding lu.
Example:
abbAyi (boy)abbAyilu (boys)
75
VII) Stem final ending in “yi”,”yyi” is replaced by “wu” before “lu”. The vowel
before “wu” is a long vowel.
Example:
ceVyyi (hand)cewulu (hands)
The class identification for stems belonging to class VII can be done through the
finite automata in Figure 4.3.
Figure 4.3: Finite Automata to identify class VII stems for Plural Formation
VIII) Stems that do not confirm to the above classes and when the stem ends in “i”:
1) If the stem consists of two syllables, or it consists of more than two
syllables and the vowel in the middle syllable is not “i” then the final “i”
changes to “u” before “lu”.
Example:
bAvi (water well) bAvulu (water wells)
2) If the stem consists of two or more syllables and the vowel in the middle
syllable is “i” then all the non-initial “i”s become “u”.
Example:
maniRi (human being) manuRulu (human beings)
76
Exception:
In the nouns of Sanskrit origin the “i” in the middle syllable
does not change.
Example:
pariXi (boundary)pariXulu (boundaries)
Proper nouns are not generally used in the plural but when they are used the rules are
similar to those of the common nouns.
4.4.2 Oblique Stem Formation
Each noun in Telugu has an oblique stem along with the basic stem in both singular
and plural forms. The oblique stem is used to indicate possession or adjectival
relationship. It corresponds in meaning to the possessive form‘s (singular), s’(plural)
in English.
4.4.2.1 Oblique Stem in Singular
The oblique stems of the personal pronouns and a few demonstrative pronouns in the
singular form like “axi”, “ixi”, “exi” do not confirm to any rules for oblique stem
formation. They need to be memorized and are listed in Table 4.7.
Singular Nominative Singular Oblique

nenu nA
nuvvu nI
axi xAni
ixi xIni
exi Xeni
Table 4.7: Pronouns and their Oblique Stems in Singular
77
The oblique stems in the singular for common nouns and demonstrative pronouns
are formed based on some morphophonemic rules. Common nouns in Telugu are
divided into six classes based on the manner in which the oblique stem is formed
(Krishnamurti and Gwynn, 1985).
The six classes are as follows:
I) All nouns denoting human beings, demonstrative pronouns ending in “du”,
“ru”, “nu”, “lu” and a few non-human nouns ending in “ru”, “lu” preceded by
a long vowel fall into this category. These form the oblique stem by deleting
the final “u” and adding “i” to the basic stem. Some example nouns and
pronouns belonging to Class I are listed in Table 4.8.

mogudu (husband) mogudi (of husband)
vAdu (he) vAdi (his)
kuwuru (daughter) kuwuri (of daughter)
ceVllelu (sister) ceVlleli (of sister)
awanu (he) awani (his)
vAru (they) vAri (theirs)
kAlu (leg) kAli (of leg)
Uru (village) Uri (of village)
Table 4.8: Examples for Class I Oblique Stems
II) Non-human nouns of two or more syllables ending in “du”,”di”,
”ru”,”ri”,”lu”,”li” replace the final syllable by “ti” in forming the oblique
stem. Some nouns belonging to Class II are listed in Table 4.9.
Singular Singular Oblique

gUdu (house) gUti (of a house)
Nominative
eru (canal) eti (of a canal)
wAbelu (tortoise) wAbeti(of a tortoise)
nAgali (plough) nAgati (of a plough)
kAvadi (balance) kAvati (of a balance)
Table 4.9: Examples for Class II Oblique Stems
78
III) Six stems ending in nnu, llu, lYlYu replace them by Mti in forming the
oblique stem. All the six stems belonging to Class III are listed in Table 4.10.

illu (house) iMti (of a house)
Nominative
villu (bow) viMti (of a bow)
pannu (tooth) paMti (of a tooth)
kannu (eye) kaMti (of an eye)
cannu (breast) caMti (of breast)
olYlYu (body) oMti (of body)
Table 4.10: Examples for Class III Oblique Stems
IV) Five stems of two syllables ending in “yi” and two stems ending in “rru”
replace the final syllable by “wi” in the formation of the oblique stem. All the
seven stems belonging to Class IV are listed in Table 4.11.

ceyi/ceVyyi cewi (of hand)
Nominative
neyi/neVyyi newi (of ghee)
(hand)
nUyi/nuyyi (well) nUwi (of well)
(ghee)
goyi/goVyyi (pit) gowi (of pit)
rAyi(stone) rAwi (of stone)
goVrru (harrow) goVrwi(of harrow)
Table 4.11: Examples for Class IV Oblique Stems
V) All nouns ending in “M” have two oblique stems, one in the genitive with no
modification and the other before the accusative and dative case which is
formed by replacing “M” with “Ani”. Some example stems belonging to
Class V are listed in Table 4.12.
Singular nominative Singular Oblique Singular Accusative/Dative

kalaM (pen) kalaM (pen) kalAni-
puswakaM (book) puswakaM (book) puswakAni-
Table 4.12: Examples for Class V Oblique Stems
VI) Basic stems ending in “e”,”a” or a long vowel, those ending in “i” or “u”
preceded by double consonants except “ll” or “nn” and all nouns not covered
79
by classes I-V have both their basic stem and oblique stem identical. Some
example stems belonging to Class VI are listed in Table 4.13.

anna (elder brother) anna (of an elder
peVtteV (box) peVtteV (of a box)
poti (contest) brother)
poti (of a contest)
ceVttu (tree) ceVttu (of a tree)
Table 4.13: Examples for Class VI Oblique Stems
4.4.2.2 Oblique stem in plural
Telugu language has two words for the English word “we”, one exclusive “memu”
which does not include the person who is addressed and one inclusive “manamu”
which includes the person who is addressed. The list of personal pronouns and a few
demonstrative pronouns like “avi”, “ivi”, “evi” in the plural form do not confirm to
any rules for oblique stem formation. They need to be memorized and are listed in
Table 4.14.
Plural Nominative Plural Oblique

memu/manamu mA/mana
mIru mI
avi vAti
ivi vIti
evi Veti
Table 4.14: Example Pronouns and their Oblique Stems in the Plural
The plural oblique stem of the common noun and demonstrative pronoun is formed
by uniformly changing “lu” or “lYlYu” to “la” or “lYlYa”. The oblique suffix is
“a” added to the plural stem. In Sandhi the final “u” of the plural stem is lost
before “a”.
80
4.4.2.3 Oblique Stems of Proper Nouns
The oblique stem of proper nouns both singular and plural is formed in the same way
as those of common nouns. Table 4.15 lists some example proper nouns and their
oblique stems.
Proper Noun Oblique Stem

rAmudu (Rama) rAmudi (Rama’s)
subbArAvu (Subba Rao) subbArAvu(Subba Rao’s)
AMXrulu (people from AMXrula (of the people
Table 4.15: Example Proper Noun Oblique Stems
Andhra) from Andhra)
4.4.3 Case Marker Agglutination
In Indian Languages postpositions (case markers) serve the purpose of prepositions
in English. Postpositions which express spatial or temporal relations or mark various
semantic roles establish some grammatical relations between the nouns which they
follow and the verbs of the sentence. In Telugu postpositions are added to the
oblique stems in both singular and plural forms.
Postpositions in Telugu can be classified into two types namely Type-1 and Type-2.
Postpositions belonging to Type-1 only occur bound to oblique stems. They never
occur as separate words in a sentence or in combination with other postpositions.
Most commonly used postpositions of this type are listed in Table 4.16.
Postposition Meaning
ni/nu Accusative
ki/ku Dative
kosaM for the sake of
wo with, along with
nuMci/niMci From
a/na/ni in/on/at
kaMteV than, compared
guMdA/xwArA Through
Table 4.16: List of Type-1 Postpositions
Among the Type-1 postpositions dative and accusative can be grouped into a
subclass because of the similarity in the morphophonemic changes they exhibit
81
different from the other Type-1 postpositions.
The accusative and dative postpositions, “nu” and “ku” respectively take the form
“ni” and “ki” if the preceding syllables end in “i” or “I”, except in the case of
personal pronouns like “nI-ku” (for you), “mI-ku” (for you in plural) with single
syllable.
Example:
rAmudu(nominative )+ki  rAmudi(oblique)+ki rAmudiki
The use of accusative suffix for nouns denoting inanimate objects is optional. The
nouns take the same form as nominative in the accusative.
Example:
amma mAku kaXa ceppiMxi (Mother told us a story)
In the above sentence “kaXa” (story) is an inanimate noun which must have taken
the form” kaXanu” but the accusative suffix being optional it takes the nominative
form “kaXa” .In the singular of nouns ending in “aM”,”AM”, and “eVM” the dative
suffix “ki” and the accusative suffix “ni” are added to variant forms of the oblique
stems. The stems ending in “aM” or “AM” (shown in section 4.4.2.1 class V) have
“Ani” as the variant and stems ending in “eVM” have “eni” as the variant of the
oblique stem.
Example:
gurraM + ki gurrAni-ki (to the horse)
palYlYeVM + ki palYlYeni-ki (to the plate)
Example sentences of other Type-1 postpositions are as follows:
awanu maxrAsu nuMci vaccAdu (He came from Madras)
rAXa jyowikosaM vacciMxi (Radha came for Jyothi)
82
Postpositions belonging to Type-2 are separate words generally denoting place and
time. Although they sometimes occur as postpositions they also occur as
independent words mostly as adverbial nouns. A feature of Type-2 postpositions is
that postpositions of Type-1 can be added to them for example “lo” which is a Type-
2 postposition can be added to “nuMci” and “ki” to form “lonuMci” and “loki”.
Most common postpositions of this type are listed in Table 4.17.
Postpostion Meaning
lo In
lopala Inside
mIxa On
kiMxa Under
bEta Outside
xaggara Near
veVnaka Behind
muMxu in front of, before
lA,lAgu,lAgA Like
prakAraM according to
warvAwa After
varaku,xAkA up to, until
eVxuta Opposite
maXyana Between
pakkana by the side of
pAtu for (of time)
vEpu in the direction
Table 4.17: List of Type-2 Postpositions
of, towards
Example sentences of Type-2 postpositions are as follows:
mA illu narasApuramlo uMxi (Our house is in Narsapuram)
rAju lopala unnAdu (Raju is inside)
puswakaM ballamixa uMxi (The book is on the table)
4.5 Genders in Telugu
In Telugu the gender of nouns and pronouns depend on the number. There are two
genders masculine and non-masculine when the number is singular. All nouns
denoting male persons belong to the masculine gender and all the others belong to
non-masculine gender. There is no feminine gender and all nouns denoting female
83
persons are treated as non-masculine when the number is singular. There are two
genders human and non-human when the number is plural. All nouns denoting male
and female persons belong to the human gender and all others belong to the non-
human gender. The relationship between gender and number is shown in Table 3.1
(Dokkara, Penumathsa, and Sripada, 2015).
As a result two demonstrative pronouns “axi”(that thing , that lady) and “ixi”(this
thing, this lady) are non-masculine when the number is singular but when the
number is plural they fall into two genders human “vAlYlYu”(those people) and
“vIlYlYu” (these people), non-human “avi”(those things) and “ivi”(these things). In
the current work both “axi” and “ixi” are treated as referring to things and not female
persons because using these words to refer to female persons happens only in casual
talk.
Nouns generally do not have any marker of gender but some words and suffixes are
used to differentiate between male and female sexes. The different nouns that use
suffixes to differentiate between male and female sexes are as follows:
a) Some masculine nouns end with “rAlu” to indicate the female sex.
Example:
snehiwudu(male friend)
snehiwurAlu(female friend)
b) Some descriptive words use the suffixes “amma”, “kawwe” to denote female
persons and “ayya”, “kAdu” to denote male persons.
Example:
paMwulu (school master)
paMwulamma (school mistress)
84
musalayya (old man)
musalamma (old woman)
AtakAdu (male player)
Atakawwe (female player)
c) The word “moVga” (male) and “Ada” (female) are used to distinguish sex in
both human beings and animals.
Example:
moVgapilla
Adapilla
d) Various words are used to distinguish male and female animals and birds.
Example:
kodipuMju (cock)
kodipeVtta (hen)
e) Among pronouns and numerals certain forms are used to distinguish male
and female persons.
Example:
vAdu (he)
AmeV (she)
okkadu (one man)
okawe (one woman)
4.6 Grammatical Persons in Telugu
There are three grammatical persons namely First Person, Second Person, and Third
Person in Telugu. They are used to distinguish between the speaker, the addressee,
and others. The personal pronouns of Telugu language are defined by the
grammatical person.
85
Verbs in Telugu take a form dependent on the person and number of the subject.
Table 4.18 is a list of verb endings depending on the number and person of the noun
or pronoun.
Verb
Number Person
Ending
First Person -nu
Singular Second Person -vu

Pronouns and Nouns ending for the verb
Third Person -du/-xi
First Person -mu
Plural Second Person -ru
Third Person -ru/yi
Table 4.18: Verb endings based on Number and Person
Example sentences of different grammatical persons with their verb endings are as
follows:
nenu annaM winnA-nu (First Person singular)
nuvvu annaM winnA-vu (Second Person singular)
vAdu annaM winnA-du (Third Person Masculine singular)
Ame annaM winna-xi (Third Person Non-Masculine singular)
memu annaM winnA-mu (First Person plural)
mIru annaM winnA-ru (Second Person plural)
vAlYlYu annaM winnA-ru (Third Person Human plural)
avi annaM winnA-yi (Third Person Non-Human plural)
86
4.7 Evaluation
An evaluation of the performance of the Morphology engine for Nouns and
Pronouns is reported here. The evaluation of the morphology engine is performed
with respect to the Telugu Noun database downloaded from the web resource Telugu
Wiktionary at https://en.wiktionary.org/wiki/Category:Telugu_nouns. A total of 524
nouns were downloaded to perform the evaluation. The nouns were tested for plural
generation and oblique stem generation and case marker agglutination.
Evaluation was not done for pronouns as both plural formation and oblique stems
formation for pronouns are shown in Table No 4.6 and Table No 4.7 in section 4.4.1
and 4.4.2.1 respectively.
Among the 524 nouns that were downloaded some nouns which were repeated and
some of them like “kri . pU” (Before Christ) which are not suitable for pluralization
and oblique form generation were removed from the list. A total of 480 nouns were
identified to be suitable for the generation of plurals and oblique forms.
The evaluation was performed by giving the nouns as input to the surface realization
engine because it speeds up the evaluation as more number of nouns can be tested at
the same time. The evaluation results for the plural formation of the nouns are given
in Table 4.19.
The evaluation results in Table 4.19 show that no nouns are categorized under Class
VI and Class VII. Class VI consists of stems that end with “Ayi”. The downloaded
database which majorly consists of proper nouns does not contain any nouns ending
with “Ayi” which generally occurs at the end of common nouns like “abbAyi” (boy),
“ammAyi” (girl) etc. Class VII consists of only three stems “ceVyyi” (hand), goVyyi
87
(pit), and “nuyyi (water well)” which end with “yyi”. All the three stems are
thoroughly tested and one example is shown in section 4.4.1.
Class No. of nouns identified

in each class
I 6
II 63
III 1
IV 2
V 30
VI 0
VII 0
VIII-a 28
VIII-b 1
Regular 349
Table 4.19 : Evaluation Results of Plural Formation
The evaluation results for oblique stem formation of the nouns are given in Table
4.21.
Class No. of nouns identified

in each class
I 53
II 0
III 0
IV 0
V 35
VI 392
Table: 4.20 Evaluation Results of Oblique Stem Formation
The evaluation results in Table 4.20 show that none of the nouns are categorized
under Class II, Class III, and Class IV. Class II consists of non-human common
nouns ending in “du”, “di”, “ru”, “ri”, “lu”, “li” which get replaced by “ti”. The
downloaded dataset which majorly consists of proper nouns does not have any noun
belonging to this class. Some example nouns belonging to this class are listed in
Table 9 of section 4.4.2.1. Class III consists of only six stems ending in “nnu”, and
“llu” which are thoroughly tested (shown in Table 10 Section 4.4.2.1). The head of
the complement in the example sentence (1) “iMtiki” shown in its root form “illu” in
88
Figure 1 also belongs to Class III. Class IV consists of only seven stems ending in
“yi”, and “rru” which are also tested thoroughly (shown in Table 11 Section 4.4.2.1).
Case marker generation primarily includes the oblique form generation and then
joining the case marker to the form. As the results in Table 4.20 show that our engine
generates the correct oblique forms. Our evaluation of case marker agglutination
shows that for all the oblique forms our engine produces the correct surface noun
forms.
The results show that all the 480 nouns were identified as belonging to different
classes but in both the plural formation and oblique stem formation a few classes do
not have any nouns in them. The list majorly consists of proper nouns and a very few
common nouns because of which the nouns are not evenly distributed through all the
classes.
4.8 Summary
The morphology engine for nouns and pronouns described in this chapter along with
verb, adverb and adjective morphology engines are separate modules in a surface
realization engine for generating well-formed sentences in Telugu. The noun and
pronoun morphology engine plays a very important role in the surface realization
engine for the Telugu realization engine.
89
-----------------------CHAPTER-5
MORPHOLOGY OF VERBS
90
MORPHOLOGY OF VERBS
5.1 Introduction
Telugu is a Dravidian language spoken by people from the south Indian states of
Andhra Pradesh and Telangana. It is a morphologically rich free word order
language with nearly 90 million first language speakers. In this paper we describe a
morphology engine which automatically generates the different forms of verbs in
Telugu. Morphological Analyser (MA) and Morphological Generator (MG) are two
very important parts of Natural Language Processing (NLP) applications like
machine translation systems (Rao et.al 2006) and surface realization engines
(Dokkara et al 2015). A Morphological Analyser analyses a given word and
processes it into its root along with its grammatical information whereas a
Morphological Generator given a root along with its grammatical information
generates the corresponding word. Morphological Generators (MG) play a very
important in Natural Language Generation (NLG) of free word order languages like
Telugu. In practice it is always advantageous to have Morphological Generator as a
separate component that is separate from the rest of the NLG system (Guido Minnen
et.al 2000). The current work is a separate module of a surface realization engine for
Telugu (Dokkara et al 2015), a java application which is the final subtask of a
Natural Language Generation (NLG) pipeline (Reiter and Dale, 2000). The sentence
realization engine for Telugu is designed following the SimpleNLG (Gatt and Reiter,
2009) approach which is a very popular surface realization engine for English.
The morphological engine described in this paper is modelled on the morphological
engine for English (Guido Minnen et.al 2000). Because Telugu is morphologically
rich language the morphology of Telugu verbs and nouns is comparatively more
91
complex. For example the morphology of Telugu verbs involves defining
morphology for six different classes of verbs. Similarly morphology of nouns
involves defining morphology for seven different classes of nouns with respect to the
grammatical feature number. We therefore have two separate morphological engines
one for verbs and one for nouns and pronouns. In the implementation instead of
using tools like Flex or JFlex we programmed our morphological engine in Java
using the regular expression package. The process of verb morphology depends on
the way in which the verbs are classified. Linguistic classification of verbs in
Telugu into a small number of conjugation types is done based on the
morphophonemic changes the verb stems undergo when inflected with tense-mode
suffixes. The model of the analysis decides the number of types into which the verbs
can be classified.
In the current work the verb morphological generator does not have an explicit
lexicon or word list but has a computational model based on finite state techniques to
classify all the verbs into a few regular classes and a very small list of words for the
irregular class. The suffixes to be added to the verbs are maintained in separated
XML files and concatenated to the variants of the verb roots to form the final
inflected form.
Morphology has been well studied both by theoretical (Hockett, 1954, Stump 2001)
and computational linguists (Beesley and Kartunnen, 2003, Roark and Sproat, 2007).
From a theoretical perspective, structure of words is explained by the following three
models:
Item and Arrangement model which is a morpheme based morphological approach
in which word forms are analysed as arrangements of morphemes. In this model a
92
morpheme is treated as the minimal meaningful unit of a language and words are
treated as concatenation of morphemes.
Item and Process model which is lexeme based morphology in which a word form
is assumed to be a result of applying rules that alter a stem to produce a new one. In
this model, inflectional rules, derivational rules, and compounding rules are applied
to a stem to obtain the required word form.
Word and Paradigm model which is a word based morphological approach which
states generalizations that hold between the different forms of inflectional
paradigms.
From a computational perspective though, the three theoretical models described
above have been shown to offer no significant computational advantage to the finite
state approaches that have been widely applied to building MAs and MGs for a
diverse range of languages (Guido Minnen et.al 2000, Karttunen, 2003, Beesley and
Karttunen, 2003, Karttunen and Beesley, 2005, Roark and Sproat, 2007). Therefore,
in the current work we apply finite state techniques to Telugu Morphology. Amongst
Indian languages (PJ Antony and Soman 2012) reported highest number of
morphology tools for Tamil. According to their survey a wide range of approaches,
from corpus based through suffix stripping to finite state exist. A database approach
is described (Goyal and Lehal, 2011) where they store all the word forms in a
relational database. For Telugu language, (Rao et.al 2011) describes a word and
paradigm based morphological analyser and generator. An item and arrangement
based morphological generator is described in (Sribadrinarayan et al 2009) for
Telugu. A rule based (item and process based) morphological generator for Telugu is
describe in (Ganapatiraju et al 2006).
93
5.2 Inputs to the Morphology Engine
The verb morphology engine in the current work is part of a surface realization
engine which is responsible for automatic generation of grammatically well-formed
Telugu sentences. The input for the surface realization engine is an XML file which
has all the grammatical information required both at the sentence level and word
level. Figure 5.1 shows an example XML specification corresponding to the Telugu
sentence given below in WX-notation.
sIwa rAmudini piliciMxi. (Sita called Rama.)
<? xml version=”1.0” encoding=”UTF8” standalone=”no”>

<document>
<sentence type=” ” predicatetype=”verbal” respect=”no”>
casemarker=” ” stem=”basic”>sIwa</head>
</nounphrase>
<head pos=”noun” gender=”masculine” number=”singular” person=”third”
casemarker=”ni” stem=”basic”> rAmudu</head>
</nounphrase>
<head pos=”verb” tensemode=”pasttense”> piluc</head>
</verbphrase>
</sentence>
</document>
5.3 Morphological Generation Process
The current chapter describes a computational model based on finite state techniques
and XML files. The computational model is a java application which uses the
“java.util.regex” package. The input to the computational model is the verb lemma,
the tense mode of the verb, PNG (person, number and gender) of the subject and the
case marker of the subject. The “Pattern” class of the “regex” package has a method
94
“matches” which creates a finite state automata for a given regular expression to
identify the class to which a given verb lemma belongs. The computational model
also computes the final constituent of the stem in the inflected verb and finally
concatenates it to the required suffixes extracted from the XML files.
5.4 Verb Forms
The input for the verb in the example XML specification of Figure 5.1 is as follows:
<head pos=”verb” tensemode=”pasttense”> piluc</head>
The first attribute is “pos” which stands for part of speech and the second attribute is
“tensemode”.
Verbs in Indian languages inflect for tense, aspect, modality (mode), and PNG
(person, number and gender) endings. The verbs co-occur with tense, aspect, and
modality in most of the languages whereas aspect and modality are packed into a
single verbal inflection word in Telugu and referred to as “tensemode” in the current
work. There are a total of 18 verb forms including both finite and non-finite forms
which are of importance in Telugu. Our morphological engine has the capacity to
generate all the verb forms but only the finite forms are used by the surface
realization engine and therefore we discuss about the finite forms in detail.
5.4.1 The Imperative
The imperative verbs are used to express a command or a request. The meaning of
the imperative verb takes the form of a command in the singular and a request in the
plural. The imperative forms of the verb are only used when the first person in the
singular addresses the second person either in the singular or in the plural. Therefore,
the imperative forms carry two suffixes. In the case of negative imperative the
95
second person suffix is added to the verb root + “ak” (negative imperative suffix).
The imperative suffixes are as shown in Table 5.1.
Form II Person Singular II Person Plural

Affirmative u(in some cases “i”) aMdi
Negative ak-u ak-aMdi
Table 5.1: Imperative Suffixes
Principles for the formation of the imperative verbs are as follows:
a) The basic verb stems when the imperative suffixes and the negative tense are
same.
b) The rules of stem final vowel loss and harmony (i.e. change of medial “u” to
“a” when followed by “a” apply to imperative verbs.
Example:
pAdu (sing) + aMdi pAdaMdi (request to sing)
c) Stems ending in “s” preceded by a long vowel change “s” to “y” in the
imperative mood. These stems optionally add the suffix “i” instead of “u” in
the singular. When the “i” suffix is added the stem vowel is optionally
shortened and “y” becomes “yy”.
Example:
ces (do) + u  ceVyyi (command to do)
ces (do) + aMdi  ceVyyaMdi (request to do)
Exception:
When the stem vowel is “A” it is not shortened.
Example:
rAs (write) + u  rAyi (command to write)
d) In the case of basic stems having two syllables ending “c” or “s” the final
consonant is replaced by “v” before the imperative suffix.
96
Example:
piluc (call) + u  piluvu (command to call)
kalus (meet) + aMdi  kalavaMdi (request to meet)
e) When the stem variant ends in a long vowel the beginning of the imperative
suffixes is dropped.
Example:
rA (come) + u rA (command to come)
f) One irregular verb in the imperative is “pax-a” (go). The last “a” here is
treated as the imperative suffix.
5.4.2 The Abusive
Many verbs cannot occur in this mood due to semantic restrictions. A few verbs like
“kAlu” (to burn), “kUlu” (to fall), “cAvu” (to die), “pagulu” (to break) etc., occur in
this mood. Some example sentences using abusive verb forms are as follows:
nI illu kUla (May you house fall)
nI kadupu kAla (May your womb (children) burn)
nI mokaM pagala (May your face break)
5.4.3 The Obligative
The obligative is formed by adding the finite or perfective form of a defective verb
“vAl” to the infinitive of a main verb. The finite form of this verb in the future
habitual tense is “vAli” (must). Some example sentences using Obligative verb
forms are as follows:
nenu iMtiki veVlYlYAli (I need to go home)
97
mIru mA Uru rAvAli (You should come to our town)
The perfective participle of “vAl” is “vAlsi” only inflected in non-masculine
singular. The obligative verb does not agree with the subject in person, gender, or
number. It occurs always in the third person non-masculine singular or without any
personal suffix.
Example:
mIru gudiki rAvAlsiMxi (You must have come to the temple)
5.4.4 The Future Habitual
The future habitual tense in Telugu can express an action or a state that will take
place in the future or an action or state that is habitual. The sentence “nenu annaM
wiMtAnu” can mean either ‘I will eat food’ or ‘I eat food’. Principles for the
formation of Future Habitual Tense:
a) The basic tense suffix for future habitual tense is “wA/wun”
b) The verb stems like “ammu” (sell), “adugu” (ask) occur unchanged before
the tense suffix.
Example:
ammu (sell) + wA  ammuwA (will sell)
c) In the case of the basic stems ending in “s” or a long vowel the tense suffix is
added directly.
Example:
kalus (meet) + wA  kaluswA (will meet)
d) In the case of basic stems ending in “n” the tense suffix changes to “tA/tun”.
98
Example:
win + tA wiMtA
e) Single syllable stems ending in “tt” (koVttu) (beat), “pp” (ceVppu) (tell)
change to “da” (koVdawA) (will beat), “bu” (ceVbuwA) (will tell)
respectively before the tense suffixes “wA” and “du” (koVduwuMtA)
(beats), “bu” (ceVbuwuMtA) (tells) respectively before the tense suffix
“wuM”.
f) Stems ending in “c”, “cc”, “Mc” changes those elements to “s” before the
tense suffix.
Example:
piluc (call) + wA  piluswA (will call)
5.4.5 The Past
In Telugu the past tense corresponds to two past tenses in English for example
“vaccAnu” in Telugu represents both ‘I came’ and ‘I have come’. The following are
the principles for the formation of past tense:
a) The tense suffix “e/iM”, and the personal suffix are added to the verb stem to
form the past tense
Example:
piluc +iM  piliciM
b) The stem final “u” before the tense suffix “e/iM” is dropped as a result of
sandhi formation.
Example:
woVdugu + e woVdigA
c) A non-initial “u” in the stem becomes “i” when the past tense suffix is added.
99
Example:
piluc + e  pilicA
d) Verb stem suffixes that end with a short vowel + n generally have “nA” as
the past tense suffix but the 3rd person singular female has “na” as the past
tense suffix.
Example:
win + nA winnA
e) The past tense suffix for the verb stem “pad” (fall) is “dA” but in the case of
third person female singular it is “da”.
Example:
pad + dA  paddA
f) The verb stem ending in “s” becomes “S” in some cases when the past tense
suffix follows.
Example:
kalus + e kaliSA
5.4.6 The Hortative
An Imperative verb that includes the speaker is called the hortative verb. In Telugu
the hortative verb is formed by adding to the verb stem the hortative suffix “xA”
followed by the first person plural “mu/M”. The hortative form also conveys a future
meaning involving both the addresser and the addressed.
Principles for forming the hortative verb form are as follows:
a) The hortative tense form is obtained by adding the verb stem in the habitual
future to the hortative suffix followed by the first person plural
100
Example:
ammu (sell) + xA-M  ammuxAM ((we) will sell)
b) In the case of the future habitual tense forms ending in “c” and “s” they
change to “d” in the hortative.
Example:
piluc (call) + xA-M  pil-ux + xA-M  piluxxAM ((we)
will call) (Table 5.3 class IIa1)
5.4.7 The Negative Finite
In Telugu the negative tense happens by the formation of a verb paradigm rather than
the use of a separate word of negation as in most languages. The negative verbs are
in the future habitual tense and negate the affirmative verb occurring in that tense.
Some example sentences using the negative finite verb forms are as follows:
nenu annaM winanu (I will not eat food)
vAdu iMtiki rAdu (He will not come home)
The negative suffix in Telugu is “a”. It occurs after the verb root and before the
personal suffix in the verb. The personal suffix in the negative tense is same as in the
other tenses except for third person singular non-masculine and third person plural
non-human which are “xi” and “yi” become “xu” and “vu” in negative tense.
Principles for the formation of negative tense
a) The negative tense is formed by adding the negative tense suffix “a” to the
basic stem followed by the personal suffix.
Example:
win (eat) + a  win-a (will not eat)
101
b) The medial “u” of the basic stems having two or more syllables of the form
(C)VCuC(u) changes to “a” when followed by the negative suffix.
Example:
woVdugu (wear) + a  woVdag-a (will not wear)
c) A number of basic stems ending in “c”, “s” replaces these constants by “v”,”
y” in the negative tense as in the case of imperative.
Example:
ces (do) + a  ceVyya (will not do)
piluc (call) + a  piluva (will not call)
5.4.8 The Durative
The durative verb is not a regular finite verb as the other finite forms discussed
earlier. The durative verb is a compound verb as at least two verb roots are involved
in its construction (the main verb and “un”).
Telugu language does not distinguish present, past and perfect continuous tenses as
English does. It is shown by the use of adverb of time or only by the context of
discourse. In the absence of time specifying clues the durative verb carries the
present continuous meaning. The durative verb is formed by adding to the basic verb
stem the durative suffix “w/t” followed by “un” in its finite form.
The principles for the formation of the durative finite verb are:
a) In the case of verb stems ending in a short vowel followed by n the durative
suffix are “t”. The durative verb is “basic stem + t + finite form of un”
Example:
vin + t + un vin-t-un (hearing)
102
b) In all the other cases the durative suffix is “w”. The durative verb is “basic
stem + w + finite form of un”
Example:
cus+w + un cus-w-un (seeing)
5.5. Verb Morphology Engine
In the current work the verb root “piluc” of the example in Figure 5.1 becomes
“piliciMxi” after going through a few steps. The steps the verb root undergoes to get
the required inflection are as follows:
a) Identification of the morphophonemic group based on the tense mode.
b) Identification of the verb inflectional class of the given verb.
c) Extraction of the phonetic alternations based on the morphophonemic group
and the verb inflection class.
d) Extraction of the tense mode suffix.
e) Extraction of the personal suffix based on person, gender, number of the
subject.
f) Formation of the final inflected verb by concatenating the extracted
constituents to the verb root.
5.5.1 Identification of Morphophonemic group
There are three morphophonemic groups namely A, B and C in Telugu. In the
current work the morphophonemic group A is divided into three groups namely
A123, A4, and A5 because the phonetic alteration of certain verb classes are
different for these subgroups of the group A. Group C is also divided into two
103
groups namely C 1-8 and C9 for the same reason as A. Each of the tense modes in
Telugu belongs to one morphophonemic group. Table 5.2 shows the list of tense
modes and the morphophonemic group they belong. In the example of Figure 5.1 the
tense mode for the verb is specified as “pasttense”. The morphophonemic group for
past tense is identified as group B by looking at Table 5.2. In the current work Table
5.2 is an XML file named “tensemodeidentification.xml” and is used for identifying
the morphophonemic group.
Tense mode Morphophonemic Group

Present Participle A123
Durative
Habitual Future
Conditional A4
Hortative A5
Past Participle
Past Tense
Past Verbal Adjective B
Concessive
Future Habitual Verbal
Conditional
Adjective
Infinitive
Abusive
Negative Tense
Negative Participle C1-8
Negative Verbal Adjective
Obligative
Negative Imperative
Imperative Plural
Imperative Singular C9
Table 5.2. Tense Modes and Morphophonemic Groups
5.5.2 Identification of verb inflection class
Telugu verbs are divided into six classes’ (krishnamurti 1961) of which classes I, II,
III, IV, and V are conjugations of weak (regular) verbs and Class VI consists of
strong (irregular) verbs.
104
Class I consist of four subclasses which are as follows:
a) Verb bases with three syllables of the form (C1)V1C2V2C3V3 (C stands for
consonant, V stands for vowel, and the occurrence of consonant inside ( ) is
optional) in which “u” occurs as V2 and V3, and C2 is not “c” or “s”.
Example:
woVdugu (to wear), kuduru (to be settled).
b) Disyllabic bases of the form. (C1)V1C2V2 or (C1)V1C2C3V2.
Example:
padu (to fall), ekku (to climb)
c) Monosyllabic bases of the form (C1)V1C2 where “n” or “l” occur as the final
consonant.
Example:
nAn (to become wet), cAl (to be sufficient)
d) Disyllabic bases of (C1)V1C2V2C3 type where the final consonant is “l” and
the second vowel is “u”.
Example:
kadul (to move)
Class II consists of two subclasses which are as follows:
a) Disyllabic bases of the (C1)V1C2V2C3 type in which the final consonant is
“c” or “s” and the second vowel is “u”.
Example:
piluc- (to call), wadus- (to get wet)
b) Monosyllabic bases of the (C1)V1C2 type in which the final consonant is “c”
or “s”.
105
Example:
wis- (to take out), rac- (to smear)
In the implementation of the morphology engine the Class II verbs are further
divided into sub classes. The subclass ‘a’ is further divided into ClassIIa1 and
ClassIIa2 where ClassIIa1 has the final consonant as “c” and ClassIIa2 has the final
consonant as “s”. The subclass ‘b’ is also further divided into two subclasses namely
ClassIIb1, and ClassIIb2.
Class III consists of three sub classes which are defined as follows:
a) A few monosyllabic bases of the form (C1)V1C2 with final “c” belong to this
sub class.
Example:
cAc- (to stretch out)
b) A few stems in final “uc” or “c” belong to this sub class.
Example:
kAluc- (to set fire)
c) A few stems with final “inc” belong to this sub class.
Example:
wittiMc- (to cause to scold)
Class IV consists of two sub classes which are defined as follows:
a) Monosyllabic bases of the type (C1)V1C2C3- which end in final “tt” or in final
“pp” belong to this sub class.
Example:
kott- (to beat), ceVpp- (to tell or speak).
106
b) Two monosyllabic bases of the same type, one in final “nn” another in
“lYlY” belong to this sub class.
Example:
wann- (to kick), veVlYlY (to go)
In the current work the ClassIVa sub class is further subdivided into ClassIVa1, and
ClassIVa2.
Each monosyllabic base in ClassIVb is treated as a separate class and ClassIVb
becomes ClassIVb1 and ClassIVb2.
Class V consists of seven monosyllabic bases of type (C1)V1C2 in final “n” belong to
this class. The seven bases are an- (to say), kan-1 (to see) kan-2 (to bring forth), kon-
(to buy), win- (to eat), vin- (to hear).
Class VI consists of irregular bases. The irregular bases that belong to this class are
icc- (to give), cacc- (to die), weVcc- (to bring), vacc- (to come), av- (to become), po
(to go), cUc- (to see), leVc- (to rise), le (to be), pax- (to go, depart).
The verb “piluc” in the example of Figure 5.1 is of the form (C1)V1C2V2C3 a
disyllabic base where the final consonant is “c” and the second vowel is “u”. It
belongs to the Class IIa. Figure 5.2 is the diagrammatic representation of the finite
automata created by the computational model for Class IIa. The state 0 is the start
state of the finite automata and the state 4 is the final state. We can see that the first
consonant C1 is optional going to the same state. In the example of Figure 5.1 the
first consonant is “p”, which the finite automata takes as input and goes to the same
state 0. The finite automata then takes V1 which is “i” as input and goes to state 1, at
state 1 it takes C2 which is “l” as input and goes to state 2, at state 2 it takes V2 which
107
is “u” as input and goes to state 3 and finally at state 3 it takes C3 which is “c” as
input and goes to the final state 4.
Figure 5.2: Finite Automata for Class IIa
5.5.3 Extraction of Phonetic Alternations
The extraction of phonetic alternations is done based on the verb class and the
morphophonemic group of the specified tense mode. Table 5.3 clearly shows the
phonetic alterations each verb class goes through in the process of generating the
final inflected form of the verb. In the case of the verb “piluc” in the example of
Figure 5.1 it is clearly shown in Table 5.3 at class IIa1 under group B (to which the
tense mode “pasttense” belongs) the value is “pil-ic”.
In the current work the Table 5.3 is implemented in two steps.
1) The required deletions and replacements are performed on the verb root
through the programming logic.
2) The required alterations to be added are extracted from the XML file
“palterations.xml”.
The first part is the java programming logic which along with the identification of
the verb class performs the required deletion to form the variant of the verb which is
108
the final constituent of the stem in the inflected verb. The fragment of the java code
which does the required process is presented in Figure 5.3.
Class Basic
alternant Morphophonemic Groups
and
Example
word Group Group Group Group Group Group
A123 A4 A5 B C1-8 C9
Ia (C)VCuCu uCu uCu uCu iC aC uC
adugu ad-ugu ad-ugu ad-ugu ad-ig ad-ag ad-ug
Ib (C)VC(C)u u u u - - -
pAdu pAd-u pAd-u pAd-u pAd pAd pAd
Ic (C)Vn/l - - - - - -
nAn- nAn nAn nAn nAn nAn nAn
Id (C)VCul ul il ul il al ul
kaxul- kax-ul kax-il kax-ul kax-il kax-al kax-ul
IIa1 (C)VCuc us is ux ic av -
piluc pil-us pil-is pil-ux pil-ic pil-av pil-uc
IIa2 (C)VCus us Is ux is av -
wadus wad-us wad-is wad-ux wad-is wad-av wad-us
II b1 (C)Vs s s x s (V)yy (V)yy
wIs wI-s wI-s wI-x wI-s wi-yy wi-yy
IIb2 (C)Vc s S x c y y
vAc vA-s vA-s vA-x vA-c vA-y vA-y
IIIa (C)Vc s s x c c c
kAc kA-s kA-s kA-x kA-c kA-c kA-c
IIIb (C)VCuc us is ux c c c
kAluc kAl-us kAl-is kAl-ux kAlu-c kAlu-c kAlu-c
IIIc .*iMc is is ix iMc iMc iMc
wittiMc witt-is witt-is witt-ix witt-iMc witt-iMc witt-iMc
IVa1 (C)Vtt du Di Da tt Tt tt
koVtt koV-du koV-di koV-da koV-tt koV-tt koV-tt
IVa2 (C)Vpp bu bi ba pp pp pp
ceVpp ceV-bu ceV-bi ceV-ba ceV-pp ceV-pp ceV-pp
IVb1 (C)Vnn M M M nn Nn nn
wann wa-M wa-M wa-M wa-nn wa-nn wa-nn
IVb2 (C)VlYlY lY lY lY lYlY lYlY lYlY
veVlYlY veV-lY veV-lY veV-lY veV- veV- veV-
V (C)Vn N N N nlYlY lYlY
N nlYlY
vin vi-n vi-n vi-n vi-n vi-n vi-n
Table 5.3: Phonetic Alternations for Verb Classes
109
The fragment of code presented in Figure 5.3 deletes the last two letters in the
example word “piluc” as follows:
piluc  pil
In step 2 the alternation “ic” is extracted from the XML file “palterations.xml”.
If
(Pattern.matches("[âAiIIuUeEoOM]?[aAiIIuUeEoOM][âAiIIuUeEoOM][u][c]",v
erb)) { vclass = "classIIa1";
verb = verb.substring(0,verb.length() - 2);
}
Figure 5.3: Fragment of Java Code for Class IIa1
5.5.4 Extraction of Tense mode Suffix
Tense mode suffixes are those suffixes which are agglutinated to the verb based on
the tense mode.
Morphophonemic Criteria
The tense-mode suffixes in Telugu language are based on the morphophonemic
groups namely A, B, and C.
Group A Suffixes beginning with consonants (w, t, x, d)
Group A123
Grammatical Name Suffix Alternant

Durative Participle wu/tu
Durative w/t
Habitual Future wA/tA/wun/tun
Table 5.4: Suffix Alternant Table for Group A123
Group A4

Conditional we/te
Table 5.5: Suffix Alternant Table for group A4
110
Group A5

Hortative xA
Table 5.6: Suffix Alternant Table for group A5
Group B Suffixes which begin with a front vowel (i, eV, e)

Past Participle I
Past Tense E
iM
nA
dA
Past Verbal ina/na
Concessive inA/nA
Adjective
Future Habitual E
Conditional Iwe
Verbal
Table 5.7:Adjective
Suffix Alternant Table for Group B
Group C Suffixes which begin with a back vowel (a, A, u)
Group C1-8

Infinitive a/an/-
Abusive a/nu
Negative Tense a/-
Negative Participle aka/ka
akunda/kunda
Negative Verbal ani/ni
Obligative Ali
AdjectiveImperative aku/ku
Negative
Imperative Plural aMdi/ndi
Table 5.8: Suffix Alternant Table for Group C1-8
Group C9

Imperative Singular u/i/-
Table 5.9: Suffix Alternant Table for Group C9
111
In the case of the verb “piluc” in the example of Figure 5.1 the tense mode being
“pasttense” which belongs to Group B and the subject being “nonmasculine” the
tense mode suffix is “iM”.
5.5.5 Extraction of Personal Suffix
Telugu verbs inflect to encode gender, number, and person suffixes of the subject. In
the current work the morphology engine gets the information about attributes of the
subject and uses that information to agglutinate the gender, number, person suffixes
and tense mode suffix to the verb. The eight personal suffixes of the finite verb for
different persons and numbers can be listed as follows:
Person Singular Plural

1st person -nu -mu
2nd person -vu -ru
3rd person -du (masculine) -ru (human)
3rd person -xi (non- -yi (non-human)
Table 5.10: Personal Suffixes
masculine)
In the case of the verb “piluc” in the example of Figure 5.1 the subject “sIwa” is non-
masculine, singular, 3rd person which means the personal suffix is “xi” from the
above Table 5.10.
5.5.6 Formation of the Final Inflected Verb
The final inflected verb is formed by concatenation of all the strings formed from
section 5.5.3 to section 5.5.5.
Final verb  verb +phonetic alternation+ tense mode suffix+ personal suffix
In the case of the verb “piluc” in the example of Figure 5.1:
Final verb  pil+ic+iM+xi which is piliciMxi.
112
5.6 Evaluation
We report an evaluation of the accuracy of our Telugu Verb Morphology engine
with respect to the Telugu verb database downloaded from the Telugu Wiktionary at
https://en.wiktionary.org/wiki/Category:Telugu_verbs. We have run the verbs in the
database by giving them as input to the surface realizer rather than the morphology
engine separately because we wanted to test them with various alternatives of the
subject with respect to person, number and gender.
A total of 503 verbs were downloaded and the evaluation was performed. The verbs
were tested for habitual future, durative and past tense. The most important part of
the evaluation was categorizing the verbs into different verb classes based on our
reference grammar book (Bh Krishnamurti 1985). The results of the evaluation are
given in Table 5.11.
The results show that 418 verbs were identified as to belonging to different classes
and were able to generate the different verb forms without any errors. The results
also show that 85 verbs were not recognized as belonging to any verb class
according to the current work. Among the 85 words which were not identified as
belonging to the verb classes are words like “pilupu” which are not considered as
verbs according to our grammar reference. Some of the words end with “agu” which
means “to become” but our grammar reference considers only “avu” as the verb to
be used to mean “to become” and we did not consider these two to be the same
otherwise the number of failed verbs would have been reduced by 20. We intend to
use this evaluation results to drive the development of the morphology engine to
extend the coverage.
113
Class No. of verbs identified in
each class
Ia 33
Ib 118
Ic 2
Id 1
II a1 10
II a2 0
II b1 24
II b2 0
III a 4
III b 6
III c 141
IV a1 26
IV a2 3
IV b1 4
IV b2 0
V 38
VI 8
Table 5.11: Evaluation Results for Verb Formation
5.7 Summary
This chapter describes a morphology engine for Telugu verbs based on finite state
techniques. The different verb forms and the different verb classes are described in
the chapter. The process of the formation of inflected verb is also described in detail
in the chapter. The verb morphological engine plays a very important role in the
surface realization engine.
114
-----------------------CHAPTER-6
SENTENCE FORMATION
115
SENTENCE FORMATION
6.1 Introduction
The grammatical operations of a language like Telugu are basically based on the
category of the words rather than the structure of its constituents. The category of the
words can be pronoun, noun, verb, adjective, adverb etc. The words are grouped to
form more syntactically relevant categories called phrases or word groups. The
phrases or word groups form sentences in the language. Sentence formation in
Telugu is the major purpose of the current thesis. The responsibility of the sentence
formation in the current work lies with the SentenceBuilder module. The
SentenceBuilder module is the one which has centralized control over all the other
modules in the surface realization engine. The SentenceBuilder performs the
following tasks in the process of sentence generation.
1) Identifies the constituents of the input with predefined grammatical functions
such as subject, object, complement and verb.
2) Appropriate element builders for the different grammatical functions are
called to create element objects to store all the information received from the
input XML file.
3) The element objects are passed to the appropriate phrase builders to generate
phrases according to the requirements specified in the input. The phrases are
received back from the phrase builders in the form of strings.
4) Agreement rules are applied to the grammatical functions to form the
complete sentence.
5) The word order of the sentence constituents in which they were given in the
XML input file is preserved.
116
6) The sentence is sent to the output generator to produce the sentences in
Telugu language from WX-notation a transliteration scheme for representing
Indian languages in ASCII notation.
6.2 Identification of Grammatical Functions
Grammatical Functions are syntactic roles played by words or phrases in the context
of a particular sentence. In the current work we have four grammatical functions
namely subject, object, complement and verb. The grammatical function is
represented through a sentence level feature called role. The grammatical function is
used to identify the noun phrase which plays the role of subject so that verb
agreement becomes easy. One more advantage with the use of grammatical functions
in the current work is it facilitates free word order for the sentence constituents
which are required in Telugu language. The grammatical functions are identified by
the SentenceBuilder module by looking for the sentence level feature called role.
6.3 Element Builder
The ElementBilder module in the current work is used only to make the application
better and facilitate better communication among the modules of the application.
Each grammatical function has a specific element builder. There are four element
builders namely SOCElementBuilder, VAElementBuilder,
AdjectiveElementBuilder, and AdverbElementBuilder. The purpose of element
builders is to convert the lexicalized input provided in the input XML file into
element objects. Element objects are objects of grammatical functions with their
grammatical features as instance variables. There are four element objects namely
SOCElement, VAElement, AdjectiveElement, and AdverbElement. The
SentenceBuilder calls the appropriate element builder to construct the required
117
element object and return it to the SentenceBuilder. The element objects are passed
to the phrase builders to get back strings as phrases.
6.4 Phrase Formation
In Telugu language the primary meaning of a sentence is expressed in the form of
word groups or phrases. The order in which the phrases are grouped in a sentence is
relatively free in Telugu when compared to languages like English. In the current
work the module that takes care of the formation of a phrase is the Phrase Builder
module to which the Sentence Builder module sends the required objects to be
generated as phrases. The Phrase Builder can be NounPhraseBuilder,
AdjectivePhraseBuilder, VerbPhraseBuilder, or AdverbPhraseBuilder because noun
phrase, adjective phrase, verb phrase, and adverb phrase are the main four types of
phrases in Telugu language. In Telugu language the phrases generally exist as head-
modifier phrases and therefore in our current word the phrase are head-modifier
phrases. The noun phrase and the verb phrase play a very important role in the
formation of a sentence as the head of a head-modifier phrase. The adjective phrase
and the adverb phrase play the secondary role of being a modifier of the phrase head.
In the following sections a detailed description about the construction of all the
phrases is provided along with the morphology of the adjectives and the adverbs.
6.4.1 Noun Phrase
The noun phrases are composed of one or more nouns or pronouns or a noun
modified by one or more adjectives. Every noun phrase has an identifiable head, a
noun or a pronoun. In the current work the noun phrase is a very simple phrase
which has a head and an optional modifier which is an adjective. The construction of
the noun phrase is the responsibility of the NounPhraseBuilder.
118
sIwa aMxamEna pAta gattigA pAdiMxi. (Sita sang a beautiful song loudly)
The XML file that generates the above sentence is as follows:
<?xml version=”1.0”encoding=”UTF8”standalone=”no”>
<document> <sentence type=” ” predicatetype=”verbal” respect=”no”>
casemarker=” ” stem=”basic”> sIwa</head>
</nounphrase>
<modifier pos=”adjective” type=”descriptive” suffix=”aEna”> aMxamu</modifier>
casemarker=”” stem=”basic”> pAta</head>
</nounphrase>
<modifier pos=”adverb” suffix=”gA”> gatti</modifier>
<head pos=”verb” tensemode=”pasttense”>pAdu</head>
</verbphrase>
</sentence>
</document>
Figure 6.1. Example Input XML Specification
In the above sentence there are two noun phrases which are passed to the noun
phrase builder as two SOCElements (see section) by the sentence builder module.
The first SOCElement is passed to the morphology engine by the
NounPhraseBuilder to get back the noun phrase “sIwa” which has only the head and
no modifier. The second SOCElement contains a noun and another element which is
the AdjectiveElement. The noun is passed to the morphology engine and the
AdjectiveElement is passed to the AdjectivePhraseBuilder. A noun group or noun
phrase in nominative (explicitly unmarked) is the subject of the clause or sentence.
In the above sentence the noun phrase “sIwa” plays the role of a subject and the noun
phrase “aMxamEna pAta” plays the role of a complement. Telugu language has the
nominative-accusative pattern with the subject predicate agreement and not the
ergative-absolutive. Since Telugu is nominative-accusative language the verb agrees
with the argument in the nominative case.
119
In the current work there can be one noun phrase which plays the role of a subject
and one or more noun phrases which play the role of complements. The noun phrase
in the subject role can be in the nominative (unmarked) or the dative (marked with
“ki” or “ku”).
rAmudiki Akali veswuMxi (Rama is hungry)
The above sentence is an example where the noun phrase in the subject role is
marked with the dative case.
6.4.2 Adjective Phrase
Adjective Phrase is an optional modifying element of a noun phrase in a head-
modifier noun phrase. The adjective phrase can be a single word or a group of words
which act as the modifier of a noun. In the current work the generation of the
adjective phrase Is the responsibility of the AdjectivePhraseBuilder. The adjective
phrase currently is restricted to a single word which is expected to change in the next
versions. In the example XML file of Figure 6.1 the AdjectivePhraseBuilder passes
the AdjectiveElement to the morphology engine to get back the adjective phrase
“aMxamEna”. The AdjectivePhraseBuilder passes the adjective phrase “aMxamEna”
back to the noun phrase which finishes the construction of the noun phrase
“aMxamEna pAta”.
6.4.2.1 Adjective Morphology
Telugu adjectives are generally indeclinable and most often occur before the noun
they qualify. Adjectives are divided into four classes.
120
Class I. The first class of adjectives are called basic adjectives. The adjective roots
that occur only as adjectives belong to this class. These roots always appear before
the nouns they qualify.
Class II. The second class of adjectives are called derived adjectives. These
adjectives are derived from nouns,verbs, or adverbs.
Class III. The third class of adjectives are called positional adjectives. These
words can either be used as nouns or as adjectives depending on the position in the
sentence.
Class IV. The fourth class of adjectives are called bound adjectives. They occur in
limited number of attributed compounds. They belong to particular class of
adjectives, nouns and adverbs.
6.4.2.1.1 Basic Adjectives
There are very few words that can be used as adjectives only and not as anything
else. Some examples of sentences that use basic adjectives are as follows:
A illu mAxi (That house is ours)
I puswakaM nAxi (This book is mine)
nAku oka kalaM kAvAli (I want a pen)
Ame aragaMta sepu mAtlAdiMxi (She talked for half an hour)
Ayana prawiroju nannu widawAdu (He scolds me every day )
The italicized words in the above sentences are basic adjectives.
121
6.4.2.1.2 Derived Adjectives
Adjectives derived from other parts of speech like nouns, verbs, adverbs are called
derived adjectives. Some example sentences consisting of derived adjectives are as
follows:
iMti kappu kuruswuMxi (The roof of the house leaks)
Ayana wammudu nAku welusu (I know his younger brother)
6.4.2.1.3 Positional Adjectives
Almost all the nouns in the nominative singular can function as adjectives when
followed by a noun. Here the noun which does not take the oblique form acts as an
adjective based on its position and therefore it is called a positional adjective. All
the cardinal numbers used adjectively belong to this class of adjectives. Some
example sentences are as follows:
seru pappu. (a seer full of dal)
reMdu puswakAlu. (Two books)
ixxaru manuRulu (Two people)
6.4.2.1.4 Bound Adjectives
All words of colour, taste, and density belong to this class of adjectives.
Noun Bound Adjective

civara cittacivara
moVxalu mottamoVxalu
koVsa kottakoVsa
bayalu battabayalu
naduma nattanaduma
Table 6.1. Example Bound Adjectives
122
6.4.2.1.5 Pronominalized Adjectives
Adjectives with pronominal suffixes (vAdu (singular masculine), xi (singular non-
masculine), vAru (plural human), vi (plural non-human)) are called
pronominalized adjectives. Pronominalized adjectives are used to express
sentences like ‘a good one’, ‘a big one’ in English.
Some examples of sentences using pronominalized adjectives are as follows:
I gaxi pexxaxi (This room is big)
I puswakAlu koVwwavi (These books are new)
vAdu goVppavAdu (He is great)
Ame piccixi (She is mad)
vAlYlYu maMci vAlYlYu (They are good people)
In sentences where adjective is used as a predicate it takes the form of a
pronominal suffix that agrees with the subject phrase in number and gender.
Pronominal adjectives like nA, mA, mana, nI, mI when used as predicate take the
appropriate pronominal suffix in the same way as other adjectives. A few example
sentences are as follows:
I illu mAxi (This house is ours)
A puswakaM nAxi (That book is mine)
I kalaM nIxi (This pen is yours)
I Uru manaxi (This village is ours)
123
Pronominalized adjectives can also occur in the subject position because any noun
can be pronominalized.
6.4.2.2 Adjective Morphological Process
Adjectives play the role of a modifier in a noun phrase. All the four categories of
adjectives discussed above can be used as modifier in a noun phrase. In the current
work the XML file which comes as input to form the sentence contains the word
and the associated feature “suffix” of the adjective as modifier in a noun phrase.
Figure (6.1) shows an example noun phrase along with the adjective as modifier.
<modifier pos=”adjective” type=”descriptive” suffix=”ti”>pulupu</modifier>
<head pos=”noun” gender=”nonmasculine” number=”plural” person=”third”
casemarker=”” stem=”basic”> paMdu</head>
</nounphrase>
Figure 6.2 . Adjective as a Modifier in a Noun Phrase
The noun phrase specified by the input in Figure 6.2 is “pullati palYlYu” (sour
fruits). The modifier “pullati” is a derived adjective formed by adding the suffix
“ti” to the noun “pulupu”. The formation of the modifier in the Figure 6.2 can be
written along with its features as follows:
“pulupu”, adjective, descriptive, “ti”  pullati
The current work does not take the type of the attribute into consideration which will
be included in the future work. The formation of the adjective “pullati” is as follows:
pulupu pul
pulpulla
pulla + ti  pullati
124
After the final sandhi the adjective “pullati” is formed from the noun “pulupu” as the
modifier of the noun phrase.
6.4.3 Verb Phrase
Simple verbs in their finite forms are inflected for tense followed by person, number,
and gender endings or states. In order to indicate aspectual, modal and voice
distinctions in the actions or states denoted by the verbs, various auxiliaries are
employed. In Telugu, simple past, future/habitual and progressive or present tense
forms of verbs are derived by affixing “A”, “wA”, and “wunnA”, to the root/stem
directly in the case of masculine nouns as illustrated below:
rAmArAvu Ata AdAdu (Ramarao played a game)
rAmArAvu Ata AduwAdu (Ramarao will play a game)
rAmArAvu Ata AduwunnAdu (Ramarao is playing a game)
In the current work the responsibility of the generation of the verb phrase is with the
VerbPhraseBuilder. The verb phrase is also a head-modifier verb phrase with a verb
as the head and an adverb as the modifier of the phrase. In the example XML file of
Figure 6.1 there is only one verb phrase. The VerbPhraseBuilder gets a VAElement
which has a verb along with its features and an AdverbElement. The verb is passed
on to the morphology engine to get back the head of the verb phrase “pAdiMxi”. The
AdverbElement is passed on to the AdverbPhraseBuilder.
bAgA pAdAdu ((He) sang well)
125
6.4.4 Adverb Phrase
The word adverb is derived from Latin where ad means attached to which indicates
that an adverb is a modifier of a verb. In the current work the construction of the
adverb phrase is the responsibility of the AdverbPhraseBuilder. In the example XML
file of Figure 6.1 the AdjectivePhraseBuilder passes the AdverbElement to the
morphology engine to get back the adverb phrase “gattigA”. The
AdverbPhraseBuilder then passes the adverb phrase back to the VerbPhraseBuilder
which completes the verb phrase “gattigA pAdiMxi”.
6.4.4.1 Adverb Morphology
Adverbs generally occur as modifiers of the verb in a sentence. All the adverbs fall
into three semantic domains denoting time, place and manner. Adverbs belonging to
time and place semantic domain are morphologically nouns since they form oblique
stems and inflect with case suffixes. Some words like “muMxu”(before),
“venaka”(after), “kiMxa”(below), “pEna” (above) referring to directions occur both
as independent time-place adverbs and postpositions of complements within the
predicate phrase.
6.4.4.1.1 Adverbs of Time
Some adverbial nouns of time occur uninflected in a sentence. Some examples of
adverbial nouns that occur uninflected are illustrated in the following sentences:
nenu repu mAtlAdawAnu. (I will talk tomorrow)
vAdu rAwrulu wiruguwuMtAdu.(He roams around in the night)
Ame iMkA iMtiki rAlexu.(She didn’t come home yet)
126
In the above sentences the italicised words “repu”, “rAwrulu”, “iMkA” occur
uninflected in the sentence. Some adverbial nouns of time include bound particles or
suffixes like “e”,”ke”, ”lo”. Some sentences that include these suffixes are as
follows:
awanu iMtinuMci veMtane vaccAdu. (He came immediately from the house)
rAmu nAku ixivarake weVlusu. (I know Ramu since long time)
In the above sentences the italicized words veVMtan + e, ixivara + ke add the
suffixes “e” and “ke”. Some adverbs are derived from nouns by the addition of “gA”
the infinitive of “av” (to become). Some example sentences to illustrate the use of
“gA” are as follows:
nuvvu AlasyaMgA vaccAvu. (You came late)
vAdu nemmaxigA naduswunnAdu. (He is walking slowly)
Some postpositions like “warwAwa”, “maxya” when used independently after
demonstrative adjectives like “A”,”I” act as adverbs in the sentence. Some example
sentences to illustrate the use of those postpositions are as follows:
I maxya mA Uru veVlYlYAnu. (I went to my home town recently)
A warvAwa prasAxu vaccAdu. (Prasad came later)
6.4.4.1.2 Adverbs of place
Some adverbs of place like “akkada”, “ikkada” are used in uninflected form. Some
example sentences containing such adverbs are as follows:
akkada kurci uMxi. (The chair is there)
nuvvu ikkada kurcoV (You sit here)
127
Some nouns of place or direction become adverbs by the addition of “gA”. Some
examples are as follows:
awanu nAku xUraMgA kurcunnAdu (He sat far away from me)
nuvvu nAku xaggaragA uMdu (You be closer to me)
The verb “niMdu” can be used as “niMdA” in a sentence as postposition to form an
adverb. Some examples are as follows:
vAdi saMciniMdA atukulunnAyi.
Ame bIruvAniMdA puswakAlunnAyi. (Her cupboard is full of books)
6.4.4.1.3 Adverbs of manner
One way of forming manner adverb is by adding “gA” to adjectives and nouns
(which do not specify about time or place). The following is a list of adjectives and
nouns from which adverbs are formed by adding “gA”.
Adjective/Noun Adverb
peVxxa peVxxagA
bAgu bAgugA
ceVdda ceVddagA
cinna cinnagA
meVwwa meVwwagA
Table 6.2 Example Manner Adverbs
Some example sentences are as follows:
awanu pexxagA navvadu.(He doesn’t laugh much)
Ame bAgA caxuvuwuMxi. (She studies well)
rAxa cAla aMxaMgA vuMxi. (Radha is very beautiful)
128
Words like “alA”, “ilA”, “eVlA” are shorter forms of “alAgA”, “ilAgA”, “eVlAgA”
used as manner adverbs. Some example sentences are as follows:
okasAri ilA raMdi. (Come here once)
mIru kudA alA ceyyaMdi. (You also do like that)
The suffixes “gA” and “lAgA” convert nominal predicates to adverbs when used
before verbs like “un”, “kanabadu/kanipiMc”, and “natiMc”. Some example
awanu pexxamaniRilAgA unnAdu. (He looks like an elderly man)
I illu cinnaxigA uMxi. (This house seems to be small)
The suffix “gA” when added to nouns referring to physical or psychological states
converts them to manner adverbs. The subject in such sentences occurs in the dative
case and the finite verb is always “un”.
nAku caligA uMxi. (I am feeling cold)
vAdiki I Uru kowwagA uMxi. (This town is new to him)
The words “niMdA”, and “ArA” are added to nouns to form adverbs. Some example
nenu kadupuniMdA annaM winnAnu. (I ate up to my brim)
annamayya xevuNNi kalYlYArA cusAdu. (Annamayya saw god)
129
6.4.4.2 Adverb Morphological Process
Adverbs play the role of a modifier in the verb phrase. All the categories of adverbs
discussed above can be used as a modifier in a verb phrase. In the current word the
adverb and its suffix are specified as the modifier of a verb phrase in the XML file.
An example verb phrase along with the adverb as modifier is given in Figure 6.2.
The verb phrase specified in the Figure 6.3 is “cinnagA pAduwu” (singing slowly).
The modifier “cinnagA” is a manner adverb formed by adding the suffix “gA” to the
noun “cinna”.

<modifier pos=”adverb” suffix=”gA”> cinna</modifier>
<head pos=”verb” tensemode=”presentparticiple”> pAdu</head>
</verbphrase>
Figure 6.3. Adverb as a Modifier in a Verb Phrase
The formation of the modifier in the Figure 6.3 can be written along with its features
as follows:
“cinna”, adverb, “gA” cinnagA
The formation of the adverb “cinnagA” which is a simple “sandhi” formation is as
follows:
cinna+gAcinnagA.
6.5 Agreement in Verb
Grammatical agreement of the word form co-occurring in a clause is the
morphological phenomenon in which the words get sensitive to each other. Predicate
agreement explains the morphological changes that occur in the predicate appearing
in a sentence with respect to the presence of some other word (Subject or object) in
130
the sentence. The predicates in Telugu can be divided into three different categories
namely verbal, nominative and abstract. In the case of the verbal predicate in Telugu
the finite verb exhibits agreement in number, gender, and person with its subject
nominal, which is always in the nominative. The use of accusative case for inanimate
objects is optional. In the case of inanimate objects the accusative case can be
omitted and the word can be written in nominative form in a sentence.
sIwa puswakA(nni) konnaxi (Sita bought a book)
sIwa puswaka(M) konnaxi (Sita bought a book)
The first sentence above is the one in which the accusative suffix is added to the
complement of the sentence. The second sentence is the one in which the accusative
suffix is not added to the complement because the use of accusative case for
inanimate objects (puswakam, paMdu) is optional in Telugu language. Both the
above sentences mean the same but the second sentence is the one which is regularly
used.
In the current thesis the predicate type can only be verbal and the other predicate
types like abstract and nominative will be included in the future versions of the
work. Some example input XML files and the surface forms generated are discussed
in the following sections to understand the implementation of agreement. In the
current work the responsibility of the agreement between the nominative subject and
the verb is with the Sentence Builder a module which has centralized control over all
the other processing modules. Here are some examples which discuss the agreement
of person, number, and gender (PNG) between the subject and the verb.
131
6.5.1 Agreement with the First Person
In Telugu language “nenu” is the pronoun for the first person. Here we look at an
example XML file which illustrates the agreement between the first person and the
verb.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<document>
<sentence type=" " predicate-type="verbal" respect="no">
<nounphrase role="subject">
<head pos="pronoun" gender="masculine" number="singular" person="first"
casemarker=" " stem="basic">nenu</head>
</nounphrase>
<verbphrase type=" ">
<head pos="verb" tensemode="futuretense">vaccu</head>
</verbphrase>
</sentence>
</document>
Fig 6.4 Example XML Specification for Agreement with First Person
The XML file in Fig 6.4 generates the following sentence:
nenu vaswAnu (I will come)
The sentence has a subject and a verb. The subject “nenu” along with its features can
be written as follows:
“nenu”, pronoun, masculine, singular, first, basic  nenu
The verb “vaswAnu” along with its features can be written as follows:
“vaccu”, verb, futuretense  vas
The future tense mode suffix “wA” is agglutinated to the word and it becomes:
vas +wAvaswA
132
The sentence builder then agglutinates the PNG suffix for the first person “nu” to the
verb in the future tense as follows:
vaswA+nuvaswAnu.
In case the number feature of the XML file is “plural” and not singular then the
sentence is generated as follows:
manamu vaswAmu (We (including the listener) will come)
or
memu vaswAmu (We (excluding the listener) will come)
The subject being “manamu” or “memu” is dependent on a feature called
“exclusive” which is only used for pronouns in the first person and when the number
is plural.
The subject “manamu” along with its features can be written as follows:
“nenu”, pronoun, masculine, plural, first, basic, no manamu
The verb will be inflected as usual except for the PNG suffix. The PNG suffix “mu”
for the first person plural is added to the verb as follows:
vaswA+muvaswAmu.
6.5.2 Agreement with the Second Person
The pronoun used to represent the second person in Telugu language is “nuvvu”. An
example XML file which illustrates the agreement between the second person and
the verb is provided here.
133
nuvvu vaswAvu (You (singular) will come)

<document>
<head pos="pronoun" gender="masculine" number="singular" person="second"
casemarker=" " stem="basic">nuvvu</head>
</nounphrase>
</verbphrase>
</sentence>
</document>
Fig 6.5 Example XML Specification for Agreement with Second Person
The subject “nuvvu” along with its features can be written as follows:
“nuvvu”, pronoun, masculine, singular, second, basic  nuvvu
The verb “vaswAvu” along with its features can be written as follows:
vas +wAvaswA
The sentence builder then agglutinates the PNG suffix for the second person “vu” to
the verb in the future tense as follows:
vaswA+vuvaswAvu.
mIru vaswAru (You (plural) will come)
134
The subject “mIru” along with its features can be written as follows:
“nuvvu”, pronoun, masculine, plural, second, basic mIru
The verb will be inflected as usual except for the PNG suffix. The PNG suffix “ru”
for the second person plural is added to the verb as follows:
vaswA+ruvaswAru.
The gender of the person does not show any distinction in the case of second person
also just like the first person.
6.5.3 Agreement with the Third Person Masculine
An example XML file to illustrate the agreement between the third person and the
verb is provided here.
vAdu vaswAdu (He will come)
The subject “vAdu” along with its features can be written as follows:
“vAdu”, pronoun, masculine, singular, third, basic  vAdu
The verb “vaswAdu” along with its features can be written as follows:
vas +wAvaswA
135
The sentence builder then agglutinates the PNG suffix for the third person
masculine“du” to the verb in the future tense as follows:
vaswA+duvaswAdu

<document>
<head pos="pronoun" gender="masculine" number="singular" person="third"
casemarker=" " stem="basic">vAdu</head>
</nounphrase>
</verbphrase>
</sentence>
</document>
Fig 6.6 XML Specification for Agreement with Third Person Masculine
vAlYlYu vaswAru (They will come)
The subject “vAlYlYu” along with its features can be written as follows:
“vAdu”, pronoun, human, plural, third, basic vAlYlYu
The gender for the third person as shown above when the number is plural is human
to denote human beings and non-human to denote animals and non-living things.
for the third person, plural and human is added to the verb as follows:
vaswA+ruvaswAru.
136
6.5.4 Agreement with the Third Person Non-Masculine
As mentioned earlier in Telugu language nouns and pronouns denoting female
persons are treated as non-masculine gender in the singular but in the plural those
nouns are treated as human along with the nouns denoting male persons. Here we
have an example XML file which illustrates the agreement between the non-
masculine nouns and the verb.
axi vaswAxi (She (It) will come)

<document>
<head pos="pronoun" gender="nonmasculine" number="singular" person="third"
casemarker=" " stem="basic">axi</head>
</nounphrase>
</verbphrase>
</sentence>
</document>
Fig 6.7 Specification for Agreement with Third Person Non-Masculine
The sentence has a subject and a verb. The subject “axi” along with its features can
be written as follows:
“axi”, pronoun, nonmasculine, singular, third, basic  axi
The verb “vaswAxi” along with its features can be written as follows:
vas +wAvaswA
137
The sentence builder then agglutinates the PNG suffix for the third person non-
masculine “xi” to the verb in the future tense as follows:
vaswA+xivaswAxi.
vAlYlYu vaswAru (They (human) will come)
or
avi vaswAyi (Those (non-human) will come)
The sentence that will be generated among the above two is decided by the gender
feature which can be “human” or “nonhuman”.
In case the gender is human the subject “vAlYlYu” is generated as described in the
previous section.
In case the gender is non-human the subject “avi” along with its features can be
written as follows:
“axi”, pronoun, nonhuman, plural, third, basic  avi
for the third person, plural and human is added to the verb as follows:
vaswA+ruvaswAru.
The PNG suffix “yi” for the third person, plural and non-human is added to the verb
as follows:
vaswA+ yivaswAyi
138
6.6 Word order
Certain languages like English which have a relatively fixed word order are called as
positional languages. Telugu unlike English is a free word order language like most
other South Asian languages (Dravidian and Indian). The word order of grammatical
functions like subjects, complements and objects is largely free. Internal changes in
the sentences or position swap between various phrases will not affect grammatical
functions of the nominals.
In Telugu the position or order of occurrence of a noun group does not contain the
information about the karaka or theta roles which specifies the number and type of
noun phrases required syntactically by a particular verb. The information is
contained in the post positions or the surface case endings of nouns (Akshara Bharati
et al., 1995). Therefore the relative free order of the words does not affect the
meaning of the sentence.
rAmudu rAvanudini BANaMwo campAdu (Rama Killed Ravana with an Arrow)
rAmudu BANaMwo rAvanudini campAdu (Rama killed Ravana with an Arrow)
BANaMwo rAmudu rAvanudini campAdu (Rama killed Ravana with an Arrow)
rAmudu campAdu rAvanudini BANaMwo (Rama killed Ravana with an Arrow)
In the current work also the word order is free. The word order of the output is same
as the word order in which the input is given. The SentenceBuilder module makes
sure that the order in which the words in a sentence are sent to the output module is
the same order in which they are given as input.
139
6.7 Output Generator
The output generator is the module which actually gives the output in Unicode
character set as Telugu sentences. A sample output in Telugu language for the
sentence in WX-notation is as follows:
rAmudu iMtiki vaccAdu (Sentence in WX-notation)
రాముడు ఇంటికి వచ్చా డు (Sentence in Telugu)
Telugu Unicode Chart

0 1 2 3 4 5 6 7 8 9 A B C D E F
0C0 ఁ ఁ ఁ అ ఆ ఇ ఈ ఉ ఊ ఋ ఌ ఎ ఏ
0C1 ఐ ఒ ఓ ఔ క ఖ గ ఘ ఙ చ ఛ జ ఝ ఞ ట
0C2 ఠ డ ఢ ణ త థ ద ధ న ప ఫ బ భ మ య
0C3 ర ఱ ల ళ వ శ ష స హ ఽ ఁ ఁ
0C4 ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ ఁ
0C5 ఁ ఁ ౘ ౙ
0C6 ౠౡ ౦ ౧ ౨ ౩ ౪ ౫ ౬ ౭ ౮ ౯
0C7 ౸ ౹ ౺ ౻ ౼ ౽ ౾ ౿
Figure 6.8 Unicode Block set for Telugu
Figure 6.8 is the Unicode block for Telugu as of Unicode version 10.0 which
contains characters for Telugu, Gondi and Lambadi languages of Andhra Pradesh
and Telangana. In its original incarnation the code points U+0C01…U+0C4D were a
direct copy of the Telugu characters A1-ED from the 1988 ISCII standard. The grey
areas indicate non-assigned code points.
140
6.8 Summary
This chapter presents the details of the sentence formation mechanism used in
current thesis. This chapter gives a detailed description of the different modules used
in the construction of the java application for Telugu sentence formation.
141
-------------------------CHAPTER-7
CONCLUSION
142
CONCLUSION
7.1 Introduction
This thesis presented research work on designing and developing a surface
realization engine (surface realizer or realizer for short) for Telugu, an Indian
language belonging to the Dravidian family of languages. A surface realizer is a
module in a natural language generation (NLG) system that accepts input
specification consisting of lexicalized sentence constituents and their associated
features to auto-generate grammatically valid sentences by applying correct
morphology, syntax and orthography.
Although surface realization engines for European languages such as English and
French have been available since the 90s, to the best of our knowledge there is no
general-purpose realization engine for any Indian language. In this context, the
current research work assumes special significance in developing a surface
realization framework that is suitable for Indian languages, demonstrated to work
with Telugu language.
The developed framework is an adaptation of the now popular SimpleNLG
framework that has recently been applied to a wide spectrum of languages including
German (Marcel Bollmann, 2011), Filipino (Ethel Ong et al., 2011), French (Vaudry
and Lapalme, 2013), and Brazilian Portuguese (Rodrigo de Oliveira and Sripada,
2014).
A major effort while building a realization engine for a new language relates to
acquire the required linguistic knowledge. Not only finding the required knowledge
sources can be difficult, but having found the right knowledge sources, it can be
quite challenging to then acquire all the required knowledge. The SimpleNLG
143
framework provided the right guidance in both the identification of the correct
knowledge sources and then provided guidelines for acquiring the required
knowledge.
Because Telugu is a morphologically rich language (MRL), the realization
framework required for it should offer a central role to morphology. The SimpleNLG
framework separates morphology and syntax offering, therefore, a realization
framework that supports the independent development of the rich morphological
processing required by Telugu language. The current research work developed a
finite-state technology based framework for Telugu morphology.
Another important feature of the SimpleNLG framework is the usage of features to
represent grammatical and at times semantic information. Our adapted framework
too makes an extensive use of features to represent grammatical/semantic
information in the realization engine. These features help in performing a wide
range of operations including morphology, syntax and agreement among sentence
constituents (e.g. subject- verb agreement).
The evaluation studies carried out during the research work show that the developed
realizer generates grammatical sentences and the morphological modules can
generate grammatically correct word forms.
7.2 Critical Review
An important property of a realizer is its coverage. The SimpleNLG framework is
based on the principle that realizers should offer complete morphological coverage
while supporting only the most frequently used syntactic forms (which is one of the
reasons why the framework is called Simple). It should be noted that SimpleNLG
144
framework was originally developed for realizing English, which is known to
involve more syntactic complexity than morphological complexity. While adapting
the SimpleNLG framework for the morphologically rich Telugu, it has been accepted
that achieving an exhaustive coverage of Telugu morphology is only possible at the
theoretical level (the finite state morphological framework) but not necessarily in the
software. As a result, the finite state morphological framework developed in the
current research work provides wide coverage theoretical basis, our evaluation
studies showed that the realizer (the software) built using the framework needs to
broaden the coverage further. This is particularly true with coverage of noun and
verb forms, the open class words. Our approach while building the software has been
to focus on all the major types of nouns and verbs as described in our knowledge
source.
In addition to the limited morphological coverage, the current version of the
realization engine software used the WX notation to deal with Telugu orthography
that is not the only notation used by Telugu language technology software. Again,
this is a limitation of the software developed in the research work, but it should be
emphasized that the realization engine framework developed in the current research
is generic and extends beyond what was implemented in the software.
An ideal evaluation of the current work should aim to show that the realization
framework and the linguistic knowledge represented by the framework ensure auto-
generation of well-formed words, phrases and sentences. It is difficult to directly
perform evaluation studies on the theoretical framework. Instead the evaluations
studies in the current work focus on showing that the realization engine software
which implements the theoretical framework generates the required well-formed
surface forms. It is worth emphasizing that further evaluation studies are required to
145
quantify the quality of the software output which in turn brings greater clarity on the
quality of the theoretical framework.
7.3 Future Work
The most important task for future is to apply the developed framework to other
Indian languages so that the developed framework can be claimed to be suitable for
realization of Indian languages. This may involve making further refinements to the
framework which reflects the differences among the Indian languages. A deeper
understanding of these individual differences would be interesting both from a
theoretical and an applied perspective.
As argued in the previous section, the current version of the software does not
provide complete coverage of the Telugu grammar. Building on from the current
version, extensions can be made to cover grammatical features currently not covered.
An important extension, in this context, is to extend the current realization engine to
generate more than one grammatically valid form for a given input. This over
generation feature should apply at all levels including words, phrases and sentences.
Software development wise this is a significant extension, but one that makes the
realization software truly attractive for real world usage.
Another important direction for future work is to actually use the realization engine
as part of an NLG application. (Several small-scale efforts have been carried out
during the current research work to apply the realization engine none of which are
significant enough to be reported here.) This will be an important step in the
evaluation of the engine. In addition, this will also help in specifying the relative
importance of different modules and where the extensions are really required for a
given use case.
146
Yet another important future work is related to rebuilding the morphology modules
using the finite state tools such as JFlex. Although this will not change the quality of
the output, it would improve the portability of the framework to other Indian
languages significantly. Because the current version already uses Java’s regular
expression library, the grammatical knowledge required to write the JFlex input files
already exists and it should not be hard to incorporate JFlex based lexer for the
Telugu realization engine.
147
REFERENCES
148
REFERENCES
1. Akshara Bharati, Rajeev Sangal, S.M. Bendre, Pavan Kumar, Aishwarya

2001 “Unsupervised Improvement of Morphological Analyser for
Inflectionally Rich Languages” Published in the Proceedings of NLPRS-,
Tokyo, 27-30 November 2001 Report No: IIIT/TR/2001/4 pp 685-692
2. Akshara Bharati, Rajeev Sangal, Dipti M Sharma 2007 “SSF: Shakti

Standard Format Guide” LTRC, IIIT, Hyderabad, Report No: TR-LTRC-33.
3. Akshara Bharati, Vineet Chaitanya, Rajeev Sangal 1995 “Natural Language

Processing A Paninian Perspective” Prentice-Hall of India, New Delhi,.
4. Albert Gatt and Ehud Reiter 2009 “SimpleNLG: A realization engine for
practical applications”, Proceedings of ENLG 2009, pp 90-93.
5. Alessandro Mazzei, Cristina Battaglino and Cristina Bosco 2016

“SimpleNLG-IT: adapting SimpleNLG to Italian”, Proceedings of the 9th
International Natural Language Generation conference, Edinburgh, UK,
September 5-8 2016. © 2016 Association for Computational Linguistics pp
184–192,
6. Appelt, D. 1985. “Planning English Sentences”. Cambridge University Press,

Cambridge, UK.
7. Ballesteros, M., Bohnet, B., Mille, S., & Wanner, L. 2015. “Data-driven
sentence generation with non-isomorphic trees”. In Proc. NAACL-HTL’15,
pp. 387– 397.
8. Banaee, H., Ahmed, M. U., & Loutfi, A. 2013. “Towards NLG for
Physiological Data Monitoring with Body Area Networks”. In Proc.
ENLG’13, pp. 193– 197.
9. Bangalore, S., & Rambow, O. 2000. Corpus-based lexical choice in Natural

Language Generation. In Proc. ACL’00, pp. 464–471.
10. Bateman, J. A. 1997. Enabling technology for multilingual natural language

generation: the KPML development environment. Natural Language
Engineering, Vol 3, Issue1, pp. 15–55.
11. Beesley, Kenneth R, Lauri Karttunen. 2003 “Finite State Morphology”. Palo
Alto, CA: CSLI Publications.
12. Belz, A. 2008. “Automatic generation of weather forecast texts using

comprehensive probabilistic generation-space models”. Natural Language
Engineering, Vol 14, Issue 04 pp 431-455.
149
13. Benoit Lavoie and Owen Rambow, 1997 “A Fast and Portable Realizer for
Text Generation Systems” Proceedings of the Fifth Conference on Applied
Natural Language Processing (ANLP97), Washington pp. 265–268.
14. Brown, C.P 1991. “The Grammar of the Telugu Language”. New Delhi:
Laurier Books Ltd.
15. Cahill, A., Forst, M., & Rohrer, C. 2007. Stochastic realisation ranking for a
free word order language. In Proc. ENLG’07, pp. 17–24.
16. Carenini, G., & Moore, J. D. 2006. Generating and evaluating evaluative
arguments. Artificial Intelligence, Vol 170 Issue 11, pp. 925–952.
17. Chen, D. L., & Mooney, R. J. 2008. Learning to sportscast: a test of

grounded language acquisition. In Proc. ICML’08, pp. 128–135.
18. Cheng, H., & Mellish, C. 2000. Capturing the interaction between
aggregation and text planning in two generation systems. In Proc. INLG ’00,
Vol. 14, pp. 186–193.
19. Dalianis, H. 1999. Aggregation in Natural Language Generation.

Computational Intelligence, Vol 15 Issue 4, pp. 384–414.
20. Dokkara S R S, Penumathsa S V, Sripada S G. 2015 “A Simple Surface

Realization Engine for Telugu” Proceedings of the 15th European Workshop
on Natural Language Generation (ENLG), Brighton Sep, pp. 1–8.
21. Dr. Ramakanth Kumar P, Shambhavi. B. R, Srividya K, Jyothi B J, Spoorti

Kundargi, Varsha Shastri G 2011 "Kannada Morphological Analyzer And
Generator Using Trie", IJCSNS International Journal Of Computer Science
And Network Security, Vol.11 Issue 1, January,pp-112-116.
22. Elhadad, M., & Robin, J. 1996. An overview of SURGE: A reusable

comprehensive syntactic realization component. In Procedings of the 8th
International Natural Language Generation Workshop (IWNLG’98), pp. 1–4.
23. Ethel Ong, Stephanie Abella, Lawrence Santos, and Dennis Tiu 2011 “A
Simple Surface Realizer for Filipino” 25th Pacific Asia Conference on
Language, Information and Computation, pp. 51–59, 2011.
24. Fleischman, M., & Hovy, E. H. 2002. Emotional Variation in speech-based

Natural Language Generation. In Proc. INLG’02, pp. 57–64.
25. Ganapathiraju M; Lori Levin. 2006: “TelMore: Morphological Generator for

Telugu Nouns and Verbs”, in Proceedings of Second International
Conference on Universal Digital Library, Alexandria, Egypt, pp. 17-19.
26. Gatt, A., Portet, F., Reiter, E., Hunter, J. R., Mahamood, S., Moncur, W., &
Sripada, S. 2009. From data to text in the neonatal intensive care Unit: Using
NLG technology for decision support and information management. AI
Communications, Vol 22 Issue 3, pp.153–186.
150
27. Girija V. R. and T. Anuradha 2017 Application of Finite State Methods in
Malayalam Text Analysis International Journal of Computer Applications
(0975 – 8887) Volume 168 Issue.12, June 2017, pp. 43-47.
28. Goldberg, E., Driedger, N., & Kittredge, R. I. 1994. Using Natural Language
Processing to Produce Weather Forecasts. IEEE Expert, 2, pp. 45–53.
29. Goldman N 1975 “Conceptual Generation”, in Schank,R. Conceptual

Information Processing. North-Holland/Elsevier, pp. 289-372.
30. Gregory T. Stump 2001: “Inflectional Morphology: A Theory of Paradigm

Structure” Cambridge University Press.
31. Halliday, M., & Matthiessen, C. M. 2004. Introduction to Functional

Grammar (3rd Edition edition). Hodder Arnold, London.
32. Harris, M. D. 2008. Building a large-scale commercial NLG system for an

EMR. In Proc. INLG ’08, pp. 157–160.
33. Hockett 1954: "Two models of grammatical description", in : Word, 10,

pp. 210–234. [= Readings in Linguistics, vol. I, pp. 386–399].
34. Huske-Kraus, D. 2003. Text generation in clinical medicine: A review.

Methods of information in medicine, Vol 42 Issue 1, pp. 51–60.
35. James A Moore and William C Mann 1979 “A Snapshot of KDS A

knowledge Delivery System” Proceedings of the Seventh Annual Meeting,
Association of Computational Linguistics, pp. 51-52.
36. James Hunter, Yvonne Freer, Albert Gatt, Ehud Reiter, Somayajulu Sripada,
Cindy Sykes, and Dave Westwater 2011 “BT-Nurse Computer Generation of
Natural Language Shift Summaries from Complex Heterogeneous Medical
Data”. Journal of the American Medical Informatics Association Sep-Oct Vol
18 Issue 5 pp. 621-624.
37. John Henry Clippinger, Jr. 1977 “Meaning and Discourse - A Computer
Model of Psychoanalytic Speech and Cognition”. The Johns Hopkins Univ.
Press, Baltimore, ISBN 0-8018-1943-1.
38. Johnson, C. D. 1972. Formal Aspects of Phonological Description. Mouton,

The Hague.
39. Kaplan, R. M. and Kay, M. 1994. Regular models of phonological rule

systems. Computational Linguistics, Vol 20 Issue 3:pp. 331–378.
40. Kasper, R. T. 1989. A Flexible Interface for Linking Applications to

Penman’s Sentence Generator. In Proc. Workshop on Speech and Natural
Langauge, pp. 153–158.
151
41. Knight, K., Hatzivassiloglou, V 1995 “NITROGEN: Two-Level, Many-
Paths Generation”. Proceedings of the ACL-95 conference. Cambridge, MA
pp. 252-260.
42. Koskenniemi, K. 1983. “Two-level morphology: A general computational

model for word-form recognition and production”. Publication 11, University
of Helsinki, Department of General Linguistics, Helsinki.
43. Krishnamurti B H, Gwynn J P L. 1985 “A Grammar of Modern Telugu”

Oxford University Press.
44. Krishnamurti B H. 1961 “Telugu Verbal Bases a comparative and

Descriptive Study” University of California Press Berkley & Los Angeles.
45. Kristina Toutanova, Hisami Suzuki, Achim Ruopp 2008: Applying

Morphology Generation Models to Machine Translation. Proceedings of
ACL-08: HLT, Columbus, Ohio, USA, pp 514–522.
46. Langkilde-Geary, I. 2000. Forest-based statistical sentence generation. In

Proc. ANLP-NAACL’00, pp. 170–177.
47. Langkilde-Geary, I., & Knight, K. 2002. HALogen Statistical Sentence

Generator. In Proc. ACL’02 (Demos), pp. 102–103.
48. Lauri Karttunen 2003 “Computing with Realizational Morphology”.
49. Lauri Karttunen and Kenneth R. Beesley 2005 “Twenty-Five Years of Finite
State Morphology”. Inquiries into Words, Constraints and Contexts.
50. Mairesse, F., & Walker, M. A. 2010. Towards personality-based user

adaptation: Psychologically informed stylistic language generation. User
Modelling and User-Adapted Interaction, Vol 20 Issue 3, pp 227–278.
51. Malin Ahlberg, Markus Forsberg, and Mans Hulden. 2014. “Semi-supervised
learning of morphological paradigms and lexicons” in Proceedings of the
14th Conference of the European Chapter of the Association for
Computational Linguistics, Gothenburg, Sweden, April 26-30 2014 pp 569–
578.
52. Malin Ahlberg, Markus Forsberg, and Mans Hulden. 2015. Paradigm
classification in supervised learning of morphology. In Human Language
Technologies: The 2015 Annual Conference of the North American Chapter
of the ACL, Denver, Colorado, May 31 – June 5, 2015 pp 1024–1029.
53. Mann, W. C., & Matthiessen, C. M. 1983. Nigel: A systemic grammar for
text generation. Tech. rep., ISI, University of Southern California, Marina del
Rey, CA (Technical Report RR) pp.83-105.
152
54. Marcel Bollmann, 2011 “Adapting SimpleNLG to German” Proceedings of
the 13th European Workshop on Natural Language Generation (ENLG),
Nancy, France, September pp 133– 138,.
55. Meehan J R, 1977 “Tale-Spin an Interactive Program that writes Stories” In

proceedings of the 5th International Joint Conference on Artificial
Intelligence pp 91-98.
56. Mellish, C., Scott, D., Cahill, L., Paiva, D. S., Evans, R., & Reape, M.
(2006). A Reference Architecture for Natural Language Generation Systems.
Natural Language Engineering, Vol 12 Issue 01, pp.1–34.
57. Meteer, M. W., McDonald, D. D., Anderson, S., Forster, D., Gay, L.,
Iluettner, A., & Sibun, P. 1987. “Mumble-86: Design and Implementation”.
Tech. rep., University of Massachusetts at Amherst, Amherst, MA (Technical
Report COINS 87-87).
58. Minnen GJ, Carroll, 2000 “Robust DP. Robust Applied morphological
generation”. Mitzpe Ramon, Israel: Proceedings of the 1st International
Natural Language Generation Conference. pp. 201-208.
59. Molina, M., Stent, A., & Parodi, E. 2011. “Generating Automated News to
Explain the Meaning of Sensor Data”. In Gama, J., Bradley, E., & Hollmén,
J. (Eds.), Proc. IDA 2011Springer, Berlin and Heidelberg, pp. 282–293.
60. Nizar Habash 2000 “OxyGen: A Language Independent Linearization

Engine”. JS White (Ed): Amta 2000 LNAI 1934 2000 Springer Verlag
Berlin Heidelberg 2000 pp 68-74.
61. O’Donnell, M. 2001. ILEX: An Architecture for a dynamic hypertext

generation system. Natural Language Engineering, Vol 7 Issue 3, pp. 225–
250.
62. Paiva, D. S., & Evans, R. 2005. Empirically-based control of natural

language generation. In Proc. ACL’05, pp. 58–65.
63. Parameswari K 2011. An Implementation of APERTIUM Morphological

Analyzer and Generator for Tamil. Language in India
www.languageinindia.com 11:5 May 2011 Special Volume: Problems of
Parsing in Indian Languages pp. 41-44.
64. Philip R Cohen and C Raymond Perrault 1979 “Elements of a Plan-Based

Theory of Speech Acts” Technical Report No: 141, Center for the Study of
Reading September pp. 1-67.
65. Pierre-Luc Vaudry and Guy Lapalme 2013 “Adapting SimpleNLG for
bilingual English-French realisation” Proceedings of the 14th European
Workshop on Natural Language Generation, Sofia, Bulgaria, August 8-9 pp
183–187.
153
66. PJ Antony, KP Soman 2012:”Computational morphology and natural
language parsing for Indian languages”: a literature survey. International
Journal of Computer Science and Engineering Technology. International
Journal of Scientific & Engineering Research Volume 3, Issue 3, March-2012
ISSN 2229-5518, pp 1-11.
67. Plachouras, V., Smiley, C., Bretz, H., Taylor, O., Leidner, J. L., Song, D., &
Schilder, F. 2016. Interacting with financial data using natural language. In
Proc. SIGIR’16, pp. 1121–1124.
68. Portet, F., Reiter, E., Gatt, A., Hunter, J. R., Sripada, S., Freer, Y., & Sykes,
C. 2009. Automatic generation of textual summaries from neonatal intensive
care data. Artificial Intelligence, Vol 173 Issue 7-8, pp. 789–816.
69. Richard J Hanson, Robert F. Simmons and J Slocum 1972 “Generating

English Discourse from Semantic Networks” Communications of the ACM,
October Volume 15 Issue 10, pp. 891-905
70. Ramos-Soto, A., Bugarin, A. J., Barro, S., & Taboada, J. 2015. Linguistic
Descriptions for Automatic Generation of Textual Short-Term Weather
Forecasts on Real Prediction Data. IEEE Transactions on Fuzzy Systems,
Vol 23 Issue 1, pp. 44–57.
71. Rao G U M, Kulkarni P A. Computer Applications in Indian Languages,

Hyderabad 2006: The Centre for Distance Education, University of
Hyderabad,.
72. Ratnaparkhi, A. 2000. Trainable methods for surface natural language

generation. In Proc. NAACL’00, pp. 194–201.
73. Reiter E, Dale R. 2000 “Building natural language generation systems”,

Cambridge University Press, New York.
74. Reiter, E., Robertson, R., & Osman, L. M. 2003. Lessons from a failure:
Generating tailored smoking cessation letters. Artificial Intelligence, Vol 144
Issue 1-2, pp. 41–58.
75. Reiter, E., Sripada, S., Hunter, J. R., Yu, J., & Davy, I. 2005. Choosing
words in computer-generated weather forecasts. Artificial Intelligence, Vol
167 Issue 1-2, pp. 137–169.
76. Rieser, V., & Lemon, O. 2009. Natural Language Generation as Planning
Under Uncertainty for Spoken Dialogue Systems. In Eacl 2009, pp. 683–
691.
77. Roark, Brain, Sproat R. 2007 “Computational approaches to Morphology and

Syntax” Oxford.
78. Rodrigo de Oliveira, Somayajulu Sripada 2014 “Adapting SimpleNLG for

Brazilian Portuguese realisation”. Proceedings of the 8th International
154
Natural Language Generation Conference, Philadelphia, Pennsylvania, 19-21
June 2014 pp 93–94.
79. Siddharthan, A., Green, M., van Deemter, K., Mellish, C., & van der Wal, R.
2013. Blogging birds: Generating narratives about reintroduced species to
promote public engagement. In Proc. INLG’13, pp. 120–124.
80. Siddharthan, A., Nenkova, A., & McKeown, K. R. 2011. Information Status
Distinctions and Referring Expressions: An Empirical Study of References to
People in News Summaries. Computational Linguistics, Vol 37 Issue 4, pp.
811–842.
81. Smriti Singh, MrugankDalal, Vishal Vachhani, Pushpak Bhattacharyya, Om

P. Damani 2007 “Hindi Generation from Interlingua (UNL)” in Proceedings
of MT summit.
82. Sri Badri Narayanan R, Saravanan S, Soman KP. 2009 “Data Driven Suffix
List and Concatenation Algorithm for Telugu Morphological Generator”.
International Journal of Engineering Science and Technology (IJEST) ISSN :
0975-5462 Vol. 3 Issue 8 pp. 6712-6717.
83. Stock, O., Zancanaro, M., Busetta, P., Callaway, C., Kru¨ger, A., Kruppa, M.,
Kuflik, T., Not, E., & Rocchi, C. 2007. “Adaptive, intelligent presentation of
information for the museum visitor in PEACH”. User Modeling and User-
Adapted Interaction, Vol 17 Issue 3, pp. 257–304.
84. Theune, M., Klabbers, E., de Pijper, J.-R., Krahmer, E., & Odijk, J. 2001.
From data to speech: a general approach. Natural Language Engineering, Vol
7 Issue 1, pp. 47–86.
85. Thompson H 1977 “Strategy and Tactics: A model for Language Production”
Papers from the Thirteenth Regional Meetings, Chicago Linguistic Society
pp. 651-668.
86. Turner, R., Sripada, S., Reiter, E., & Davy, I. 2008. Selecting the Content of
Textual Descriptions of Geographically Located Events in SpatioTemporal
Weather Data. In Applications and Innovations in Intelligent Systems XV,
pp. 75–88.
87. Uma Maheshwar Rao, G. and Christopher Mala 2011 “TELUGU WORD
SYNTHESIZER” International Telugu Internet Conference Proceedings,
Milpitas, California, USA 28th-30th September pp 1-8.
88. Van Deemter, K., Krahmer, E., & Theune, M. 2005. Real versus
templatebased natural language generation: A false opposition?.
Computational Linguistics, Vol 31 Issue 1, pp. 15–24.
89. Vaudry, P.-L., & Lapalme, G. 2013. Adapting SimpleNLG for bilingual
French English realisation. In Proc. ENLG’13, pp. 183–187.
155
90. Vishal Goyal, Gurpreet Singh Lehal 2011: “Hindi to Punjabi Machine
Translation System Proceedings of the ACL-HLT 2011 System
Demonstrations, Portland, Oregon, USA, 21 June 2011 pp 1–6.
91. Walker, M. A., Rambow, O., & Rogati, M. 2002. Training a sentence planner
for spoken dialogue using boosting. Computer Speech and Language, Vol 16
Issue 34, pp. 409–433.
92. Walther V Hahn, Wolfgang Hoeppner, Antony Jameson, Wolfgang Wahlster,

1978 “The Anatomy of the Natural Language Dialogue System HAM-RPM”
In AJCL Microfiche 77, pp 53-67.
93. Wanner, L., Bosch, H., Bouayad-Agha, N., & Casamayor, G. 2015. Getting
the environmental information across: from the Web to the user. Expert
Systems, Vol 32 Issue 3, pp. 405–432.
94. Weizenbaum, Joseph 1966. "ELIZA—a computer program for the study of
natural language communication between man and
machine". Communications of the ACM. Vol 9: pp. 36–45.
95. Winograd T. 1972. “Understanding Natural Language”. Cognitive

Psychology, Vol 3 Issue 1, pp. 1–191.
156
APPENDIX
157
Appendix
WX-Notation for Telugu
158
PAPERS EMANATED FROM THE THESIS
List
1) “A Simple Surface Realization Engine for Telugu”. Proceedings of the 15th

European Workshop on Natural Language Generation (ENLG), Bighton,
September 2015. © 2015 Association for Computational Linguistics pp 1-8.
2) “Verb Morphological Generator for Telugu”. Indian Journal of Science and

Technology, Vol 10 Issue 13, DOI: 10.17485/ijst/2017/v10i13/110448, April
2017 pp 1-11.
3) “Morphological Generator for Telugu Nouns and Pronouns”. International

Journal of Computer Applications (0975 – 8887) Volume 165 Issue.5, May
2017 pp 6-14.
159

Thesis "A Simple Surface Realization Engine For Telugu"

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis "A Simple Surface Realization Engine For Telugu"

Uploaded by

Copyright:

Available Formats

-------------------------CHAPTER-1

Natural Language Generation (NLG) is a computational task aiming to automatically

Hunter et al., 2011).

limited objectives (e.g. to demonstrate language generation capabilities as part of an

application) into a major technological advancement backed by solid science and a

huge commercial potential as predicted by Gartner (Gartner predicts,

year). With mountains of electronic data available to organizations (public, private

holds tremendous potential for delivering an unprecedented transformation in our

society by auto-generating human comprehensible information in a language of

choice (English, Hindi or Telugu).

surface realization subtask focusing particularly on Telugu, a south Indian

morphologically rich language belonging to the Dravidian family of languages

formed words, phrases and sentences) as well as orthographic validity (e.g.

punctuation) of the auto-generated text. Because a natural language (say Telugu)

this knowledge for constructing sentences algorithmically.

FUF/SURGE, RealPro and SimpleNLG are available. However, general purpose

economic context of India, when technological advancements have been driving

acknowledged by the government of India and several organizations (e.g. IIIT

be encouraged in modern India. With the proliferation of smart phones in recent

information personalized to individuals on their mobile phones in any Indian

language using NLG.

built domain-independent. The decision-making subtasks of NLG mentioned earlier

are domain-specific and therefore involve significant adaptation work to be portable

deliver return on investment (ROI) with the passage of time.

be reused across several applications (domain independent). To the best of our

realizers for English such as PENMAN and FUF/SURGE; SimpleNLG is designed

and in the industry.

Inspired by the success of the SimpleNLG approach of being linguistic theory

French, German , Brazilian Portuguese and Filipino are a few examples.

grammatically well-formed Telugu sentences from an input specification consisting

of lexicalized grammatical constituents and associated features. Our realization

engine adapts the design approach of SimpleNLG family of surface realizers.

1. The objectives for the Telugu language are:

a. To identify the most commonly used Telugu sublanguage and its

(linguistic) knowledge and to acquire the required knowledge from these

2. The objectives for the realization engine are:

a. To design and develop a framework for modelling the acquired

knowledge for the purpose of Telugu sentence construction

b. To design, develop and evaluate a surface realization engine for Telugu

using the knowledge acquired and using the developed framework.

nouns and pronouns.

c. To design and develop the case-marker agglutination mechanism for

Telugu nouns and pronouns

4. The objectives for the verb morphological engine are:

a. To design and develop a mechanism for identification of verb class.

agglutination of the required constituents.

5. The objectives for the sentence formation are:

a. To design and develop a mechanism for identifying the input as

predefined grammatical constituents and constructing appropriate element

builders (software component in the implementation) for them.

facilitate free word order of the constituents of Telugu sentence.

specifying lexicalized sentence constituents and their associated features.

The methodology followed in the thesis is as follows:

modelled on the SimpleNLG approach.

2. Primary meaning of Telugu sentences is mainly expressed using

inflected forms of content words and case markers or postpositions.

Therefore each part-of-speech has a separate morphology engine to

generate the inflected forms of words.

3. Telugu is not governed by a phrase structure grammar, instead fits better

into the dependency grammar. Therefore head-modifier dependency

structures are used at the phrase level.

4. Sentence constituents in Telugu can be ordered freely without impacting

the primary meaning of the sentence. Therefore the constituents of a